Why Memory Interleaving Patterns Are the Hidden Culprit Behind Non-Deterministic AI Training

May 19·9 min read·AI-assisted · human-reviewed

When a deep learning training run produces slightly different results each time you execute it, the typical response is to blame floating-point non-associativity or stochastic operations like dropout. You set random seeds, you pin CUDA operations, you even try TensorFlow determinism flags. Yet the variance persists. What many engineers overlook is the hardware memory subsystem itself. DRAM banks, NUMA domains, and virtual-to-physical page mappings are not deterministic by default, and on modern multi-socket servers with high-bandwidth memory (HBM) stacks, the interleaving pattern that the operating system chooses can silently change the order in which memory requests complete. This article walks through the specific mechanisms by which memory interleaving introduces non-determinism into AI training, and gives you practical, actionable methods to detect, measure, and control it.

DRAM Bank Conflicts and the Order of Atomic Operations

Modern DDR4 and DDR5 memory is organized into banks, rows, and columns. When two memory accesses target the same bank, the second must wait for the first to complete — a bank conflict. GPUs and CPUs with multiple memory channels can issue requests in parallel, but the order in which those requests finish depends on which physical addresses happen to map to the same bank. The operating system assigns physical pages to virtual addresses at runtime, and that mapping is not deterministic across reboots or even across process launches if address-space layout randomization (ASLR) is active. For AI training loops that use atomic operations — for example, gradient accumulation with atomic adds, or synchronizations in data-loading pipelines — the non-deterministic ordering of bank-conflict resolution can change the final numerical result.

How to Test for Bank-Conflict-Induced Variance

Disable ASLR for the training process using personality(ADDR_NO_RANDOMIZE) or the setarch x86_64 -R wrapper. Run the same training twice and compare bitwise outputs.
Pin memory to specific NUMA nodes with numactl --membind=0 to reduce cross-socket variance. If results stabilize, interleaving across sockets was the source of non-determinism.
Use huge pages (2 MB or 1 GB) to reduce the number of TLB entries and compress the page table. Fewer pages means fewer opportunities for the OS to rearrange physical mappings between runs.

NUMA Node Interleaving on Multi-Socket Servers

On a dual-socket AMD EPYC or Intel Xeon system, each CPU socket has its own local memory controller. When a process runs on socket 0 but allocates memory that gets physically placed on socket 1, every memory access incurs cross-socket latency. The Linux kernel’s default memory policy is “local allocation” — it places pages on the node where the allocating thread runs. But if the training process spawns worker threads that migrate between sockets (due to scheduler decisions), the memory pages they allocate can end up on different nodes. Worse, the kernel’s automatic NUMA balancing (enabled by default) may migrate pages to a different socket mid-training. This migration is not deterministic between runs, and it changes the latency profile of every memory access, which in turn changes the timing of thread synchronizations and the order of atomic gradient updates.

Controlling NUMA Behavior for Reproducible Training

Disable automatic NUMA balancing with echo 0 > /proc/sys/kernel/numa_balancing.
Bind the training process and all threads to a single socket using numactl --cpunodebind=0 --membind=0. This eliminates cross-socket memory traffic and makes page placement deterministic.
Use thread pinning (e.g., taskset or KMP_AFFINITY=granularity=fine,compact,1,0 for OpenMP) to prevent thread migration within the socket.

Page Coloring and Cache Slice Placement in Modern Intel CPUs

Intel’s last-level cache (LLC) is sliced: each core has a slice, and the hash function that maps physical address to cache slice is deterministic but undocumented. However, the physical address itself depends on OS page allocation. Two consecutive runs of the same training script may get different physical pages for the model weights, which means the weights land in different cache slices. Because cache slice contention is a function of which cores happen to be accessing which slices, the effective memory bandwidth can vary by 5–15% between runs. For bandwidth-bound operations like convolution or attention, this variance directly translates into different numbers of stalled cycles, different instruction retirement rates, and ultimately different floating-point accumulation order — hence different final loss values.

Mitigation via Page Coloring and Huge Pages

Explicitly reserve and pin physical pages using mlockall(MCL_CURRENT | MCL_FUTURE) combined with posix_memalign to 2 MB boundaries. This prevents page migration after allocation.
Use 1 GB huge pages where possible (requires kernel boot parameter default_hugepagesz=1G hugepagesz=1G hugepages=X). 1 GB pages have only one possible cache-slice mapping per page, greatly reducing inter-run variance.
Measure cache miss rates with perf stat -e LLC-load-misses,LLC-store-misses across multiple runs. If the miss count fluctuates by more than 2%, memory placement is likely a non-determinism source.

HBM2E Interleaving on GPU-Accelerated Nodes

NVIDIA GPUs with HBM2E memory stack memory across multiple channels and pseudo-channels internally. The GPU’s memory controller uses a hash-based interleaving algorithm to distribute cache lines across channels. This interleaving is deterministic for a given physical page, but the driver’s page allocator on the GPU is not guaranteed to be repeatable across allocations. When a training script creates tensors in a non-deterministic order — for example, when using dynamic batching or variable-length inputs — the GPU memory pages assigned to those tensors can differ between runs. The resulting interleaving pattern changes the bank-level parallelism, which changes the latency of atomic gradients during backpropagation. On A100 and H100 GPUs, this effect is small but measurable: bitwise identical results require forcing the GPU memory allocator to a deterministic mode.

Making GPU Memory Allocation Deterministic

Set CUDA_CACHE_DISABLE=1 to avoid driver-level caching of kernels (which can change allocation order).
Use cudaMallocManaged with cudaMemAdviseSetPreferredLocation to pre-allocate GPU pages in a fixed order. Wrap all tensor allocations in a factory that records the allocation sequence.
Enable deterministic GPU memory allocator via CUBLAS_WORKSPACE_CONFIG=:4096:8 and CUDNN_DETERMINISTIC=1, though note that these only control kernel-level determinism, not page placement.

Virtual Memory Layout Randomization and Its Effect on Data Locality

Linux’s ASLR randomizes the base address of the stack, heap, and mmap regions. For a Python training script that uses PyTorch or TensorFlow, the heap layout determines the adjacency of tensors in virtual address space. Two tensors allocated consecutively in virtual memory are more likely to be placed in the same DRAM row or on the same NUMA node, depending on the kernel’s page coloring policy. When ASLR is active, the relative virtual positions of tensors change between runs, altering the probability of row-buffer hits in DRAM. Row-buffer hits are about 3–4x faster than misses, so a 10% change in row-hit rate can shift training wall-clock time by 2–3%, which cascades into different scheduling of data loader threads and different ordering of gradient updates.

Practical Steps to Stabilize Virtual Memory Layout

Disable ASLR entirely for the training process using setarch x86_64 -R. This ensures the heap starts at the same virtual address every time.
Pre-allocate all tensors before training begins (rather than using dynamic growth). This prevents the OS from interleaving pages from other allocations between your tensor data.
Use a custom memory allocator like jemalloc with MALLOC_CONF=background_thread:false,percpu_arena:true to reduce allocator-induced variance.

Why Memory Interleaving Matters More for Large-Scale Distributed Training

In single-GPU training, memory interleaving variance is typically at the level of 1e-7 relative difference in loss values — often ignored as “numerical noise.” But in distributed data-parallel training with gradient all-reduce, the interaction between memory interleaving and communication becomes amplified. The NCCL library uses ring or tree algorithms that are sensitive to the order in which gradient chunks arrive at each GPU. If GPU 0 on node A gets gradient chunk 1 slightly faster due to favorable DRAM interleaving, the all-reduce operation may use a different reduction order than if chunk 3 arrived first. Over hundreds of steps, this difference can lead to divergent model weights across runs, especially when combined with stochastic rounding or mixed-precision training. I have personally debugged a multi-node BERT training run where the final validation accuracy varied by 0.3% across five identical runs — a difference that vanished once we pinned memory layout.

Checklist for Distributed Training Non-Determinism

Enable NCCL’s deterministic algorithm selection: NCCL_ALGO=Ring and NCCL_PROTO=Simple.
Set NCCL_DEBUG=INFO and verify that the tree/ring topology is identical across runs (same GPU ranks in the same order).
Pin each process to a fixed set of CPU cores and NUMA node using numactl --physcpubind and --membind.
Use MPI_PRELOAD with OMPI_MCA_btl=^openib to disable InfiniBand’s adaptive routing if using RDMA (adaptive routing changes message timing).

A Systematic Diagnostic Protocol for Memory-Induced Variance

If you suspect memory interleaving is causing non-deterministic training, here is a step-by-step protocol to isolate it. First, run the training twice with the same software environment and record the final loss and accuracy. If they differ, the source is either timing-dependent (memory, threads, network) or a missed seed. Second, disable ASLR, pin to one NUMA node, use huge pages, and set all applicable CUDA and cuDNN deterministic flags. Rerun twice. If results converge, the issue was memory layout. Third, to identify which specific level (DRAM, cache slice, NUMA) matters most, run a series of single-variable tests:

Test A: enable ASLR but disable NUMA balancing. If variance reappears, the culprit is virtual memory layout (DRAM bank mapping).
Test B: keep NUMA disabled but pin CPU cores. If variance still exists, the issue is likely cache-slice placement (page coloring).
Test C: use 1 GB huge pages (which eliminate page coloring effects). If variance drops to zero, cache slices were the problem.
Test D: on multi-node runs, disable adaptive routing in the network fabric (e.g., set MLNX_NON_PREFERRED=1 for Mellanox adapters).

The key insight is that hardware is not a deterministic black box. The operating system and memory controller make micro-decisions that are not repeatable unless you explicitly constrain them. By controlling page placement, NUMA policies, and cache slice mapping, you can eliminate the hidden variance that undermines reproducibility in AI training.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.