When a deep learning training run produces slightly different results each time you execute it, the typical response is to blame floating-point non-associativity or stochastic operations like dropout. You set random seeds, you pin CUDA operations, you even try TensorFlow determinism flags. Yet the variance persists. What many engineers overlook is the hardware memory subsystem itself. DRAM banks, NUMA domains, and virtual-to-physical page mappings are not deterministic by default, and on modern multi-socket servers with high-bandwidth memory (HBM) stacks, the interleaving pattern that the operating system chooses can silently change the order in which memory requests complete. This article walks through the specific mechanisms by which memory interleaving introduces non-determinism into AI training, and gives you practical, actionable methods to detect, measure, and control it.
Modern DDR4 and DDR5 memory is organized into banks, rows, and columns. When two memory accesses target the same bank, the second must wait for the first to complete — a bank conflict. GPUs and CPUs with multiple memory channels can issue requests in parallel, but the order in which those requests finish depends on which physical addresses happen to map to the same bank. The operating system assigns physical pages to virtual addresses at runtime, and that mapping is not deterministic across reboots or even across process launches if address-space layout randomization (ASLR) is active. For AI training loops that use atomic operations — for example, gradient accumulation with atomic adds, or synchronizations in data-loading pipelines — the non-deterministic ordering of bank-conflict resolution can change the final numerical result.
personality(ADDR_NO_RANDOMIZE) or the setarch x86_64 -R wrapper. Run the same training twice and compare bitwise outputs.numactl --membind=0 to reduce cross-socket variance. If results stabilize, interleaving across sockets was the source of non-determinism.On a dual-socket AMD EPYC or Intel Xeon system, each CPU socket has its own local memory controller. When a process runs on socket 0 but allocates memory that gets physically placed on socket 1, every memory access incurs cross-socket latency. The Linux kernel’s default memory policy is “local allocation” — it places pages on the node where the allocating thread runs. But if the training process spawns worker threads that migrate between sockets (due to scheduler decisions), the memory pages they allocate can end up on different nodes. Worse, the kernel’s automatic NUMA balancing (enabled by default) may migrate pages to a different socket mid-training. This migration is not deterministic between runs, and it changes the latency profile of every memory access, which in turn changes the timing of thread synchronizations and the order of atomic gradient updates.
echo 0 > /proc/sys/kernel/numa_balancing.numactl --cpunodebind=0 --membind=0. This eliminates cross-socket memory traffic and makes page placement deterministic.taskset or KMP_AFFINITY=granularity=fine,compact,1,0 for OpenMP) to prevent thread migration within the socket.Intel’s last-level cache (LLC) is sliced: each core has a slice, and the hash function that maps physical address to cache slice is deterministic but undocumented. However, the physical address itself depends on OS page allocation. Two consecutive runs of the same training script may get different physical pages for the model weights, which means the weights land in different cache slices. Because cache slice contention is a function of which cores happen to be accessing which slices, the effective memory bandwidth can vary by 5–15% between runs. For bandwidth-bound operations like convolution or attention, this variance directly translates into different numbers of stalled cycles, different instruction retirement rates, and ultimately different floating-point accumulation order — hence different final loss values.
mlockall(MCL_CURRENT | MCL_FUTURE) combined with posix_memalign to 2 MB boundaries. This prevents page migration after allocation.default_hugepagesz=1G hugepagesz=1G hugepages=X). 1 GB pages have only one possible cache-slice mapping per page, greatly reducing inter-run variance.perf stat -e LLC-load-misses,LLC-store-misses across multiple runs. If the miss count fluctuates by more than 2%, memory placement is likely a non-determinism source.NVIDIA GPUs with HBM2E memory stack memory across multiple channels and pseudo-channels internally. The GPU’s memory controller uses a hash-based interleaving algorithm to distribute cache lines across channels. This interleaving is deterministic for a given physical page, but the driver’s page allocator on the GPU is not guaranteed to be repeatable across allocations. When a training script creates tensors in a non-deterministic order — for example, when using dynamic batching or variable-length inputs — the GPU memory pages assigned to those tensors can differ between runs. The resulting interleaving pattern changes the bank-level parallelism, which changes the latency of atomic gradients during backpropagation. On A100 and H100 GPUs, this effect is small but measurable: bitwise identical results require forcing the GPU memory allocator to a deterministic mode.
CUDA_CACHE_DISABLE=1 to avoid driver-level caching of kernels (which can change allocation order).cudaMallocManaged with cudaMemAdviseSetPreferredLocation to pre-allocate GPU pages in a fixed order. Wrap all tensor allocations in a factory that records the allocation sequence.CUBLAS_WORKSPACE_CONFIG=:4096:8 and CUDNN_DETERMINISTIC=1, though note that these only control kernel-level determinism, not page placement.Linux’s ASLR randomizes the base address of the stack, heap, and mmap regions. For a Python training script that uses PyTorch or TensorFlow, the heap layout determines the adjacency of tensors in virtual address space. Two tensors allocated consecutively in virtual memory are more likely to be placed in the same DRAM row or on the same NUMA node, depending on the kernel’s page coloring policy. When ASLR is active, the relative virtual positions of tensors change between runs, altering the probability of row-buffer hits in DRAM. Row-buffer hits are about 3–4x faster than misses, so a 10% change in row-hit rate can shift training wall-clock time by 2–3%, which cascades into different scheduling of data loader threads and different ordering of gradient updates.
setarch x86_64 -R. This ensures the heap starts at the same virtual address every time.jemalloc with MALLOC_CONF=background_thread:false,percpu_arena:true to reduce allocator-induced variance.In single-GPU training, memory interleaving variance is typically at the level of 1e-7 relative difference in loss values — often ignored as “numerical noise.” But in distributed data-parallel training with gradient all-reduce, the interaction between memory interleaving and communication becomes amplified. The NCCL library uses ring or tree algorithms that are sensitive to the order in which gradient chunks arrive at each GPU. If GPU 0 on node A gets gradient chunk 1 slightly faster due to favorable DRAM interleaving, the all-reduce operation may use a different reduction order than if chunk 3 arrived first. Over hundreds of steps, this difference can lead to divergent model weights across runs, especially when combined with stochastic rounding or mixed-precision training. I have personally debugged a multi-node BERT training run where the final validation accuracy varied by 0.3% across five identical runs — a difference that vanished once we pinned memory layout.
NCCL_ALGO=Ring and NCCL_PROTO=Simple.NCCL_DEBUG=INFO and verify that the tree/ring topology is identical across runs (same GPU ranks in the same order).numactl --physcpubind and --membind.MPI_PRELOAD with OMPI_MCA_btl=^openib to disable InfiniBand’s adaptive routing if using RDMA (adaptive routing changes message timing).If you suspect memory interleaving is causing non-deterministic training, here is a step-by-step protocol to isolate it. First, run the training twice with the same software environment and record the final loss and accuracy. If they differ, the source is either timing-dependent (memory, threads, network) or a missed seed. Second, disable ASLR, pin to one NUMA node, use huge pages, and set all applicable CUDA and cuDNN deterministic flags. Rerun twice. If results converge, the issue was memory layout. Third, to identify which specific level (DRAM, cache slice, NUMA) matters most, run a series of single-variable tests:
MLNX_NON_PREFERRED=1 for Mellanox adapters).The key insight is that hardware is not a deterministic black box. The operating system and memory controller make micro-decisions that are not repeatable unless you explicitly constrain them. By controlling page placement, NUMA policies, and cache slice mapping, you can eliminate the hidden variance that undermines reproducibility in AI training.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse