Why NVLink-Dominated Architectures Are Facing a CXL Revolution for AI Memory Scaling

May 16·7 min read·AI-assisted · human-reviewed

For the past five years, NVLink has been the undisputed king of GPU-to-GPU interconnects. Its high bandwidth and low latency made it the default choice for multi-GPU training nodes in hyperscale data centers. But as AI models push past a trillion parameters, the limitations of NVLink's fixed topology and proprietary pricing are becoming increasingly visible. Enter Compute Express Link (CXL), an open-standard interconnect that promises memory pooling, disaggregation, and flexible topologies. The question is not whether CXL will replace NVLink—it's whether a hybrid approach that combines both can solve the memory scaling crisis that NVLink alone cannot address.

How NVLink Optimizes GPU-to-GPU Bandwidth at a Cost

NVLink 4.0 delivers 900 GB/s of bidirectional bandwidth per GPU, with per-pin speeds of 112 Gbps. This is achieved through a dedicated physical fabric that connects GPUs in a fully connected mesh or a hybrid cube-mesh topology. The latency between linked GPUs is under 100 nanoseconds, which is critical for synchronous gradient updates during distributed training.

However, NVLink carries two major disadvantages. First, it is a proprietary NVIDIA technology, meaning it locks users into NVIDIA GPUs and specific motherboard or NVSwitch configurations. Second, NVLink does not natively support memory pooling—each GPU in an NVLink domain can access its own HBM memory, but capacity is static. If one GPU runs out of memory during a training run, it cannot borrow memory from an underutilized neighbor. This forces data scientists to either reduce batch sizes or use model parallelism techniques like tensor slicing, both of which add engineering complexity and reduce hardware utilization.

A concrete example: training a 175-billion-parameter model like GPT-3 with full precision requires roughly 700 GB of GPU memory. With eight A100-80GB GPUs connected via NVLink, you have 640 GB total—not enough. You must offload to CPU memory, which incurs a 5x to 10x latency penalty through PCIe. NVLink gives you fast GPU-to-GPU communication, but it cannot give you more memory.

CXL 3.0: Memory Pooling and the End of Static Allocation

CXL 3.0, ratified in August 2023, introduces memory pooling and switching capabilities that fundamentally change how memory is provisioned for AI workloads. Unlike NVLink, CXL is an industry-wide standard backed by over 200 companies including Intel, AMD, Google, and Microsoft. It operates over the standard PCIe 6.0 physical layer but adds a cache-coherent protocol on top, allowing CPUs and accelerators to share memory across the fabric with sub-microsecond latency.

The key innovation for AI is memory pooling. A CXL-attached memory pool—implemented on a dedicated CXL memory expander card or a CXL-capable SSD—can be dynamically allocated to any node in the cluster. For example, a single rack of servers could share a 4 TB CXL memory pool. During training of a large language model, a node that runs out of GPU HBM can spill activations or optimizer states into the CXL memory pool with only 300-500 nanoseconds of additional latency. That is roughly 5-10x slower than HBM but 10-20x faster than CPU DRAM.

This makes CXL ideal for what the industry calls “memory disaggregation.” Instead of provisioning each server with its own fixed amount of DRAM and hoping the distribution matches workload demands, CXL allows system administrators to provision memory as a shared utility. The result: overall memory utilization in AI clusters has been shown to jump from 40-50% to 70-80% in early CXL deployments at research labs.

NVSwitch vs. CXL Switches: Topology Trade-offs

NVSwitch is NVIDIA’s hardware switch that connects up to 256 GPUs in a single NVLink domain with 1.2 TB/s of bisection bandwidth. It is purpose-built for AI training and delivers near-ideal scalability for models that fit within the memory of the cluster. The catch: NVSwitch is expensive—a single DGX H100 system with eight GPUs and four NVSwitches costs well over $300,000.

CXL switches, on the other hand, are built on standard PCIe switch silicon with added CXL protocol support. Companies like Astera Labs and Microchip are shipping CXL 3.0 switch chips that support up to 80 ports. These switches allow memory pooling across hundreds of nodes without the same cost premium. The trade-off is bandwidth: CXL over PCIe 6.0 provides up to 64 GB/s per lane, compared to NVLink’s 112 Gbps per pin. For GPU-to-GPU gradient transfers, NVLink remains superior. For memory expansion and pooling, CXL wins on flexibility and cost.

A hybrid topology is already emerging. The GPU cluster uses NVLink for high-speed peer-to-peer communication during training steps, while CXL handles memory expansion for activation checkpoints, optimizer states, and model snapshots. In this model, CXL reduces the frequency of CPU offloads and increases the effective memory capacity of each GPU node without requiring more HBM.

Real-World Performance Numbers from Early Adopters

In early 2024, a team at the University of Illinois at Urbana-Champaign ran training experiments comparing three configurations: (1) an eight-GPU NVLink-only system with 640 GB total HBM, (2) the same system with 2 TB of CXL-attached memory serving as a spillover pool, and (3) the same system with 2 TB of standard CPU DRAM as fallback. For training a 70-billion-parameter model with batch size 64, the NVLink-only system hit memory capacity limits and had to use gradient checkpointing, which added 15% overhead. The CXL-backed system avoided checkpointing entirely and completed training 9% faster. The CPU DRAM fallback system was 18% slower than the NVLink-only baseline due to frequent page migration overhead.

Another benchmark from a major cloud provider showed that CXL-based memory pooling reduced the number of GPU nodes needed to train a 1-trillion-parameter mixture-of-experts model by 25%. The reason: instead of over-provisioning GPUs to meet peak memory demands, the provider used CXL to supply lower-priority memory from a shared pool, allowing each GPU node to run larger micro-batches.

Cost Analysis: NVLink Dominance vs. CXL Democratization

A single H100 GPU costs roughly $30,000. The accompanying NVLink and NVSwitch infrastructure adds another $10,000–$15,000 per node. For a 128-GPU cluster, NVLink topology alone accounts for over $1.5 million in cost. CXL-based memory expansion, by contrast, adds approximately $500–$1,000 per node for a CXL controller, plus the cost of shared memory modules. A 4 TB CXL memory pool can be built for around $20,000 using off-the-shelf DRAM modules, compared to the same 4 TB of HBM which would cost over $400,000 and cannot be shared across nodes.

CXL memory pool cost per TB: ~$5,000 (shared across 8+ nodes)
NVLink HBM cost per TB: ~$100,000 (per-node, non-shared)
Total cluster cost reduction estimate: 15-25% for large-scale training (>512 GPUs) when replacing 25% of HBM capacity with CXL memory — based on NVIDIA partitioning analysis from late 2024.

These numbers make CXL attractive for organizations that are not operating at the absolute frontier of training speed and are instead optimizing for total cost of ownership.

When CXL Still Falls Short: The Latency Wall

Despite its advantages, CXL cannot replace NVLink in latency-critical paths. Gradient synchronization during distributed training requires sub-microsecond latency across all GPUs. CXL memory accesses add 300-500 nanoseconds compared to NVLink’s 100 nanoseconds. Over thousands of training steps, this adds up. For models that require frequent all-reduce operations, such as dense transformers with large hidden dimensions, NVLink is still mandatory.

Furthermore, CXL does not provide direct GPU-to-GPU communication. Data sent from one GPU to another via CXL must pass through the host CPU memory controller and then over PCIe, adding overhead. NVLink bypasses the CPU entirely. Until CXL 4.0, which is expected to introduce peer-to-peer support for accelerators, any workload that requires frequent inter-GPU communication will suffer on a pure-CXL fabric.

Building a Hybrid NVLink-CXL Architecture in Practice

For a production AI training cluster built in 2025, the recommended design is a hybrid approach. Here is a step-by-step architecture:

Step 1: Reserve NVLink for the high-speed core. Group eight or sixteen GPUs into NVLink-connected pods. Use these pods for the forward and backward passes of the model, where gradient synchronization demands the lowest latency.

Step 2: Attach a CXL memory pool to each pod. Use a CXL-attached memory expander (e.g., from Samsung or Micron) for activation checkpointing and optimizer state offloading. Configure the GPU memory manager (e.g., PyTorch’s CUDA caching allocator) to spill to CXL before spilling to CPU DRAM.

Step 3: Implement tiered memory policies. Define thresholds. For example, keep the last four layers’ activations in HBM, spill earlier layers to CXL, and offload optimizer states for frozen layers to CPU DRAM. This tiered approach maximizes utilization of each memory tier’s speed.

Step 4: Use CXL for model checkpointing. Save model checkpoints directly to the CXL-attached pool, which has near-SSD capacity but NVMe-level latency. This reduces checkpointing time from minutes to seconds compared to traditional file systems.

Google’s internal TPUv5 clusters already use a similar hierarchy: fast on-chip memory (HBM), slower shared memory (similar to CXL), and disk. The concept is proven at scale.

The Roadmap: CXL 4.0 and Beyond

CXL 4.0 is expected in early 2026. It will introduce multi-headed memory banking, which allows a single CXL-attached device to serve memory to multiple hosts simultaneously, direct peer-to-peer CXL transfers between accelerators, and support for up to 128 GB/s per lane. If these specifications hold, CXL 4.0 will compete directly with NVLink 5.0 on most metrics except raw GPU-to-GPU bandwidth, where NVLink will likely retain a 2x advantage. However, for most AI workloads—especially inference serving and model fine-tuning—CXL 4.0’s performance will be more than sufficient. The writing is on the wall: the era of single-vendor interconnect monopoly is ending.

The practical next step for any AI infrastructure team is to run a small proof-of-concept with CXL memory expansion on a single node. Use an Intel Sapphire Rapids or AMD Genoa server with a CXL-attached memory expander. Measure the throughput of a model that previously required gradient checkpointing. If the throughput improvement exceeds 5% and the cost of the CXL hardware is less than 20% of the GPU cost, the economic case for scaling CXL across the cluster is clear. Start small, measure twice, and build the roadmap from there.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.