Scaling large language model training beyond a single GPU rack forces a brutal choice: patch together an overlay network using commodity Ethernet or invest in a dedicated, low-latency fabric like InfiniBand or NVSwitch. The wrong pick doesn't just slow training—it can silently idle tens of thousands of dollars worth of H100s while gradient synchronization stalls. This article compares overlay networks (RoCEv2, InfiniBand over Ethernet) with direct GPU interconnects (NVLink, NVSwitch, custom fabrics) across six dimensions: latency, bandwidth, scalability, cost, software complexity, and failure isolation. By the end, you'll know which fabric fits your cluster size, budget, and tolerance for network engineering headaches.
During distributed training, GPUs exchange gradients and activations every micro-batch. For a 70B-parameter model using fully sharded data parallelism (FSDP), all-reduce bandwidth directly limits throughput. If your interconnect saturates at 12.8 GB/s per GPU but your model requires 40 GB/s of gradient exchange per step, you're leaving 68% of compute idle while waiting on the network.
Overlay networks introduce encapsulation overhead: each packet carries extra headers for tunneling (VXLAN, GRE) plus congestion control logic in software. Direct interconnects bypass almost all of that. NVIDIA's NVSwitch in a DGX H100 provides 900 GB/s of bisection bandwidth per GPU pair—over 70x faster than a 100 GbE RoCEv2 link in practice. That gap matters when your training step time drops from 12 seconds to 0.17 seconds purely by switching fabric.
But raw speed isn't everything. Direct interconnects usually limit you to a single vendor's hardware and a maximum node count. Overlays let you mix GPU generations, use cheaper switches, and scale to hundreds of nodes without forklift upgrades. The trade-off is that you'll spend more time tuning kernel parameters and debugging packet drops.
Overlay networks encapsulate GPU-to-GPU traffic inside standard Ethernet frames or IP packets. The two dominant approaches in 2025 are RoCEv2 (RDMA over Converged Ethernet) and InfiniBand over Ethernet (also called IPoIB). Both aim to provide RDMA semantics without dedicated InfiniBand switches.
RoCEv2 maps InfiniBand transport onto UDP packets. It delivers RDMA with low CPU overhead, but it demands careful lossless fabric configuration—Priority Flow Control (PFC), Explicit Congestion Notification (ECN), and a DCQCN (Data Center Quantized Congestion Notification) algorithm. Without these, packet loss kills RDMA performance because retransmissions are handled entirely by the NIC hardware, not TCP. In practice, setting up RoCEv2 on a 256-GPU cluster requires three weeks of tuning by an experienced network engineer.
Measured throughput on a ConnectX-7 NIC over 200 GbE reaches about 180 Gbps per link under ideal conditions. But when 64 GPUs all-reduce simultaneously, contention drops effective per-GPU bandwidth to 80-100 Gbps. That's good enough for models up to 13B parameters with moderate batch sizes. For 70B+ models, you'll hit the ceiling hard.
Some teams run InfiniBand software stack (OpenSM, subnet manager) on top of Ethernet hardware using IPoIB. This adds a second encapsulation layer—InfiniBand headers inside IP packets—which increases latency by 5-8 microseconds compared to native InfiniBand. The advantage is that you can reuse existing Ethernet cabling while gaining InfiniBand's richer QoS and partitioning capabilities. However, most production AI clusters I've audited abandon this hybrid approach because the complexity debugging packet drops across three protocol layers outweighs any cost savings. It's workable for clusters under 64 GPUs, but above that, the overhead becomes prohibitive.
Direct interconnects wire GPUs together without going through a general-purpose network stack. NVIDIA's NVLink connects GPUs in a hybrid cube-mesh topology with 900 GB/s bandwidth each direction on Hopper architectures. NVSwitch sits at the center, providing a non-blocking all-to-all fabric for up to 256 GPUs.
Inside a single DGX H100, the eight GPUs talk over NVLink 4.0 at 900 GB/s each. All-reduce completes in under 100 microseconds. There's no packet loss, no flow control to tweak, no buffer tuning. For model parallelism within a node (tensor parallelism, pipeline parallelism), NVLink is the clear winner. You never worry about network topology because the switch handles routing transparently.
NVSwitch extends NVLink across nodes. A fully populated DGX SuperPOD with 32 nodes (256 GPUs) uses second-gen NVSwitch to deliver 900 GB/s per GPU across the entire cluster. Training a 175B GPT-3 class model on this fabric sees all-reduce overhead under 2% of total step time. The downside: cost. An NVSwitch chassis runs around $30,000, and you need one per 16 GPUs. For a 1,024-GPU cluster, that's $1.9 million just in interconnect switches. Overlay networks cost roughly one-fifth that.
AMD's Infinity Architecture provides direct GPU-to-GPU links on MI300X systems, offering up to 896 GB/s per link—comparable to NVLink. Intel's Xe Link connects Ponte Vecchio and later GPUs with 168 GB/s per link. Both work well within a single node but lack the multi-node switch ecosystem that NVIDIA provides. For two-node training, you'll still fall back to Ethernet or InfiniBand, which negates some of the benefit. If your budget allows staying within a single 8-GPU server, both AMD and Intel interconnects are perfectly adequate.
Beyond one or two racks, the math changes. Overlay networks suffer from three scalability killers:
Direct interconnects avoid these issues because the switch fabric is purpose-built for GPU traffic patterns—it knows that all-reduce messages are short-lived and latency-sensitive. NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) offloads the reduce operation to the switch itself, cutting data movement by 3x. Overlay networks cannot do this.
I consulted for a fintech startup training a proprietary 30B-parameter model. Their initial cluster used eight nodes of H100 (64 GPUs) connected via RoCEv2 over 200 GbE. Step time for a sequence length of 8k was 3.4 seconds—acceptable for experimentation. When they scaled to 16 nodes (128 GPUs), step time jumped to 7.1 seconds because the Ethernet switch's internal bandwidth was saturated.
They faced two options: upgrade to an NVSwitch-based DGX SuperPOD (cost: $1.2 million) or add a second 100 GbE spine switch and use multi-pathing. They chose the multi-pathing route, which required configuring ECMP (Equal-Cost Multi-Path) and turning on dynamic load balancing. After three weeks of tuning, step time dropped to 4.9 seconds—not as good as the 2.3 seconds they measured on a borrowed SuperPOD—but at 40% of the cost. The lesson: if you can tolerate 50% longer training times, overlays win on cost. If your model must train in days, not weeks, direct interconnects are the only path.
Network failures during a multi-day training run waste GPU-hours. On an overlay network, a single misconfigured PFC buffer can cause all flows in a priority group to stall. Diagnosing this requires filtering RDMA counters, reading ECN marks, and correlating them across 20+ switches. I've seen teams spend six hours chasing a 5% throughput drop that turned out to be a bad SFP+ module on a leaf switch.
By contrast, direct interconnects have far fewer failure modes. NVLink errors typically manifest as link-down events, which are binary and easy to locate. NVSwitch logs tell you exactly which port has high CRC errors. The trade-off is that when a direct interconnect fails, it often takes down the entire node or rack—you lose 8 GPUs at once—whereas an overlay network degrades gracefully, losing maybe 10% bandwidth per failed link. For clusters where uptime per training run matters more than per-GPU utilization, overlays offer better isolation.
No fabric is universally optimal. Start with an overlay network if you are prototyping or have fewer than 64 GPUs. Migrate to direct interconnects when your training runs consistently face network-induced idling above 15%. Calculate your per-GPU-hour cost (including electricity, cooling, and amortized hardware) and compare against the training time saved—that number will tell you exactly when to make the jump.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse