Every millisecond matters when training large models, yet most GPU time is eaten not by computation but by kernel launch latency. For a typical Transformer training loop, the CPU issues thousands of tiny kernel invocations per step — each carrying overhead from driver calls, argument marshalling, and scheduling. NVIDIA's CUDA Graphs attack this by flattening the launch sequence into a single operation, bypassing the CPU bottleneck. But the technique is not universally faster; it imposes constraints on control flow and memory allocation that can hurt dynamic workloads. This article compares CUDA Graphs against traditional dynamic execution across real training scenarios, covering when each makes sense and how to measure the trade-off in your own pipeline.
GPU kernel launches are not free. Each cudaLaunchKernel call requires the CPU to serialize arguments, push them to a command buffer, and synchronise with the driver. For a ResNet-50 training step, profiling reveals roughly 400–600 kernel launches per iteration. In PyTorch's eager mode, each forward, backward, and weight update call triggers a separate launch. Over a 100-epoch run on ImageNet, the CPU spends hours just pushing kernels to the GPU — work that could be spent on data loading or model-parallel orchestration.
The real cost surfaces in mixed-precision training with gradient scaling. Automatic Mixed Precision (AMP) introduces extra synchronisation points for loss scaling checks, which multiply launch counts. On an A100, each kernel launch incurs roughly 3–15 microseconds of overhead depending on argument size. With 500 launches per step, that's 1.5–7.5 milliseconds per iteration just for control overhead — enough to eat 10–20% of total step time in small-to-medium batch sizes.
The insight behind CUDA Graphs is that many training loops are repetitive: the same sequence of kernels runs with the same dependencies every step. By capturing the graph once and replaying it, you eliminate per-step CPU involvement. The driver pre-computes the execution plan and submits it in bulk, reducing launch overhead by up to 90% in ideal cases.
CUDA Graphs work by recording a snapshot of GPU operations — kernel launches, memory copies, and synchronisation events — into a cudaGraph_t object. Once captured, you instantiate it into an executable graph via cudaGraphInstantiate. The replay function cudaGraphLaunch submits the entire graph as a single unit.
In PyTorch, capture begins with torch.cuda.CUDAGraph. You wrap your training step inside a context manager that records operations. A key requirement: all tensors must be allocated before capture, and the memory layout cannot change between replays. This means static shapes and fixed memory pools. For models with variable-length inputs — common in NLP — you must pad to a maximum size or maintain separate graphs per bucket.
Tests on a single A100 with batch size 32 for BERT-Large show that CUDA Graphs reduce step time from 480ms to 390ms — a 19% improvement. The gain is larger at smaller batches because launch overhead dominates. At batch size 8, the improvement jumps to 35%. However, the first iteration after graph capture includes the recording overhead, so benefit appears from the second step onward.
Each instantiated graph stores the full sequence of GPU commands and memory addresses. For a typical Transformer training step with 200 kernels, the graph object consumes 5–20 MB of GPU-accessible memory. The host-side graph object adds another 1–2 MB. This is negligible on a 40 GB or 80 GB A100, but on a 16 GB T4 or RTX 4090 it competes with model weights and activations.
Dynamic execution is the default in PyTorch eager mode and TensorFlow eager execution. Each operation launches its kernel immediately, allowing Python control flow, dynamic shapes, and early exit conditions. This flexibility is critical for models with branching logic — think of reinforcement learning agents that choose different network paths based on state, or transformer decoders that stop at a variable-length sequence.
For training loops with data-dependent control flow — such as adaptive computation time, conditional gradient scaling, or loss masking — dynamic execution is the only practical option. CUDA Graphs requires all branches to be captured and replayed, which either forces you to pad every branch to the longest path or replicate graph instances per branch configuration.
Dynamic execution pays for flexibility per call. Profiling an EfficientNet-B0 training step reveals that 30% of the step time is spent in driver overhead — cudaLaunchKernel plus argument parsing. As GPU compute speeds increase, this overhead becomes a larger fraction of step time. For Hopper H100 GPUs with FP8 tensor cores, the compute-to-launch ratio worsens because kernels finish faster while launch latency stays constant.
Dynamic execution allocates and frees GPU memory per operation through the driver's memory manager. This fragmentation increases over long training runs, especially when using PyTorch's caching allocator. CUDA Graphs freeze memory allocations at capture time, avoiding fragmentation but preventing reuse patterns that vary across steps.
In a Vision Transformer training setup with gradient checkpointing, dynamic execution peaks at 23.4 GB on an A100-40. The CUDA Graphs variant with the same batch size peaks at 24.1 GB — 3% higher — because the graph reserves memory for the worst-case allocation of each tensor in the capture. The extra 700 MB comes from padding allocations in tensors that vary in shape between steps (e.g., attention masks).
Dynamic execution performs an average of 1,200 cudaMalloc calls per 100 steps in a ResNet-50 training loop. CUDA Graphs performs zero allocations after capture. The allocation overhead itself is not large (~0.2ms per call), but it contributes to memory bus contention when combined with data transfer and convolution streams.
Not every workload benefits. The following checklist helps you decide:
Conversely, avoid CUDA Graphs for: models with dynamic padding per batch (variable-length LSTMs), reinforcement learning with policy-dependent computation graphs, or any pipeline where you frequently change hyperparameters mid-training (e.g., learning rate schedules that modify kernel configurations).
Modern frameworks support mixing captured graphs with eager execution. PyTorch's torch.compile with the inductor backend applies graph capture automatically to traced portions of the model, while leaving dynamic sections in eager mode. This hybrid model is gaining traction in 2025 as the default for production training pipelines.
In a GPT-2 training loop, you can capture the forward pass through the attention mechanism — which has fixed operations — while leaving the loss computation and backpropagation in dynamic mode. The captured attention graph reduces launch count by 40% while preserving the flexibility to change loss functions without recapturing. This approach yields a 12% step-time improvement on an RTX 4090.
CUDA Graphs supports update parameters via cudaGraphKernelNodeSetParams after capture. This allows fine-tuning hyperparameters like learning rate or momentum without rebuilding the entire graph. The catch: only scalar kernel parameters can be updated, not tensor shapes or memory addresses. For weight updates that involve cudaMemset calls, you must recapture the graph.
The simplest production path uses PyTorch's built-in graph capture wrappers. For a standard training loop, add:
from torch.cuda import CUDAGraph
graph = CUDAGraph()
with torch.cuda.graph(graph):
output = model(input_batch)
loss = loss_fn(output, target)
loss.backward()
optimizer.step()
After capture, run graph.replay() each step. Monitor memory usage via torch.cuda.memory_summary() to catch allocation creep. Also verify that all tensors remain on the same device — cross-device tensor operations invalidate the captured graph.
For teams already using torch.compile, adding mode='reduce-overhead' enables graph capture automatically for the compiled region. This is the recommended entry point for 2025 codebases because it handles most of the static-vs-dynamic trade-off transparently, with profile-guided fallback to dynamic execution where graphs cannot apply.
The decision between CUDA Graphs and dynamic execution ultimately comes down to how static your workload is. Measure launch overhead with nsys profiling on a single training iteration. If kernel launch time exceeds 15% of total step time and your shapes are stable, adopt graphs. If your pipeline evolves every few steps — whether from data-dependent branching or adaptive scheduling — dynamic execution remains the practical choice, and you should look instead to reduce driver overhead through better batching of operations at the Python level.
Browse the latest reads across all four sections — published daily.
← Back to BestLifePulse