Why Speculative Decoding Is Quietly Halving LLM Latency Without Quality Loss

May 12·8 min read·AI-assisted · human-reviewed

Large language models are getting faster, but not fast enough for latency-critical applications like real-time chatbots, interactive coding assistants, and voice-driven AI. Most optimization techniques force a trade-off: quantization reduces precision, pruning loses parameters, and caching helps only with repeated queries. Speculative decoding takes a different path. Instead of modifying the target model, it pairs it with a smaller, faster draft model that guesses multiple tokens at once. The target model then verifies those guesses in parallel, turning sequential token generation into a batch verification step. In practice, this cuts end-to-end latency by 40 to 60 percent on consumer GPUs like the NVIDIA RTX 4090 and server-grade H100s alike, with zero change to the final output. This article walks through the mechanics, the deployment pitfalls, and the specific conditions where speculative decoding delivers—or fails to deliver—on its promise.

How speculative decoding differs from every other latency trick

Standard autoregressive decoding generates one token at a time. Each forward pass through the model computes probabilities for the entire vocabulary, then the sampler picks one token. That token becomes the input for the next pass. The GPU spends most of its time moving weights from memory to compute units, not actually doing math—a problem known as memory-bandwidth-bound inference. Speculative decoding sidesteps this by having a lightweight draft model, often 10 to 100 times smaller, propose a sequence of k tokens in a single pass. The target model then runs a single forward pass on the entire sequence and accepts or rejects each token based on its own probability distribution. Accepted tokens are kept; rejected tokens trigger a resample from the corrected distribution. The net effect is that the target model processes multiple tokens for the cost of roughly one-and-a-half forward passes, rather than k forward passes. This is not a heuristic or a lossy approximation—it is a mathematically exact method: the output distribution matches the target model exactly, assuming the draft model's proposals are statistically aligned.

The rejection sampling mechanism that guarantees quality

The core insight comes from a 2023 paper by researchers at Google and UC Berkeley, who formalized speculative decoding using rejection sampling. If the draft model proposes a token with probability q, and the target model assigns it probability p, the token is accepted with probability p/q (capped at 1). When a token is rejected, the target model resamples from the distribution max(0, p−q), normalized. This ensures the final distribution is identical to the target model's, no matter how bad the draft model is. In practice, even a poorly tailored draft model degrades latency at worst to the original cost, because the acceptance rate never drops to zero. But good draft models—those fine-tuned on the same data distribution—achieve acceptance rates above 0.8 per token, meaning the target model verifies a block of 4–6 tokens in one pass, most of which are accepted outright.

Choosing the right draft model for your workload

Not all draft models are created equal. The ideal draft model is fast, small, and aligned with the target model's output distribution. For a 70B parameter target, a 7B or 8B draft is common, but teams deploying to edge devices or high-throughput APIs often use draft models as small as 1.5B parameters. The trade-off is acceptance rate versus draft speed. A 1.5B draft runs extremely fast—often 10x faster than the target on the same GPU—but its guesses are less accurate, leading to a lower acceptance rate. An 8B draft accepts more tokens but costs more to run itself. The optimal size depends on the latency target and the batch size. For single-stream inference with a 50-millisecond latency budget, a smaller draft that hits a higher reject rate can still win, because the target model's verification pass is the expensive part. For batched inference, where the GPU is already saturated, the draft model's overhead can erode gains.

Matching draft and target architectures

The draft model should share the same vocabulary and tokenizer as the target model. If they don't, you pay the cost of detokenizing and re-tokenizing intermediate states, which adds 5–15 percent overhead. Most production setups use a smaller variant of the same model family—for example, Llama 3.2 8B as a draft for Llama 3.4 70B, or DeepSeek-Coder 1.3B for a 33B coding model. Cross-family pairing, such as using a TinyLlama draft for a Mistral target, works but often underperforms because output distributions diverge. Some teams fine-tune the draft model on a small set of target model outputs using distillation, which boosts acceptance rates by 5–10 percentage points without changing the draft architecture.

Hardware and batching: where speculative decoding shines and stumbles

Speculative decoding's performance depends heavily on hardware characteristics and batch size. On high-memory-bandwidth GPUs like the H100 (3.35 TB/s), the target model's verification pass is already fast relative to the draft model's run. The gain comes from reducing the number of target passes. On older GPUs like the A100 (2.0 TB/s) or consumer cards like the RTX 4090 (1.0 TB/s), the memory-bandwidth constraint is more severe, and speculative decoding yields proportionally larger speedups. However, for batched inference with batch sizes above 16, the GPU is already well-utilized, and the overhead of running a separate draft model per sequence can outweigh the benefit. A 2024 benchmark by a large cloud provider showed that for batch sizes of 32, speculative decoding reduced latency by only 12 percent compared to 55 percent for single-stream inference. Teams running high-throughput APIs should profile both modes—speculative decoding may not help for throughput-heavy workloads with large batch sizes, but it is almost always beneficial for latency-critical single-user requests.

Fallback strategies when acceptance rates drop

Domain shift is the biggest risk. If your deployment receives inputs from a distribution the draft model has not been fine-tuned on—for instance, legal documents in a chat-oriented pipeline—acceptance rates can fall below 0.4, and speculative decoding may actually increase latency due to overhead. The fix is to implement a fallback: monitor the running acceptance rate over a sliding window of 200 tokens, and if it falls below a threshold like 0.5, revert to standard autoregressive decoding for that request. The best production systems log this switch and alert the team, triggering a draft model update or fine-tuning cycle. Some frameworks, like vLLM and TensorRT-LLM, already support speculative decoding with automatic fallback.

Real-world deployment: tools and frameworks in 2025

Most major inference servers now include speculative decoding as a built-in feature. vLLM added experimental support in late 2024 and made it stable in version 0.7.0, with a simple configuration flag that accepts a draft model checkpoint. TensorRT-LLM supports it via the Medusa extension, which uses multiple draft heads instead of a separate draft model—a variation called Medusa decoding. For custom deployments, Hugging Face's Text Generation Inference (TGI) added speculative decoding in version 2.5, and the open-source library SpecInfer provides a standalone implementation. The ecosystem has matured enough that building from scratch is rarely justified unless you need tight integration with a proprietary model or custom hardware.

vLLM: Best for teams already using its paged attention and continuous batching. Set --speculative-model to a draft checkpoint path.
TensorRT-LLM: Offers Medusa heads (trained on top of the target model) rather than a separate draft, which can yield higher acceptance rates at the cost of additional fine-tuning.
TGI: Lightest integration; works with any Hugging Face transformer model.
SpecInfer: Allows tree-based speculative decoding, where multiple draft models vote on the next token, boosting acceptance further for complex domains.

Measuring the real cost: latency vs. throughput vs. hardware overhead

Speculative decoding is not free. Running two models means two sets of weights must reside in GPU memory. A 70B target plus an 8B draft requires roughly 156 GB at FP16, which fits on a single H100 (80 GB) only with aggressive quantization or sharding across two GPUs. For smaller teams on a single A100 80 GB, the draft model may need to be 2–3 GB at INT4, which limits accuracy. The overhead also includes compute for the draft model's forward pass and the verification pass. In benchmarks with a 7B target and 1.5B draft on an A100, total FLOPs increased by 22 percent, but wall-clock latency dropped by 48 percent because memory bandwidth was the bottleneck. The trade-off makes sense for latency-critical deployments but not for throughput-optimized batch jobs where memory is already saturated. A 2025 paper from a major AI startup reported that speculative decoding cut their chatbot's p95 latency from 2.1 seconds to 1.1 seconds on H100s, with 99.97 percent output match verified via exact string comparison across 10,000 prompts.

When not to use speculative decoding

Batch sizes above 32: Gains diminish rapidly as GPU compute becomes the bottleneck rather than memory bandwidth.
Very short prompts (fewer than 50 tokens): The overhead of loading both models' weights outweighs the savings from batched verification.
Hardware with limited memory: On a single RTX 4090 (24 GB), fitting a 70B target plus any meaningful draft forces INT4 quantization on both, which can degrade output quality.
Open-ended generation with high entropy: For code generation, creative writing, or chain-of-thought reasoning where token probabilities are uncertain, acceptance rates drop and latency gains shrink.

Draft model distillation: the next frontier for acceptance rates

Teams that invest in fine-tuning their draft model see substantially better results. The process is straightforward: collect a few hundred thousand outputs from the target model on a representative dataset, then train the draft model to imitate those outputs using standard supervised fine-tuning. This distillation step aligns the draft's distribution more closely with the target's, raising acceptance rates from 0.6 to 0.85 for typical chat domains. Some teams go further with reinforcement learning from human feedback on the draft, but that is overkill for most use cases. The cost is modest—a single training run on 4 A100s for a few hours—and the latency improvement is often 10–15 percentage points higher than using an off-the-shelf draft. For production teams serious about latency, this is the highest-ROI investment after basic speculative decoding integration.

Speculative decoding is not a silver bullet, but it is one of the few optimization techniques that improves latency without any trade-off in output quality. That combination—strict mathematical correctness, zero retraining of the production model, and immediate deployment on existing GPU hardware—makes it a strong candidate for any latency-sensitive AI pipeline in 2025. Start by profiling your target model's average acceptance rate with a lightweight draft on a representative workload. If the rate exceeds 0.5, the gain will be significant. If not, invest in distillation before committing to speculative decoding at scale.

About this article. This piece was drafted with the help of an AI writing assistant and reviewed by a human editor for accuracy and clarity before publication. It is general information only — not professional medical, financial, legal or engineering advice. Spotted an error? Tell us. Read more about how we work and our editorial disclaimer.