Speculative Decoding
A technique that speeds up AI text generation by having a smaller, faster model propose tokens while a larger model verifies them, reducing wait time without sacrificing quality.
What is Speculative Decoding?
Speculative decoding is an inference optimization technique for large language models (LLMs) that enables faster token generation by leveraging a small, fast draft model to propose multiple tokens ahead of time, while a larger, accurate target model verifies the draft and accepts the longest matching prefix. This process maintains the target model’s output distribution, ensuring that the results are mathematically identical to pure sequential decoding, but with much lower latency.
The key insight is that many tokens in LLM output sequences can be guessed correctly by a much smaller model, and verifying a batch of tokens together with the large model is more efficient than generating each token in strict sequence.
Speculative decoding is structured as a draft-then-verify process that maintains the exact output quality of the target model while achieving 2–3x+ latency reduction in production systems.
How Speculative Decoding Works
Draft-Then-Verify Process
1. Draft Step
The draft model, which is a smaller and faster LLM, generates a batch of candidate next tokens (often 3–8 at a time). This model is designed to “guess ahead,” producing tokens that the target model is likely to accept.
2. Verification Step
The large, accurate target model evaluates the same context (input plus all generated tokens so far) and computes the probability distributions for the next batch of tokens. The target model checks which draft tokens match its own most probable next-token predictions. It accepts the longest matching prefix, which could be all, some, or none of the draft tokens.
3. Continuation
If the draft is fully accepted, output continues. If there is a mismatch, the target model generates the next token itself, which becomes the new context for the next speculative round.
4. Repeat
This process continues until the output sequence is complete.
Guarantee:
The final output is provably identical to what the target model would produce via naïve, sequential decoding.
Core Terminology
Autoregressive Generation:
Generating tokens one at a time, with each depending on all previous tokens. This is the standard for LLMs like GPT, T5, Llama.
Draft Model:
Small, fast model used to propose candidate tokens in advance. Should be trained or fine-tuned to match the output distribution of the target model as closely as possible.
Target Model:
The large, accurate LLM whose output must be preserved exactly.
Draft Tokens:
Batch of tokens guessed by the draft model for the next sequence positions.
Rejection Sampling:
Statistical mechanism whereby the target model accepts only those draft tokens that match its own most probable predictions, thus preserving the target’s output distribution.
Acceptance Rate (α):
Fraction of draft tokens accepted by the target model. High α means the draft model is well aligned to the target model.
Speculative Token Count (γ):
Number of tokens generated by the draft model in each speculative round.
Acceptance Length (τ):
Average number of draft tokens accepted per speculative round: τ = (1 - α^(γ+1)) / (1 - α)
Inter-Token Latency (ITL):
Time between emitting one output token and the next.
EAGLE-3:
An advanced speculative decoding technique that attaches a lightweight prediction head to the internal layers of the target model itself, eliminating the need for a separate draft model.
Motivation
LLM Inference Bottleneck
Problem:
- Generating each new token requires a full forward pass through the entire target model, which is slow (especially for models >10B parameters)
- This sequential dependency results in high latency and underutilization of parallel compute in modern GPUs
Solution:
- By allowing a draft model to “guess ahead” and only invoking the large model for verification, token generation can be parallelized
- This reduces overall latency, increases throughput, and enables more responsive LLM-powered applications (chatbots, code assistants, real-time summarization)
Industry Need:
Real-world products such as Google Search’s AI Overviews rely on speculative decoding to serve billions of users with high-quality, low-latency results.
Core Algorithm
Pseudocode
context = input_tokens
while not finished:
draft_tokens = draft_model.generate(context, num_tokens=γ)
target_distributions = target_model.predict(context, num_tokens=γ)
accepted = []
for i in range(γ):
if draft_tokens[i] == argmax(target_distributions[i]):
accepted.append(draft_tokens[i])
else:
break
context += accepted
next_token = target_model.generate(context, num_tokens=1)
context.append(next_token)
if next_token == stop_token:
finished = True
Detailed Algorithm
1. Initialization:
The target model generates the first token using standard decoding.
2. Draft:
The draft model receives the current context and proposes γ tokens.
3. Verification:
The target model computes the probability distribution over the next γ tokens and compares with the draft’s proposal. Accepts the longest prefix of matching tokens. If a mismatch occurs, stops accepting further draft tokens.
4. Target Token Generation:
After accepting h tokens (h ≤ γ), the target model produces the next token.
5. Repeat:
Context is updated and the process continues until the sequence ends.
Performance Metrics
Key Metrics
Acceptance Rate (α):
α = (number of accepted draft tokens) / (number of tokens proposed by draft model)
Speculative Token Count (γ):
Number of tokens proposed per speculative round. Tuning γ impacts speedup and resource use.
Acceptance Length (τ):
Average number of draft tokens accepted per speculative round
Inter-Token Latency (ITL):
Time between generated tokens
Throughput:
Number of tokens generated per second
Practical Example
Suppose:
- Target model generation: 10 ms/token
- Draft model: 1 ms for γ=4 tokens
- Target verification: 1 ms for γ=4 tokens
If average of 2.5 draft tokens are accepted per round, then in 15 ms (1 ms draft, 1 ms verify, 10 ms for non-accepted token), you get 3.5 tokens.
Effective token time: ≈4.3 ms/token (vs. 10 ms/token baseline)
Key Factors
Draft Model Alignment:
Higher acceptance rates come from draft models whose output distribution closely matches the target model’s.
Model Size/Architecture:
Larger models benefit more from speculative decoding; the draft model should be significantly faster, but not so small that it poorly predicts the target.
Hardware Constraints:
Both models and their key-value caches must fit in memory.
Batch Size:
Speculative decoding is most effective at low batch sizes (latency-critical applications).
Orchestration Overhead:
Efficient communication and scheduling between models is critical.
Benefits
2–3x+ Latency Reduction:
Empirically demonstrated speedups in Google products and academic benchmarks.
Guaranteed Output Quality:
Outputs are mathematically identical to target model sequential decoding.
Better Hardware Utilization:
Unlocks latent compute power on GPUs/TPUs by batching token checks.
No Retraining Required:
Any pre-trained models can be used as draft/target, though fine-tuning the draft for higher α is beneficial.
Lower Serving Costs:
Fewer machines required for the same throughput.
Used in Major Production Systems:
Google Search AI Overviews, code assistants, summarization tools.
Limitations and Caveats
Increased Memory Use:
Both models (with caches) must fit in memory, reducing batch size.
Throughput Tradeoffs:
At high batch sizes, speculative decoding may not improve and may even decrease throughput due to contention.
Waste if Draft is Poorly Aligned:
If acceptance rate is low (e.g., <0.5), speculative decoding adds overhead without speedup.
Model Compatibility Constraints:
Draft and target should use the same tokenizer and similar architectures for best results.
Orchestration Complexity:
Requires careful engineering for efficient model interaction and cache management.
Less Effective for Small Models or High Batch Loads:
Speedup is most pronounced for large models and latency-sensitive applications.
Implementation Guidance
When to Use Speculative Decoding
Latency-Critical Applications:
Chatbots, code completion, real-time summarization.
Large Models:
10B parameters, where per-token latency is highest.
Low to Moderate Batch Sizes:
Where user-facing latency is more important than throughput.
When to Avoid
GPU Memory Maxed Out:
Large batch sizes, long context windows.
Low Draft Acceptance Rate:
If draft model struggles to mimic the target.
Small LLMs:
Marginal gain does not justify complexity.
Configuration & Tuning
Draft Model Selection:
Start with a smaller version of your target, fine-tune if possible.
Speculative Token Count (γ):
Typical: 3–8 per round. Tune for your workload.
Acceptance Rate Monitoring:
Track α in production. If α < 0.6, consider tuning or fallback.
Memory Management:
Monitor GPU memory; use quantization or split models if needed.
Example: vLLM Python API
from vllm import LLM, SamplingParams
llm = LLM(model="your-target-model", draft_model="your-draft-model")
params = SamplingParams(max_tokens=100, speculative_tokens=4)
output = llm.generate("Prompt text", sampling_params=params)
print(output)
Use Cases and Examples
Real-World Deployments
Google Search AI Overviews:
Powers high-quality, low-latency summaries for billions of users.
Code Generation Tools:
Used by IDE assistants for fast code completion.
Enterprise Chatbots:
Improves user experience and reduces serving cost for high-volume customer support.
Batch Translation/Summarization:
Enables fast output for long documents with high per-sample latency.
Research Benchmarks
T5-XXL (11B) with T5-small (60M):
Shows 2–3x speedup on translation tasks.
Llama 70B:
Reports significant latency improvements with speculative decoding.
Best Practices and Tuning
1. Draft Model Selection:
Use a smaller model from the same family and tokenizer; fine-tune for your use-case if possible.
2. Tune γ:
Start with γ=3–5, increase if α remains high.
3. Monitor α:
If α drops below 0.5, reduce γ or re-align draft model.
4. Optimize Memory:
Use quantization, multi-GPU, or reduce batch/context if needed.
5. Benchmark:
Test under real workload and hardware.
6. Automate Fallback:
Disable speculative decoding if α or memory pressure crosses thresholds.
Open Source Tools and Frameworks
vLLM:
High-throughput LLM inference engine with speculative decoding.
BentoML:
Guides and framework integration.
Modular MAX:
Out-of-the-box support via config.
TensorRT-LLM (Baseten):
High-performance deployment with speculative decoding.
References
- Google Research Blog: Looking Back at Speculative Decoding
- arXiv: Fast Inference from Transformers via Speculative Decoding
- NVIDIA Technical Blog: An Introduction to Speculative Decoding
- BentoML: Speculative Decoding
- Baseten: A Quick Introduction to Speculative Decoding
- arXiv: Speculative Sampling (Stern et al., 2018)
- arXiv: Distributed Speculative Decoding
- vLLM Documentation: Speculative Decoding
- BentoML Blog: Structured Decoding in vLLM
- Modular MAX: Speculative Decoding Documentation
- GitHub: vLLM Project
Related Terms
PagedAttention
A memory management technique that divides the AI model's cache into fixed blocks, allowing it to us...