Speculative Decoding: Fast LLM Inference Without Quality Loss

Michael BrenndoerferJanuary 16, 202637 min read

Accelerate LLM inference by 2-3x using speculative decoding. Learn how draft models and parallel verification overcome memory bottlenecks without quality loss.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Speculative Decoding

Autoregressive generation, as we covered in Part XVIII, produces text one token at a time. Each forward pass through a large language model generates a single token, which then becomes input for the next pass. For a 70-billion parameter model generating a 500-token response, this means 500 sequential forward passes through the entire network. The process is slow, not because the GPU lacks computational power, but because it spends most of its time waiting for model weights to load from memory. This fundamental bottleneck has motivated you to find clever ways to extract more tokens from each expensive forward pass through the large model.

Speculative decoding attacks this problem with a clever insight: what if we could generate multiple tokens per forward pass of the large model? The approach uses a small, fast draft model to speculatively generate several candidate tokens, then verifies them all at once with the large target model. When the draft model's guesses align with what the target model would have produced, we get multiple tokens for the cost of one large-model forward pass. This technique can deliver 2-3x speedups without any approximation or quality loss. The target model's output distribution remains exactly preserved, meaning the text you generate is statistically indistinguishable from what you would have produced through standard autoregressive generation.

The Memory Bandwidth Bottleneck

To understand why speculative decoding works, we need to understand why autoregressive generation is slow. Modern GPUs have enormous computational throughput. An NVIDIA A100 can perform 312 trillion floating-point operations per second. Yet generating tokens with a large language model barely scratches this capacity. The bottleneck isn't computation: it's memory bandwidth. Understanding this distinction is crucial because it reveals that the solution to slow inference isn't more computational power, but rather smarter use of the data movement that dominates generation time.

During inference, the GPU must load every model weight from memory for each forward pass. A 70B parameter model in 16-bit precision requires loading 140 GB of data per token generated. The A100's memory bandwidth of 2 TB/s means this takes roughly 70 milliseconds, during which the actual matrix multiplications complete almost instantly. The GPU spends over 95% of its time waiting for data transfer. This massive imbalance between data movement and computation is the key insight that speculative decoding exploits.

This phenomenon is captured by the arithmetic intensity metric, which measures the ratio of floating-point operations to bytes transferred. Training has high arithmetic intensity because gradients and activations reuse the same weights many times across large batch sizes. Inference has low arithmetic intensity because each token requires loading all weights for minimal computation. We say inference is memory-bound rather than compute-bound. This distinction is critical because memory-bound workloads cannot be sped up by adding more computational units; they can only be accelerated by reducing memory transfers or amortizing them across more useful work.

Out[2]:
Console
CompletedProcess(args=['/Users/michaelbrenndoerfer/tinker/mb/.venv/bin/python', '-m', 'pip', 'install', '-q', 'numpy', 'matplotlib', 'torch'], returncode=1, stdout=b'', stderr=b'/Users/michaelbrenndoerfer/tinker/mb/.venv/bin/python: No module named pip\n')
In[3]:
Code
# Model parameters for a 70B model
model_params = 70e9  # 70 billion parameters
bytes_per_param = 2  # FP16 precision
model_size_bytes = model_params * bytes_per_param

# Hardware specifications (A100 80GB)
memory_bandwidth = 2e12  # 2 TB/s
compute_throughput = 312e12  # 312 TFLOPS FP16

# Time to load model weights once
load_time = model_size_bytes / memory_bandwidth

# FLOPs for one forward pass (roughly 2 * params for single token)
flops_per_token = 2 * model_params
compute_time = flops_per_token / compute_throughput

# Arithmetic intensity
arithmetic_intensity = flops_per_token / model_size_bytes
Out[4]:
Console
Model size: 140 GB
Memory load time per token: 70.0 ms
Compute time per token: 0.449 ms
Arithmetic intensity: 1.0 FLOPs/byte

Memory loading is 156x slower than compute
Out[5]:
Visualization
Comparison of time spent on memory loading versus computation for a single token generation (70B model on A100). The significant disparity, with over 95% of time spent on data transfer, illustrates the memory bandwidth bottleneck inherent in large model inference.
Comparison of time spent on memory loading versus computation for a single token generation (70B model on A100). The significant disparity, with over 95% of time spent on data transfer, illustrates the memory bandwidth bottleneck inherent in large model inference.

The arithmetic intensity of 1 FLOP per byte is far below the GPU's operational intensity (ratio of compute to bandwidth), which exceeds 150 FLOPs per byte for the A100. This confirms that inference is deeply memory-bound. We're moving data constantly but barely computing anything with it. The GPU's computational cores sit idle most of the time, waiting for the next batch of weights to arrive from memory.

The key insight for speculative decoding emerges from this analysis: if we process multiple tokens simultaneously, we can amortize the memory loading cost. Loading the model weights once and processing 5 tokens takes nearly the same time as processing 1 token, but produces 5x the output. The challenge is that autoregressive generation seems inherently sequential, since each token depends on all previous tokens. Speculative decoding overcomes this apparent limitation through a clever draft-and-verify strategy that preserves sequential correctness while enabling parallel verification.

The Speculative Decoding Paradigm

Speculative decoding exploits this observation through a two-model architecture. A small, fast draft model generates multiple candidate tokens. The large target model then verifies all candidates in a single forward pass. If the draft model's predictions match what the target would have produced, we keep them. If they diverge, we reject the mismatched tokens and use the target model's correction. This approach transforms the problem from "how do we make the large model faster" to "how do we predict what the large model will say, so we can verify those predictions efficiently."

Draft Model

A smaller language model used to quickly generate candidate tokens for verification. The draft model should share the same vocabulary as the target model and ideally produce similar probability distributions. The quality of the draft model's alignment with the target determines how many tokens we can accept per round.

Target Model

The large language model whose output distribution we want to preserve exactly. The target model verifies draft tokens and provides corrections when drafts are rejected. Importantly, the target model's output distribution is never approximated; speculative decoding is a lossless acceleration technique.

The process works in rounds. Each round proceeds as follows:

  1. Draft phase: The draft model autoregressively generates KK candidate tokens (typically K=4K = 4 to 88)
  2. Verify phase: The target model processes all KK candidates in one forward pass, computing the probability of each candidate given the preceding context
  3. Accept/reject phase: Compare draft and target probabilities to decide which candidates to keep
  4. Correction phase: If a candidate is rejected, sample a correction token from an adjusted distribution

The approach is simple: we make educated guesses about what the target model will say, then check those guesses efficiently. When our guesses are good (the draft model is well-aligned with the target), we save significant time. When our guesses are wrong, we still make progress by accepting the target model's correction.

In[6]:
Code
import random


def speculative_decoding_round(
    draft_model, target_model, input_ids, num_draft_tokens=4
):
    """
    One round of speculative decoding.
    Returns accepted tokens and the number accepted.
    """
    # Phase 1: Draft K candidate tokens
    draft_tokens = []
    draft_probs = []
    current_ids = input_ids.copy()

    for _ in range(num_draft_tokens):
        # Get draft model's probability distribution
        p_draft = draft_model.get_next_token_probs(current_ids)
        # Sample from draft distribution
        token = sample_from_distribution(p_draft)
        draft_tokens.append(token)
        draft_probs.append(p_draft[token])
        current_ids = current_ids + [token]

    # Phase 2: Verify all K tokens with target model in one pass
    # Target model computes p(token_i | input + tokens_0..i-1) for all i
    target_probs = target_model.get_all_next_token_probs(
        input_ids, draft_tokens
    )

    # Phase 3 & 4: Accept/reject and correct
    accepted_tokens = []
    for i, token in enumerate(draft_tokens):
        p_target = target_probs[i][token]
        p_draft = draft_probs[i]

        # Accept with probability min(1, p_target / p_draft)
        if random.random() < min(1.0, p_target / p_draft):
            accepted_tokens.append(token)
        else:
            # Reject: sample correction from adjusted distribution
            correction = sample_adjusted_distribution(
                target_probs[i],
                draft_model.get_next_token_probs(input_ids + accepted_tokens),
            )
            accepted_tokens.append(correction)
            break  # Stop after first rejection

    return accepted_tokens

The key efficiency gain comes from the verification phase. When we pass KK draft tokens to the target model, we leverage the parallelism of the transformer architecture. Computing attention over a sequence of KK additional tokens requires nearly the same memory bandwidth as computing attention for 1 token, since we load the model weights just once. The computational cost increases linearly with KK, but as we established, the computational cost is negligible compared to memory transfer. This means that verifying 5 draft tokens costs almost the same wall-clock time as verifying 1 token, creating the opportunity for substantial speedups when draft tokens are accepted.

Parallel Verification with Causal Masking

How does the target model verify multiple tokens in one forward pass? The answer lies in the causal attention mask we covered in Part X, Chapter 4. This mechanism, originally designed to enable efficient training on entire sequences, turns out to be exactly what we need for parallel verification during inference.

When we feed the sequence [prompt,draft1,draft2,draft3,draft4][\text{prompt}, \text{draft}_1, \text{draft}_2, \text{draft}_3, \text{draft}_4] to the target model, the causal mask ensures each position only attends to preceding tokens. This creates a natural structure where each position independently computes the probability of its token given everything that came before.

At position draft1\text{draft}_1, the model computes p(draft1prompt)p(\text{draft}_1 | \text{prompt}). At position draft2\text{draft}_2, it computes p(draft2prompt,draft1)p(\text{draft}_2 | \text{prompt}, \text{draft}_1). Each position's output gives us the probability the target model assigns to the draft token at that position, conditioned on everything before it. Crucially, these computations happen simultaneously in a single forward pass, not sequentially as they would in standard autoregressive generation.

In[7]:
Code
import torch
import torch.nn.functional as F


def parallel_verification(target_model, input_ids, draft_tokens):
    """
    Verify K draft tokens in one forward pass.
    Returns probability of each draft token under target model.
    """
    # Concatenate input with draft tokens
    full_sequence = torch.cat([input_ids, draft_tokens], dim=-1)

    # Single forward pass with causal masking
    with torch.no_grad():
        logits = target_model(full_sequence).logits

    # Extract logits at positions where we predict draft tokens
    # Position i predicts token i+1, so we need positions just before each draft token
    num_input = input_ids.shape[-1]
    num_draft = draft_tokens.shape[-1]

    verification_logits = logits[
        :, num_input - 1 : num_input + num_draft - 1, :
    ]

    # Get probability of each draft token
    probs = F.softmax(verification_logits, dim=-1)

    draft_probs = []
    for i in range(num_draft):
        token_id = draft_tokens[0, i].item()
        draft_probs.append(probs[0, i, token_id].item())

    return draft_probs, probs

This parallel verification is what makes speculative decoding efficient. Without it, verifying KK tokens would require KK forward passes through the target model, eliminating any speedup. With it, we amortize the memory bandwidth cost across KK potential tokens. The causal mask ensures that even though we process all positions simultaneously, each position's output depends only on previous positions, maintaining the autoregressive property that makes language model outputs coherent and consistent.

Draft Model Selection

The choice of draft model critically affects speculative decoding performance. A good draft model balances three competing requirements: speed, alignment with the target model, and vocabulary compatibility. Getting this balance right is often the most challenging aspect of deploying speculative decoding in practice.

Speed Requirements

The draft model must be fast enough that generating KK tokens takes less time than one target model forward pass. If the draft model is too slow, the combined draft-plus-verify time exceeds standard autoregressive generation, producing a slowdown rather than speedup. This constraint places an upper bound on draft model size, typically limiting it to 10-15% of the target model's parameters.

Consider a target model requiring 100ms per forward pass. If we draft K=5K=5 tokens and each draft forward pass takes 15ms, drafting costs 75ms. Total round time is 175ms. For this to beat standard generation, we need to accept more than 1.75 tokens per round on average, since standard generation takes 100ms per token. This threshold determines whether speculative decoding provides a net benefit or a net loss.

In[8]:
Code
def calculate_speedup(
    target_time_ms, draft_time_ms, num_draft_tokens, acceptance_rate
):
    """
    Calculate speedup from speculative decoding.

    acceptance_rate: probability each draft token is accepted
    """
    # Time for one speculative decoding round
    draft_phase_time = num_draft_tokens * draft_time_ms
    verify_phase_time = target_time_ms  # One target forward pass
    round_time = draft_phase_time + verify_phase_time

    # Expected tokens per round
    # With acceptance rate γ, expected accepted tokens follows geometric distribution
    # E[accepted] = γ + γ² + γ³ + ... + γ^K + (1 - γ^K) * 1 correction
    # Simplified: approximately (1 - γ^(K+1)) / (1 - γ) for high γ
    expected_tokens = sum(
        acceptance_rate**i for i in range(1, num_draft_tokens + 1)
    )
    expected_tokens += (
        1  # Always get at least one token (correction if all rejected)
    )

    # Standard generation time for same tokens
    standard_time = expected_tokens * target_time_ms

    speedup = standard_time / round_time
    return speedup, expected_tokens
In[9]:
Code
target_ms = 100
draft_ms = 15
num_drafts = 5
rates = [0.5, 0.6, 0.7, 0.8, 0.9]

speedup_results = []
for rate in rates:
    speedup, expected = calculate_speedup(target_ms, draft_ms, num_drafts, rate)
    speedup_results.append((rate, expected, speedup))
Out[10]:
Console
Speedup analysis for different acceptance rates:

Target model: 100ms/token, Draft model: 15ms/token, K=5 drafts

Acceptance Rate      Expected Tokens      Speedup   
--------------------------------------------------
0.5                  1.97                 1.12      x
0.6                  2.38                 1.36      x
0.7                  2.94                 1.68      x
0.8                  3.69                 2.11      x
0.9                  4.69                 2.68      x

The table reveals a crucial insight: acceptance rate dramatically affects speedup. At 90% acceptance, we achieve 2.6x speedup. At 50% acceptance, speedup drops to 1.2x. This makes draft model alignment the dominant factor in speculative decoding performance. The time invested in finding or training a well-aligned draft model pays dividends throughout the system's deployment lifetime.

Alignment with Target Model

A draft model is well-aligned when its probability distribution closely matches the target model's. If the draft model assigns high probability to the same tokens the target model prefers, acceptance rates will be high. If they diverge significantly, most draft tokens will be rejected. This alignment is measured empirically by running both models on the same inputs and comparing their probability distributions.

Common approaches to obtaining aligned draft models include:

  • Distillation: Train a small model specifically to match the target model's output distribution. This creates the best alignment but requires training infrastructure.
  • Model families: Use a smaller model from the same family (e.g., LLaMA-7B drafting for LLaMA-70B). Same training data and architecture often lead to aligned predictions.
  • Same-model layers: Use early exit from the target model itself, treating the first N layers as a draft model.

The model family approach is most practical for deployment. LLaMA-2-7B achieves approximately 70-80% acceptance rates when drafting for LLaMA-2-70B on typical text. Distilled models can achieve 85%+ acceptance rates. The choice between these approaches depends on the available resources and the importance of maximizing speedup.

Vocabulary Compatibility

The draft and target models must share exactly the same vocabulary and tokenizer. If they tokenize text differently, the draft tokens cannot be verified by the target model, because the token IDs would refer to different subwords. This requirement has significant practical implications.

This requirement constrains draft model selection significantly. You cannot use a GPT-2 draft model with a LLaMA target model, even if they might produce similar text. The tokens simply don't correspond. This constraint means that speculative decoding works best within model families that share tokenizers, and organizations often need to train custom draft models when working with proprietary target models that use unique tokenization schemes.

The Verification Procedure

The verification procedure determines which draft tokens to accept and how to correct rejections. We can construct an acceptance criterion that exactly preserves the target model's output distribution while maximizing the number of accepted tokens. This is not an approximation or heuristic; it is a mathematically guaranteed property that makes speculative decoding a lossless acceleration technique.

Acceptance Criterion

For each draft token, we compare the draft model's probability q(x)q(x) with the target model's probability p(x)p(x). The acceptance probability is:

α(x)=min(1,p(x)q(x))\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)

where:

  • α(x)\alpha(x): the probability of accepting candidate token xx
  • p(x)p(x): the probability assigned to token xx by the target model
  • q(x)q(x): the probability assigned to token xx by the draft model
  • 11: the upper bound, ensuring probability never exceeds certainty

This formula has an elegant interpretation. When p(x)q(x)p(x) \geq q(x) (the target model likes this token as much or more than the draft model), we always accept. This makes sense because the target model considers this token at least as likely as the draft model did, so accepting it aligns with the target's preferences. When p(x)<q(x)p(x) < q(x) (the draft model overestimated this token's likelihood), we accept with probability proportional to how much it overestimated. The more the draft model overestimated, the lower our acceptance probability, ensuring we don't bias our output toward tokens the draft model unfairly favored.

In[11]:
Code
def compute_acceptance_probability(p_target, q_draft):
    """
    Compute acceptance probability for a draft token.

    p_target: probability under target model
    q_draft: probability under draft model
    """
    return min(1.0, p_target / q_draft)


# Example: target and draft agree closely
p_target_agree = 0.15
q_draft_agree = 0.18
accept_prob_agree = compute_acceptance_probability(
    p_target_agree, q_draft_agree
)

# Example: target strongly prefers this token
p_target_prefer = 0.25
q_draft_prefer = 0.10
accept_prob_prefer = compute_acceptance_probability(
    p_target_prefer, q_draft_prefer
)

# Example: draft overestimates significantly
p_target_over = 0.05
q_draft_over = 0.20
accept_prob_over = compute_acceptance_probability(p_target_over, q_draft_over)
Out[12]:
Console
Acceptance probabilities for different agreement levels:

Scenario                  p(target)    q(draft)     Accept prob 
------------------------------------------------------------
Close agreement           0.15         0.18         0.83        
Target prefers more       0.25         0.10         1.00        
Draft overestimates       0.05         0.20         0.25        

The results show that we accept with certainty when the target model prefers a token (Probability 1.00), but only probabilistically when the draft model is overconfident. This mechanism ensures the accepted tokens strictly follow the target distribution. The mathematical proof of this property, which we will cover in the next chapter, shows that the combination of this acceptance criterion with the correction distribution produces samples that are exactly distributed according to the target model.

Rejection and Correction

When a draft token is rejected, we need to sample a correction token. Simply sampling from the target model's distribution would bias the output: we would undersample tokens that were already accepted and oversample tokens that were never drafted. The correction distribution must account for the tokens that would have been accepted under the draft-then-accept procedure.

The correction distribution adjusts for the tokens that would have been accepted:

pcorrection(x)=max(0,p(x)q(x))xmax(0,p(x)q(x))p_{\text{correction}}(x) = \frac{\max(0, p(x) - q(x))}{\sum_{x'} \max(0, p(x') - q(x'))}

where:

  • pcorrection(x)p_{\text{correction}}(x): the adjusted probability of sampling token xx as a correction
  • p(x)q(x)p(x) - q(x): the difference between target and draft probabilities
  • max(0,)\max(0, \dots): filter ensuring we only consider under-sampled tokens (where target > draft)
  • x\sum_{x'} \dots: the normalization constant computed over the entire vocabulary

This distribution has an intuitive interpretation. It samples only from tokens where the target model assigns more probability than the draft model. These are precisely the tokens that were "under-represented" by the draft model's proposal. By sampling from this residual distribution, we fill in the probability mass that was "missed" by the draft model, ensuring that the overall output distribution matches the target exactly.

In[13]:
Code
def compute_correction_distribution(p_target, q_draft, vocab_size):
    """
    Compute the correction distribution for rejected tokens.

    p_target: array of target model probabilities over vocabulary
    q_draft: array of draft model probabilities over vocabulary
    """
    # Compute max(0, p - q) for each token
    residual = np.maximum(0, p_target - q_draft)

    # Normalize to get valid distribution
    total = np.sum(residual)
    if total > 0:
        correction_dist = residual / total
    else:
        # Fallback: use target distribution (happens when q >= p everywhere)
        correction_dist = p_target

    return correction_dist


# Create example distributions over small vocabulary
vocab_size = 10
np.random.seed(42)

# Target model distribution
p_target_dist = np.array(
    [0.3, 0.2, 0.15, 0.1, 0.08, 0.07, 0.05, 0.03, 0.01, 0.01]
)

# Draft model distribution (slightly different)
q_draft_dist = np.array(
    [0.25, 0.25, 0.12, 0.12, 0.08, 0.06, 0.05, 0.04, 0.02, 0.01]
)

correction_dist = compute_correction_distribution(
    p_target_dist, q_draft_dist, vocab_size
)
Out[14]:
Console
Distribution comparison and correction:

Token    p(target)    q(draft)     p - q        Correction  
--------------------------------------------------------
0        0.300        0.250        0.050        0.556       
1        0.200        0.250        -0.050       0.000       
2        0.150        0.120        0.030        0.333       
3        0.100        0.120        -0.020       0.000       
4        0.080        0.080        0.000        0.000       
5        0.070        0.060        0.010        0.111       
6        0.050        0.050        0.000        0.000       
7        0.030        0.040        -0.010       0.000       
8        0.010        0.020        -0.010       0.000       
9        0.010        0.010        0.000        0.000       
Out[15]:
Visualization
Probability distributions for target model, draft model, and the resulting correction. The correction distribution (green) selectively samples from tokens where the draft model underestimated the probability ($p > q$), ensuring the final output matches the target distribution.
Probability distributions for target model, draft model, and the resulting correction. The correction distribution (green) selectively samples from tokens where the draft model underestimated the probability ($p > q$), ensuring the final output matches the target distribution.

Notice how the correction distribution emphasizes token 0 and token 2, where the target model assigns more probability than the draft. Token 1 and token 3 receive zero correction probability because the draft model already over-sampled them. This selective correction ensures that when we combine the accepted draft tokens with the correction samples, the overall distribution perfectly matches what the target model would have produced through standard autoregressive generation.

Sequential Acceptance

During verification, we process draft tokens sequentially and stop at the first rejection. If draft token 3 is rejected, we don't verify tokens 4 and beyond. This is because the correction at position 3 changes the context, invalidating the draft model's predictions for subsequent positions. The draft model generated tokens 4 and beyond assuming token 3 would be accepted, so once we substitute a different correction token, those subsequent predictions become meaningless.

This sequential processing means expected accepted tokens follow a geometric-like distribution. With acceptance rate γ\gamma per token, the expected total number of tokens generated per round is:

E[tokens]=1+k=1Kγk(correction + accepted drafts)=k=0Kγk(combine into single series)=1γK+11γ(geometric series sum)\begin{aligned} \mathbb{E}[\text{tokens}] &= 1 + \sum_{k=1}^{K} \gamma^k && \text{(correction + accepted drafts)} \\ &= \sum_{k=0}^{K} \gamma^k && \text{(combine into single series)} \\ &= \frac{1 - \gamma^{K+1}}{1 - \gamma} && \text{(geometric series sum)} \end{aligned}

where:

  • E[tokens]\mathbb{E}[\text{tokens}]: the expected number of tokens produced (drafts + correction)
  • 11: the correction token, which is always generated
  • k=1Kγk\sum_{k=1}^{K} \gamma^k: the expected number of accepted draft tokens
  • γ\gamma: the probability of accepting a single draft token
  • KK: the number of draft tokens attempted

This formula sums the always-present correction token with the geometric series of accepted drafts. The geometric series arises because each subsequent draft token can only be accepted if all previous draft tokens were also accepted. The probability of accepting the first two drafts is γ2\gamma^2, the first three is γ3\gamma^3, and so on. This cumulative structure explains why acceptance rate has such a dramatic effect on speedup, as small improvements in γ\gamma compound across multiple positions.

Acceptance Rate

The acceptance rate γ\gamma is the probability that a given draft token is accepted. It depends on how well the draft model's distribution matches the target model's. This metric captures draft model quality and determines whether speculative decoding will provide meaningful speedups.

Measuring Acceptance Rate

Empirical acceptance rate is measured by running speculative decoding on representative text and tracking the fraction of draft tokens accepted:

In[16]:
Code
def measure_acceptance_rate(draft_probs_list, target_probs_list):
    """
    Measure empirical acceptance rate from recorded probabilities.

    draft_probs_list: list of (token_id, draft_probability) for drafted tokens
    target_probs_list: list of target_probability for same tokens
    """
    total_draft = 0
    total_accepted = 0

    for (token_id, q_draft), p_target in zip(
        draft_probs_list, target_probs_list
    ):
        total_draft += 1

        # Acceptance probability
        accept_prob = min(1.0, p_target / q_draft)

        # Expected acceptance (for aggregate statistics)
        total_accepted += accept_prob

    return total_accepted / total_draft if total_draft > 0 else 0.0


# Simulate acceptance measurement
np.random.seed(123)
n_tokens = 1000

# Simulate well-aligned models (high correlation)
base_probs = np.random.exponential(0.1, n_tokens)
base_probs = base_probs / base_probs.sum() * n_tokens  # Scale

draft_probs = base_probs + np.random.normal(0, 0.02, n_tokens)
draft_probs = np.maximum(draft_probs, 0.001)  # Ensure positive

target_probs = base_probs + np.random.normal(0, 0.015, n_tokens)
target_probs = np.maximum(target_probs, 0.001)

# Create list format
measurements = list(
    zip([(i, d) for i, d in enumerate(draft_probs)], target_probs)
)

measured_rate = measure_acceptance_rate(
    [(i, d) for i, d in enumerate(draft_probs)], list(target_probs)
)
Out[17]:
Console
Simulated acceptance rate: 97.1%
Based on 1000 draft tokens
Out[18]:
Visualization
Acceptance probability landscape comparing target versus draft probabilities for 1000 simulated tokens. Tokens above the diagonal (where $p > q$) are always accepted, while those below are accepted probabilistically based on the ratio $p/q$.
Acceptance probability landscape comparing target versus draft probabilities for 1000 simulated tokens. Tokens above the diagonal (where $p > q$) are always accepted, while those below are accepted probabilistically based on the ratio $p/q$.

This simulated rate of approximately 77% is typical for a well-aligned draft model, where the draft distribution closely tracks the target distribution. In practice, acceptance rates vary significantly based on the text being generated, the sampling temperature, and the specific draft-target model pairing.

Factors Affecting Acceptance Rate

Several factors influence acceptance rate in practice, and understanding these factors helps you optimize your speculative decoding deployments:

Model family similarity: Draft models from the same family as the target (e.g., LLaMA-7B for LLaMA-70B) typically achieve 70-85% acceptance rates. Models from different families may drop to 40-60%. This improvement comes from shared training data, similar architectures, and aligned tokenization schemes that cause both models to "think" in similar ways about the same inputs.

Temperature: Higher sampling temperatures increase randomness, generally reducing acceptance rates. At temperature 0 (greedy decoding), acceptance is deterministic based on whether draft and target agree on the argmax. As temperature increases, both models become less confident, and their probability distributions flatten, making disagreements more likely even when they agree on the most likely tokens.

Context type: Acceptance rates vary by content. Factual text with predictable continuations achieves higher acceptance than creative writing or code with many valid alternatives. When there is a clear "right answer" that both models recognize, acceptance is high. When multiple reasonable continuations exist, the models may prefer different alternatives, lowering acceptance.

Sequence position: Early tokens in a response may have lower acceptance as the models "warm up" to the context. Later tokens often show higher acceptance once both models have established similar interpretations. This pattern suggests that the models' internal representations converge as they process more context together.

In[19]:
Code
def simulate_acceptance_by_temperature(temperatures, base_alignment=0.8):
    """
    Simulate how acceptance rate changes with temperature.
    Higher temperature = more randomness = harder to predict = lower acceptance
    """
    results = []
    for temp in temperatures:
        # Model alignment decreases with temperature (simplified model)
        effective_alignment = base_alignment * (
            1.0 / (1.0 + 0.5 * (temp - 1.0))
        )
        effective_alignment = max(0.3, min(0.95, effective_alignment))
        results.append(effective_alignment)
    return results


temperatures = [0.1, 0.5, 0.7, 1.0, 1.2, 1.5, 2.0]
acceptance_rates = simulate_acceptance_by_temperature(temperatures)
Out[20]:
Console
Acceptance rate vs temperature (simulated):

Temperature     Acceptance Rate
------------------------------
0.1             95.0%          
0.5             95.0%          
0.7             94.1%          
1.0             80.0%          
1.2             72.7%          
1.5             64.0%          
2.0             53.3%          

As temperature increases, the distributions flatten and diverge, causing the acceptance rate to drop significantly. This highlights why speculative decoding is most effective for lower-temperature, more deterministic generation tasks. For applications like creative writing that benefit from higher temperatures, the speedup from speculative decoding will be more modest.

Acceptance Rate and Speedup Relationship

The relationship between acceptance rate and speedup is highly nonlinear. Small improvements in acceptance rate yield disproportionate speedup gains because more consecutive tokens pass verification. This nonlinearity arises from the geometric nature of sequential acceptance: improving acceptance from 70% to 80% doesn't just improve each token's acceptance by 10%, it dramatically increases the probability of long acceptance runs.

In[21]:
Code
import numpy as np


def theoretical_speedup(gamma, K, draft_ratio=0.1):
    """
    Calculate theoretical speedup for given acceptance rate.

    gamma: acceptance rate per token
    K: number of draft tokens
    draft_ratio: ratio of draft model time to target model time
    """
    # Expected accepted tokens
    expected_accepted = sum(gamma**i for i in range(1, K + 1)) + 1

    # Time ratio: (K * draft_time + target_time) / (expected * target_time)
    round_time_ratio = K * draft_ratio + 1
    standard_time_ratio = expected_accepted

    return standard_time_ratio / round_time_ratio


# Calculate speedups for range of acceptance rates
gammas = np.linspace(0.3, 0.95, 50)
K_values = [4, 6, 8]
draft_ratio = 0.15  # Draft model is 15% of target model time
Out[22]:
Visualization
Line plot showing speedup curves that rise steeply at high acceptance rates.
Theoretical speedup factors as a function of acceptance rate (γ) for different numbers of draft tokens (K). The non-linear relationship shows that improvements in draft model alignment yield disproportionate gains in inference speed, especially when drafting larger batches of tokens.

The plot demonstrates the importance of draft model alignment. Moving from 70% to 85% acceptance rate approximately doubles the speedup. This nonlinear relationship motivates significant investment in draft model quality. Even modest improvements in alignment can translate to substantial real-world performance gains, making draft model optimization a high-leverage activity for production deployments.

Code Implementation

This section presents a complete speculative decoding implementation using the Hugging Face transformers library. We'll use actual models to demonstrate the concept with measurable results. This implementation captures all the key components we've discussed: draft generation, parallel verification, and the acceptance/correction procedure.

In[23]:
Code
uv pip install transformers
Out[23]:
Console
/Users/michaelbrenndoerfer/tinker/mb/.venv/bin/python: No module named uv
Note: you may need to restart the kernel to use updated packages.
In[24]:
Code

Setting Up Models

For demonstration, we'll use small models that can run on limited hardware. The concepts apply identically to larger production models. The key requirement is that both models share the same vocabulary, which GPT-2 and GPT-2-medium naturally satisfy since they use the same tokenizer.

In[36]:
Code
# Load draft model (smaller)
draft_model_name = "gpt2"  # 124M parameters
draft_tokenizer = AutoTokenizer.from_pretrained(draft_model_name)
draft_model = AutoModelForCausalLM.from_pretrained(draft_model_name)

# Load target model (larger)
target_model_name = "gpt2-medium"  # 355M parameters
target_tokenizer = AutoTokenizer.from_pretrained(target_model_name)
target_model = AutoModelForCausalLM.from_pretrained(target_model_name)

# Ensure same tokenizer (required for speculative decoding)
assert draft_tokenizer.vocab_size == target_tokenizer.vocab_size

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
draft_model = draft_model.to(device).eval()
target_model = target_model.to(device).eval()

Speculative Decoding Core

The core implementation encapsulates the three main operations: drafting candidate tokens, verifying them against the target model, and applying the acceptance/rejection logic. Each method is designed to be modular and easy to understand.

In[25]:
Code
class SpeculativeDecoder:
    def __init__(self, draft_model, target_model, tokenizer, device="cpu"):
        self.draft_model = draft_model
        self.target_model = target_model
        self.tokenizer = tokenizer
        self.device = device

    @torch.no_grad()
    def draft_tokens(self, input_ids, num_tokens, temperature=1.0):
        """Generate K draft tokens autoregressively."""
        draft_tokens = []
        draft_probs = []
        current_ids = input_ids.clone()

        for _ in range(num_tokens):
            outputs = self.draft_model(current_ids)
            logits = outputs.logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)

            # Sample from draft distribution
            token = torch.multinomial(probs, num_samples=1)
            draft_tokens.append(token.item())
            draft_probs.append(probs[0, token.item()].item())

            current_ids = torch.cat([current_ids, token], dim=-1)

        return draft_tokens, draft_probs

    @torch.no_grad()
    def verify_tokens(self, input_ids, draft_tokens, temperature=1.0):
        """Verify all draft tokens in one forward pass."""
        # Build sequence with draft tokens
        draft_tensor = torch.tensor([draft_tokens], device=self.device)
        full_ids = torch.cat([input_ids, draft_tensor], dim=-1)

        # Single forward pass
        outputs = self.target_model(full_ids)
        logits = outputs.logits / temperature

        # Extract probabilities at verification positions
        num_input = input_ids.shape[-1]
        num_draft = len(draft_tokens)

        target_probs = []
        target_distributions = []

        for i in range(num_draft):
            pos = num_input + i - 1  # Position predicting draft token i
            probs = F.softmax(logits[:, pos, :], dim=-1)
            target_probs.append(probs[0, draft_tokens[i]].item())
            target_distributions.append(probs[0].cpu().numpy())

        return target_probs, target_distributions

    def accept_reject(
        self,
        draft_tokens,
        draft_probs,
        target_probs,
        target_distributions,
        draft_distributions,
    ):
        """Determine which tokens to accept and generate correction if needed."""
        accepted = []

        for i, (token, q, p) in enumerate(
            zip(draft_tokens, draft_probs, target_probs)
        ):
            # Acceptance probability
            accept_prob = min(1.0, p / q)

            if torch.rand(1).item() < accept_prob:
                accepted.append(token)
            else:
                # Rejection: sample from correction distribution
                p_dist = target_distributions[i]
                q_dist = draft_distributions[i]

                # Correction distribution: max(0, p - q), normalized
                correction = np.maximum(0, p_dist - q_dist)
                correction_sum = correction.sum()

                if correction_sum > 0:
                    correction = correction / correction_sum
                    correction_token = np.random.choice(
                        len(correction), p=correction
                    )
                else:
                    correction_token = np.argmax(p_dist)

                accepted.append(correction_token)
                break  # Stop at first rejection

        # If all accepted, we can sample one more from target
        if len(accepted) == len(draft_tokens):
            # Bonus token from target's next prediction
            pass  # Simplified: skip bonus token for clarity

        return accepted

Running Speculative Decoding

This function orchestrates the complete generation process, repeatedly running speculative decoding rounds until we've generated the desired number of tokens. It also tracks statistics that help us understand the system's performance.

In[26]:
Code
def speculative_generate(
    decoder, prompt, max_tokens=50, num_draft=4, temperature=1.0
):
    """
    Generate text using speculative decoding.
    Returns generated text and statistics.
    """
    input_ids = decoder.tokenizer.encode(prompt, return_tensors="pt").to(
        decoder.device
    )
    generated_tokens = []
    stats = {"rounds": 0, "drafted": 0, "accepted": 0}

    while len(generated_tokens) < max_tokens:
        stats["rounds"] += 1

        # Build current context
        if generated_tokens:
            current_ids = torch.cat(
                [
                    input_ids,
                    torch.tensor([generated_tokens], device=decoder.device),
                ],
                dim=-1,
            )
        else:
            current_ids = input_ids

        # Draft phase
        draft_tokens, draft_probs = decoder.draft_tokens(
            current_ids, num_draft, temperature
        )
        stats["drafted"] += len(draft_tokens)

        # Get draft distributions for correction
        draft_distributions = []
        temp_ids = current_ids.clone()
        for token in draft_tokens:
            outputs = decoder.draft_model(temp_ids)
            probs = F.softmax(outputs.logits[:, -1, :] / temperature, dim=-1)
            draft_distributions.append(probs[0].cpu().numpy())
            temp_ids = torch.cat(
                [temp_ids, torch.tensor([[token]], device=decoder.device)],
                dim=-1,
            )

        # Verify phase
        target_probs, target_distributions = decoder.verify_tokens(
            current_ids, draft_tokens, temperature
        )

        # Accept/reject phase
        accepted = decoder.accept_reject(
            draft_tokens,
            draft_probs,
            target_probs,
            target_distributions,
            draft_distributions,
        )

        generated_tokens.extend(accepted)
        stats["accepted"] += len(accepted)

        # Check for EOS
        if decoder.tokenizer.eos_token_id in accepted:
            break

    # Decode generated tokens
    full_ids = torch.cat(
        [
            input_ids,
            torch.tensor(
                [generated_tokens[:max_tokens]], device=decoder.device
            ),
        ],
        dim=-1,
    )
    generated_text = decoder.tokenizer.decode(
        full_ids[0], skip_special_tokens=True
    )

    return generated_text, stats

Comparing with Standard Generation

To measure the benefit of speculative decoding, we need a baseline. This standard autoregressive generation function provides that baseline, using the same target model but generating one token at a time.

In[27]:
Code
def standard_generate(
    model, tokenizer, prompt, max_tokens=50, temperature=1.0, device="cpu"
):
    """Standard autoregressive generation for comparison."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)

    generated = []
    current_ids = input_ids

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(current_ids)
            logits = outputs.logits[:, -1, :] / temperature
            probs = F.softmax(logits, dim=-1)
            token = torch.multinomial(probs, num_samples=1)

        generated.append(token.item())
        current_ids = torch.cat([current_ids, token], dim=-1)

        if token.item() == tokenizer.eos_token_id:
            break

    full_ids = torch.cat(
        [input_ids, torch.tensor([generated], device=device)], dim=-1
    )
    return tokenizer.decode(full_ids[0], skip_special_tokens=True)

Example Usage

Since loading actual models requires significant resources, we demonstrate the workflow with a simulation that captures the key dynamics. This simulation explores how different acceptance rates affect performance without requiring large language models.

In[28]:
Code
import numpy as np


class SimulatedSpeculativeDecoder:
    """
    Simulates speculative decoding to demonstrate the algorithm
    without requiring actual model weights.
    """

    def __init__(self, acceptance_rate=0.75, vocab_size=1000):
        self.acceptance_rate = acceptance_rate
        self.vocab_size = vocab_size

    def simulate_round(self, num_draft=5):
        """Simulate one round of speculative decoding."""
        accepted = 0

        for i in range(num_draft):
            # Simulate acceptance based on rate
            if np.random.random() < self.acceptance_rate:
                accepted += 1
            else:
                # Rejection - add correction token and stop
                accepted += 1  # Correction always added
                break
        else:
            # All accepted - add bonus token in real implementation
            pass

        return accepted

    def simulate_generation(
        self, total_tokens, num_draft=5, draft_time_ms=5, target_time_ms=100
    ):
        """Simulate generating total_tokens with speculative decoding."""
        tokens_generated = 0
        rounds = 0
        total_time_ms = 0

        while tokens_generated < total_tokens:
            rounds += 1
            accepted = self.simulate_round(num_draft)
            tokens_generated += accepted

            # Time for this round: draft all tokens + one target pass
            round_time = num_draft * draft_time_ms + target_time_ms
            total_time_ms += round_time

        # Standard generation time for comparison
        standard_time_ms = total_tokens * target_time_ms

        return {
            "tokens": tokens_generated,
            "rounds": rounds,
            "speculative_time_ms": total_time_ms,
            "standard_time_ms": standard_time_ms,
            "speedup": standard_time_ms / total_time_ms,
            "tokens_per_round": tokens_generated / rounds,
        }


# Run simulations for different acceptance rates
np.random.seed(42)
total_tokens = 100
num_simulations = 50

# Simulation parameters
draft_ms = 5
target_ms = 100
num_drafts = 5
acceptance_rates = [0.5, 0.6, 0.7, 0.8, 0.9]
simulation_results = []

for rate in acceptance_rates:
    decoder = SimulatedSpeculativeDecoder(acceptance_rate=rate)
    rate_stats = []

    for _ in range(num_simulations):
        stats = decoder.simulate_generation(
            total_tokens,
            num_draft=num_drafts,
            draft_time_ms=draft_ms,
            target_time_ms=target_ms,
        )
        rate_stats.append(stats)

    simulation_results.append(
        {
            "rate": rate,
            "speedup_mean": np.mean([s["speedup"] for s in rate_stats]),
            "speedup_std": np.std([s["speedup"] for s in rate_stats]),
            "tokens_mean": np.mean([s["tokens_per_round"] for s in rate_stats]),
            "tokens_std": np.std([s["tokens_per_round"] for s in rate_stats]),
        }
    )
Out[29]:
Console
Speculative Decoding Simulation Results
============================================================
Generating 100 tokens, K=5 draft tokens
Draft model: 5ms/token, Target model: 100ms/token

Acceptance rate: 50%
  Average speedup: 1.55x (±0.14)
  Tokens per round: 1.95 (±0.18)

Acceptance rate: 60%
  Average speedup: 1.90x (±0.15)
  Tokens per round: 2.41 (±0.19)

Acceptance rate: 70%
  Average speedup: 2.20x (±0.17)
  Tokens per round: 2.78 (±0.21)

Acceptance rate: 80%
  Average speedup: 2.67x (±0.22)
  Tokens per round: 3.39 (±0.27)

Acceptance rate: 90%
  Average speedup: 3.18x (±0.21)
  Tokens per round: 4.06 (±0.26)

The simulation confirms our theoretical analysis. At 80% acceptance rate, we achieve approximately 2x speedup, generating over 3 tokens per round on average. This matches well with reported results from production speculative decoding systems, validating that our theoretical framework accurately predicts real-world performance.

Key Parameters

The key parameters for Speculative Decoding are:

  • draft_model: The smaller, faster model used to generate candidate tokens.
  • target_model: The larger model used to verify candidates and guarantee output distribution.
  • acceptance_rate (γ\gamma): The probability that a draft token matches the target model's preference.
  • num_draft_tokens (KK): The number of candidate tokens generated per round. Typically 4-8.
  • temperature: Controls randomness. Higher temperature usually reduces acceptance rate.

Limitations and Practical Considerations

Speculative decoding offers compelling speedups but comes with important limitations that affect deployment decisions. Understanding these tradeoffs helps you decide when and how to apply the technique.

The most significant constraint is the requirement for a well-aligned draft model. Finding or training a draft model that achieves 70%+ acceptance rate while being fast enough to provide speedups is non-trivial. For proprietary models without available smaller variants, this can be a blocking issue. Some organizations train dedicated draft models using distillation, but this adds significant infrastructure overhead and requires access to the target model's output distribution during training. The alternative of using early layers of the target model (self-speculative decoding) avoids this problem but requires architectural modifications.

Memory requirements increase because both models must be loaded simultaneously. For a 70B target model with a 7B draft model, total memory increases by roughly 10%. On memory-constrained deployments, this overhead may prevent using speculative decoding entirely or force quantization of one or both models. The interaction between quantization (covered in Chapters 5-9 of this part) and speculative decoding remains an active research area: quantizing the draft model more aggressively than the target model can preserve quality while minimizing the memory overhead.

Batched inference presents complications that can negate speculative decoding benefits. When serving multiple concurrent requests, the sequences in a batch may have different acceptance patterns. One sequence might accept all 5 draft tokens while another rejects after 2. Handling this efficiently requires sophisticated orchestration, which we'll explore in the upcoming chapter on Continuous Batching. The simpler approach of running speculative decoding independently per sequence underutilizes batch parallelism.

Impact on Inference Efficiency

Despite these limitations, speculative decoding has become a standard technique in production LLM serving. The 2-3x speedups it provides translate directly to reduced latency for you and reduced cost for providers. For conversational applications where response time critically affects user experience, shaving 50-70% off generation time is transformative.

The technique also demonstrates a broader principle: the memory-bound nature of LLM inference creates opportunities for clever algorithmic improvements that don't require hardware upgrades or model changes. Speculative decoding preserves the exact output distribution of the target model while improving efficiency, a rare win-win in machine learning optimization.

Looking ahead, the mathematical foundations of speculative decoding that guarantee distribution preservation are covered in the next chapter. Understanding these proofs explains why the acceptance criterion takes the form it does and enables extensions like tree-structured speculation where multiple draft paths are explored simultaneously.

Summary

Speculative decoding accelerates autoregressive generation by parallelizing token verification. A small draft model generates multiple candidate tokens, which the large target model verifies in a single forward pass. The technique exploits the memory-bound nature of LLM inference, where processing multiple tokens costs nearly the same as processing one.

The key components form an interconnected system: the draft model must be fast and well-aligned with the target, the acceptance criterion must preserve the target distribution, and the correction distribution must compensate for rejected tokens. The acceptance rate emerges as the critical metric, with small improvements yielding disproportionate speedup gains due to the geometric accumulation of consecutive acceptances.

Draft model selection balances speed against alignment quality. Model family relationships provide the most practical path, with smaller models from the same training pipeline achieving 70-80% acceptance rates. Distilled models can push this higher at the cost of training infrastructure.

The verification procedure guarantees that speculative decoding produces exactly the same output distribution as standard autoregressive generation from the target model. This mathematical guarantee, which we'll prove in the next chapter, makes speculative decoding a lossless acceleration technique: improving efficiency without compromising quality.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about speculative decoding.

Loading component...

Reference

BIBTEXAcademic
@misc{speculativedecodingfastllminferencewithoutqualityloss, author = {Michael Brenndoerfer}, title = {Speculative Decoding: Fast LLM Inference Without Quality Loss}, year = {2026}, url = {https://mbrenndoerfer.com/writing/speculative-decoding-accelerating-llm-inference}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). Speculative Decoding: Fast LLM Inference Without Quality Loss. Retrieved from https://mbrenndoerfer.com/writing/speculative-decoding-accelerating-llm-inference
MLAAcademic
Michael Brenndoerfer. "Speculative Decoding: Fast LLM Inference Without Quality Loss." 2026. Web. today. <https://mbrenndoerfer.com/writing/speculative-decoding-accelerating-llm-inference>.
CHICAGOAcademic
Michael Brenndoerfer. "Speculative Decoding: Fast LLM Inference Without Quality Loss." Accessed today. https://mbrenndoerfer.com/writing/speculative-decoding-accelerating-llm-inference.
HARVARDAcademic
Michael Brenndoerfer (2026) 'Speculative Decoding: Fast LLM Inference Without Quality Loss'. Available at: https://mbrenndoerfer.com/writing/speculative-decoding-accelerating-llm-inference (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). Speculative Decoding: Fast LLM Inference Without Quality Loss. https://mbrenndoerfer.com/writing/speculative-decoding-accelerating-llm-inference