Speculative Decoding Math: Algorithms & Speedup Limits

Michael BrenndoerferJanuary 17, 202632 min read

Learn the mathematical framework for speculative decoding, including the exact acceptance criterion, rejection sampling logic, and deriving optimal draft lengths.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Speculative Decoding Math

In the previous chapter, we introduced speculative decoding as a technique for accelerating autoregressive generation by having a small draft model propose multiple tokens that a larger target model verifies in parallel. The key insight was that verification is faster than sequential generation because it allows parallel processing. However, a critical question remains: how do we decide which draft tokens to accept?

The answer involves carefully designed acceptance criteria that guarantee mathematical correctness. We need a rejection sampling scheme that preserves the target model's output distribution exactly, not approximately. This chapter develops the mathematical framework for speculative decoding, including the acceptance criterion, expected speedup analysis, draft quality effects, and optimal draft length selection.

The Acceptance Criterion

The fundamental challenge in speculative decoding is accepting draft tokens in a way that produces the exact same distribution as sampling directly from the target model. If we simply accept tokens when the draft and target models agree, we would bias the output toward the intersection of their distributions. Instead, we need a principled rejection sampling approach.

Setting Up the Problem

To understand the acceptance criterion, we must first appreciate the subtle problem it solves. When a draft model proposes a token, we face a fundamental question: how do we decide whether to keep that token while ensuring our final output looks exactly as if we had sampled directly from the target model? This is not merely about accepting "good" tokens and rejecting "bad" ones; it is about maintaining a precise mathematical relationship between what we accept and what the target model would have produced on its own.

Let p(x)p(x) denote the target model's probability distribution over the next token, and let q(x)q(x) denote the draft model's distribution. When the draft model proposes token xx, we need an acceptance probability α(x)\alpha(x) such that the final output distribution equals p(x)p(x).

The key insight comes from rejection sampling, a classical technique in computational statistics that allows us to sample from one distribution by filtering samples from another. The fundamental idea is to sample from a "proposal" distribution (our draft model) and then selectively keep or reject those samples in a way that reshapes the distribution to match our target. If we accept with probability:

α(x)=min(1,p(x)q(x))\alpha(x) = \min\left(1, \frac{p(x)}{q(x)}\right)

where:

  • α(x)\alpha(x): acceptance probability for token xx
  • p(x)p(x): target model probability of token xx
  • q(x)q(x): draft model probability of token xx

then tokens where p(x)q(x)p(x) \geq q(x) are always accepted, while tokens where p(x)<q(x)p(x) < q(x) are accepted with probability proportional to how much the target model favors them relative to the draft model.

To build intuition for this formula, consider what happens in two contrasting scenarios. In the first scenario, suppose the target model assigns probability 0.3 to token xx while the draft model only assigns probability 0.1. The ratio p(x)/q(x)=3p(x)/q(x) = 3 exceeds 1, so we set α(x)=1\alpha(x) = 1 and always accept. This makes sense: the draft model is underproposing this token relative to what the target wants, so we should accept it whenever it appears. In the second scenario, suppose the target model assigns probability 0.1 to token xx while the draft model assigns probability 0.3. Now the ratio p(x)/q(x)=1/3p(x)/q(x) = 1/3, so we accept with probability 1/3. The draft model is overproposing this token, so we need to reject some instances to bring its frequency down to what the target model expects.

Why This Works

Let's verify that this acceptance criterion produces the correct distribution. The verification proceeds through careful probability calculations that track what happens when we combine sampling from the draft model with our acceptance decision. When we sample xq(x)x \sim q(x) and accept with probability α(x)\alpha(x), the probability of accepting a specific token xx is:

P(accept x)=q(x)α(x)(definition of joint probability)=q(x)min(1,p(x)q(x))(substitute α(x))=min(q(x),q(x)p(x)q(x))(distribute q(x))=min(q(x),p(x))(simplify)\begin{aligned} P(\text{accept } x) &= q(x) \cdot \alpha(x) && \text{(definition of joint probability)} \\ &= q(x) \cdot \min\left(1, \frac{p(x)}{q(x)}\right) && \text{(substitute } \alpha(x) \text{)} \\ &= \min\left(q(x), q(x) \cdot \frac{p(x)}{q(x)}\right) && \text{(distribute } q(x) \text{)} \\ &= \min(q(x), p(x)) && \text{(simplify)} \end{aligned}

where:

  • P(accept x)P(\text{accept } x): probability of generating and accepting token xx
  • q(x)q(x): probability of drafting token xx
  • α(x)\alpha(x): probability of accepting drafted token xx
  • p(x)p(x): probability of token xx under the target distribution

This derivation reveals that the probability of accepting any particular token xx is the minimum of the two distributions at that point. This creates a kind of "clipping" effect where we keep only the overlapping probability mass between the draft and target distributions.

The total acceptance probability across all tokens is:

P(accept)=xmin(q(x),p(x))P(\text{accept}) = \sum_x \min(q(x), p(x))

where:

  • P(accept)P(\text{accept}): probability that a drafted token is accepted
  • x\sum_x: sum over the entire vocabulary
  • q(x)q(x): draft model probability of token xx
  • p(x)p(x): target model probability of token xx

This sum has a geometric interpretation. If you plot both distributions as histograms over the vocabulary, P(accept)P(\text{accept}) equals the total area of overlap between the two histograms. When the distributions are identical, every token is accepted (total overlap equals 1). When they share no common support, no tokens are accepted (overlap equals 0).

Conditional on acceptance, the distribution over accepted tokens is:

P(xaccept)=min(q(x),p(x))xmin(q(x),p(x))P(x \mid \text{accept}) = \frac{\min(q(x), p(x))}{\sum_{x'} \min(q(x'), p(x'))}

where:

  • P(xaccept)P(x \mid \text{accept}): probability of token xx given it was accepted
  • x\sum_{x'}: normalization sum over all possible tokens xx'
  • q(x)q(x): draft model probability
  • p(x)p(x): target model probability

This does not yet equal p(x)p(x). The distribution is corrected by handling rejections appropriately.

The Residual Distribution

When we reject a draft token, we don't simply redraft. Instead, we sample from a carefully constructed residual distribution that "fills in" the probability mass that rejection sampling missed. This residual distribution is the mathematical key that makes speculative decoding exact rather than approximate.

To understand why we need a residual distribution, consider what happens with pure rejection sampling. When we accept with probability min(q(x),p(x))\min(q(x), p(x)), we capture all the probability mass where the draft distribution meets or exceeds the target. But what about tokens where p(x)>q(x)p(x) > q(x)? For these tokens, the draft model underestimates the target probability, and our rejection sampling only captures the portion up to q(x)q(x). The residual distribution accounts for this missing mass.

Define:

presid(x)=max(0,p(x)q(x))Zp_{\text{resid}}(x) = \frac{\max(0, p(x) - q(x))}{Z}

where:

  • presid(x)p_{\text{resid}}(x): residual distribution probability for token xx
  • p(x)p(x): target model probability of token xx
  • q(x)q(x): draft model probability of token xx
  • p(x)q(x)p(x) - q(x): difference between target and draft probabilities
  • ZZ: normalizing constant, xmax(0,p(x)q(x))\sum_x \max(0, p(x) - q(x))

The residual distribution captures the probability mass where p(x)>q(x)p(x) > q(x), which is exactly the mass that acceptance sampling misses. Geometrically, if you imagine the target distribution as a histogram and the draft distribution as another histogram overlaid on top, the residual distribution represents the portions of the target histogram that "stick out above" the draft histogram. By sampling from presidp_{\text{resid}} on rejection, we ensure the overall distribution matches p(x)p(x).

Notice that the normalizing constant ZZ equals 1P(accept)1 - P(\text{accept}), which is the total rejection probability. The probability mass captured by the residual distribution exactly equals the probability mass that rejection sampling misses, ensuring everything adds up correctly.

Out[2]:
Visualization
Using Python 3.13.5 environment at: /Users/michaelbrenndoerfer/tinker/research-deepagents/.venv
Audited 2 packages in 10ms
Decomposition of rejection sampling probability mass into accepted, residual, and rejected components. The green area represents the shared mass between models, while the blue and red areas illustrate the necessary adjustments to maintain the target distribution's exactness. The total overlapping area directly determines the overall acceptance rate for draft tokens.
Decomposition of rejection sampling probability mass into accepted, residual, and rejected components. The green area represents the shared mass between models, while the blue and red areas illustrate the necessary adjustments to maintain the target distribution's exactness. The total overlapping area directly determines the overall acceptance rate for draft tokens.

Complete Algorithm for One Token

The full acceptance procedure for a single token position works as follows. This algorithm combines the acceptance criterion with the residual distribution to guarantee exact sampling from the target model:

  1. Sample draft token xq(x)x \sim q(x)
  2. Sample uniform uUniform(0,1)u \sim \text{Uniform}(0, 1)
  3. If u<p(x)q(x)u < \frac{p(x)}{q(x)}, accept xx
  4. Otherwise, sample xpresid(x)x \sim p_{\text{resid}}(x)

This procedure guarantees that the output follows distribution p(x)p(x) exactly. To see why, consider that there are two mutually exclusive ways to generate a token xx. The first path is through acceptance: we draft xx with probability q(x)q(x) and accept it with probability min(1,p(x)/q(x))\min(1, p(x)/q(x)), contributing min(q(x),p(x))\min(q(x), p(x)) to the total probability. The second path is through rejection and resampling: with probability 1P(accept)1 - P(\text{accept}) we reject and then sample xx from the residual with probability max(0,p(x)q(x))Z\frac{\max(0, p(x) - q(x))}{Z}, contributing max(0,p(x)q(x))\max(0, p(x) - q(x)) to the total probability. Adding these two contributions gives min(q(x),p(x))+max(0,p(x)q(x))=p(x)\min(q(x), p(x)) + \max(0, p(x) - q(x)) = p(x), exactly as desired.

Rejection Sampling Guarantee

The acceptance criterion with residual resampling produces outputs that are statistically indistinguishable from sampling directly from the target model. This is not an approximation; it is mathematically exact.

Extending to Multiple Tokens

In practice, the draft model proposes kk tokens at once. We verify them sequentially from left to right, accepting tokens until the first rejection. This sequential verification is essential because language models produce conditional distributions: the probability of each token depends on all preceding tokens.

Let x1,x2,,xkx_1, x_2, \ldots, x_k be the draft sequence. For position ii, conditioning on the prefix x1,,xi1x_1, \ldots, x_{i-1}:

αi=min(1,p(xix1,,xi1)q(xix1,,xi1))\alpha_i = \min\left(1, \frac{p(x_i \mid x_1, \ldots, x_{i-1})}{q(x_i \mid x_1, \ldots, x_{i-1})}\right)

where:

  • αi\alpha_i: acceptance probability for the ii-th token
  • xix_i: the ii-th token in the draft sequence
  • p(xi)p(x_i \mid \dots): target model conditional probability
  • q(xi)q(x_i \mid \dots): draft model conditional probability

If position ii rejects, we sample from the residual distribution and discard positions i+1,,ki+1, \ldots, k. This maintains correctness because each accepted token was drawn from the correct conditional distribution. The key insight is that we cannot "skip" a rejection and continue verifying later tokens, because those tokens were conditioned on the rejected token. Once we sample a different token from the residual distribution, the entire subsequent sequence becomes invalid and must be regenerated.

This left-to-right verification creates an important efficiency consideration. Even though we verify all kk positions in a single parallel forward pass of the target model, we can only use tokens up to the first rejection. However, because we always sample from the residual distribution upon rejection, we are guaranteed to produce at least one valid token per iteration, even if all kk draft tokens are rejected. In fact, when all tokens are accepted, we get k+1k + 1 tokens, since the target model's forward pass also provides the distribution for the next position beyond the draft sequence.

Expected Speedup Analysis

Understanding the expected speedup from speculative decoding requires analyzing how many tokens we expect to accept from each draft sequence and how this translates to wall-clock improvements. This analysis provides both theoretical insights and practical guidance for deploying speculative decoding systems.

Token Acceptance Probability

Let α\alpha denote the average acceptance probability for a single token. Under simplifying assumptions where each position has the same acceptance rate, we can derive a clean expression for this probability. While real-world acceptance rates vary by position and context, this i.i.d. assumption provides valuable analytical insights.

α=Exq[min(1,p(x)q(x))](definition of expectation)=xq(x)min(1,p(x)q(x))(expand expectation)=xmin(q(x),q(x)p(x)q(x))(distribute q(x))=xmin(q(x),p(x))(simplify)\begin{aligned} \alpha &= \mathbb{E}_{x \sim q}\left[\min\left(1, \frac{p(x)}{q(x)}\right)\right] && \text{(definition of expectation)} \\ &= \sum_x q(x) \cdot \min\left(1, \frac{p(x)}{q(x)}\right) && \text{(expand expectation)} \\ &= \sum_x \min\left(q(x), q(x) \cdot \frac{p(x)}{q(x)}\right) && \text{(distribute } q(x) \text{)} \\ &= \sum_x \min(q(x), p(x)) && \text{(simplify)} \end{aligned}

where:

  • α\alpha: expected acceptance probability
  • Exq\mathbb{E}_{x \sim q}: expectation over tokens proposed by the draft model
  • q(x)q(x): draft model probability of token xx
  • p(x)p(x): target model probability of token xx
  • xmin(q(x),p(x))\sum_x \min(q(x), p(x)): overlap between the two distributions

This quantity measures the overlap between the draft and target distributions. When the distributions are identical, α=1\alpha = 1. When they are completely disjoint, α=0\alpha = 0. In practice, α\alpha typically falls between 0.6 and 0.9 for well-matched draft and target model pairs.

Expected Accepted Tokens

Given a draft of length kk, the number of accepted tokens follows a truncated geometric distribution. This distribution arises because we accept tokens sequentially until the first rejection, but we stop at position kk even if no rejection has occurred. Let NN be the number of tokens we generate (accepted drafts plus one from rejection sampling or the (k+1)(k+1)th position). The expected value is:

E[N]=i=1kiαi1(1α)+(k+1)αk\mathbb{E}[N] = \sum_{i=1}^{k} i \cdot \alpha^{i-1}(1-\alpha) + (k+1) \cdot \alpha^k

where:

  • E[N]\mathbb{E}[N]: expected number of accepted tokens (plus one verification)
  • kk: draft sequence length
  • α\alpha: probability of accepting a single token
  • ii: position of the first rejection
  • αi1(1α)\alpha^{i-1}(1-\alpha): probability that the first rejection occurs at position ii
  • (k+1)αk(k+1) \cdot \alpha^k: contribution from the case where all kk tokens are accepted

The first term accounts for rejecting at position ii (we keep ii tokens including the resampled one). The second term accounts for accepting all kk drafts (we get k+1k+1 tokens including the bonus verification token).

We can simplify this sum by observing that NN is 1 plus the count of accepted draft tokens. Since the probability of accepting the first jj tokens is αj\alpha^j for all jkj \le k (under the simplified i.i.d. assumption), we can use the linearity of expectation:

E[N]=1+j=1kP(token j is accepted)(linearity of expectation)=1+j=1kαj(i.i.d. assumption)=j=0kαj(include j=0 term)=1αk+11α(geometric series sum)\begin{aligned} \mathbb{E}[N] &= 1 + \sum_{j=1}^{k} P(\text{token } j \text{ is accepted}) && \text{(linearity of expectation)} \\ &= 1 + \sum_{j=1}^{k} \alpha^j && \text{(i.i.d. assumption)} \\ &= \sum_{j=0}^{k} \alpha^j && \text{(include } j=0 \text{ term)} \\ &= \frac{1 - \alpha^{k+1}}{1 - \alpha} && \text{(geometric series sum)} \end{aligned}

where:

  • E[N]\mathbb{E}[N]: closed-form expected number of tokens generated per iteration
  • α\alpha: acceptance probability
  • kk: draft length

This closed-form expression shows that expected tokens per iteration grow as a geometric series in the acceptance probability. When α\alpha is close to 1, almost all draft tokens are accepted, and E[N]\mathbb{E}[N] approaches k+1k+1. When α\alpha is close to 0, most drafts are rejected immediately, and E[N]\mathbb{E}[N] approaches 1 (we generate just one token from the residual distribution).

Out[3]:
Visualization
Expected number of valid tokens generated per iteration (E[N]) as a function of acceptance probability (alpha) for different draft lengths (k). The expected yield scales as a geometric sum of the acceptance probability, approaching the maximum limit of k+1 tokens as the draft model quality nears perfection.
Expected number of valid tokens generated per iteration (E[N]) as a function of acceptance probability (alpha) for different draft lengths (k). The expected yield scales as a geometric sum of the acceptance probability, approaching the maximum limit of k+1 tokens as the draft model quality nears perfection.

The Speedup Formula

Let cc be the cost ratio, defined as the time for one target model forward pass divided by the time for one draft model forward pass. Typically c1c \gg 1 since the target model is much larger. For example, if the target model has 70 billion parameters and the draft model has 7 billion parameters, we might expect cc to be roughly 10, depending on hardware and batching configurations.

Without speculative decoding, generating nn tokens requires nn target model passes, taking time nTtargetn \cdot T_{\text{target}}.

With speculative decoding, each iteration requires:

  • One draft model pass generating kk tokens, taking time kTdraftk \cdot T_{\text{draft}}
  • One target model pass verifying all kk positions, taking time TtargetT_{\text{target}}

The iteration produces E[N]=1αk+11α\mathbb{E}[N] = \frac{1 - \alpha^{k+1}}{1 - \alpha} tokens on average and takes time Ttarget+kTdraft=Ttarget(1+k/c)T_{\text{target}} + k \cdot T_{\text{draft}} = T_{\text{target}}(1 + k/c).

The speedup ratio SS compares how fast we generate tokens with speculative decoding versus standard autoregressive generation. We derive this by computing the time per token under both approaches:

S=TtargetTtarget(1+k/c)E[N](ratio of time per token)=E[N]1+k/c(cancel Ttarget)=(1αk+1)/(1α)1+k/c(substitute E[N])\begin{aligned} S &= \frac{T_{\text{target}}}{\frac{T_{\text{target}}(1 + k/c)}{\mathbb{E}[N]}} && \text{(ratio of time per token)} \\ &= \frac{\mathbb{E}[N]}{1 + k/c} && \text{(cancel } T_{\text{target}} \text{)} \\ &= \frac{(1 - \alpha^{k+1})/(1 - \alpha)}{1 + k/c} && \text{(substitute } \mathbb{E}[N] \text{)} \end{aligned}

where:

  • SS: speedup ratio
  • TtargetT_{\text{target}}: time for standard autoregressive generation per token
  • kk: draft length
  • cc: cost ratio (Ttarget/TdraftT_{\text{target}} / T_{\text{draft}})
  • α\alpha: acceptance probability (used in E[N]\mathbb{E}[N])
  • 1+k/c1 + k/c: cost of one speculative iteration in units of target model passes
  • E[N]\mathbb{E}[N]: average tokens produced per iteration

This formula captures the essential tradeoff in speculative decoding. The numerator measures benefit, specifically how many tokens we expect to generate per iteration. The denominator measures cost, specifically how much computation that iteration requires. Speedup greater than 1 means speculative decoding is faster than standard generation.

When the draft model is very fast (cc \to \infty), this simplifies to:

S1αk+11αS \approx \frac{1 - \alpha^{k+1}}{1 - \alpha}

where:

  • SS: theoretical maximum speedup with a zero-cost draft model
  • α\alpha: acceptance probability
  • kk: draft length

This limiting case represents the best possible speedup when the draft model adds negligible overhead. In practice, draft models incur costs, so the actual speedup is lower than this bound. However, this formula provides a useful upper limit for what speculative decoding can achieve.

Out[4]:
Visualization
Theoretical speedup S as a function of acceptance probability alpha for different cost ratios c (draft length k=5). Faster draft models (higher c) enable greater potential speedup factors, provided the acceptance probability exceeds the breakeven threshold where S=1.
Theoretical speedup S as a function of acceptance probability alpha for different cost ratios c (draft length k=5). Faster draft models (higher c) enable greater potential speedup factors, provided the acceptance probability exceeds the breakeven threshold where S=1.

Draft Quality Effects

The acceptance probability α\alpha directly depends on how well the draft model approximates the target model. Understanding this relationship helps us choose appropriate draft models and predict performance. The choice of draft model is a critical practical decision in deploying speculative decoding.

Measuring Draft Quality

Draft quality can be quantified by the expected acceptance probability:

α=xmin(q(x),p(x))\alpha = \sum_x \min(q(x), p(x))

where:

  • α\alpha: measure of draft quality (higher is better)
  • q(x)q(x): draft distribution
  • p(x)p(x): target distribution

This quantity has a clear interpretation: it is the probability mass that the two distributions share. A high-quality draft model assigns similar probabilities to most tokens as the target model, resulting in large overlap. A poor draft model might assign high probability to tokens the target model dislikes, or vice versa, resulting in small overlap.

This is related to the total variation distance between distributions, one of the most fundamental measures of dissimilarity between probability distributions:

TV(p,q)=12xp(x)q(x)(definition of TV distance)=12x(p(x)+q(x)2min(p(x),q(x)))(algebraic identity)=12(xp(x)+xq(x)2xmin(p(x),q(x)))(split sums)=12(1+12α)(probabilities sum to 1)=1α(simplify)\begin{aligned} TV(p, q) &= \frac{1}{2}\sum_x |p(x) - q(x)| && \text{(definition of TV distance)} \\ &= \frac{1}{2}\sum_x (p(x) + q(x) - 2\min(p(x), q(x))) && \text{(algebraic identity)} \\ &= \frac{1}{2}\left(\sum_x p(x) + \sum_x q(x) - 2\sum_x \min(p(x), q(x))\right) && \text{(split sums)} \\ &= \frac{1}{2}(1 + 1 - 2\alpha) && \text{(probabilities sum to 1)} \\ &= 1 - \alpha && \text{(simplify)} \end{aligned}

where:

  • TV(p,q)TV(p, q): total variation distance
  • p(x)p(x): target distribution probability for token xx
  • q(x)q(x): draft distribution probability for token xx
  • p(x)q(x)|p(x) - q(x)|: absolute difference in probability for token xx
  • α\alpha: acceptance probability

Higher draft quality means smaller total variation distance, which means higher acceptance probability. This connection to total variation distance indicates that the acceptance probability captures the most fundamental notion of distributional similarity: the maximum probability that an adversary could distinguish between samples from the two distributions.

Quality-Size Tradeoffs

Larger draft models tend to produce distributions closer to the target, increasing α\alpha. However, larger models are slower, decreasing cc. The optimal draft model balances these factors, and finding this balance requires understanding the tradeoffs involved.

Consider three scenarios:

  • Very small draft model: Fast inference (cc large) but poor distribution match (α\alpha small). We accept few tokens per iteration, limiting speedup. As a concrete example, imagine using a tiny 125M parameter model as the draft for a 70B parameter target. The draft model runs 100 times faster than the target, but it might only match the target's distribution 40% of the time. Most iterations produce just one or two tokens before rejection.

  • Medium draft model: Moderate speed and moderate acceptance rate. Often the sweet spot for practical applications. A 7B draft model for a 70B target might run 10 times faster while achieving 75% acceptance rates. This combination often yields the highest speedups in practice.

  • Large draft model: Good distribution match (α\alpha close to 1) but slow (cc small). The overhead of running the draft model eats into speedup gains. A 30B draft for a 70B target might achieve 90% acceptance rates, but if it only runs 2-3 times faster than the target, the iteration cost is too high to achieve meaningful speedup.

Empirical Observations

In practice, acceptance rates typically range from 0.6 to 0.9 depending on:

  • Model family similarity (using a smaller model from the same family yields higher α\alpha)
  • Task difficulty (easier, more predictable text has higher α\alpha)
  • Temperature settings (lower temperature often increases α\alpha)

These factors interact in interesting ways. For instance, at very low temperatures where sampling becomes nearly deterministic, both draft and target models tend to select the same highest-probability token, resulting in acceptance rates approaching 1. At higher temperatures where sampling explores more of the distribution, the models are more likely to disagree, reducing acceptance rates.

Optimal Draft Length

Choosing the draft length kk involves balancing the cost of generating more drafts against the diminishing probability of accepting them all. This optimization problem has both theoretical and practical importance.

The Optimization Problem

We want to find kk^* that maximizes speedup:

k=argmaxkE[Nk]1+k/c(maximize speedup)=argmaxk(1αk+1)/(1α)1+k/c(substitute E[Nk])\begin{aligned} k^* &= \arg\max_k \frac{\mathbb{E}[N_k]}{1 + k/c} && \text{(maximize speedup)} \\ &= \arg\max_k \frac{(1 - \alpha^{k+1})/(1 - \alpha)}{1 + k/c} && \text{(substitute } \mathbb{E}[N_k] \text{)} \end{aligned}

where:

  • kk^*: optimal draft length
  • E[Nk]\mathbb{E}[N_k]: expected tokens generated with draft length kk
  • cc: cost ratio

Taking the derivative with respect to kk and setting to zero is analytically intractable, but we can understand the behavior through analysis of the numerator and denominator.

Diminishing Returns

The expected tokens E[Nk]\mathbb{E}[N_k] grows sublinearly in kk:

E[Nk]=1αk+11α11α as k\mathbb{E}[N_k] = \frac{1 - \alpha^{k+1}}{1 - \alpha} \approx \frac{1}{1-\alpha} \text{ as } k \to \infty

where:

  • E[Nk]\mathbb{E}[N_k]: expected tokens for draft length kk
  • 11α\frac{1}{1-\alpha}: asymptotic limit of expected tokens
  • α<1\alpha < 1: acceptance probability

The growth rate decreases exponentially because αk\alpha^k decays exponentially. Meanwhile, the cost 1+k/c1 + k/c grows linearly in kk. This tension creates an interior optimum. Intuitively, each additional draft token has a diminishing marginal benefit (it is only useful if all previous tokens were accepted, which becomes increasingly unlikely) but a constant marginal cost (one more draft model forward pass). At some point, the marginal cost exceeds the marginal benefit, and we should stop drafting.

Approximate Optimal kk

For large cc (fast draft model), we can derive an approximate optimal draft length. When cc \to \infty, the optimal kk is infinite since there's no cost to drafting more tokens. For finite cc, the optimal kk satisfies:

ddk[1αk+1(1α)(1+k/c)]=0\frac{d}{dk}\left[\frac{1 - \alpha^{k+1}}{(1-\alpha)(1 + k/c)}\right] = 0

where:

  • ddk\frac{d}{dk}: derivative with respect to draft length kk
  • α\alpha: acceptance probability
  • cc: cost ratio
  • The expression in brackets is the speedup function

Applying the quotient rule for differentiation:

ddk(1αk+11+k/c)=αk+1ln(α)(1+k/c)(1αk+1)(1/c)(1+k/c)2(quotient rule)=0(set derivative to zero)\begin{aligned} \frac{d}{dk} \left( \frac{1 - \alpha^{k+1}}{1 + k/c} \right) &= \frac{-\alpha^{k+1} \ln(\alpha) \cdot (1 + k/c) - (1 - \alpha^{k+1}) \cdot (1/c)}{(1 + k/c)^2} && \text{(quotient rule)} \\ &= 0 && \text{(set derivative to zero)} \end{aligned}

where:

  • α\alpha: acceptance probability
  • kk: draft length
  • cc: cost ratio
  • αk+1ln(α)-\alpha^{k+1} \ln(\alpha): derivative of the numerator term (1αk+1)(1 - \alpha^{k+1})
  • 1/c1/c: derivative of the denominator term (1+k/c)(1 + k/c)
  • (1+k/c)2(1 + k/c)^2: squared denominator from the quotient rule

Setting the numerator to zero gives:

αk+1ln(α)=1αk+1c(1+k/c)-\alpha^{k+1} \ln(\alpha) = \frac{1 - \alpha^{k+1}}{c(1 + k/c)}

where:

  • α\alpha: acceptance probability
  • kk: draft length
  • cc: cost ratio
  • αk+1ln(α)-\alpha^{k+1} \ln(\alpha): marginal gain term
  • Right side: marginal cost term

This equation balances marginal benefit against marginal cost. The left side represents the expected additional tokens from extending the draft by one position. The right side represents the cost of that extension in terms of reduced speedup per existing token. While this equation cannot be solved in closed form, it can be solved numerically for any specific values of α\alpha and cc.

For typical values (α0.7\alpha \approx 0.7, c10c \approx 10), optimal kk ranges from 4 to 8 tokens.

Adaptive Draft Length

In practice, the optimal kk varies based on context. Some systems use adaptive strategies:

  • Fixed kk: Simple to implement, works well when acceptance rates are stable
  • Adaptive kk: Adjust based on recent acceptance rates
  • Tree-based drafting: Generate multiple candidate continuations, increasing effective acceptance probability

The adaptive approach monitors acceptance rates during generation and increases kk when rates are high (indicating the draft model is performing well on the current content) or decreases kk when rates are low (indicating more challenging content where longer drafts would be wasteful).

Code Implementation

Let's implement the speculative decoding mathematics to see these concepts in action.

In[5]:
Code
import numpy as np

np.random.seed(42)

First, we'll implement the acceptance criterion for a single token:

In[6]:
Code
import numpy as np
from typing import Tuple


def acceptance_probability(p: np.ndarray, q: np.ndarray) -> float:
    """
    Calculate the expected acceptance probability alpha.

    Args:
        p: Target model distribution over vocabulary
        q: Draft model distribution over vocabulary

    Returns:
        Expected acceptance probability
    """
    return np.sum(np.minimum(p, q))


def sample_with_rejection(
    p: np.ndarray, q: np.ndarray, draft_token: int
) -> Tuple[int, bool]:
    """
    Apply acceptance criterion to a draft token.

    Args:
        p: Target model distribution
        q: Draft model distribution
        draft_token: Token proposed by draft model

    Returns:
        Tuple of (final_token, was_accepted)
    """
    # Calculate acceptance probability for this specific token
    accept_prob = min(1.0, p[draft_token] / q[draft_token])

    # Sample uniform and compare
    u = np.random.uniform()

    if u < accept_prob:
        return draft_token, True
    else:
        # Sample from residual distribution
        residual = np.maximum(0, p - q)
        residual = residual / residual.sum()  # Normalize
        new_token = np.random.choice(len(p), p=residual)
        return new_token, False

Let's verify that this produces the correct distribution:

In[7]:
Code
# Create example distributions
vocab_size = 10
p = np.array([0.3, 0.25, 0.15, 0.1, 0.08, 0.05, 0.03, 0.02, 0.01, 0.01])
q = np.array([0.2, 0.2, 0.2, 0.15, 0.1, 0.05, 0.04, 0.03, 0.02, 0.01])

# Run many samples to verify distribution
n_samples = 100000
final_tokens = []

for _ in range(n_samples):
    # Draft model proposes a token
    draft_token = np.random.choice(vocab_size, p=q)
    # Apply acceptance criterion
    final_token, _ = sample_with_rejection(p, q, draft_token)
    final_tokens.append(final_token)

# Calculate empirical distribution
empirical_dist = np.bincount(final_tokens, minlength=vocab_size) / n_samples
Out[8]:
Console
Target distribution p(x): [0.3  0.25 0.15 0.1  0.08 0.05 0.03 0.02 0.01 0.01]
Draft distribution q(x): [0.2  0.2  0.2  0.15 0.1  0.05 0.04 0.03 0.02 0.01]
Empirical distribution:   [0.299 0.251 0.151 0.099 0.081 0.05  0.029 0.02  0.01  0.009]

Max deviation from target: 0.0010

The empirical distribution closely matches the target, confirming our acceptance criterion preserves the correct distribution.

Now let's calculate expected speedup for different configurations:

In[9]:
Code
def expected_tokens(alpha: float, k: int) -> float:
    """Calculate expected number of tokens per iteration."""
    return (1 - alpha ** (k + 1)) / (1 - alpha)


def speedup(alpha: float, k: int, c: float) -> float:
    """
    Calculate speedup ratio.

    Args:
        alpha: Acceptance probability
        k: Draft length
        c: Cost ratio (target time / draft time)

    Returns:
        Speedup factor
    """
    expected_n = expected_tokens(alpha, k)
    iteration_cost = 1 + k / c
    return expected_n / iteration_cost

Let's visualize how speedup varies with draft length for different acceptance rates:

In[10]:
Code
k_values = np.arange(1, 16)
c = 20  # Target model is 20x slower than draft
alphas = [0.5, 0.6, 0.7, 0.8, 0.9]

# Calculate speedups for each alpha
speedup_results = {}
for alpha in alphas:
    speedup_results[alpha] = [speedup(alpha, k, c) for k in k_values]
Out[11]:
Visualization
Line plot showing speedup curves that rise then flatten as draft length increases.
Speedup factor as a function of draft length for different acceptance probabilities. Higher acceptance rates allow longer effective draft sequences. All curves assume a cost ratio c=20.

The plot shows that optimal draft length increases with acceptance probability. For α=0.9\alpha = 0.9, we can productively use 8-10 draft tokens, while α=0.5\alpha = 0.5 shows diminishing returns after just 3-4 tokens.

Let's find the optimal kk for different scenarios:

In[12]:
Code
def find_optimal_k(
    alpha: float, c: float, max_k: int = 20
) -> Tuple[int, float]:
    """Find the optimal draft length and corresponding speedup."""
    best_k = 1
    best_speedup = speedup(alpha, 1, c)

    for k in range(2, max_k + 1):
        s = speedup(alpha, k, c)
        if s > best_speedup:
            best_speedup = s
            best_k = k

    return best_k, best_speedup
In[13]:
Code
configs = []
for alpha in [0.6, 0.7, 0.8, 0.9]:
    for c in [10, 20, 50]:
        opt_k, opt_speedup = find_optimal_k(alpha, c)
        configs.append((alpha, c, opt_k, opt_speedup))
Out[14]:
Console
Optimal draft length for different configurations:
--------------------------------------------------
α        c        Optimal k    Speedup   
--------------------------------------------------
0.6      10       3            1.67x
0.6      20       4            1.92x
0.6      50       6            2.17x
0.7      10       4            1.98x
0.7      20       6            2.35x
0.7      50       8            2.76x
0.8      10       6            2.47x
0.8      20       8            3.09x
0.8      50       11           3.82x
0.9      10       10           3.43x
0.9      20       13           4.67x
0.9      50       19           6.37x

The results confirm our analysis: higher acceptance rates and faster draft models (larger cc) support longer draft sequences and achieve greater speedup.

Finally, let's visualize the acceptance rate as a function of distribution similarity:

In[15]:
Code
# Generate random distribution pairs with varying divergence
np.random.seed(42)
alphas = []
kl_divs = []
n_pairs = 200

for _ in range(n_pairs):
    # Create base distribution
    base = np.random.dirichlet(np.ones(20))

    # Add noise to create draft distribution with varying divergence
    noise_scale = np.random.uniform(0, 2)
    noise = np.random.dirichlet(np.ones(20) * (1 / (noise_scale + 0.1)))
    mix = np.random.uniform(0.3, 1.0)
    q = mix * base + (1 - mix) * noise
    q = q / q.sum()

    p = base

    # Calculate alpha and KL divergence
    alpha = np.sum(np.minimum(p, q))

    # KL divergence (with smoothing to avoid log(0))
    eps = 1e-10
    kl = np.sum(p * np.log((p + eps) / (q + eps)))

    alphas.append(alpha)
    kl_divs.append(kl)
Out[16]:
Visualization
Scatter plot showing acceptance probability decreasing as KL divergence increases.
Relationship between acceptance probability and KL divergence between draft and target distributions. As distributions diverge, acceptance rates drop, reducing speculative decoding effectiveness.

The relationship between distribution divergence and acceptance probability is clear: as the draft model's distribution diverges from the target, acceptance rates drop substantially. This underscores the importance of choosing draft models that approximate the target well.

Key Parameters

The key parameters for the speculative decoding implementation are:

  • k: Draft length. The number of tokens proposed by the draft model in each step.
  • c: Cost ratio (Ttarget/TdraftT_{\text{target}} / T_{\text{draft}}). Represents how much faster the draft model is compared to the target model.
  • alpha: Acceptance probability (α\alpha). The probability that a draft token is accepted by the target model, derived from distribution overlap.

Limitations and Practical Considerations

While the mathematics of speculative decoding provides strong theoretical guarantees, several practical challenges affect real-world performance.

The acceptance criterion assumes we can efficiently compute both p(x)p(x) and q(x)q(x) for all tokens. In practice, this means running the full forward pass of both models, even though we only need the probability of specific draft tokens. Some implementations optimize this by caching intermediate states, but the full softmax computation remains a bottleneck. The residual distribution sampling also requires computing the full target distribution, making it difficult to apply speculative decoding with certain memory-efficient generation techniques.

The i.i.d. assumption in our speedup analysis rarely holds in practice. Acceptance rates vary significantly based on context: function names and common phrases have high acceptance rates, while creative or technical content sees more rejections. This variance means actual speedups may differ from theoretical predictions, and systems should monitor acceptance rates to adapt draft lengths dynamically. Additionally, the tree-structured extensions to speculative decoding (generating multiple candidate continuations) can improve effective acceptance rates but add implementation complexity.

Hardware considerations also affect practical speedup. The theoretical model assumes draft and target passes don't interfere with each other's memory or compute. On GPUs with limited memory bandwidth, loading both models' weights can create contention. Some deployments use separate GPUs for draft and target models, while others time-multiplex on a single device. Continuous batching systems, which we'll explore in the next chapter, add another layer of complexity to speculative decoding integration.

Summary

This chapter developed the mathematical foundations of speculative decoding, providing tools to understand and optimize this important inference acceleration technique.

The acceptance criterion uses rejection sampling with a residual distribution to guarantee that speculative decoding produces outputs identical in distribution to standard autoregressive sampling. The acceptance probability α(x)=min(1,p(x)/q(x))\alpha(x) = \min(1, p(x)/q(x)) ensures we accept tokens where the draft model underestimates the target probability while probabilistically rejecting overestimated tokens.

Expected speedup depends on the acceptance probability α\alpha, draft length kk, and cost ratio cc. The formula S=(1αk+1)/(1α)1+k/cS = \frac{(1 - \alpha^{k+1})/(1-\alpha)}{1 + k/c} captures the tradeoff between generating more draft tokens and the diminishing probability of accepting them all.

Optimal draft length balances these factors. Higher acceptance rates support longer drafts, with practical optima typically ranging from 4 to 10 tokens depending on model quality and hardware configuration.

Draft model quality directly impacts speedup through the acceptance probability α=xmin(p(x),q(x))\alpha = \sum_x \min(p(x), q(x)). Smaller models within the same family often provide the best balance of speed and distribution similarity.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about speculative decoding math.

Loading component...

Reference

BIBTEXAcademic
@misc{speculativedecodingmathalgorithmsspeeduplimits, author = {Michael Brenndoerfer}, title = {Speculative Decoding Math: Algorithms & Speedup Limits}, year = {2026}, url = {https://mbrenndoerfer.com/writing/speculative-decoding-math-acceptance-criterion}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). Speculative Decoding Math: Algorithms & Speedup Limits. Retrieved from https://mbrenndoerfer.com/writing/speculative-decoding-math-acceptance-criterion
MLAAcademic
Michael Brenndoerfer. "Speculative Decoding Math: Algorithms & Speedup Limits." 2026. Web. today. <https://mbrenndoerfer.com/writing/speculative-decoding-math-acceptance-criterion>.
CHICAGOAcademic
Michael Brenndoerfer. "Speculative Decoding Math: Algorithms & Speedup Limits." Accessed today. https://mbrenndoerfer.com/writing/speculative-decoding-math-acceptance-criterion.
HARVARDAcademic
Michael Brenndoerfer (2026) 'Speculative Decoding Math: Algorithms & Speedup Limits'. Available at: https://mbrenndoerfer.com/writing/speculative-decoding-math-acceptance-criterion (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). Speculative Decoding Math: Algorithms & Speedup Limits. https://mbrenndoerfer.com/writing/speculative-decoding-math-acceptance-criterion