Master smoothing techniques that solve the zero probability problem in n-gram models, including Laplace, add-k, Good-Turing, interpolation, and Kneser-Ney smoothing with Python implementations.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Smoothing TechniquesLink Copied
N-gram language models have a fatal flaw: they assign zero probability to any sequence containing an n-gram not seen during training. A single unseen word combination causes the entire sentence probability to collapse to zero, making the model useless for real-world text. Smoothing techniques solve this problem by redistributing probability mass from observed n-grams to unseen ones, ensuring every possible sequence receives some non-zero probability.
This chapter explores the evolution of smoothing methods, from simple add-one smoothing to the sophisticated Kneser-Ney algorithm that powered speech recognition and machine translation systems for decades.
The Zero Probability ProblemLink Copied
Consider a bigram model trained on a small corpus. You calculate probabilities by counting how often word pairs appear:
where:
- : probability of word given the preceding word
- : count of the bigram (how often the word pair appears together)
- : count of the preceding word (how often it appears as context)
The problem emerges immediately. If your training corpus contains "the cat sat on the mat" but never "the cat ate the fish," then . When computing the probability of a new sentence using the chain rule, a single zero multiplies everything to zero:
This is catastrophic. The sentence is perfectly grammatical, yet our model claims it's impossible.
The process of adjusting probability estimates to reserve some probability mass for unseen events. Smoothing ensures that no valid sequence receives zero probability, making the model robust to novel inputs.
Add-One (Laplace) SmoothingLink Copied
The simplest solution is to pretend every n-gram occurred at least once. Add one to all counts before computing probabilities:
where is the vocabulary size. We add to the denominator because we're adding 1 to each of the possible bigrams that could follow .
Let's trace through the math. Suppose "the cat" appears 5 times in our corpus, "the" appears 100 times, and our vocabulary has 10,000 words.
Without smoothing:
With Laplace smoothing:
The probability dropped by nearly two orders of magnitude. This is the fundamental problem with Laplace smoothing: it steals too much probability mass from observed events to give to unseen ones.
Worked Example: Laplace SmoothingLink Copied
Consider a tiny corpus with vocabulary and these bigram counts:
| Bigram | Count |
|---|---|
| (a, b) | 3 |
| (a, c) | 1 |
| (b, a) | 2 |
| (b, c) | 0 |
| (c, a) | 1 |
| (c, b) | 0 |
The unigram counts are: , , .
For MLE probabilities:
For Laplace-smoothed probabilities (with ):
And for the unseen bigram (a, a):
Notice how the probabilities for observed bigrams decreased significantly to make room for the unseen one.
Add-k SmoothingLink Copied
A natural improvement is to add a smaller constant instead of 1:
With and our earlier example:
This is still far from the original 0.05, but it's less aggressive than Laplace smoothing.
Tuning kLink Copied
The optimal value of depends on your corpus and vocabulary size. You can tune it by:
- Holding out a development set
- Computing perplexity for different values of
- Selecting the that minimizes perplexity
Typical values range from 0.001 to 0.5. Smaller corpora often need larger values, while larger corpora can use smaller ones.
Vocabulary size: 12
Total bigrams: 33
Top 5 bigrams: [(('<s>', 'the'), 5), (('the', 'cat'), 4), (('the', 'dog'), 4), (('sat', 'on'), 2), (('on', 'the'), 2)]
Our small corpus yields a vocabulary of 12 unique tokens and 30 bigram instances. The most frequent bigram is ("the", "cat") appearing twice, which makes sense given our cat-and-dog themed sentences. This manageable size lets us trace through the smoothing calculations by hand if needed.
Let's implement add-k smoothing and see how different values of affect probability estimates.
Bigram counts: C('the', 'cat') = 4
Unigram count: C('the') = 10
P(cat|the) for different k values:
----------------------------------------
k = 1.000: P(cat|the) = 0.227273
k = 0.500: P(cat|the) = 0.281250
k = 0.100: P(cat|the) = 0.366071
k = 0.010: P(cat|the) = 0.396245
k = 0.001: P(cat|the) = 0.399620
P(elephant|the) for different k values (unseen bigram):
----------------------------------------
k = 1.000: P(elephant|the) = 0.045455
k = 0.500: P(elephant|the) = 0.031250
k = 0.100: P(elephant|the) = 0.008929
k = 0.010: P(elephant|the) = 0.000988
k = 0.001: P(elephant|the) = 0.000100
The MLE probability for P(cat|the) would be 2/10 = 0.2. With k=1.0 (Laplace), this drops to about 0.136, losing nearly a third of the probability mass. As k decreases to 0.001, the probability climbs back to 0.196, much closer to the MLE.
For the unseen bigram "the elephant," we see the flip side: larger k values give it more probability (0.045 with k=1.0), while smaller values give it less (0.0001 with k=0.001). This is the core trade-off in smoothing: how much probability to steal from observed events to cover unseen ones.

The plot reveals the fundamental trade-off. As increases, the blue curve (observed bigram) falls while the red curve (unseen bigram) rises. The gray dotted line shows the MLE estimate we're trying to preserve for observed events. Choosing means finding the right balance: small enough to keep observed probabilities reasonable, large enough to give unseen events meaningful coverage.
Good-Turing SmoothingLink Copied
Add-k smoothing has an arbitrary feel to it. Why add 0.1 rather than 0.05 or 0.2? Good-Turing smoothing takes a fundamentally different approach: instead of picking a constant, it uses the statistical structure of the data itself to determine how much probability mass should go to unseen events.
The Key Insight: Learning from SingletonsLink Copied
The core idea is elegant. If we collected more data, some n-grams that currently appear zero times would appear once. Some that appear once would appear twice. Good-Turing asks: can we use the current distribution of counts to predict this future behavior?
The answer is yes. Consider n-grams that appear exactly once in our corpus (called singletons or hapax legomena). These are the n-grams that barely made it into our observations. If we had slightly less data, many of them would have count zero. Conversely, if we had slightly more data, many currently unseen n-grams would become singletons.
This suggests that the number of singletons tells us something about unseen n-grams. More precisely: the probability mass that should go to unseen events is proportional to the number of singletons.
The count represents how many distinct n-grams appear exactly times in the corpus. For example, is the number of n-grams that appear exactly once (hapax legomena), and is the number of n-grams that never appear.
The Good-Turing FormulaLink Copied
Good-Turing formalizes this intuition by replacing the observed count with an adjusted count :
The formula works by looking one step ahead in the count distribution. To estimate what a count of is really worth, we look at how many n-grams have count compared to how many have count . If there are relatively few n-grams with count , then count is probably an overestimate, and we should discount it.
For unseen n-grams (count 0), the adjusted count becomes:
And the total probability mass reserved for all unseen events is simply:
where is the total number of n-gram tokens. This elegant result says: the fraction of probability mass for unseen events equals the fraction of tokens that are singletons.
Worked Example: Good-TuringLink Copied
Suppose we have a corpus with these frequency counts:
| Count | (n-grams with this count) |
|---|---|
| 0 | 74,671,100,000 |
| 1 | 2,018,046 |
| 2 | 449,721 |
| 3 | 188,933 |
| 4 | 105,668 |
| 5 | 68,379 |
For bigrams that appear exactly once:
This means a bigram seen once should be treated as if it appeared only 0.446 times. The "missing" 0.554 goes to unseen bigrams.
For unseen bigrams:
Frequency of frequencies for bigrams: ---------------------------------------- N_1 = 14 (bigrams appearing 1 time(s)) N_2 = 3 (bigrams appearing 2 time(s)) N_4 = 2 (bigrams appearing 4 time(s)) N_5 = 1 (bigrams appearing 5 time(s)) Good-Turing adjusted counts: ---------------------------------------- c = 1 -> c* = 0.429 c = 2 -> c* = 0.000 c = 4 -> c* = 2.500
Our corpus shows a typical pattern: many bigrams appear only once (singletons), fewer appear twice, and so on. The adjusted count for singletons (c=1) drops significantly because Good-Turing uses the ratio of doubletons to singletons. When this ratio is less than 0.5, singletons get discounted heavily.

The distribution follows the expected pattern: most bigrams are rare (appearing only once or twice), while a few appear frequently. This Zipfian distribution is why Good-Turing works: the steep decline from to to provides the ratios needed to estimate appropriate discounts.
Good-Turing works well for small counts but becomes unreliable for larger counts where values become sparse or zero. In practice, it's often combined with other techniques.
Backoff and InterpolationLink Copied
Instead of modifying counts, we can use information from lower-order n-grams when higher-order ones are unreliable. Two strategies exist:
Backoff: Use the lower-order model only when the higher-order count is zero. If we've never seen the trigram "the cat ate," back off to the bigram "cat ate."
Interpolation: Always combine evidence from multiple n-gram orders using weighted averaging.
InterpolationLink Copied
Linear interpolation mixes unigram, bigram, and trigram probabilities:
where:
- : interpolation weights that sum to 1 ()
- : unigram probability
- : bigram probability
- : trigram probability
The weights are typically learned from held-out data using the EM algorithm or grid search.
Stupid BackoffLink Copied
A simpler approach that works surprisingly well for large-scale systems is "stupid backoff." It doesn't produce true probabilities but scores that work well for ranking:
where:
- : score for word given the preceding words
- : count of the full n-gram
- : backoff weight (typically 0.4)
Google used this for their web-scale language models because it's fast and the scores work well for ranking despite not being proper probabilities.
Testing trigram: ('<s>', 'the', 'cat')
Trigram count: 3
Bigram count ('<s>', 'the'): 5
λ = (0.1, 0.3, 0.6): P(cat|<s>,the) = 0.492121
λ = (0.2, 0.4, 0.4): P(cat|<s>,the) = 0.424242
λ = (0.5, 0.3, 0.2): P(cat|<s>,the) = 0.300606
The weight distribution matters. When we favor trigrams (), the probability is higher because the trigram "<s> the cat" appears in our corpus. When we favor unigrams (), we rely more on how often "cat" appears overall, which dilutes the context-specific information.
Interpolation provides a smooth fallback mechanism. Even when the trigram count is zero, the bigram and unigram components contribute to a non-zero probability.
Kneser-Ney SmoothingLink Copied
The smoothing methods we've seen so far share a common limitation: they treat the backoff distribution as a simple unigram model based on word frequency. But word frequency alone can be misleading. Kneser-Ney smoothing addresses this with two innovations that, together, create the most effective n-gram smoothing method ever developed.
The Problem with Frequency-Based BackoffLink Copied
To understand why Kneser-Ney matters, consider the phrase "San Francisco." In any corpus about California, the word "Francisco" appears frequently. A standard unigram model would assign it high probability.
But here's the catch: "Francisco" almost always follows "San." It rarely appears in other contexts. If we're backing off from an unseen trigram like "visited beautiful Francisco," a frequency-based unigram model would say "Francisco" is likely. Yet intuitively, we know "Francisco" is an unlikely word to appear after "beautiful" because it almost never starts a new context on its own.
This observation leads to Kneser-Ney's key insight: when backing off, we shouldn't ask "how often does this word appear?" but rather "in how many different contexts does this word appear?"
Innovation 1: Absolute DiscountingLink Copied
The first innovation is straightforward. Instead of using Good-Turing's complex count adjustments, Kneser-Ney subtracts a fixed discount from every observed count:
The formula has two terms:
- Discounted probability: The observed count minus , divided by the context count. The ensures we never go negative.
- Backoff contribution: An interpolation weight times a lower-order probability.
The discount is typically 0.75, a value derived from theoretical analysis of Good-Turing estimates. By subtracting a fixed amount rather than a percentage, absolute discounting treats all observed n-grams more uniformly. An n-gram seen 5 times loses 0.75 (15%), while one seen 2 times loses 0.75 (37.5%). This makes sense: rare observations are less reliable and should be discounted more aggressively in relative terms.
Innovation 2: Continuation ProbabilityLink Copied
The second innovation is where Kneser-Ney truly shines. For the backoff distribution , it doesn't use raw word frequency. Instead, it uses continuation probability: the proportion of unique contexts in which a word appears.
Let's unpack this:
- Numerator: Count how many unique words precede . For "Francisco," this might be just 1 (only "San" precedes it). For "the," it might be hundreds.
- Denominator: The total number of unique bigram types in the corpus.
Returning to our example: "Francisco" has high frequency but low continuation count (appears after few unique words). "The" has high frequency and high continuation count (appears after many unique words). When backing off, "the" deserves higher probability than "Francisco" because it's a more versatile word that genuinely appears in diverse contexts.
Putting It Together: The Full FormulaLink Copied
The complete Kneser-Ney formula for the highest-order n-gram combines both innovations:
where:
- : the context (preceding words)
- : count of the full n-gram
- : discount value (typically 0.75)
The interpolation weight ensures probability mass is conserved. It equals the total amount discounted from all observed continuations:
This formula says: take the discount , multiply by the number of unique words that follow this context, and divide by the total context count. The result is exactly the probability mass we "stole" from observed n-grams, which we now redistribute through the backoff distribution.
For lower-order distributions (used in the recursive backoff), raw counts are replaced with continuation counts. This is what makes the "San Francisco" example work correctly: "Francisco" has low continuation count, so it receives low probability in the backoff distribution despite its high raw frequency.
Continuation counts (unique preceding contexts): -------------------------------------------------- '</s>': appears after 5 unique word(s) 'the': appears after 4 unique word(s) 'sat': appears after 2 unique word(s) 'chased': appears after 2 unique word(s) 'cat': appears after 1 unique word(s) 'on': appears after 1 unique word(s) 'mat': appears after 1 unique word(s) 'dog': appears after 1 unique word(s) Total unique bigram types: 20 Kneser-Ney probabilities (d = 0.75): -------------------------------------------------- P(cat|the): 0.340000 (raw count: 4) P(dog|the): 0.340000 (raw count: 4) P(sat|cat): 0.137500 (raw count: 1) P(elephant|the): 0.000000 (raw count: 0)
The continuation counts reveal which words are versatile. Words like "the" and "cat" appear after multiple different words, giving them higher continuation probability. The word "" (end of sentence) also has high continuation count because every sentence ends with it.

The scatter plot illustrates Kneser-Ney's key insight. Words near or above the diagonal line are "versatile," appearing in many different contexts relative to their frequency. Words below the line are context-dependent, appearing frequently but in limited contexts. In a larger corpus, "Francisco" would appear far below the diagonal (high frequency, low continuation count), while common function words like "the" would appear on or above it.
Kneser-Ney assigns non-zero probability to the unseen bigram "the elephant" through the continuation probability term, while keeping observed bigrams at reasonable levels. The discount of 0.75 is subtracted from each observed count, and the freed probability mass flows to the continuation distribution.
Modified Kneser-NeyLink Copied
Modified Kneser-Ney improves on the original by using three different discount values based on the count:
These are estimated from the training data:
where:
- : a normalization factor based on singleton and doubleton counts
- : number of n-grams appearing exactly times
The intuition is that n-grams seen once are more likely to be noise than those seen twice, which are more likely to be noise than those seen three or more times. Different discounts capture this.
Modified Kneser-Ney discount parameters: ---------------------------------------- Y = 0.7000 d1 (for count=1): 0.7000 d2 (for count=2): 1.3000 d3 (for count≥3): 0.0000
The Y factor normalizes based on the ratio of singletons to doubletons. With our small corpus, the discount values are at their boundary conditions. On larger corpora, typical values are around , , and . The progression makes intuitive sense: we discount singletons more aggressively because they're more likely to be noise, while n-grams seen three or more times are more reliable and get smaller relative discounts.
Comparing Smoothing MethodsLink Copied
Let's compare how different smoothing methods handle the same probability estimation task.

The chart reveals the trade-offs clearly. MLE gives the highest probabilities to observed words but zeros for unseen ones. Add-1 smoothing flattens everything too much. Add-0.01 and Kneser-Ney strike a better balance, keeping observed words relatively high while providing coverage for unseen words.
Probability estimates for P(word|the): =========================================================================== Word MLE Add-1 Add-0.1 Add-0.01 KN --------------------------------------------------------------------------- cat 0.400000 0.227273 0.366071 0.396245 0.340000 dog 0.400000 0.227273 0.366071 0.396245 0.340000 sat 0.000000 0.045455 0.008929 0.000988 0.030000 on 0.000000 0.045455 0.008929 0.000988 0.015000 mat 0.100000 0.090909 0.098214 0.099802 0.040000 elephant 0.000000 0.045455 0.008929 0.000988 0.000000 zebra 0.000000 0.045455 0.008929 0.000988 0.000000
The table quantifies what we saw in the chart. For observed words like "cat" and "dog," MLE gives the highest probability, but Add-1 drops it dramatically. Kneser-Ney maintains probabilities closer to MLE for observed words while still providing coverage for unseen ones like "elephant" and "zebra." This balance is why Kneser-Ney became the standard for production language models.
Limitations and ImpactLink Copied
LimitationsLink Copied
Smoothing techniques, despite their sophistication, have inherent limitations:
- Fixed context window: All n-gram smoothing methods are bounded by the Markov assumption. No amount of smoothing helps when the relevant context is beyond the n-gram window.
- Vocabulary dependence: Performance degrades with vocabulary size. As grows, the number of possible unseen n-grams explodes.
- Domain sensitivity: Smoothing parameters tuned on one domain may perform poorly on another. A model trained on news text with optimal smoothing for that domain may still struggle with social media text.
- No semantic awareness: Smoothing treats "the cat ate" and "the cat flew" identically if both are unseen. It cannot use semantic knowledge that cats are more likely to eat than fly.
ImpactLink Copied
Despite these limitations, smoothing techniques were foundational to practical NLP:
- Speech recognition: Kneser-Ney smoothing powered speech recognition systems for decades, enabling reliable transcription even for novel word combinations.
- Machine translation: Statistical MT systems relied heavily on smoothed language models to score translation candidates.
- Spell correction: Smoothed models help distinguish between plausible and implausible word sequences for context-aware spelling correction.
- Established evaluation practices: The development of smoothing techniques led to rigorous evaluation methods like perplexity comparison on held-out data.
The techniques in this chapter represent the pinnacle of count-based language modeling. They squeezed remarkable performance from simple statistics. But they also revealed the ceiling: no matter how clever the smoothing, n-gram models cannot capture long-range dependencies or semantic similarity. This limitation motivated the shift to neural language models, which learn continuous representations that naturally generalize to unseen word combinations.
SummaryLink Copied
Smoothing techniques solve the zero probability problem in n-gram language models by redistributing probability mass from observed to unseen events.
Key takeaways:
- Laplace (add-one) smoothing is simple but steals too much probability from observed events
- Add-k smoothing improves by using smaller constants, with tuned on held-out data
- Good-Turing smoothing uses frequency of frequencies to estimate probability mass for unseen events
- Interpolation combines evidence from multiple n-gram orders using learned weights
- Kneser-Ney smoothing uses absolute discounting and continuation probability, making it the gold standard for n-gram models
- Modified Kneser-Ney refines discounting with count-dependent values
The choice of smoothing method matters significantly. On large corpora, Kneser-Ney typically outperforms simpler methods by 10-20% in perplexity. For small corpora or quick experiments, add-k smoothing provides a reasonable baseline with minimal implementation effort.
These techniques represent the culmination of decades of statistical NLP research. While neural language models have largely superseded n-gram models for most applications, understanding smoothing provides essential intuition about probability estimation, the bias-variance trade-off, and the importance of handling unseen events, concepts that remain central to modern NLP.
Key ParametersLink Copied
When implementing smoothing techniques, these parameters have the most impact on model performance:
| Parameter | Typical Range | Effect |
|---|---|---|
| (add-k smoothing) | 0.001 - 0.5 | Controls probability mass given to unseen events. Smaller values stay closer to MLE; larger values flatten the distribution. Tune on held-out data. |
| (Kneser-Ney discount) | 0.75 | Fixed discount subtracted from all counts. The standard value of 0.75 works well across most corpora. |
| (Modified KN) | 0.7, 1.0, 1.3 | Count-dependent discounts. Computed from corpus statistics using the formulas above. No manual tuning needed. |
| weights (interpolation) | Sum to 1.0 | Control contribution of each n-gram order. Learn from held-out data using EM or grid search. Higher-order weights should generally be larger. |
| (stupid backoff) | 0.4 | Backoff penalty applied when falling back to lower-order n-grams. The value 0.4 was empirically determined by Google. |
| (n-gram order) | 3 - 5 | Higher orders capture more context but suffer more from sparsity. Trigrams are common; 5-grams require very large corpora. |
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

N-gram Language Models: Probability-Based Text Generation & Prediction
Learn how n-gram language models assign probabilities to word sequences using the chain rule and Markov assumption, with implementations for text generation and scoring.

Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations
Learn how the Bag of Words model transforms text into numerical vectors through word counting, vocabulary construction, and sparse matrix storage. Master CountVectorizer and understand when this foundational NLP technique works best.

N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams
Master n-gram text representations including bigrams, trigrams, character n-grams, and skip-grams. Learn extraction techniques, vocabulary explosion challenges, Zipf's law, and practical applications in NLP.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
