Learn how n-gram language models assign probabilities to word sequences using the chain rule and Markov assumption, with implementations for text generation and scoring.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
N-gram Language ModelsLink Copied
What makes some word sequences sound natural while others feel wrong? "The cat sat on the mat" flows effortlessly, but "mat the on sat cat the" is gibberish. Native speakers intuitively know which sequences are likely, even for sentences they've never encountered. N-gram language models capture this intuition mathematically, assigning probabilities to sequences of words based on patterns learned from text.
Language models are foundational to NLP. They power spell checkers that suggest "their" over "thier," autocomplete systems that predict your next word, and speech recognizers that disambiguate "recognize speech" from "wreck a nice beach." Before neural networks dominated the field, n-gram models were the workhorses of language modeling, and understanding them provides essential intuition for modern approaches.
This chapter builds n-gram language models from first principles. You'll learn the probability theory that makes them work, implement them from scratch, and discover both their power and fundamental limitations.
The Language Modeling ProblemLink Copied
A language model assigns probabilities to sequences of words. Given a sequence , the model estimates , the probability of that exact sequence occurring in natural language.
A language model is a probability distribution over sequences of words. It assigns higher probabilities to sequences that are more likely to occur in natural language and lower probabilities to unlikely or ungrammatical sequences.
Why is this useful? Consider a speech recognition system that hears an ambiguous audio signal. It might produce two candidate transcriptions: "I saw a van" and "eyes awe of an." A language model knows the first is far more probable and helps the system choose correctly.
Language models also enable text generation. If we can compute for any sequence, we can generate new text by sampling words according to their probabilities given the preceding context.
The Chain Rule of ProbabilityLink Copied
Computing the probability of a full sequence seems daunting at first. How could we possibly estimate directly? We'd need to count how often this exact six-word sequence appears in our training data, then divide by the total number of six-word sequences. But most sentences are unique. Even in a corpus of billions of words, we're unlikely to see any specific sentence more than a handful of times.
The chain rule of probability offers an escape from this dilemma. Instead of estimating the joint probability directly, we decompose it into a product of conditional probabilities. Each conditional asks a simpler question: given everything that came before, how likely is the next word?
This decomposition follows directly from the definition of conditional probability applied repeatedly. We can write it more compactly using product notation:
where:
- is the -th word in the sequence
- is the probability of word given all preceding words
- The product multiplies all these conditional probabilities together
This decomposition is exact, not an approximation. The probability of any sequence equals the product of each word's probability given its entire preceding context.
Let's see this decomposition in action with a concrete example:
Sentence: "the cat sat on the mat"
Chain rule decomposition:
P(the cat sat on the mat) =
P(the)
× P(cat|the)
× P(sat|the cat)
× P(on|the cat sat)
× P(the|the cat sat on)
× P(mat|the cat sat on the)
Each term in this product answers a progressively more specific question. The first asks: how likely is "the" to start a sentence? The second asks: given we started with "the," how likely is "cat" to follow? By the time we reach the final word, we're asking: given the entire context "the cat sat on the," how likely is "mat"?
The chain rule is mathematically exact, but it creates a severe practical problem. To compute , we need the probability of word given the entire preceding context. For long sentences, this context can be arbitrarily long. A 20-word sentence requires us to estimate the probability of the 20th word given all 19 preceding words. We'll never have enough data to estimate these long-context probabilities reliably, since most 19-word contexts appear only once, if ever.
This is where the Markov assumption comes to the rescue.
The Markov AssumptionLink Copied
The chain rule gives us an exact decomposition, but it demands something impossible: reliable estimates of probabilities conditioned on arbitrarily long histories. The key insight of n-gram models is to make a simplifying assumption that trades some accuracy for tractability.
The Markov assumption states that the probability of a word depends only on a fixed-length history of preceding words, not the entire sequence. An n-gram model assumes the probability of word depends only on the previous words.
The assumption is named after the Russian mathematician Andrey Markov, who studied stochastic processes with this "memoryless" property. In the context of language, it means we pretend the future depends only on the recent past, not the distant past.
For a bigram model (), each word depends only on the single word immediately before it:
For a trigram model (), the memory extends to two previous words:
The pattern continues for higher orders: an n-gram model conditions on exactly previous words. The approximation symbol reminds us this is a simplification, not an equality.
Why does this help? Consider the difference in what we need to estimate:
- Full history: We need , requiring counts of the specific five-word context "the cat sat on the"
- Trigram: We need , requiring only counts of the two-word context "on the"
- Bigram: We need , requiring only counts of the single word "the"
Shorter contexts appear far more frequently in training data, giving us more reliable probability estimates.
Markov Approximations for the final word: ------------------------------------------------------------ Full history: P(mat | the cat sat on the) → exact but impractical Trigram (n=3): P(mat | on the) → considers 2 previous words Bigram (n=2): P(mat | the) → considers 1 previous word Unigram (n=1): P(mat) → ignores all context
Is the Markov assumption reasonable? Strictly speaking, no. The probability of "mat" after "the cat sat on the" depends on "cat" and "sat," which establish that we're talking about an animal sitting somewhere. A bigram model, seeing only "the," has no idea whether we're discussing cats on mats, birds in trees, or books on shelves.
Yet this approximation works well in practice, especially for larger values of . Language has enough local structure that nearby words carry most of the predictive information. The word "on" strongly predicts a location or surface will follow. The phrase "sat on" even more strongly suggests something like "mat," "chair," or "floor." We lose some information, but we gain the ability to estimate probabilities reliably.
Maximum Likelihood EstimationLink Copied
With the Markov assumption in hand, we've reduced our problem to estimating conditional probabilities like . But how do we actually compute these from data?
The most straightforward approach is maximum likelihood estimation (MLE): count how often each n-gram appears in our training corpus, then normalize. The probability of word following context equals the fraction of times we observed followed by .
For bigrams, the MLE estimate is:
where:
- is the count of the bigram in the training corpus
- is the count of the context word
The formula says: to find the probability of word following word , count how many times the bigram appears in the corpus, then divide by the total count of .
For trigrams, we extend the same logic to two-word contexts:
where:
- is the count of the trigram
- is the count of the preceding bigram context
This pattern applies to any n-gram order: divide the count of the full n-gram by the count of its -gram prefix.
Let's implement this and see it work on a small corpus:
Corpus tokens: 22
Unique unigrams: 9
Unique bigrams: 14
Top 10 bigrams:
('the', 'cat'): 3
('sat', 'on'): 2
('on', 'the'): 2
('the', 'mat'): 2
('mat', 'the'): 2
('the', 'dog'): 2
('cat', 'sat'): 1
('cat', 'ate'): 1
('ate', 'the'): 1
('the', 'fish'): 1
Our small corpus contains 24 tokens but only 9 unique words. The most frequent bigrams are ("the", "cat") and ("the", "dog"), each appearing twice. Notice how "the" dominates as the first element of many bigrams, reflecting its role as the most common English word.

The frequency distributions reveal a pattern common to all natural language: a few items dominate while most appear rarely. "The" appears 8 times, more than any other word, while most bigrams appear only once. This Zipf-like distribution explains why n-gram models struggle with data sparsity. As we move to higher-order n-grams, the long tail grows longer, and most n-grams remain unseen even in large corpora.
Now let's apply the MLE formula to compute actual probabilities:
Bigram Probabilities P(w2|w1): ---------------------------------------- P(cat|the) = 3/8 = 0.375 P(dog|the) = 2/8 = 0.250 P(mat|the) = 2/8 = 0.250 P(sat|cat) = 1/3 = 0.333 P(on|sat) = 2/2 = 1.000 P(the|on) = 2/2 = 1.000
The probabilities reveal patterns in our corpus. After "the," both "cat" and "dog" appear with equal probability (25% each), while "mat" also appears 25% of the time. After "sat," the word "on" appears with 100% probability, reflecting the fixed phrase "sat on" that appears in every relevant sentence. After "on," the word "the" follows every time, since our corpus always uses "on the" before the final noun.
These probabilities capture the local structure of our tiny corpus. A model trained on this data would predict that "sat" is always followed by "on," and "on" is always followed by "the." Of course, with only four sentences, these patterns are artifacts of our limited data. A larger corpus would reveal more nuanced distributions.
We can visualize the complete bigram probability matrix as a heatmap, where each cell shows , the probability of word following word :

The heatmap reveals the sparse structure of bigram probabilities. Most cells are empty (zero probability), indicating word pairs that never occurred in our corpus. The non-zero cells cluster around specific patterns: "the" can be followed by several words, but "sat" is always followed by "on." This sparsity becomes even more pronounced with larger vocabularies and higher-order n-grams.
Start and End TokensLink Copied
Sentences have beginnings and endings. To model this, we add special tokens: <s> marks the start of a sentence and </s> marks the end. These tokens let us model which words typically start sentences and which words end them.
Sentences with markers (bigram model):
1. <s> the cat sat on the mat </s>
2. <s> the cat ate the fish </s>
3. <s> the dog sat on the mat </s>
4. <s> the dog chased the cat </s>
Bigrams involving markers:
('<s>', 'the'): 4
('</s>', '<s>'): 3
('mat', '</s>'): 2
('fish', '</s>'): 1
('cat', '</s>'): 1
The start marker <s> lets us compute , the probability of a word starting a sentence. In our corpus, "the" starts every sentence, so . The end marker </s> lets us model sentence endings: "mat," "fish," and "cat" all end sentences.
For higher-order n-grams, we add multiple start tokens. A trigram model needs two <s> tokens at the beginning to provide context for the first two words.
Computing Sequence ProbabilitiesLink Copied
With n-gram probabilities estimated, we can compute the probability of any sequence. We apply the chain rule with our Markov approximation:
where is the sequence length, is the n-gram order, and each term conditions on the previous words rather than the full history. For positions where , we use start tokens <s> to provide the required context.
Sentence Log-Probabilities: ------------------------------------------------------------ log P = -3.47 "the dog sat on the mat" log P = -3.47 "the cat sat on the mat" log P = -inf "the cat chased the dog" log P = -inf "the fish ate the cat"
Notice we use log probabilities. Multiplying many small probabilities leads to numerical underflow, where numbers become too small to represent. Taking logarithms converts products to sums, keeping values in a manageable range.

The visualization shows why log probabilities are essential. With an average per-word probability of 0.1, raw probabilities become vanishingly small after about 30 words, approaching the limits of floating-point representation. Log probabilities decrease linearly and remain computationally tractable for sentences of any length.
The rankings reveal what our model learned. Sentences similar to the training data ("the cat sat on the mat") get higher probabilities. Sentences with unseen bigrams ("the fish ate") may get zero probability, resulting in negative infinity log probability.
The Zero Probability ProblemLink Copied
What happens when we encounter an n-gram that never appeared in training?
Analyzing: "the bird flew over the house" -------------------------------------------------- (<s>, the): count=4, P=1.000 ✓ (the, bird): count=0, P=0.000 ✗ UNSEEN (bird, flew): count=0, P=0.000 ✗ UNSEEN (flew, over): count=0, P=0.000 ✗ UNSEEN (over, the): count=0, P=0.000 ✗ UNSEEN (the, house): count=0, P=0.000 ✗ UNSEEN (house, </s>): count=0, P=0.000 ✗ UNSEEN Total probability: 0.0
A single unseen bigram makes the entire sequence probability zero. This is a serious problem: perfectly valid sentences get zero probability just because our training data didn't contain every possible word combination.
This is the zero probability problem. It motivates the smoothing techniques covered in the next chapter. Raw MLE estimates are too brittle for real applications.
Building a Complete N-gram ModelLink Copied
Let's build a complete, reusable n-gram language model class:
NgramLanguageModel class defined Methods: train(), probability(), sentence_log_probability()
The class encapsulates all the logic we've built: n-gram counting, context tracking, and probability computation. The train() method processes text and builds the count tables. The probability() and sentence_log_probability() methods query the trained model.
Let's train models of different orders and compare them:
Model Comparison (log probabilities): ====================================================================== Sentence Unigram Bigram Trigram ---------------------------------------------------------------------- the cat sat on the mat -16.24 -5.30 -3.11 the dog chased the cat -13.35 -5.30 -3.62 the bird flew over the tree -19.83 -5.99 -2.71 the cat ate the bird -13.17 -5.99 -inf
Higher-order models assign more varied probabilities because they consider more context. The trigram model can distinguish between sentences that differ only in longer-range patterns.
Generating Text from N-gram ModelsLink Copied
N-gram models can generate new text by sampling words according to their conditional probabilities. Starting from the start token, we repeatedly sample the next word given the current context.
Generated sentences (bigram model): -------------------------------------------------- 1. the bird sat on the fish 2. the bird 3. the tree 4. the dog chased the dog sat on the cat ate the dog watched the dog sat on the dog ate 5. the cat
Generated sentences (trigram model): -------------------------------------------------- 1. the dog 2. the bird 3. the bird flew over the tree 4. the dog chased the cat sat on the mat 5. the cat
The generated text reflects patterns in the training data. Bigram models produce locally coherent text but may wander incoherently over longer spans. Trigram models maintain coherence over slightly longer distances, but neither captures long-range dependencies like subject-verb agreement across clauses.
Temperature SamplingLink Copied
We can control the randomness of generation with a temperature parameter. Lower temperatures make the model more deterministic, always choosing high-probability words. Higher temperatures increase diversity:
Effect of Temperature on Generation: ============================================================ Temperature = 0.5: 1. the dog 2. the food 3. the bird flew over the dog ate the dog ate the cat Temperature = 1.0: 1. the bird sat on the fish 2. the bird 3. the tree Temperature = 1.5: 1. the bird sat on the fish 2. the bird 3. the branch
At low temperatures, the model gravitates toward the most common patterns. At high temperatures, it explores less likely paths and sometimes produces more creative but less coherent output.

The visualization shows how temperature reshapes probability distributions. At low temperature (0.5), the highest-probability words become even more dominant, making generation more predictable. At high temperature (1.5), the distribution flattens, giving rare words a fighting chance. This trade-off between coherence and diversity is fundamental to text generation with language models.
Model Storage and EfficiencyLink Copied
N-gram models can grow large. With a vocabulary of words, there are possible n-grams. For English with and , that's potential trigrams, so efficient storage is essential.
Sparse StorageLink Copied
Most possible n-grams never occur. We store only observed n-grams using dictionaries or hash tables:
Model Storage Statistics: ------------------------------------------------------------ Model N-grams Contexts Vocab Est. MB ------------------------------------------------------------ Unigram 19 1 19 0.0023 Bigram 36 19 20 0.0068 Trigram 44 29 20 0.0120
The trigram model stores more n-grams than the bigram model because it tracks longer sequences. Our toy corpus produces tiny models measured in kilobytes, but real-world models trained on billions of words can require gigabytes of storage. This makes efficient data structures essential.
Trie-Based StorageLink Copied
For faster lookups, n-grams can be stored in a trie (prefix tree). Each path from root to node represents an n-gram prefix. Nodes store counts:
Trie-based N-gram Counts:
----------------------------------------
('the',): 60
('the', 'cat'): 12
('the', 'cat', 'sat'): 1
('the', 'dog'): 10
('cat', 'sat', 'on'): 1
The trie efficiently retrieves counts for any prefix. Notice how "the" has a high count because it appears in every sentence, while longer n-grams like "the cat sat" are less frequent. Tries share prefixes between n-grams, reducing memory for models with many overlapping contexts. They also enable efficient prefix queries like "What words can follow 'the cat'?"
Visualizing N-gram DistributionsLink Copied
Let's visualize how n-gram probabilities distribute across possible continuations:

The distributions reveal how context constrains possibilities. After "the," many words are plausible, but after "sat," almost all probability mass concentrates on "on." Longer contexts generally produce more peaked distributions, reflecting stronger constraints.
Working with NLTKLink Copied
NLTK provides built-in n-gram functionality:
NLTK 3-gram model trained Vocabulary size: 21
NLTK Model Predictions: -------------------------------------------------- Context: ['the'] P(cat) = 0.300 P(dog) = 0.250 P(bird) = 0.150 P(mat) = 0.100 P(fish) = 0.050 Context: ['the', 'cat'] P(</s>) = 0.333 P(sat) = 0.167 P(ate) = 0.167 P(ran) = 0.167 P(watched) = 0.167 Context: ['sat', 'on'] P(the) = 1.000
NLTK's language modeling module handles the bookkeeping of n-gram counting, vocabulary management, and probability computation. This lets you focus on higher-level tasks.
Limitations of N-gram ModelsLink Copied
N-gram models have served NLP well, but they face fundamental limitations.
Limited context: The Markov assumption ignores long-range dependencies. In "The cat that the dog chased ran away," the verb "ran" agrees with "cat," but a trigram model only sees "chased ran away."
Data sparsity: Most possible n-grams never occur in training data. Even with billions of words, many valid word combinations remain unseen.
No generalization: N-gram models can't recognize that "the cat sat" and "the dog sat" share similar structure. Each n-gram is treated independently.
Storage requirements: Large vocabularies and high-order n-grams require substantial memory. A trigram model for a 100,000-word vocabulary could theoretically have parameters.
No semantic understanding: The model doesn't know that "cat" and "feline" are related, or that "bank" has multiple meanings.

Despite these limitations, n-gram models remain useful. They're fast, interpretable, and work well for many applications. They also provide the conceptual foundation for understanding more sophisticated models.
What N-gram Models UnlockedLink Copied
Before neural language models, n-grams powered critical applications.
Speech recognition: N-gram language models helped disambiguate acoustically similar phrases and improved recognition accuracy.
Machine translation: Statistical MT systems used n-gram models to score translation fluency, preferring outputs that sounded natural.
Spelling correction: N-gram probabilities identify which corrections produce more likely word sequences.
Text prediction: Early autocomplete systems used n-grams to suggest likely next words.
Information retrieval: N-gram statistics help identify important phrases and improve search ranking.
The concepts introduced by n-gram models remain central to modern language modeling: conditional probability, the chain rule, and the trade-off between context length and data sparsity. Transformers and large language models are sophisticated solutions to the problems n-gram models first revealed.
Key ParametersLink Copied
When building n-gram language models, several parameters affect model behavior and performance.
n (n-gram order)
The order determines how many words of context the model considers. Common choices:
n=1(unigram): No context, just word frequencies. Fast but ignores word order entirely.n=2(bigram): One word of context. Good balance of simplicity and performance for many tasks.n=3(trigram): Two words of context. Standard choice for production systems before neural models.n=4or higher: Rarely used due to severe data sparsity. Requires massive training corpora.
Higher values capture more context but exponentially increase the number of possible n-grams. This leads to sparse counts and zero probabilities for valid sequences.
temperature (for generation)
Controls randomness during text generation:
temperature < 1.0: More deterministic, favors high-probability words. Use for predictable, conservative output.temperature = 1.0: Samples according to true model probabilities. Balanced creativity and coherence.temperature > 1.0: More random, explores low-probability words. Use for creative, diverse output at the cost of coherence.
Values between 0.7 and 1.2 work well for most applications.
Start and end tokens
<s>: Marks sentence beginnings. Addn-1copies for an n-gram model to provide context for the first words.</s>: Marks sentence endings. Enables the model to learn which words typically end sentences.
Vocabulary handling
- Case normalization: Converting to lowercase reduces vocabulary size but loses information (e.g., "Apple" vs "apple").
- Unknown token
<UNK>: Replacing rare words with a special token handles out-of-vocabulary words during inference. - Minimum frequency threshold: Excluding words that appear fewer than k times reduces noise from typos and rare terms.
SummaryLink Copied
N-gram language models assign probabilities to word sequences by decomposing joint probabilities into conditional probabilities using the chain rule. The Markov assumption limits context to the previous words, making estimation tractable.
Key takeaways:
- Chain rule decomposition breaks sequence probability into a product of conditional probabilities
- Markov assumption limits history to words, trading accuracy for tractability
- Maximum likelihood estimation computes probabilities as normalized n-gram counts
- Start and end tokens model sentence boundaries explicitly
- Log probabilities prevent numerical underflow when multiplying many small values
- Zero probability problem occurs when test data contains unseen n-grams
- Text generation samples words according to conditional probabilities
- Storage efficiency requires sparse data structures like tries
N-gram models reveal the fundamental tension in language modeling: longer context improves predictions but requires exponentially more data. This insight motivates both smoothing techniques, covered next, and the neural approaches that eventually superseded n-gram models.
QuizLink Copied
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about n-gram language models.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Smoothing Techniques for N-gram Language Models: From Laplace to Kneser-Ney
Master smoothing techniques that solve the zero probability problem in n-gram models, including Laplace, add-k, Good-Turing, interpolation, and Kneser-Ney smoothing with Python implementations.

Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations
Learn how the Bag of Words model transforms text into numerical vectors through word counting, vocabulary construction, and sparse matrix storage. Master CountVectorizer and understand when this foundational NLP technique works best.

N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams
Master n-gram text representations including bigrams, trigrams, character n-grams, and skip-grams. Learn extraction techniques, vocabulary explosion challenges, Zipf's law, and practical applications in NLP.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.

