N-gram Language Models: Probability-Based Text Generation & Prediction
Back to Writing

N-gram Language Models: Probability-Based Text Generation & Prediction

Michael BrenndoerferDecember 7, 202527 min read6,529 wordsInteractive

Learn how n-gram language models assign probabilities to word sequences using the chain rule and Markov assumption, with implementations for text generation and scoring.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

N-gram Language ModelsLink Copied

What makes some word sequences sound natural while others feel wrong? "The cat sat on the mat" flows effortlessly, but "mat the on sat cat the" is gibberish. Native speakers intuitively know which sequences are likely, even for sentences they've never encountered. N-gram language models capture this intuition mathematically, assigning probabilities to sequences of words based on patterns learned from text.

Language models are foundational to NLP. They power spell checkers that suggest "their" over "thier," autocomplete systems that predict your next word, and speech recognizers that disambiguate "recognize speech" from "wreck a nice beach." Before neural networks dominated the field, n-gram models were the workhorses of language modeling, and understanding them provides essential intuition for modern approaches.

This chapter builds n-gram language models from first principles. You'll learn the probability theory that makes them work, implement them from scratch, and discover both their power and fundamental limitations.

The Language Modeling ProblemLink Copied

A language model assigns probabilities to sequences of words. Given a sequence w1,w2,,wnw_1, w_2, \ldots, w_n, the model estimates P(w1,w2,,wn)P(w_1, w_2, \ldots, w_n), the probability of that exact sequence occurring in natural language.

Language Model

A language model is a probability distribution over sequences of words. It assigns higher probabilities to sequences that are more likely to occur in natural language and lower probabilities to unlikely or ungrammatical sequences.

Why is this useful? Consider a speech recognition system that hears an ambiguous audio signal. It might produce two candidate transcriptions: "I saw a van" and "eyes awe of an." A language model knows the first is far more probable and helps the system choose correctly.

Language models also enable text generation. If we can compute P(w1,,wn)P(w_1, \ldots, w_n) for any sequence, we can generate new text by sampling words according to their probabilities given the preceding context.

The Chain Rule of ProbabilityLink Copied

Computing the probability of a full sequence seems daunting at first. How could we possibly estimate P("the cat sat on the mat")P(\text{"the cat sat on the mat"}) directly? We'd need to count how often this exact six-word sequence appears in our training data, then divide by the total number of six-word sequences. But most sentences are unique. Even in a corpus of billions of words, we're unlikely to see any specific sentence more than a handful of times.

The chain rule of probability offers an escape from this dilemma. Instead of estimating the joint probability directly, we decompose it into a product of conditional probabilities. Each conditional asks a simpler question: given everything that came before, how likely is the next word?

P(w1,w2,,wn)=P(w1)×P(w2w1)×P(w3w1,w2)××P(wnw1,,wn1)P(w_1, w_2, \ldots, w_n) = P(w_1) \times P(w_2|w_1) \times P(w_3|w_1, w_2) \times \cdots \times P(w_n|w_1, \ldots, w_{n-1})

This decomposition follows directly from the definition of conditional probability applied repeatedly. We can write it more compactly using product notation:

P(w1,,wn)=i=1nP(wiw1,,wi1)P(w_1, \ldots, w_n) = \prod_{i=1}^{n} P(w_i|w_1, \ldots, w_{i-1})

where:

  • wiw_i is the ii-th word in the sequence
  • P(wiw1,,wi1)P(w_i|w_1, \ldots, w_{i-1}) is the probability of word wiw_i given all preceding words
  • The product \prod multiplies all these conditional probabilities together

This decomposition is exact, not an approximation. The probability of any sequence equals the product of each word's probability given its entire preceding context.

Let's see this decomposition in action with a concrete example:

In[2]:
# Illustrate the chain rule decomposition
sentence = "the cat sat on the mat"
words = sentence.split()

# The chain rule breaks the joint probability into conditional probabilities
decomposition = []
for i, word in enumerate(words):
    if i == 0:
        condition = "∅"  # No conditioning for first word
        term = f"P({word})"
    else:
        context = " ".join(words[:i])
        term = f"P({word}|{context})"
    decomposition.append(term)
Out[3]:
Sentence: "the cat sat on the mat"

Chain rule decomposition:
P(the cat sat on the mat) =
    P(the)
  × P(cat|the)
  × P(sat|the cat)
  × P(on|the cat sat)
  × P(the|the cat sat on)
  × P(mat|the cat sat on the)

Each term in this product answers a progressively more specific question. The first asks: how likely is "the" to start a sentence? The second asks: given we started with "the," how likely is "cat" to follow? By the time we reach the final word, we're asking: given the entire context "the cat sat on the," how likely is "mat"?

The chain rule is mathematically exact, but it creates a severe practical problem. To compute P(wnw1,,wn1)P(w_n|w_1, \ldots, w_{n-1}), we need the probability of word wnw_n given the entire preceding context. For long sentences, this context can be arbitrarily long. A 20-word sentence requires us to estimate the probability of the 20th word given all 19 preceding words. We'll never have enough data to estimate these long-context probabilities reliably, since most 19-word contexts appear only once, if ever.

This is where the Markov assumption comes to the rescue.

The Markov AssumptionLink Copied

The chain rule gives us an exact decomposition, but it demands something impossible: reliable estimates of probabilities conditioned on arbitrarily long histories. The key insight of n-gram models is to make a simplifying assumption that trades some accuracy for tractability.

Markov Assumption

The Markov assumption states that the probability of a word depends only on a fixed-length history of preceding words, not the entire sequence. An n-gram model assumes the probability of word wiw_i depends only on the previous n1n-1 words.

The assumption is named after the Russian mathematician Andrey Markov, who studied stochastic processes with this "memoryless" property. In the context of language, it means we pretend the future depends only on the recent past, not the distant past.

For a bigram model (n=2n=2), each word depends only on the single word immediately before it:

P(wiw1,,wi1)P(wiwi1)P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-1})

For a trigram model (n=3n=3), the memory extends to two previous words:

P(wiw1,,wi1)P(wiwi2,wi1)P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-2}, w_{i-1})

The pattern continues for higher orders: an n-gram model conditions on exactly n1n-1 previous words. The approximation symbol \approx reminds us this is a simplification, not an equality.

Why does this help? Consider the difference in what we need to estimate:

  • Full history: We need P(matthe cat sat on the)P(\text{mat}|\text{the cat sat on the}), requiring counts of the specific five-word context "the cat sat on the"
  • Trigram: We need P(maton the)P(\text{mat}|\text{on the}), requiring only counts of the two-word context "on the"
  • Bigram: We need P(matthe)P(\text{mat}|\text{the}), requiring only counts of the single word "the"

Shorter contexts appear far more frequently in training data, giving us more reliable probability estimates.

In[4]:
# Compare full history vs Markov approximations
sentence = "the cat sat on the mat"
words = sentence.split()

# Different n-gram approximations for P(mat | the cat sat on the)
full_history = "P(mat | the cat sat on the)"
trigram = "P(mat | on the)"
bigram = "P(mat | the)"
unigram = "P(mat)"

approximations = [
    ("Full history", full_history, "exact but impractical"),
    ("Trigram (n=3)", trigram, "considers 2 previous words"),
    ("Bigram (n=2)", bigram, "considers 1 previous word"),
    ("Unigram (n=1)", unigram, "ignores all context"),
]
Out[5]:
Markov Approximations for the final word:
------------------------------------------------------------

Full history:
  P(mat | the cat sat on the)
  → exact but impractical

Trigram (n=3):
  P(mat | on the)
  → considers 2 previous words

Bigram (n=2):
  P(mat | the)
  → considers 1 previous word

Unigram (n=1):
  P(mat)
  → ignores all context

Is the Markov assumption reasonable? Strictly speaking, no. The probability of "mat" after "the cat sat on the" depends on "cat" and "sat," which establish that we're talking about an animal sitting somewhere. A bigram model, seeing only "the," has no idea whether we're discussing cats on mats, birds in trees, or books on shelves.

Yet this approximation works well in practice, especially for larger values of nn. Language has enough local structure that nearby words carry most of the predictive information. The word "on" strongly predicts a location or surface will follow. The phrase "sat on" even more strongly suggests something like "mat," "chair," or "floor." We lose some information, but we gain the ability to estimate probabilities reliably.

Maximum Likelihood EstimationLink Copied

With the Markov assumption in hand, we've reduced our problem to estimating conditional probabilities like P(maton the)P(\text{mat}|\text{on the}). But how do we actually compute these from data?

The most straightforward approach is maximum likelihood estimation (MLE): count how often each n-gram appears in our training corpus, then normalize. The probability of word ww following context cc equals the fraction of times we observed cc followed by ww.

For bigrams, the MLE estimate is:

P(wiwi1)=C(wi1,wi)C(wi1)P(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}

where:

  • C(wi1,wi)C(w_{i-1}, w_i) is the count of the bigram (wi1,wi)(w_{i-1}, w_i) in the training corpus
  • C(wi1)C(w_{i-1}) is the count of the context word wi1w_{i-1}

The formula says: to find the probability of word wiw_i following word wi1w_{i-1}, count how many times the bigram (wi1,wi)(w_{i-1}, w_i) appears in the corpus, then divide by the total count of wi1w_{i-1}.

For trigrams, we extend the same logic to two-word contexts:

P(wiwi2,wi1)=C(wi2,wi1,wi)C(wi2,wi1)P(w_i|w_{i-2}, w_{i-1}) = \frac{C(w_{i-2}, w_{i-1}, w_i)}{C(w_{i-2}, w_{i-1})}

where:

  • C(wi2,wi1,wi)C(w_{i-2}, w_{i-1}, w_i) is the count of the trigram
  • C(wi2,wi1)C(w_{i-2}, w_{i-1}) is the count of the preceding bigram context

This pattern applies to any n-gram order: divide the count of the full n-gram by the count of its (n1)(n-1)-gram prefix.

Let's implement this and see it work on a small corpus:

In[6]:
from collections import defaultdict, Counter

def count_ngrams(tokens, n):
    """Count n-grams in a token sequence."""
    ngram_counts = Counter()
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i+n])
        ngram_counts[ngram] += 1
    return ngram_counts

# Sample corpus
corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
"""

# Tokenize
tokens = corpus.lower().split()

# Count unigrams and bigrams
unigram_counts = count_ngrams(tokens, 1)
bigram_counts = count_ngrams(tokens, 2)
Out[7]:
Corpus tokens: 22
Unique unigrams: 9
Unique bigrams: 14

Top 10 bigrams:
  ('the', 'cat'): 3
  ('sat', 'on'): 2
  ('on', 'the'): 2
  ('the', 'mat'): 2
  ('mat', 'the'): 2
  ('the', 'dog'): 2
  ('cat', 'sat'): 1
  ('cat', 'ate'): 1
  ('ate', 'the'): 1
  ('the', 'fish'): 1

Our small corpus contains 24 tokens but only 9 unique words. The most frequent bigrams are ("the", "cat") and ("the", "dog"), each appearing twice. Notice how "the" dominates as the first element of many bigrams, reflecting its role as the most common English word.

Out[8]:
Visualization
Bar charts showing frequency distributions for unigrams and bigrams, both exhibiting long-tail patterns.
N-gram frequency distributions for unigrams and bigrams. Both follow a Zipf-like pattern: a few n-grams are very common, while most appear only once or twice. This long-tail distribution is characteristic of natural language and explains why data sparsity is such a challenge for n-gram models.

The frequency distributions reveal a pattern common to all natural language: a few items dominate while most appear rarely. "The" appears 8 times, more than any other word, while most bigrams appear only once. This Zipf-like distribution explains why n-gram models struggle with data sparsity. As we move to higher-order n-grams, the long tail grows longer, and most n-grams remain unseen even in large corpora.

Now let's apply the MLE formula to compute actual probabilities:

In[9]:
def bigram_probability(w1, w2, bigram_counts, unigram_counts):
    """Calculate P(w2|w1) using MLE."""
    bigram = (w1, w2)
    unigram = (w1,)
    
    bigram_count = bigram_counts.get(bigram, 0)
    unigram_count = unigram_counts.get(unigram, 0)
    
    if unigram_count == 0:
        return 0.0
    
    return bigram_count / unigram_count

# Calculate some example probabilities
examples = [
    ("the", "cat"),
    ("the", "dog"),
    ("the", "mat"),
    ("cat", "sat"),
    ("sat", "on"),
    ("on", "the"),
]

probabilities = [(w1, w2, bigram_probability(w1, w2, bigram_counts, unigram_counts)) 
                 for w1, w2 in examples]
Out[10]:
Bigram Probabilities P(w2|w1):
----------------------------------------
P(cat|the) = 3/8 = 0.375
P(dog|the) = 2/8 = 0.250
P(mat|the) = 2/8 = 0.250
P(sat|cat) = 1/3 = 0.333
P(on|sat) = 2/2 = 1.000
P(the|on) = 2/2 = 1.000

The probabilities reveal patterns in our corpus. After "the," both "cat" and "dog" appear with equal probability (25% each), while "mat" also appears 25% of the time. After "sat," the word "on" appears with 100% probability, reflecting the fixed phrase "sat on" that appears in every relevant sentence. After "on," the word "the" follows every time, since our corpus always uses "on the" before the final noun.

These probabilities capture the local structure of our tiny corpus. A model trained on this data would predict that "sat" is always followed by "on," and "on" is always followed by "the." Of course, with only four sentences, these patterns are artifacts of our limited data. A larger corpus would reveal more nuanced distributions.

We can visualize the complete bigram probability matrix as a heatmap, where each cell shows P(wjwi)P(w_j|w_i), the probability of word wjw_j following word wiw_i:

Out[11]:
Visualization
Heatmap showing bigram transition probabilities between words in the corpus.
Bigram transition probability matrix. Each row represents a context word, and each column represents a possible next word. Darker cells indicate higher probabilities. The sparse structure reveals that most word pairs never occur together, while a few transitions dominate (e.g., 'sat' → 'on' has probability 1.0).

The heatmap reveals the sparse structure of bigram probabilities. Most cells are empty (zero probability), indicating word pairs that never occurred in our corpus. The non-zero cells cluster around specific patterns: "the" can be followed by several words, but "sat" is always followed by "on." This sparsity becomes even more pronounced with larger vocabularies and higher-order n-grams.

Start and End TokensLink Copied

Sentences have beginnings and endings. To model this, we add special tokens: <s> marks the start of a sentence and </s> marks the end. These tokens let us model which words typically start sentences and which words end them.

In[12]:
def add_sentence_markers(text, n):
    """Add start and end markers to sentences."""
    sentences = text.strip().split('\n')
    marked_sentences = []
    
    for sentence in sentences:
        if sentence.strip():
            tokens = sentence.lower().split()
            # Add n-1 start tokens and 1 end token
            marked = ['<s>'] * (n - 1) + tokens + ['</s>']
            marked_sentences.append(marked)
    
    return marked_sentences

# Process corpus with markers for bigram model
marked_corpus = add_sentence_markers(corpus, n=2)

# Flatten for counting
all_tokens = [token for sentence in marked_corpus for token in sentence]

# Recount with markers
unigram_counts_marked = count_ngrams(all_tokens, 1)
bigram_counts_marked = count_ngrams(all_tokens, 2)
Out[13]:
Sentences with markers (bigram model):
  1. <s> the cat sat on the mat </s>
  2. <s> the cat ate the fish </s>
  3. <s> the dog sat on the mat </s>
  4. <s> the dog chased the cat </s>

Bigrams involving markers:
  ('<s>', 'the'): 4
  ('</s>', '<s>'): 3
  ('mat', '</s>'): 2
  ('fish', '</s>'): 1
  ('cat', '</s>'): 1

The start marker <s> lets us compute P(w1<s>)P(w_1|\text{<s>}), the probability of a word starting a sentence. In our corpus, "the" starts every sentence, so P(the<s>)=1.0P(\text{the}|\text{<s>}) = 1.0. The end marker </s> lets us model sentence endings: "mat," "fish," and "cat" all end sentences.

For higher-order n-grams, we add multiple start tokens. A trigram model needs two <s> tokens at the beginning to provide context for the first two words.

Computing Sequence ProbabilitiesLink Copied

With n-gram probabilities estimated, we can compute the probability of any sequence. We apply the chain rule with our Markov approximation:

P(w1,,wm)i=1mP(wiwin+1,,wi1)P(w_1, \ldots, w_m) \approx \prod_{i=1}^{m} P(w_i|w_{i-n+1}, \ldots, w_{i-1})

where mm is the sequence length, nn is the n-gram order, and each term conditions on the previous n1n-1 words rather than the full history. For positions where i<ni < n, we use start tokens <s> to provide the required context.

In[14]:
import math

def sequence_probability(sentence, bigram_counts, unigram_counts, log=True):
    """Calculate the probability of a sentence using bigram model."""
    # Add markers
    tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    
    if log:
        # Use log probabilities to avoid underflow
        log_prob = 0.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i-1], tokens[i]
            prob = bigram_probability(w1, w2, bigram_counts, unigram_counts)
            if prob == 0:
                return float('-inf')  # Log of zero
            log_prob += math.log(prob)
        return log_prob
    else:
        # Regular probability (prone to underflow)
        prob = 1.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i-1], tokens[i]
            prob *= bigram_probability(w1, w2, bigram_counts, unigram_counts)
        return prob

# Test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog sat on the mat",
    "the cat chased the dog",
    "the fish ate the cat",
]

results = [(s, sequence_probability(s, bigram_counts_marked, unigram_counts_marked, log=True))
           for s in test_sentences]
Out[15]:
Sentence Log-Probabilities:
------------------------------------------------------------
  log P =   -3.47  "the dog sat on the mat"
  log P =   -3.47  "the cat sat on the mat"
  log P =    -inf  "the cat chased the dog"
  log P =    -inf  "the fish ate the cat"

Notice we use log probabilities. Multiplying many small probabilities leads to numerical underflow, where numbers become too small to represent. Taking logarithms converts products to sums, keeping values in a manageable range.

Out[16]:
Visualization
Two line plots comparing raw probabilities versus log probabilities across sentence lengths.
Why log probabilities are essential. The left panel shows how raw probabilities shrink exponentially as sentence length increases, quickly becoming too small to represent. The right panel shows the same data on a log scale, where the values remain in a computationally tractable range.

The visualization shows why log probabilities are essential. With an average per-word probability of 0.1, raw probabilities become vanishingly small after about 30 words, approaching the limits of floating-point representation. Log probabilities decrease linearly and remain computationally tractable for sentences of any length.

The rankings reveal what our model learned. Sentences similar to the training data ("the cat sat on the mat") get higher probabilities. Sentences with unseen bigrams ("the fish ate") may get zero probability, resulting in negative infinity log probability.

The Zero Probability ProblemLink Copied

What happens when we encounter an n-gram that never appeared in training?

In[17]:
# Test with an unseen sentence
unseen_sentence = "the bird flew over the house"
unseen_tokens = ['<s>'] + unseen_sentence.lower().split() + ['</s>']

# Check each bigram
bigram_analysis = []
for i in range(1, len(unseen_tokens)):
    w1, w2 = unseen_tokens[i-1], unseen_tokens[i]
    bigram = (w1, w2)
    count = bigram_counts_marked.get(bigram, 0)
    prob = bigram_probability(w1, w2, bigram_counts_marked, unigram_counts_marked)
    bigram_analysis.append((w1, w2, count, prob))
Out[18]:
Analyzing: "the bird flew over the house"
--------------------------------------------------
  (<s>, the): count=4, P=1.000 ✓
  (the, bird): count=0, P=0.000 ✗ UNSEEN
  (bird, flew): count=0, P=0.000 ✗ UNSEEN
  (flew, over): count=0, P=0.000 ✗ UNSEEN
  (over, the): count=0, P=0.000 ✗ UNSEEN
  (the, house): count=0, P=0.000 ✗ UNSEEN
  (house, </s>): count=0, P=0.000 ✗ UNSEEN

Total probability: 0.0

A single unseen bigram makes the entire sequence probability zero. This is a serious problem: perfectly valid sentences get zero probability just because our training data didn't contain every possible word combination.

This is the zero probability problem. It motivates the smoothing techniques covered in the next chapter. Raw MLE estimates are too brittle for real applications.

Building a Complete N-gram ModelLink Copied

Let's build a complete, reusable n-gram language model class:

In[19]:
class NgramLanguageModel:
    """An n-gram language model with MLE estimation."""
    
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()
        
    def train(self, text):
        """Train the model on text."""
        sentences = text.strip().split('\n')
        
        for sentence in sentences:
            if not sentence.strip():
                continue
            
            # Tokenize and add markers
            tokens = sentence.lower().split()
            tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
            
            # Update vocabulary
            self.vocab.update(tokens)
            
            # Count n-grams and (n-1)-grams
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i:i + self.n])
                context = tuple(tokens[i:i + self.n - 1])
                
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1
    
    def probability(self, word, context):
        """Calculate P(word|context) using MLE."""
        if isinstance(context, str):
            context = tuple(context.split())
        
        ngram = context + (word,)
        
        ngram_count = self.ngram_counts.get(ngram, 0)
        context_count = self.context_counts.get(context, 0)
        
        if context_count == 0:
            return 0.0
        
        return ngram_count / context_count
    
    def sentence_log_probability(self, sentence):
        """Calculate log probability of a sentence."""
        tokens = sentence.lower().split()
        tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
        
        log_prob = 0.0
        for i in range(self.n - 1, len(tokens)):
            context = tuple(tokens[i - self.n + 1:i])
            word = tokens[i]
            
            prob = self.probability(word, context)
            if prob == 0:
                return float('-inf')
            
            log_prob += math.log(prob)
        
        return log_prob
Out[20]:
NgramLanguageModel class defined
Methods: train(), probability(), sentence_log_probability()

The class encapsulates all the logic we've built: n-gram counting, context tracking, and probability computation. The train() method processes text and builds the count tables. The probability() and sentence_log_probability() methods query the trained model.

Let's train models of different orders and compare them:

In[21]:
# Larger training corpus
training_corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
the cat ran from the dog
the dog ate the food
the bird flew over the tree
the bird sat on the branch
the cat watched the bird
the dog watched the cat
"""

# Train unigram, bigram, and trigram models
unigram_model = NgramLanguageModel(n=1)
bigram_model = NgramLanguageModel(n=2)
trigram_model = NgramLanguageModel(n=3)

unigram_model.train(training_corpus)
bigram_model.train(training_corpus)
trigram_model.train(training_corpus)

# Compare on test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog chased the cat",
    "the bird flew over the tree",
    "the cat ate the bird",
]

model_comparison = []
for sentence in test_sentences:
    scores = {
        'unigram': unigram_model.sentence_log_probability(sentence),
        'bigram': bigram_model.sentence_log_probability(sentence),
        'trigram': trigram_model.sentence_log_probability(sentence),
    }
    model_comparison.append((sentence, scores))
Out[22]:
Model Comparison (log probabilities):
======================================================================
Sentence                               Unigram     Bigram    Trigram
----------------------------------------------------------------------
the cat sat on the mat                  -16.24      -5.30      -3.11
the dog chased the cat                  -13.35      -5.30      -3.62
the bird flew over the tree             -19.83      -5.99      -2.71
the cat ate the bird                    -13.17      -5.99       -inf

Higher-order models assign more varied probabilities because they consider more context. The trigram model can distinguish between sentences that differ only in longer-range patterns.

Generating Text from N-gram ModelsLink Copied

N-gram models can generate new text by sampling words according to their conditional probabilities. Starting from the start token, we repeatedly sample the next word given the current context.

In[23]:
import random

def generate_sentence(model, max_length=20, seed=None):
    """Generate a sentence from an n-gram model."""
    if seed is not None:
        random.seed(seed)
    
    # Start with n-1 start tokens
    tokens = ['<s>'] * (model.n - 1)
    
    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1):])
        
        # Get all possible next words and their probabilities
        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))
        
        if not candidates:
            break
        
        # Sample according to probabilities
        words, probs = zip(*candidates)
        next_word = random.choices(words, weights=probs, k=1)[0]
        
        if next_word == '</s>':
            break
        
        tokens.append(next_word)
    
    # Remove start tokens and return
    return ' '.join(tokens[model.n - 1:])

# Generate sentences from bigram model
generated_bigram = [generate_sentence(bigram_model, seed=i) for i in range(5)]
Out[24]:
Generated sentences (bigram model):
--------------------------------------------------
  1. the bird sat on the fish
  2. the bird
  3. the tree
  4. the dog chased the dog sat on the cat ate the dog watched the dog sat on the dog ate
  5. the cat
In[25]:
# Generate from trigram model
generated_trigram = [generate_sentence(trigram_model, seed=i) for i in range(5)]
Out[26]:
Generated sentences (trigram model):
--------------------------------------------------
  1. the dog
  2. the bird
  3. the bird flew over the tree
  4. the dog chased the cat sat on the mat
  5. the cat

The generated text reflects patterns in the training data. Bigram models produce locally coherent text but may wander incoherently over longer spans. Trigram models maintain coherence over slightly longer distances, but neither captures long-range dependencies like subject-verb agreement across clauses.

Temperature SamplingLink Copied

We can control the randomness of generation with a temperature parameter. Lower temperatures make the model more deterministic, always choosing high-probability words. Higher temperatures increase diversity:

In[27]:
def generate_with_temperature(model, temperature=1.0, max_length=20, seed=None):
    """Generate with temperature-scaled probabilities."""
    if seed is not None:
        random.seed(seed)
    
    tokens = ['<s>'] * (model.n - 1)
    
    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1):])
        
        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))
        
        if not candidates:
            break
        
        words, probs = zip(*candidates)
        
        # Apply temperature scaling
        if temperature != 1.0:
            log_probs = [math.log(p + 1e-10) / temperature for p in probs]
            max_log = max(log_probs)
            scaled_probs = [math.exp(lp - max_log) for lp in log_probs]
            total = sum(scaled_probs)
            probs = tuple(p / total for p in scaled_probs)
        
        next_word = random.choices(words, weights=probs, k=1)[0]
        
        if next_word == '</s>':
            break
        
        tokens.append(next_word)
    
    return ' '.join(tokens[model.n - 1:])

# Compare different temperatures
temperatures = [0.5, 1.0, 1.5]
temp_samples = {t: [generate_with_temperature(bigram_model, temperature=t, seed=i) 
                    for i in range(3)] 
                for t in temperatures}
Out[28]:
Effect of Temperature on Generation:
============================================================

Temperature = 0.5:
  1. the dog
  2. the food
  3. the bird flew over the dog ate the dog ate the cat

Temperature = 1.0:
  1. the bird sat on the fish
  2. the bird
  3. the tree

Temperature = 1.5:
  1. the bird sat on the fish
  2. the bird
  3. the branch

At low temperatures, the model gravitates toward the most common patterns. At high temperatures, it explores less likely paths and sometimes produces more creative but less coherent output.

Out[29]:
Visualization
Three bar charts showing how temperature affects the probability distribution over next words.
Effect of temperature on probability distributions. At temperature 0.5, the distribution becomes sharper, concentrating probability on the most likely words. At temperature 1.0, we use the original probabilities. At temperature 1.5, the distribution flattens, giving more weight to unlikely words.

The visualization shows how temperature reshapes probability distributions. At low temperature (0.5), the highest-probability words become even more dominant, making generation more predictable. At high temperature (1.5), the distribution flattens, giving rare words a fighting chance. This trade-off between coherence and diversity is fundamental to text generation with language models.

Model Storage and EfficiencyLink Copied

N-gram models can grow large. With a vocabulary of VV words, there are VnV^n possible n-grams. For English with V=100,000V = 100,000 and n=3n = 3, that's 101510^{15} potential trigrams, so efficient storage is essential.

Sparse StorageLink Copied

Most possible n-grams never occur. We store only observed n-grams using dictionaries or hash tables:

In[30]:
import sys

def estimate_model_size(model):
    """Estimate memory usage of an n-gram model."""
    # Count stored n-grams
    num_ngrams = len(model.ngram_counts)
    num_contexts = len(model.context_counts)
    vocab_size = len(model.vocab)
    
    # Estimate memory (rough approximation)
    # Each n-gram tuple: ~n * 50 bytes (string references)
    # Each count: 28 bytes (Python int)
    ngram_memory = num_ngrams * (model.n * 50 + 28)
    context_memory = num_contexts * ((model.n - 1) * 50 + 28)
    vocab_memory = vocab_size * 50
    
    total_bytes = ngram_memory + context_memory + vocab_memory
    
    return {
        'num_ngrams': num_ngrams,
        'num_contexts': num_contexts,
        'vocab_size': vocab_size,
        'estimated_mb': total_bytes / (1024 * 1024)
    }

# Check model sizes
sizes = {
    'Unigram': estimate_model_size(unigram_model),
    'Bigram': estimate_model_size(bigram_model),
    'Trigram': estimate_model_size(trigram_model),
}
Out[31]:
Model Storage Statistics:
------------------------------------------------------------
Model         N-grams   Contexts    Vocab    Est. MB
------------------------------------------------------------
Unigram            19          1       19     0.0023
Bigram             36         19       20     0.0068
Trigram            44         29       20     0.0120

The trigram model stores more n-grams than the bigram model because it tracks longer sequences. Our toy corpus produces tiny models measured in kilobytes, but real-world models trained on billions of words can require gigabytes of storage. This makes efficient data structures essential.

Trie-Based StorageLink Copied

For faster lookups, n-grams can be stored in a trie (prefix tree). Each path from root to node represents an n-gram prefix. Nodes store counts:

In[32]:
class TrieNode:
    """A node in the n-gram trie."""
    def __init__(self):
        self.children = {}
        self.count = 0

class TrieNgramModel:
    """N-gram model using trie storage."""
    
    def __init__(self, n=2):
        self.n = n
        self.root = TrieNode()
        self.vocab = set()
    
    def _add_ngram(self, ngram):
        """Add an n-gram to the trie."""
        node = self.root
        for token in ngram:
            if token not in node.children:
                node.children[token] = TrieNode()
            node = node.children[token]
            node.count += 1
    
    def train(self, text):
        """Train on text."""
        sentences = text.strip().split('\n')
        
        for sentence in sentences:
            if not sentence.strip():
                continue
            
            tokens = sentence.lower().split()
            tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
            self.vocab.update(tokens)
            
            # Add all n-grams up to order n
            for i in range(len(tokens)):
                for j in range(1, min(self.n + 1, len(tokens) - i + 1)):
                    ngram = tuple(tokens[i:i+j])
                    self._add_ngram(ngram)
    
    def get_count(self, ngram):
        """Get count of an n-gram."""
        if isinstance(ngram, str):
            ngram = tuple(ngram.split())
        
        node = self.root
        for token in ngram:
            if token not in node.children:
                return 0
            node = node.children[token]
        return node.count

# Build trie model
trie_model = TrieNgramModel(n=3)
trie_model.train(training_corpus)
Out[33]:
Trie-based N-gram Counts:
----------------------------------------
  ('the',): 60
  ('the', 'cat'): 12
  ('the', 'cat', 'sat'): 1
  ('the', 'dog'): 10
  ('cat', 'sat', 'on'): 1

The trie efficiently retrieves counts for any prefix. Notice how "the" has a high count because it appears in every sentence, while longer n-grams like "the cat sat" are less frequent. Tries share prefixes between n-grams, reducing memory for models with many overlapping contexts. They also enable efficient prefix queries like "What words can follow 'the cat'?"

Visualizing N-gram DistributionsLink Copied

Let's visualize how n-gram probabilities distribute across possible continuations:

Out[34]:
Visualization
Bar charts showing probability distributions for next word given three different contexts.
Probability distribution over next words given different contexts. The bigram model shows how context constrains the set of likely continuations. After 'the,' many words are possible. After 'sat,' 'on' dominates. More specific contexts yield more peaked distributions.

The distributions reveal how context constrains possibilities. After "the," many words are plausible, but after "sat," almost all probability mass concentrates on "on." Longer contexts generally produce more peaked distributions, reflecting stronger constraints.

Working with NLTKLink Copied

NLTK provides built-in n-gram functionality:

In[35]:
import nltk
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

# Download required data
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab', quiet=True)

# Prepare training data
train_sentences = [s.lower().split() for s in training_corpus.strip().split('\n') if s.strip()]

# Create padded n-grams for training
n = 3
train_data, padded_vocab = padded_everygram_pipeline(n, train_sentences)

# Train MLE model
nltk_model = MLE(n)
nltk_model.fit(train_data, padded_vocab)
Out[36]:
NLTK 3-gram model trained
Vocabulary size: 21
In[37]:
# Query the NLTK model
test_contexts = [
    ['the'],
    ['the', 'cat'],
    ['sat', 'on'],
]

nltk_results = []
for context in test_contexts:
    # Get probability distribution
    probs = {}
    for word in nltk_model.vocab:
        try:
            p = nltk_model.score(word, context)
            if p > 0:
                probs[word] = p
        except:
            pass
    
    # Get top predictions
    top_words = sorted(probs.items(), key=lambda x: -x[1])[:5]
    nltk_results.append((context, top_words))
Out[38]:
NLTK Model Predictions:
--------------------------------------------------

Context: ['the']
  P(cat) = 0.300
  P(dog) = 0.250
  P(bird) = 0.150
  P(mat) = 0.100
  P(fish) = 0.050

Context: ['the', 'cat']
  P(</s>) = 0.333
  P(sat) = 0.167
  P(ate) = 0.167
  P(ran) = 0.167
  P(watched) = 0.167

Context: ['sat', 'on']
  P(the) = 1.000

NLTK's language modeling module handles the bookkeeping of n-gram counting, vocabulary management, and probability computation. This lets you focus on higher-level tasks.

Limitations of N-gram ModelsLink Copied

N-gram models have served NLP well, but they face fundamental limitations.

Limited context: The Markov assumption ignores long-range dependencies. In "The cat that the dog chased ran away," the verb "ran" agrees with "cat," but a trigram model only sees "chased ran away."

Data sparsity: Most possible n-grams never occur in training data. Even with billions of words, many valid word combinations remain unseen.

No generalization: N-gram models can't recognize that "the cat sat" and "the dog sat" share similar structure. Each n-gram is treated independently.

Storage requirements: Large vocabularies and high-order n-grams require substantial memory. A trigram model for a 100,000-word vocabulary could theoretically have 101510^{15} parameters.

No semantic understanding: The model doesn't know that "cat" and "feline" are related, or that "bank" has multiple meanings.

Out[39]:
Visualization
Line chart showing decreasing n-gram coverage as order increases from 1 to 5.
The trade-off between n-gram order and data sparsity. Higher-order n-grams capture more context but suffer from increased sparsity. The chart shows the percentage of test n-grams that were seen during training, illustrating how coverage drops dramatically as n increases.

Despite these limitations, n-gram models remain useful. They're fast, interpretable, and work well for many applications. They also provide the conceptual foundation for understanding more sophisticated models.

What N-gram Models UnlockedLink Copied

Before neural language models, n-grams powered critical applications.

Speech recognition: N-gram language models helped disambiguate acoustically similar phrases and improved recognition accuracy.

Machine translation: Statistical MT systems used n-gram models to score translation fluency, preferring outputs that sounded natural.

Spelling correction: N-gram probabilities identify which corrections produce more likely word sequences.

Text prediction: Early autocomplete systems used n-grams to suggest likely next words.

Information retrieval: N-gram statistics help identify important phrases and improve search ranking.

The concepts introduced by n-gram models remain central to modern language modeling: conditional probability, the chain rule, and the trade-off between context length and data sparsity. Transformers and large language models are sophisticated solutions to the problems n-gram models first revealed.

Key ParametersLink Copied

When building n-gram language models, several parameters affect model behavior and performance.

n (n-gram order)

The order determines how many words of context the model considers. Common choices:

  • n=1 (unigram): No context, just word frequencies. Fast but ignores word order entirely.
  • n=2 (bigram): One word of context. Good balance of simplicity and performance for many tasks.
  • n=3 (trigram): Two words of context. Standard choice for production systems before neural models.
  • n=4 or higher: Rarely used due to severe data sparsity. Requires massive training corpora.

Higher values capture more context but exponentially increase the number of possible n-grams. This leads to sparse counts and zero probabilities for valid sequences.

temperature (for generation)

Controls randomness during text generation:

  • temperature < 1.0: More deterministic, favors high-probability words. Use for predictable, conservative output.
  • temperature = 1.0: Samples according to true model probabilities. Balanced creativity and coherence.
  • temperature > 1.0: More random, explores low-probability words. Use for creative, diverse output at the cost of coherence.

Values between 0.7 and 1.2 work well for most applications.

Start and end tokens

  • <s>: Marks sentence beginnings. Add n-1 copies for an n-gram model to provide context for the first words.
  • </s>: Marks sentence endings. Enables the model to learn which words typically end sentences.

Vocabulary handling

  • Case normalization: Converting to lowercase reduces vocabulary size but loses information (e.g., "Apple" vs "apple").
  • Unknown token <UNK>: Replacing rare words with a special token handles out-of-vocabulary words during inference.
  • Minimum frequency threshold: Excluding words that appear fewer than k times reduces noise from typos and rare terms.

SummaryLink Copied

N-gram language models assign probabilities to word sequences by decomposing joint probabilities into conditional probabilities using the chain rule. The Markov assumption limits context to the previous n1n-1 words, making estimation tractable.

Key takeaways:

  • Chain rule decomposition breaks sequence probability into a product of conditional probabilities
  • Markov assumption limits history to n1n-1 words, trading accuracy for tractability
  • Maximum likelihood estimation computes probabilities as normalized n-gram counts
  • Start and end tokens model sentence boundaries explicitly
  • Log probabilities prevent numerical underflow when multiplying many small values
  • Zero probability problem occurs when test data contains unseen n-grams
  • Text generation samples words according to conditional probabilities
  • Storage efficiency requires sparse data structures like tries

N-gram models reveal the fundamental tension in language modeling: longer context improves predictions but requires exponentially more data. This insight motivates both smoothing techniques, covered next, and the neural approaches that eventually superseded n-gram models.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about n-gram language models.

Loading component...

Reference

BIBTEXAcademic
@misc{ngramlanguagemodelsprobabilitybasedtextgenerationprediction, author = {Michael Brenndoerfer}, title = {N-gram Language Models: Probability-Based Text Generation & Prediction}, year = {2025}, url = {https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-07} }
APAAcademic
Michael Brenndoerfer (2025). N-gram Language Models: Probability-Based Text Generation & Prediction. Retrieved from https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation
MLAAcademic
Michael Brenndoerfer. "N-gram Language Models: Probability-Based Text Generation & Prediction." 2025. Web. 12/7/2025. <https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation>.
CHICAGOAcademic
Michael Brenndoerfer. "N-gram Language Models: Probability-Based Text Generation & Prediction." Accessed 12/7/2025. https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'N-gram Language Models: Probability-Based Text Generation & Prediction'. Available at: https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation (Accessed: 12/7/2025).
SimpleBasic
Michael Brenndoerfer (2025). N-gram Language Models: Probability-Based Text Generation & Prediction. https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.