N-gram Language Models: Probability-Based Text Generation & Prediction

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn how n-gram language models assign probabilities to word sequences using the chain rule and Markov assumption, with implementations for text generation and scoring.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

N-gram Language ModelsLink Copied

What makes some word sequences sound natural while others feel wrong? "The cat sat on the mat" flows effortlessly, but "mat the on sat cat the" is gibberish. Native speakers intuitively know which sequences are likely, even for sentences they've never encountered. N-gram language models capture this intuition mathematically, assigning probabilities to sequences of words based on patterns learned from text.

Language models are foundational to NLP. They power spell checkers that suggest "their" over "thier," autocomplete systems that predict your next word, and speech recognizers that disambiguate "recognize speech" from "wreck a nice beach." Before neural networks dominated the field, n-gram models were the workhorses of language modeling, and understanding them provides essential intuition for modern approaches.

This chapter builds n-gram language models from first principles. You'll learn the probability theory that makes them work, implement them from scratch, and discover both their power and fundamental limitations.

The Language Modeling ProblemLink Copied

A language model assigns probabilities to sequences of words. Given a sequence $w_1, w_2, \ldots, w_n$ , the model estimates $P(w_1, w_2, \ldots, w_n)$ , the probability of that exact sequence occurring in natural language.

Language Model

A language model is a probability distribution over sequences of words. It assigns higher probabilities to sequences that are more likely to occur in natural language and lower probabilities to unlikely or ungrammatical sequences.

Why is this useful? Consider a speech recognition system that hears an ambiguous audio signal. It might produce two candidate transcriptions: "I saw a van" and "eyes awe of an." A language model knows the first is far more probable and helps the system choose correctly.

Language models also enable text generation. If we can compute $P(w_1, \ldots, w_n)$ for any sequence, we can generate new text by sampling words according to their probabilities given the preceding context.

The Chain Rule of ProbabilityLink Copied

Computing the probability of a full sequence seems daunting at first. How could we possibly estimate $P(\text{"the cat sat on the mat"})$ directly? We'd need to count how often this exact six-word sequence appears in our training data, then divide by the total number of six-word sequences. But most sentences are unique. Even in a corpus of billions of words, we're unlikely to see any specific sentence more than a handful of times.

The chain rule of probability offers an escape from this dilemma. Instead of estimating the joint probability directly, we decompose it into a product of conditional probabilities. Each conditional asks a simpler question: given everything that came before, how likely is the next word?

P(w_1, w_2, \ldots, w_n) = P(w_1) \times P(w_2|w_1) \times P(w_3|w_1, w_2) \times \cdots \times P(w_n|w_1, \ldots, w_{n-1})

This decomposition follows directly from the definition of conditional probability applied repeatedly. We can write it more compactly using product notation:

P(w_1, \ldots, w_n) = \prod_{i=1}^{n} P(w_i|w_1, \ldots, w_{i-1})

where:

$w_i$ is the $i$ -th word in the sequence
$P(w_i|w_1, \ldots, w_{i-1})$ is the probability of word $w_i$ given all preceding words
The product $\prod$ multiplies all these conditional probabilities together

This decomposition is exact, not an approximation. The probability of any sequence equals the product of each word's probability given its entire preceding context.

Let's see this decomposition in action with a concrete example:

In[2]:

# Illustrate the chain rule decomposition
sentence = "the cat sat on the mat"
words = sentence.split()

# The chain rule breaks the joint probability into conditional probabilities
decomposition = []
for i, word in enumerate(words):
    if i == 0:
        condition = "∅"  # No conditioning for first word
        term = f"P({word})"
    else:
        context = " ".join(words[:i])
        term = f"P({word}|{context})"
    decomposition.append(term)

# Illustrate the chain rule decomposition
sentence = "the cat sat on the mat"
words = sentence.split()

# The chain rule breaks the joint probability into conditional probabilities
decomposition = []
for i, word in enumerate(words):
    if i == 0:
        condition = "∅"  # No conditioning for first word
        term = f"P({word})"
    else:
        context = " ".join(words[:i])
        term = f"P({word}|{context})"
    decomposition.append(term)

Out[3]:

Sentence: "the cat sat on the mat"

Chain rule decomposition:
P(the cat sat on the mat) =
    P(the)
  × P(cat|the)
  × P(sat|the cat)
  × P(on|the cat sat)
  × P(the|the cat sat on)
  × P(mat|the cat sat on the)

Each term in this product answers a progressively more specific question. The first asks: how likely is "the" to start a sentence? The second asks: given we started with "the," how likely is "cat" to follow? By the time we reach the final word, we're asking: given the entire context "the cat sat on the," how likely is "mat"?

The chain rule is mathematically exact, but it creates a severe practical problem. To compute $P(w_n|w_1, \ldots, w_{n-1})$ , we need the probability of word $w_n$ given the entire preceding context. For long sentences, this context can be arbitrarily long. A 20-word sentence requires us to estimate the probability of the 20th word given all 19 preceding words. We'll never have enough data to estimate these long-context probabilities reliably, since most 19-word contexts appear only once, if ever.

This is where the Markov assumption comes to the rescue.

The Markov AssumptionLink Copied

The chain rule gives us an exact decomposition, but it demands something impossible: reliable estimates of probabilities conditioned on arbitrarily long histories. The key insight of n-gram models is to make a simplifying assumption that trades some accuracy for tractability.

Markov Assumption

The Markov assumption states that the probability of a word depends only on a fixed-length history of preceding words, not the entire sequence. An n-gram model assumes the probability of word $w_i$ depends only on the previous $n-1$ words.

The assumption is named after the Russian mathematician Andrey Markov, who studied stochastic processes with this "memoryless" property. In the context of language, it means we pretend the future depends only on the recent past, not the distant past.

For a bigram model ( $n=2$ ), each word depends only on the single word immediately before it:

P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-1})

For a trigram model ( $n=3$ ), the memory extends to two previous words:

P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-2}, w_{i-1})

The pattern continues for higher orders: an n-gram model conditions on exactly $n-1$ previous words. The approximation symbol $\approx$ reminds us this is a simplification, not an equality.

Why does this help? Consider the difference in what we need to estimate:

Full history: We need $P(\text{mat}|\text{the cat sat on the})$ , requiring counts of the specific five-word context "the cat sat on the"
Trigram: We need $P(\text{mat}|\text{on the})$ , requiring only counts of the two-word context "on the"
Bigram: We need $P(\text{mat}|\text{the})$ , requiring only counts of the single word "the"

Shorter contexts appear far more frequently in training data, giving us more reliable probability estimates.

In[4]:

# Compare full history vs Markov approximations
sentence = "the cat sat on the mat"
words = sentence.split()

# Different n-gram approximations for P(mat | the cat sat on the)
full_history = "P(mat | the cat sat on the)"
trigram = "P(mat | on the)"
bigram = "P(mat | the)"
unigram = "P(mat)"

approximations = [
    ("Full history", full_history, "exact but impractical"),
    ("Trigram (n=3)", trigram, "considers 2 previous words"),
    ("Bigram (n=2)", bigram, "considers 1 previous word"),
    ("Unigram (n=1)", unigram, "ignores all context"),
]

# Compare full history vs Markov approximations
sentence = "the cat sat on the mat"
words = sentence.split()

# Different n-gram approximations for P(mat | the cat sat on the)
full_history = "P(mat | the cat sat on the)"
trigram = "P(mat | on the)"
bigram = "P(mat | the)"
unigram = "P(mat)"

approximations = [
    ("Full history", full_history, "exact but impractical"),
    ("Trigram (n=3)", trigram, "considers 2 previous words"),
    ("Bigram (n=2)", bigram, "considers 1 previous word"),
    ("Unigram (n=1)", unigram, "ignores all context"),
]

Out[5]:

Markov Approximations for the final word:
------------------------------------------------------------

Full history:
  P(mat | the cat sat on the)
  → exact but impractical

Trigram (n=3):
  P(mat | on the)
  → considers 2 previous words

Bigram (n=2):
  P(mat | the)
  → considers 1 previous word

Unigram (n=1):
  P(mat)
  → ignores all context

Is the Markov assumption reasonable? Strictly speaking, no. The probability of "mat" after "the cat sat on the" depends on "cat" and "sat," which establish that we're talking about an animal sitting somewhere. A bigram model, seeing only "the," has no idea whether we're discussing cats on mats, birds in trees, or books on shelves.

Yet this approximation works well in practice, especially for larger values of $n$ . Language has enough local structure that nearby words carry most of the predictive information. The word "on" strongly predicts a location or surface will follow. The phrase "sat on" even more strongly suggests something like "mat," "chair," or "floor." We lose some information, but we gain the ability to estimate probabilities reliably.

Maximum Likelihood EstimationLink Copied

With the Markov assumption in hand, we've reduced our problem to estimating conditional probabilities like $P(\text{mat}|\text{on the})$ . But how do we actually compute these from data?

The most straightforward approach is maximum likelihood estimation (MLE): count how often each n-gram appears in our training corpus, then normalize. The probability of word $w$ following context $c$ equals the fraction of times we observed $c$ followed by $w$ .

For bigrams, the MLE estimate is:

P(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}

where:

$C(w_{i-1}, w_i)$ is the count of the bigram $(w_{i-1}, w_i)$ in the training corpus
$C(w_{i-1})$ is the count of the context word $w_{i-1}$

The formula says: to find the probability of word $w_i$ following word $w_{i-1}$ , count how many times the bigram $(w_{i-1}, w_i)$ appears in the corpus, then divide by the total count of $w_{i-1}$ .

For trigrams, we extend the same logic to two-word contexts:

P(w_i|w_{i-2}, w_{i-1}) = \frac{C(w_{i-2}, w_{i-1}, w_i)}{C(w_{i-2}, w_{i-1})}

where:

$C(w_{i-2}, w_{i-1}, w_i)$ is the count of the trigram
$C(w_{i-2}, w_{i-1})$ is the count of the preceding bigram context

This pattern applies to any n-gram order: divide the count of the full n-gram by the count of its $(n-1)$ -gram prefix.

Let's implement this and see it work on a small corpus:

In[6]:

from collections import defaultdict, Counter

def count_ngrams(tokens, n):
    """Count n-grams in a token sequence."""
    ngram_counts = Counter()
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i+n])
        ngram_counts[ngram] += 1
    return ngram_counts

# Sample corpus
corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
"""

# Tokenize
tokens = corpus.lower().split()

# Count unigrams and bigrams
unigram_counts = count_ngrams(tokens, 1)
bigram_counts = count_ngrams(tokens, 2)

from collections import defaultdict, Counter

def count_ngrams(tokens, n):
    """Count n-grams in a token sequence."""
    ngram_counts = Counter()
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i:i+n])
        ngram_counts[ngram] += 1
    return ngram_counts

# Sample corpus
corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
"""

# Tokenize
tokens = corpus.lower().split()

# Count unigrams and bigrams
unigram_counts = count_ngrams(tokens, 1)
bigram_counts = count_ngrams(tokens, 2)

Out[7]:

Corpus tokens: 22
Unique unigrams: 9
Unique bigrams: 14

Top 10 bigrams:
  ('the', 'cat'): 3
  ('sat', 'on'): 2
  ('on', 'the'): 2
  ('the', 'mat'): 2
  ('mat', 'the'): 2
  ('the', 'dog'): 2
  ('cat', 'sat'): 1
  ('cat', 'ate'): 1
  ('ate', 'the'): 1
  ('the', 'fish'): 1

Our small corpus contains 24 tokens but only 9 unique words. The most frequent bigrams are ("the", "cat") and ("the", "dog"), each appearing twice. Notice how "the" dominates as the first element of many bigrams, reflecting its role as the most common English word.

Out[8]:

Visualization

Bar charts showing frequency distributions for unigrams and bigrams, both exhibiting long-tail patterns. — N-gram frequency distributions for unigrams and bigrams. Both follow a Zipf-like pattern: a few n-grams are very common, while most appear only once or twice. This long-tail distribution is characteristic of natural language and explains why data sparsity is such a challenge for n-gram models.

The frequency distributions reveal a pattern common to all natural language: a few items dominate while most appear rarely. "The" appears 8 times, more than any other word, while most bigrams appear only once. This Zipf-like distribution explains why n-gram models struggle with data sparsity. As we move to higher-order n-grams, the long tail grows longer, and most n-grams remain unseen even in large corpora.

Now let's apply the MLE formula to compute actual probabilities:

In[9]:

def bigram_probability(w1, w2, bigram_counts, unigram_counts):
    """Calculate P(w2|w1) using MLE."""
    bigram = (w1, w2)
    unigram = (w1,)
    
    bigram_count = bigram_counts.get(bigram, 0)
    unigram_count = unigram_counts.get(unigram, 0)
    
    if unigram_count == 0:
        return 0.0
    
    return bigram_count / unigram_count

# Calculate some example probabilities
examples = [
    ("the", "cat"),
    ("the", "dog"),
    ("the", "mat"),
    ("cat", "sat"),
    ("sat", "on"),
    ("on", "the"),
]

probabilities = [(w1, w2, bigram_probability(w1, w2, bigram_counts, unigram_counts)) 
                 for w1, w2 in examples]

def bigram_probability(w1, w2, bigram_counts, unigram_counts):
    """Calculate P(w2|w1) using MLE."""
    bigram = (w1, w2)
    unigram = (w1,)
    
    bigram_count = bigram_counts.get(bigram, 0)
    unigram_count = unigram_counts.get(unigram, 0)
    
    if unigram_count == 0:
        return 0.0
    
    return bigram_count / unigram_count

# Calculate some example probabilities
examples = [
    ("the", "cat"),
    ("the", "dog"),
    ("the", "mat"),
    ("cat", "sat"),
    ("sat", "on"),
    ("on", "the"),
]

probabilities = [(w1, w2, bigram_probability(w1, w2, bigram_counts, unigram_counts)) 
                 for w1, w2 in examples]

Out[10]:

Bigram Probabilities P(w2|w1):
----------------------------------------
P(cat|the) = 3/8 = 0.375
P(dog|the) = 2/8 = 0.250
P(mat|the) = 2/8 = 0.250
P(sat|cat) = 1/3 = 0.333
P(on|sat) = 2/2 = 1.000
P(the|on) = 2/2 = 1.000

The probabilities reveal patterns in our corpus. After "the," both "cat" and "dog" appear with equal probability (25% each), while "mat" also appears 25% of the time. After "sat," the word "on" appears with 100% probability, reflecting the fixed phrase "sat on" that appears in every relevant sentence. After "on," the word "the" follows every time, since our corpus always uses "on the" before the final noun.

These probabilities capture the local structure of our tiny corpus. A model trained on this data would predict that "sat" is always followed by "on," and "on" is always followed by "the." Of course, with only four sentences, these patterns are artifacts of our limited data. A larger corpus would reveal more nuanced distributions.

We can visualize the complete bigram probability matrix as a heatmap, where each cell shows $P(w_j|w_i)$ , the probability of word $w_j$ following word $w_i$ :

Out[11]:

Visualization

Heatmap showing bigram transition probabilities between words in the corpus. — Bigram transition probability matrix. Each row represents a context word, and each column represents a possible next word. Darker cells indicate higher probabilities. The sparse structure reveals that most word pairs never occur together, while a few transitions dominate (e.g., 'sat' → 'on' has probability 1.0).

The heatmap reveals the sparse structure of bigram probabilities. Most cells are empty (zero probability), indicating word pairs that never occurred in our corpus. The non-zero cells cluster around specific patterns: "the" can be followed by several words, but "sat" is always followed by "on." This sparsity becomes even more pronounced with larger vocabularies and higher-order n-grams.

Start and End TokensLink Copied

Sentences have beginnings and endings. To model this, we add special tokens: <s> marks the start of a sentence and </s> marks the end. These tokens let us model which words typically start sentences and which words end them.

In[12]:

def add_sentence_markers(text, n):
    """Add start and end markers to sentences."""
    sentences = text.strip().split('\n')
    marked_sentences = []
    
    for sentence in sentences:
        if sentence.strip():
            tokens = sentence.lower().split()
            # Add n-1 start tokens and 1 end token
            marked = ['<s>'] * (n - 1) + tokens + ['</s>']
            marked_sentences.append(marked)
    
    return marked_sentences

# Process corpus with markers for bigram model
marked_corpus = add_sentence_markers(corpus, n=2)

# Flatten for counting
all_tokens = [token for sentence in marked_corpus for token in sentence]

# Recount with markers
unigram_counts_marked = count_ngrams(all_tokens, 1)
bigram_counts_marked = count_ngrams(all_tokens, 2)

def add_sentence_markers(text, n):
    """Add start and end markers to sentences."""
    sentences = text.strip().split('\n')
    marked_sentences = []
    
    for sentence in sentences:
        if sentence.strip():
            tokens = sentence.lower().split()
            # Add n-1 start tokens and 1 end token
            marked = ['<s>'] * (n - 1) + tokens + ['</s>']
            marked_sentences.append(marked)
    
    return marked_sentences

# Process corpus with markers for bigram model
marked_corpus = add_sentence_markers(corpus, n=2)

# Flatten for counting
all_tokens = [token for sentence in marked_corpus for token in sentence]

# Recount with markers
unigram_counts_marked = count_ngrams(all_tokens, 1)
bigram_counts_marked = count_ngrams(all_tokens, 2)

Out[13]:

Sentences with markers (bigram model):
  1. <s> the cat sat on the mat </s>
  2. <s> the cat ate the fish </s>
  3. <s> the dog sat on the mat </s>
  4. <s> the dog chased the cat </s>

Bigrams involving markers:
  ('<s>', 'the'): 4
  ('</s>', '<s>'): 3
  ('mat', '</s>'): 2
  ('fish', '</s>'): 1
  ('cat', '</s>'): 1

The start marker <s> lets us compute $P(w_1|\text{<s>})$ , the probability of a word starting a sentence. In our corpus, "the" starts every sentence, so $P(\text{the}|\text{<s>}) = 1.0$ . The end marker </s> lets us model sentence endings: "mat," "fish," and "cat" all end sentences.

For higher-order n-grams, we add multiple start tokens. A trigram model needs two <s> tokens at the beginning to provide context for the first two words.

Computing Sequence ProbabilitiesLink Copied

With n-gram probabilities estimated, we can compute the probability of any sequence. We apply the chain rule with our Markov approximation:

P(w_1, \ldots, w_m) \approx \prod_{i=1}^{m} P(w_i|w_{i-n+1}, \ldots, w_{i-1})

where $m$ is the sequence length, $n$ is the n-gram order, and each term conditions on the previous $n-1$ words rather than the full history. For positions where $i < n$ , we use start tokens <s> to provide the required context.

In[14]:

import math

def sequence_probability(sentence, bigram_counts, unigram_counts, log=True):
    """Calculate the probability of a sentence using bigram model."""
    # Add markers
    tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    
    if log:
        # Use log probabilities to avoid underflow
        log_prob = 0.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i-1], tokens[i]
            prob = bigram_probability(w1, w2, bigram_counts, unigram_counts)
            if prob == 0:
                return float('-inf')  # Log of zero
            log_prob += math.log(prob)
        return log_prob
    else:
        # Regular probability (prone to underflow)
        prob = 1.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i-1], tokens[i]
            prob *= bigram_probability(w1, w2, bigram_counts, unigram_counts)
        return prob

# Test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog sat on the mat",
    "the cat chased the dog",
    "the fish ate the cat",
]

results = [(s, sequence_probability(s, bigram_counts_marked, unigram_counts_marked, log=True))
           for s in test_sentences]

import math

def sequence_probability(sentence, bigram_counts, unigram_counts, log=True):
    """Calculate the probability of a sentence using bigram model."""
    # Add markers
    tokens = ['<s>'] + sentence.lower().split() + ['</s>']
    
    if log:
        # Use log probabilities to avoid underflow
        log_prob = 0.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i-1], tokens[i]
            prob = bigram_probability(w1, w2, bigram_counts, unigram_counts)
            if prob == 0:
                return float('-inf')  # Log of zero
            log_prob += math.log(prob)
        return log_prob
    else:
        # Regular probability (prone to underflow)
        prob = 1.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i-1], tokens[i]
            prob *= bigram_probability(w1, w2, bigram_counts, unigram_counts)
        return prob

# Test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog sat on the mat",
    "the cat chased the dog",
    "the fish ate the cat",
]

results = [(s, sequence_probability(s, bigram_counts_marked, unigram_counts_marked, log=True))
           for s in test_sentences]

Out[15]:

Sentence Log-Probabilities:
------------------------------------------------------------
  log P =   -3.47  "the dog sat on the mat"
  log P =   -3.47  "the cat sat on the mat"
  log P =    -inf  "the cat chased the dog"
  log P =    -inf  "the fish ate the cat"

Notice we use log probabilities. Multiplying many small probabilities leads to numerical underflow, where numbers become too small to represent. Taking logarithms converts products to sums, keeping values in a manageable range.

Out[16]:

Visualization

Two line plots comparing raw probabilities versus log probabilities across sentence lengths. — Why log probabilities are essential. The left panel shows how raw probabilities shrink exponentially as sentence length increases, quickly becoming too small to represent. The right panel shows the same data on a log scale, where the values remain in a computationally tractable range.

The visualization shows why log probabilities are essential. With an average per-word probability of 0.1, raw probabilities become vanishingly small after about 30 words, approaching the limits of floating-point representation. Log probabilities decrease linearly and remain computationally tractable for sentences of any length.

The rankings reveal what our model learned. Sentences similar to the training data ("the cat sat on the mat") get higher probabilities. Sentences with unseen bigrams ("the fish ate") may get zero probability, resulting in negative infinity log probability.

The Zero Probability ProblemLink Copied

What happens when we encounter an n-gram that never appeared in training?

In[17]:

# Test with an unseen sentence
unseen_sentence = "the bird flew over the house"
unseen_tokens = ['<s>'] + unseen_sentence.lower().split() + ['</s>']

# Check each bigram
bigram_analysis = []
for i in range(1, len(unseen_tokens)):
    w1, w2 = unseen_tokens[i-1], unseen_tokens[i]
    bigram = (w1, w2)
    count = bigram_counts_marked.get(bigram, 0)
    prob = bigram_probability(w1, w2, bigram_counts_marked, unigram_counts_marked)
    bigram_analysis.append((w1, w2, count, prob))

# Test with an unseen sentence
unseen_sentence = "the bird flew over the house"
unseen_tokens = ['<s>'] + unseen_sentence.lower().split() + ['</s>']

# Check each bigram
bigram_analysis = []
for i in range(1, len(unseen_tokens)):
    w1, w2 = unseen_tokens[i-1], unseen_tokens[i]
    bigram = (w1, w2)
    count = bigram_counts_marked.get(bigram, 0)
    prob = bigram_probability(w1, w2, bigram_counts_marked, unigram_counts_marked)
    bigram_analysis.append((w1, w2, count, prob))

Out[18]:

Analyzing: "the bird flew over the house"
--------------------------------------------------
  (<s>, the): count=4, P=1.000 ✓
  (the, bird): count=0, P=0.000 ✗ UNSEEN
  (bird, flew): count=0, P=0.000 ✗ UNSEEN
  (flew, over): count=0, P=0.000 ✗ UNSEEN
  (over, the): count=0, P=0.000 ✗ UNSEEN
  (the, house): count=0, P=0.000 ✗ UNSEEN
  (house, </s>): count=0, P=0.000 ✗ UNSEEN

Total probability: 0.0

A single unseen bigram makes the entire sequence probability zero. This is a serious problem: perfectly valid sentences get zero probability just because our training data didn't contain every possible word combination.

This is the zero probability problem. It motivates the smoothing techniques covered in the next chapter. Raw MLE estimates are too brittle for real applications.

Building a Complete N-gram ModelLink Copied

Let's build a complete, reusable n-gram language model class:

In[19]:

class NgramLanguageModel:
    """An n-gram language model with MLE estimation."""
    
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()
        
    def train(self, text):
        """Train the model on text."""
        sentences = text.strip().split('\n')
        
        for sentence in sentences:
            if not sentence.strip():
                continue
            
            # Tokenize and add markers
            tokens = sentence.lower().split()
            tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
            
            # Update vocabulary
            self.vocab.update(tokens)
            
            # Count n-grams and (n-1)-grams
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i:i + self.n])
                context = tuple(tokens[i:i + self.n - 1])
                
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1
    
    def probability(self, word, context):
        """Calculate P(word|context) using MLE."""
        if isinstance(context, str):
            context = tuple(context.split())
        
        ngram = context + (word,)
        
        ngram_count = self.ngram_counts.get(ngram, 0)
        context_count = self.context_counts.get(context, 0)
        
        if context_count == 0:
            return 0.0
        
        return ngram_count / context_count
    
    def sentence_log_probability(self, sentence):
        """Calculate log probability of a sentence."""
        tokens = sentence.lower().split()
        tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
        
        log_prob = 0.0
        for i in range(self.n - 1, len(tokens)):
            context = tuple(tokens[i - self.n + 1:i])
            word = tokens[i]
            
            prob = self.probability(word, context)
            if prob == 0:
                return float('-inf')
            
            log_prob += math.log(prob)
        
        return log_prob

class NgramLanguageModel:
    """An n-gram language model with MLE estimation."""
    
    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()
        
    def train(self, text):
        """Train the model on text."""
        sentences = text.strip().split('\n')
        
        for sentence in sentences:
            if not sentence.strip():
                continue
            
            # Tokenize and add markers
            tokens = sentence.lower().split()
            tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
            
            # Update vocabulary
            self.vocab.update(tokens)
            
            # Count n-grams and (n-1)-grams
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i:i + self.n])
                context = tuple(tokens[i:i + self.n - 1])
                
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1
    
    def probability(self, word, context):
        """Calculate P(word|context) using MLE."""
        if isinstance(context, str):
            context = tuple(context.split())
        
        ngram = context + (word,)
        
        ngram_count = self.ngram_counts.get(ngram, 0)
        context_count = self.context_counts.get(context, 0)
        
        if context_count == 0:
            return 0.0
        
        return ngram_count / context_count
    
    def sentence_log_probability(self, sentence):
        """Calculate log probability of a sentence."""
        tokens = sentence.lower().split()
        tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
        
        log_prob = 0.0
        for i in range(self.n - 1, len(tokens)):
            context = tuple(tokens[i - self.n + 1:i])
            word = tokens[i]
            
            prob = self.probability(word, context)
            if prob == 0:
                return float('-inf')
            
            log_prob += math.log(prob)
        
        return log_prob

Out[20]:

NgramLanguageModel class defined
Methods: train(), probability(), sentence_log_probability()

The class encapsulates all the logic we've built: n-gram counting, context tracking, and probability computation. The train() method processes text and builds the count tables. The probability() and sentence_log_probability() methods query the trained model.

Let's train models of different orders and compare them:

In[21]:

# Larger training corpus
training_corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
the cat ran from the dog
the dog ate the food
the bird flew over the tree
the bird sat on the branch
the cat watched the bird
the dog watched the cat
"""

# Train unigram, bigram, and trigram models
unigram_model = NgramLanguageModel(n=1)
bigram_model = NgramLanguageModel(n=2)
trigram_model = NgramLanguageModel(n=3)

unigram_model.train(training_corpus)
bigram_model.train(training_corpus)
trigram_model.train(training_corpus)

# Compare on test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog chased the cat",
    "the bird flew over the tree",
    "the cat ate the bird",
]

model_comparison = []
for sentence in test_sentences:
    scores = {
        'unigram': unigram_model.sentence_log_probability(sentence),
        'bigram': bigram_model.sentence_log_probability(sentence),
        'trigram': trigram_model.sentence_log_probability(sentence),
    }
    model_comparison.append((sentence, scores))

# Larger training corpus
training_corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
the cat ran from the dog
the dog ate the food
the bird flew over the tree
the bird sat on the branch
the cat watched the bird
the dog watched the cat
"""

# Train unigram, bigram, and trigram models
unigram_model = NgramLanguageModel(n=1)
bigram_model = NgramLanguageModel(n=2)
trigram_model = NgramLanguageModel(n=3)

unigram_model.train(training_corpus)
bigram_model.train(training_corpus)
trigram_model.train(training_corpus)

# Compare on test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog chased the cat",
    "the bird flew over the tree",
    "the cat ate the bird",
]

model_comparison = []
for sentence in test_sentences:
    scores = {
        'unigram': unigram_model.sentence_log_probability(sentence),
        'bigram': bigram_model.sentence_log_probability(sentence),
        'trigram': trigram_model.sentence_log_probability(sentence),
    }
    model_comparison.append((sentence, scores))

Out[22]:

Model Comparison (log probabilities):
======================================================================
Sentence                               Unigram     Bigram    Trigram
----------------------------------------------------------------------
the cat sat on the mat                  -16.24      -5.30      -3.11
the dog chased the cat                  -13.35      -5.30      -3.62
the bird flew over the tree             -19.83      -5.99      -2.71
the cat ate the bird                    -13.17      -5.99       -inf

Higher-order models assign more varied probabilities because they consider more context. The trigram model can distinguish between sentences that differ only in longer-range patterns.

Generating Text from N-gram ModelsLink Copied

N-gram models can generate new text by sampling words according to their conditional probabilities. Starting from the start token, we repeatedly sample the next word given the current context.

In[23]:

import random

def generate_sentence(model, max_length=20, seed=None):
    """Generate a sentence from an n-gram model."""
    if seed is not None:
        random.seed(seed)
    
    # Start with n-1 start tokens
    tokens = ['<s>'] * (model.n - 1)
    
    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1):])
        
        # Get all possible next words and their probabilities
        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))
        
        if not candidates:
            break
        
        # Sample according to probabilities
        words, probs = zip(*candidates)
        next_word = random.choices(words, weights=probs, k=1)[0]
        
        if next_word == '</s>':
            break
        
        tokens.append(next_word)
    
    # Remove start tokens and return
    return ' '.join(tokens[model.n - 1:])

# Generate sentences from bigram model
generated_bigram = [generate_sentence(bigram_model, seed=i) for i in range(5)]

import random

def generate_sentence(model, max_length=20, seed=None):
    """Generate a sentence from an n-gram model."""
    if seed is not None:
        random.seed(seed)
    
    # Start with n-1 start tokens
    tokens = ['<s>'] * (model.n - 1)
    
    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1):])
        
        # Get all possible next words and their probabilities
        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))
        
        if not candidates:
            break
        
        # Sample according to probabilities
        words, probs = zip(*candidates)
        next_word = random.choices(words, weights=probs, k=1)[0]
        
        if next_word == '</s>':
            break
        
        tokens.append(next_word)
    
    # Remove start tokens and return
    return ' '.join(tokens[model.n - 1:])

# Generate sentences from bigram model
generated_bigram = [generate_sentence(bigram_model, seed=i) for i in range(5)]

Out[24]:

Generated sentences (bigram model):
--------------------------------------------------
  1. the bird sat on the fish
  2. the bird
  3. the tree
  4. the dog chased the dog sat on the cat ate the dog watched the dog sat on the dog ate
  5. the cat

In[25]:

# Generate from trigram model
generated_trigram = [generate_sentence(trigram_model, seed=i) for i in range(5)]

# Generate from trigram model
generated_trigram = [generate_sentence(trigram_model, seed=i) for i in range(5)]

Out[26]:

Generated sentences (trigram model):
--------------------------------------------------
  1. the dog
  2. the bird
  3. the bird flew over the tree
  4. the dog chased the cat sat on the mat
  5. the cat

The generated text reflects patterns in the training data. Bigram models produce locally coherent text but may wander incoherently over longer spans. Trigram models maintain coherence over slightly longer distances, but neither captures long-range dependencies like subject-verb agreement across clauses.

Temperature SamplingLink Copied

We can control the randomness of generation with a temperature parameter. Lower temperatures make the model more deterministic, always choosing high-probability words. Higher temperatures increase diversity:

In[27]:

def generate_with_temperature(model, temperature=1.0, max_length=20, seed=None):
    """Generate with temperature-scaled probabilities."""
    if seed is not None:
        random.seed(seed)
    
    tokens = ['<s>'] * (model.n - 1)
    
    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1):])
        
        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))
        
        if not candidates:
            break
        
        words, probs = zip(*candidates)
        
        # Apply temperature scaling
        if temperature != 1.0:
            log_probs = [math.log(p + 1e-10) / temperature for p in probs]
            max_log = max(log_probs)
            scaled_probs = [math.exp(lp - max_log) for lp in log_probs]
            total = sum(scaled_probs)
            probs = tuple(p / total for p in scaled_probs)
        
        next_word = random.choices(words, weights=probs, k=1)[0]
        
        if next_word == '</s>':
            break
        
        tokens.append(next_word)
    
    return ' '.join(tokens[model.n - 1:])

# Compare different temperatures
temperatures = [0.5, 1.0, 1.5]
temp_samples = {t: [generate_with_temperature(bigram_model, temperature=t, seed=i) 
                    for i in range(3)] 
                for t in temperatures}

def generate_with_temperature(model, temperature=1.0, max_length=20, seed=None):
    """Generate with temperature-scaled probabilities."""
    if seed is not None:
        random.seed(seed)
    
    tokens = ['<s>'] * (model.n - 1)
    
    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1):])
        
        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))
        
        if not candidates:
            break
        
        words, probs = zip(*candidates)
        
        # Apply temperature scaling
        if temperature != 1.0:
            log_probs = [math.log(p + 1e-10) / temperature for p in probs]
            max_log = max(log_probs)
            scaled_probs = [math.exp(lp - max_log) for lp in log_probs]
            total = sum(scaled_probs)
            probs = tuple(p / total for p in scaled_probs)
        
        next_word = random.choices(words, weights=probs, k=1)[0]
        
        if next_word == '</s>':
            break
        
        tokens.append(next_word)
    
    return ' '.join(tokens[model.n - 1:])

# Compare different temperatures
temperatures = [0.5, 1.0, 1.5]
temp_samples = {t: [generate_with_temperature(bigram_model, temperature=t, seed=i) 
                    for i in range(3)] 
                for t in temperatures}

Out[28]:

Effect of Temperature on Generation:
============================================================

Temperature = 0.5:
  1. the dog
  2. the food
  3. the bird flew over the dog ate the dog ate the cat

Temperature = 1.0:
  1. the bird sat on the fish
  2. the bird
  3. the tree

Temperature = 1.5:
  1. the bird sat on the fish
  2. the bird
  3. the branch

At low temperatures, the model gravitates toward the most common patterns. At high temperatures, it explores less likely paths and sometimes produces more creative but less coherent output.

Out[29]:

Visualization

Three bar charts showing how temperature affects the probability distribution over next words. — Effect of temperature on probability distributions. At temperature 0.5, the distribution becomes sharper, concentrating probability on the most likely words. At temperature 1.0, we use the original probabilities. At temperature 1.5, the distribution flattens, giving more weight to unlikely words.

The visualization shows how temperature reshapes probability distributions. At low temperature (0.5), the highest-probability words become even more dominant, making generation more predictable. At high temperature (1.5), the distribution flattens, giving rare words a fighting chance. This trade-off between coherence and diversity is fundamental to text generation with language models.

Model Storage and EfficiencyLink Copied

N-gram models can grow large. With a vocabulary of $V$ words, there are $V^n$ possible n-grams. For English with $V = 100,000$ and $n = 3$ , that's $10^{15}$ potential trigrams, so efficient storage is essential.

Sparse StorageLink Copied

Most possible n-grams never occur. We store only observed n-grams using dictionaries or hash tables:

In[30]:

import sys

def estimate_model_size(model):
    """Estimate memory usage of an n-gram model."""
    # Count stored n-grams
    num_ngrams = len(model.ngram_counts)
    num_contexts = len(model.context_counts)
    vocab_size = len(model.vocab)
    
    # Estimate memory (rough approximation)
    # Each n-gram tuple: ~n * 50 bytes (string references)
    # Each count: 28 bytes (Python int)
    ngram_memory = num_ngrams * (model.n * 50 + 28)
    context_memory = num_contexts * ((model.n - 1) * 50 + 28)
    vocab_memory = vocab_size * 50
    
    total_bytes = ngram_memory + context_memory + vocab_memory
    
    return {
        'num_ngrams': num_ngrams,
        'num_contexts': num_contexts,
        'vocab_size': vocab_size,
        'estimated_mb': total_bytes / (1024 * 1024)
    }

# Check model sizes
sizes = {
    'Unigram': estimate_model_size(unigram_model),
    'Bigram': estimate_model_size(bigram_model),
    'Trigram': estimate_model_size(trigram_model),
}

import sys

def estimate_model_size(model):
    """Estimate memory usage of an n-gram model."""
    # Count stored n-grams
    num_ngrams = len(model.ngram_counts)
    num_contexts = len(model.context_counts)
    vocab_size = len(model.vocab)
    
    # Estimate memory (rough approximation)
    # Each n-gram tuple: ~n * 50 bytes (string references)
    # Each count: 28 bytes (Python int)
    ngram_memory = num_ngrams * (model.n * 50 + 28)
    context_memory = num_contexts * ((model.n - 1) * 50 + 28)
    vocab_memory = vocab_size * 50
    
    total_bytes = ngram_memory + context_memory + vocab_memory
    
    return {
        'num_ngrams': num_ngrams,
        'num_contexts': num_contexts,
        'vocab_size': vocab_size,
        'estimated_mb': total_bytes / (1024 * 1024)
    }

# Check model sizes
sizes = {
    'Unigram': estimate_model_size(unigram_model),
    'Bigram': estimate_model_size(bigram_model),
    'Trigram': estimate_model_size(trigram_model),
}

Out[31]:

Model Storage Statistics:
------------------------------------------------------------
Model         N-grams   Contexts    Vocab    Est. MB
------------------------------------------------------------
Unigram            19          1       19     0.0023
Bigram             36         19       20     0.0068
Trigram            44         29       20     0.0120

The trigram model stores more n-grams than the bigram model because it tracks longer sequences. Our toy corpus produces tiny models measured in kilobytes, but real-world models trained on billions of words can require gigabytes of storage. This makes efficient data structures essential.

Trie-Based StorageLink Copied

For faster lookups, n-grams can be stored in a trie (prefix tree). Each path from root to node represents an n-gram prefix. Nodes store counts:

In[32]:

class TrieNode:
    """A node in the n-gram trie."""
    def __init__(self):
        self.children = {}
        self.count = 0

class TrieNgramModel:
    """N-gram model using trie storage."""
    
    def __init__(self, n=2):
        self.n = n
        self.root = TrieNode()
        self.vocab = set()
    
    def _add_ngram(self, ngram):
        """Add an n-gram to the trie."""
        node = self.root
        for token in ngram:
            if token not in node.children:
                node.children[token] = TrieNode()
            node = node.children[token]
            node.count += 1
    
    def train(self, text):
        """Train on text."""
        sentences = text.strip().split('\n')
        
        for sentence in sentences:
            if not sentence.strip():
                continue
            
            tokens = sentence.lower().split()
            tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
            self.vocab.update(tokens)
            
            # Add all n-grams up to order n
            for i in range(len(tokens)):
                for j in range(1, min(self.n + 1, len(tokens) - i + 1)):
                    ngram = tuple(tokens[i:i+j])
                    self._add_ngram(ngram)
    
    def get_count(self, ngram):
        """Get count of an n-gram."""
        if isinstance(ngram, str):
            ngram = tuple(ngram.split())
        
        node = self.root
        for token in ngram:
            if token not in node.children:
                return 0
            node = node.children[token]
        return node.count

# Build trie model
trie_model = TrieNgramModel(n=3)
trie_model.train(training_corpus)

class TrieNode:
    """A node in the n-gram trie."""
    def __init__(self):
        self.children = {}
        self.count = 0

class TrieNgramModel:
    """N-gram model using trie storage."""
    
    def __init__(self, n=2):
        self.n = n
        self.root = TrieNode()
        self.vocab = set()
    
    def _add_ngram(self, ngram):
        """Add an n-gram to the trie."""
        node = self.root
        for token in ngram:
            if token not in node.children:
                node.children[token] = TrieNode()
            node = node.children[token]
            node.count += 1
    
    def train(self, text):
        """Train on text."""
        sentences = text.strip().split('\n')
        
        for sentence in sentences:
            if not sentence.strip():
                continue
            
            tokens = sentence.lower().split()
            tokens = ['<s>'] * (self.n - 1) + tokens + ['</s>']
            self.vocab.update(tokens)
            
            # Add all n-grams up to order n
            for i in range(len(tokens)):
                for j in range(1, min(self.n + 1, len(tokens) - i + 1)):
                    ngram = tuple(tokens[i:i+j])
                    self._add_ngram(ngram)
    
    def get_count(self, ngram):
        """Get count of an n-gram."""
        if isinstance(ngram, str):
            ngram = tuple(ngram.split())
        
        node = self.root
        for token in ngram:
            if token not in node.children:
                return 0
            node = node.children[token]
        return node.count

# Build trie model
trie_model = TrieNgramModel(n=3)
trie_model.train(training_corpus)

Out[33]:

Trie-based N-gram Counts:
----------------------------------------
  ('the',): 60
  ('the', 'cat'): 12
  ('the', 'cat', 'sat'): 1
  ('the', 'dog'): 10
  ('cat', 'sat', 'on'): 1

The trie efficiently retrieves counts for any prefix. Notice how "the" has a high count because it appears in every sentence, while longer n-grams like "the cat sat" are less frequent. Tries share prefixes between n-grams, reducing memory for models with many overlapping contexts. They also enable efficient prefix queries like "What words can follow 'the cat'?"

Visualizing N-gram DistributionsLink Copied

Let's visualize how n-gram probabilities distribute across possible continuations:

Out[34]:

Visualization

Bar charts showing probability distributions for next word given three different contexts. — Probability distribution over next words given different contexts. The bigram model shows how context constrains the set of likely continuations. After 'the,' many words are possible. After 'sat,' 'on' dominates. More specific contexts yield more peaked distributions.

The distributions reveal how context constrains possibilities. After "the," many words are plausible, but after "sat," almost all probability mass concentrates on "on." Longer contexts generally produce more peaked distributions, reflecting stronger constraints.

Working with NLTKLink Copied

NLTK provides built-in n-gram functionality:

In[35]:

import nltk
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

# Download required data
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab', quiet=True)

# Prepare training data
train_sentences = [s.lower().split() for s in training_corpus.strip().split('\n') if s.strip()]

# Create padded n-grams for training
n = 3
train_data, padded_vocab = padded_everygram_pipeline(n, train_sentences)

# Train MLE model
nltk_model = MLE(n)
nltk_model.fit(train_data, padded_vocab)

import nltk
from nltk.util import ngrams
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

# Download required data
try:
    nltk.data.find('tokenizers/punkt_tab')
except LookupError:
    nltk.download('punkt_tab', quiet=True)

# Prepare training data
train_sentences = [s.lower().split() for s in training_corpus.strip().split('\n') if s.strip()]

# Create padded n-grams for training
n = 3
train_data, padded_vocab = padded_everygram_pipeline(n, train_sentences)

# Train MLE model
nltk_model = MLE(n)
nltk_model.fit(train_data, padded_vocab)

Out[36]:

NLTK 3-gram model trained
Vocabulary size: 21

In[37]:

# Query the NLTK model
test_contexts = [
    ['the'],
    ['the', 'cat'],
    ['sat', 'on'],
]

nltk_results = []
for context in test_contexts:
    # Get probability distribution
    probs = {}
    for word in nltk_model.vocab:
        try:
            p = nltk_model.score(word, context)
            if p > 0:
                probs[word] = p
        except:
            pass
    
    # Get top predictions
    top_words = sorted(probs.items(), key=lambda x: -x[1])[:5]
    nltk_results.append((context, top_words))

# Query the NLTK model
test_contexts = [
    ['the'],
    ['the', 'cat'],
    ['sat', 'on'],
]

nltk_results = []
for context in test_contexts:
    # Get probability distribution
    probs = {}
    for word in nltk_model.vocab:
        try:
            p = nltk_model.score(word, context)
            if p > 0:
                probs[word] = p
        except:
            pass
    
    # Get top predictions
    top_words = sorted(probs.items(), key=lambda x: -x[1])[:5]
    nltk_results.append((context, top_words))

Out[38]:

NLTK Model Predictions:
--------------------------------------------------

Context: ['the']
  P(cat) = 0.300
  P(dog) = 0.250
  P(bird) = 0.150
  P(mat) = 0.100
  P(fish) = 0.050

Context: ['the', 'cat']
  P(</s>) = 0.333
  P(sat) = 0.167
  P(ate) = 0.167
  P(ran) = 0.167
  P(watched) = 0.167

Context: ['sat', 'on']
  P(the) = 1.000

NLTK's language modeling module handles the bookkeeping of n-gram counting, vocabulary management, and probability computation. This lets you focus on higher-level tasks.

Limitations of N-gram ModelsLink Copied

N-gram models have served NLP well, but they face fundamental limitations.

Limited context: The Markov assumption ignores long-range dependencies. In "The cat that the dog chased ran away," the verb "ran" agrees with "cat," but a trigram model only sees "chased ran away."

Data sparsity: Most possible n-grams never occur in training data. Even with billions of words, many valid word combinations remain unseen.

No generalization: N-gram models can't recognize that "the cat sat" and "the dog sat" share similar structure. Each n-gram is treated independently.

Storage requirements: Large vocabularies and high-order n-grams require substantial memory. A trigram model for a 100,000-word vocabulary could theoretically have $10^{15}$ parameters.

No semantic understanding: The model doesn't know that "cat" and "feline" are related, or that "bank" has multiple meanings.

Out[39]:

Visualization

Line chart showing decreasing n-gram coverage as order increases from 1 to 5. — The trade-off between n-gram order and data sparsity. Higher-order n-grams capture more context but suffer from increased sparsity. The chart shows the percentage of test n-grams that were seen during training, illustrating how coverage drops dramatically as n increases.

Despite these limitations, n-gram models remain useful. They're fast, interpretable, and work well for many applications. They also provide the conceptual foundation for understanding more sophisticated models.

What N-gram Models UnlockedLink Copied

Before neural language models, n-grams powered critical applications.

Speech recognition: N-gram language models helped disambiguate acoustically similar phrases and improved recognition accuracy.

Machine translation: Statistical MT systems used n-gram models to score translation fluency, preferring outputs that sounded natural.

Spelling correction: N-gram probabilities identify which corrections produce more likely word sequences.

Text prediction: Early autocomplete systems used n-grams to suggest likely next words.

Information retrieval: N-gram statistics help identify important phrases and improve search ranking.

The concepts introduced by n-gram models remain central to modern language modeling: conditional probability, the chain rule, and the trade-off between context length and data sparsity. Transformers and large language models are sophisticated solutions to the problems n-gram models first revealed.

Key ParametersLink Copied

When building n-gram language models, several parameters affect model behavior and performance.

n (n-gram order)

The order determines how many words of context the model considers. Common choices:

n=1 (unigram): No context, just word frequencies. Fast but ignores word order entirely.
n=2 (bigram): One word of context. Good balance of simplicity and performance for many tasks.
n=3 (trigram): Two words of context. Standard choice for production systems before neural models.
n=4 or higher: Rarely used due to severe data sparsity. Requires massive training corpora.

Higher values capture more context but exponentially increase the number of possible n-grams. This leads to sparse counts and zero probabilities for valid sequences.

temperature (for generation)

Controls randomness during text generation:

temperature < 1.0: More deterministic, favors high-probability words. Use for predictable, conservative output.
temperature = 1.0: Samples according to true model probabilities. Balanced creativity and coherence.
temperature > 1.0: More random, explores low-probability words. Use for creative, diverse output at the cost of coherence.

Values between 0.7 and 1.2 work well for most applications.

Start and end tokens

<s>: Marks sentence beginnings. Add n-1 copies for an n-gram model to provide context for the first words.
</s>: Marks sentence endings. Enables the model to learn which words typically end sentences.

Vocabulary handling

Case normalization: Converting to lowercase reduces vocabulary size but loses information (e.g., "Apple" vs "apple").
Unknown token <UNK>: Replacing rare words with a special token handles out-of-vocabulary words during inference.
Minimum frequency threshold: Excluding words that appear fewer than k times reduces noise from typos and rare terms.

SummaryLink Copied

N-gram language models assign probabilities to word sequences by decomposing joint probabilities into conditional probabilities using the chain rule. The Markov assumption limits context to the previous $n-1$ words, making estimation tractable.

Key takeaways:

Chain rule decomposition breaks sequence probability into a product of conditional probabilities
Markov assumption limits history to $n-1$ words, trading accuracy for tractability
Maximum likelihood estimation computes probabilities as normalized n-gram counts
Start and end tokens model sentence boundaries explicitly
Log probabilities prevent numerical underflow when multiplying many small values
Zero probability problem occurs when test data contains unseen n-grams
Text generation samples words according to conditional probabilities
Storage efficiency requires sparse data structures like tries

N-gram models reveal the fundamental tension in language modeling: longer context improves predictions but requires exponentially more data. This insight motivates both smoothing techniques, covered next, and the neural approaches that eventually superseded n-gram models.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about n-gram language models.

Loading component...

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{ngramlanguagemodelsprobabilitybasedtextgenerationprediction, author = {Michael Brenndoerfer}, title = {N-gram Language Models: Probability-Based Text Generation & Prediction}, year = {2025}, url = {https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-07} }

APAAcademic

Michael Brenndoerfer (2025). N-gram Language Models: Probability-Based Text Generation & Prediction. Retrieved from https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation

MLAAcademic

Michael Brenndoerfer. "N-gram Language Models: Probability-Based Text Generation & Prediction." 2025. Web. 12/7/2025. <https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation>.

CHICAGOAcademic

Michael Brenndoerfer. "N-gram Language Models: Probability-Based Text Generation & Prediction." Accessed 12/7/2025. https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'N-gram Language Models: Probability-Based Text Generation & Prediction'. Available at: https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation (Accessed: 12/7/2025).

SimpleBasic

Michael Brenndoerfer (2025). N-gram Language Models: Probability-Based Text Generation & Prediction. https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation

Direct link:

https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

InteractiveN-gram Language Models: Probability-Based Text Generation & Prediction

N-gram Language ModelsLink Copied

The Language Modeling ProblemLink Copied

The Chain Rule of ProbabilityLink Copied

The Markov AssumptionLink Copied

Maximum Likelihood EstimationLink Copied

Start and End TokensLink Copied

Computing Sequence ProbabilitiesLink Copied

The Zero Probability ProblemLink Copied

Building a Complete N-gram ModelLink Copied

Generating Text from N-gram ModelsLink Copied

Temperature SamplingLink Copied

Model Storage and EfficiencyLink Copied

Sparse StorageLink Copied

Trie-Based StorageLink Copied

Visualizing N-gram DistributionsLink Copied

Working with NLTKLink Copied

Limitations of N-gram ModelsLink Copied

What N-gram Models UnlockedLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Reference

About the author: Michael Brenndoerfer

Related Content

Smoothing Techniques for N-gram Language Models: From Laplace to Kneser-Ney

Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations

N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams

Stay updated