N-gram Language Models: Probability-Based Text Generation & Prediction

Michael BrenndoerferUpdated March 26, 202542 min read

Learn how n-gram language models assign probabilities to word sequences using the chain rule and Markov assumption, with implementations for text generation and scoring.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

N-gram Language Models

What makes some word sequences sound natural while others feel wrong? "The cat sat on the mat" flows effortlessly, but "mat the on sat cat the" is gibberish. Native speakers intuitively know which sequences are likely, even for sentences they've never encountered. N-gram language models capture this intuition mathematically, assigning probabilities to sequences of words based on patterns learned from text.

Language models are foundational to NLP. They power spell checkers that suggest "their" over "thier," autocomplete systems that predict your next word, and speech recognizers that disambiguate "recognize speech" from "wreck a nice beach." Before neural networks dominated the field, n-gram models were the workhorses of language modeling, and understanding them provides essential intuition for modern approaches.

This chapter builds n-gram language models from first principles. You'll learn the probability theory that makes them work, implement them from scratch, and discover both their power and fundamental limitations.

The Language Modeling Problem

A language model assigns probabilities to sequences of words. Given a sequence of nn words, denoted w1,w2,,wnw_1, w_2, \ldots, w_n, the model estimates P(w1,w2,,wn)P(w_1, w_2, \ldots, w_n), the probability of that exact sequence occurring in natural language. Here wiw_i represents the ii-th word in the sequence (e.g., w1w_1 might be "the" and w2w_2 might be "cat").

Language Model

A language model is a probability distribution over sequences of words. It assigns higher probabilities to sequences that are more likely to occur in natural language and lower probabilities to unlikely or ungrammatical sequences.

Why is this useful? Consider a speech recognition system that hears an ambiguous audio signal. It might produce two candidate transcriptions: "I saw a van" and "eyes awe of an." A language model knows the first is far more probable and helps the system choose correctly.

Language models also enable text generation. If we can compute P(w1,,wn)P(w_1, \ldots, w_n) for any sequence, we can generate new text by sampling words according to their probabilities given the preceding context.

The Chain Rule of Probability

Computing the probability of a full sequence seems daunting at first. How could we possibly estimate P("the cat sat on the mat")P(\text{"the cat sat on the mat"}) directly? We'd need to count how often this exact six-word sequence appears in our training data, then divide by the total number of six-word sequences. But most sentences are unique. Even in a corpus of billions of words, we're unlikely to see any specific sentence more than a handful of times.

The chain rule of probability offers an escape from this dilemma. Instead of estimating the joint probability directly, we decompose it into a product of conditional probabilities. Each conditional asks a simpler question: given everything that came before, how likely is the next word?

The chain rule derives from the definition of conditional probability. Recall that conditional probability is defined as:

P(AB)=P(A,B)P(B)P(A|B) = \frac{P(A, B)}{P(B)}

where:

  • P(AB)P(A|B): the probability of event AA occurring given that event BB has already occurred
  • P(A,B)P(A, B): the joint probability of both AA and BB occurring together
  • P(B)P(B): the probability of event BB occurring (the denominator normalizes by the probability of the condition)

Rearranging, we get the multiplication rule: P(A,B)=P(B)×P(AB)P(A, B) = P(B) \times P(A|B). This says the probability of both events equals the probability of the first event times the probability of the second event given the first. For a sequence of words, we apply this rule repeatedly to break down the joint probability into simpler conditional probabilities:

Step 1: Start with the joint probability of all words. This is what we ultimately want to compute:

P(w1,w2,,wn)P(w_1, w_2, \ldots, w_n)

Step 2: Apply the multiplication rule to separate the last word. We treat "all words up to wn1w_{n-1}" as event BB and "wnw_n appears next" as event AA:

P(w1,,wn)=P(w1,,wn1)×P(wnw1,,wn1)P(w_1, \ldots, w_n) = P(w_1, \ldots, w_{n-1}) \times P(w_n|w_1, \ldots, w_{n-1})

Step 3: Repeat for the remaining words, working backwards. Each application peels off one more word from the joint probability until we reach the first word:

P(w1,w2,,wn)=P(w1)×P(w2w1)×P(w3w1,w2)××P(wnw1,,wn1)P(w_1, w_2, \ldots, w_n) = P(w_1) \times P(w_2|w_1) \times P(w_3|w_1, w_2) \times \cdots \times P(w_n|w_1, \ldots, w_{n-1})

We can write this more compactly using product notation:

P(w1,,wn)=i=1nP(wiw1,,wi1)P(w_1, \ldots, w_n) = \prod_{i=1}^{n} P(w_i|w_1, \ldots, w_{i-1})

where:

  • wiw_i: the ii-th word in the sequence (e.g., w1="the"w_1 = \text{"the"}, w2="cat"w_2 = \text{"cat"})
  • P(wiw1,,wi1)P(w_i|w_1, \ldots, w_{i-1}): the conditional probability of word wiw_i given all preceding words (the "history")
  • i=1n\prod_{i=1}^{n}: the product operator, which multiplies all nn terms together

The intuition is straightforward: the probability of a sequence equals the probability of the first word, times the probability of the second word given the first, times the probability of the third word given the first two, and so on. Each step conditions on all previously observed words.

This decomposition is exact, not an approximation. The probability of any sequence equals the product of each word's probability given its entire preceding context.

Let's see this decomposition in action with a concrete example:

In[2]:
Code
# Illustrate the chain rule decomposition
sentence = "the cat sat on the mat"
words = sentence.split()

# The chain rule breaks the joint probability into conditional probabilities
decomposition = []
for i, word in enumerate(words):
    if i == 0:
        condition = "∅"  # No conditioning for first word
        term = f"P({word})"
    else:
        context = " ".join(words[:i])
        term = f"P({word}|{context})"
    decomposition.append(term)
Out[3]:
Console
Sentence: "the cat sat on the mat"

Chain rule decomposition:
P(the cat sat on the mat) =
    P(the)
  × P(cat|the)
  × P(sat|the cat)
  × P(on|the cat sat)
  × P(the|the cat sat on)
  × P(mat|the cat sat on the)

Each term in this product answers a progressively more specific question. The first asks: how likely is "the" to start a sentence? The second asks: given we started with "the," how likely is "cat" to follow? By the time we reach the final word, we're asking: given the entire context "the cat sat on the," how likely is "mat"?

The chain rule is mathematically exact, but it creates a severe practical problem. To compute P(wnw1,,wn1)P(w_n|w_1, \ldots, w_{n-1}), we need the probability of word wnw_n given the entire preceding context. For long sentences, this context can be arbitrarily long. A 20-word sentence requires us to estimate the probability of the 20th word given all 19 preceding words. We'll never have enough data to estimate these long-context probabilities reliably, since most 19-word contexts appear only once, if ever.

This is where the Markov assumption comes to the rescue.

The Markov Assumption

The chain rule gives us an exact decomposition, but it demands something impossible: reliable estimates of probabilities conditioned on arbitrarily long histories. The key insight of n-gram models is to make a simplifying assumption that trades some accuracy for tractability.

Markov Assumption

The Markov assumption states that the probability of a word depends only on a fixed-length history of preceding words, not the entire sequence. An n-gram model assumes the probability of word wiw_i depends only on the previous n1n-1 words.

The assumption is named after the Russian mathematician Andrey Markov, who studied stochastic processes with this "memoryless" property. In the context of language, it means we pretend the future depends only on the recent past, not the distant past.

For a bigram model (n=2n=2), each word depends only on the single word immediately before it:

P(wiw1,,wi1)P(wiwi1)P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-1})

where:

  • P(wiw1,,wi1)P(w_i|w_1, \ldots, w_{i-1}): the true conditional probability given the full history (left side)
  • P(wiwi1)P(w_i|w_{i-1}): the bigram approximation using only the immediately preceding word (right side)
  • \approx: indicates this is an approximation, not an equality

For a trigram model (n=3n=3), the memory extends to two previous words:

P(wiw1,,wi1)P(wiwi2,wi1)P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-2}, w_{i-1})

where:

  • wi2w_{i-2}: the word two positions before the current word
  • wi1w_{i-1}: the word immediately before the current word
  • The context (wi2,wi1)(w_{i-2}, w_{i-1}) forms a two-word history

In general, for an n-gram model, we approximate:

P(wiw1,,wi1)P(wiwin+1,,wi1)P(w_i|w_1, \ldots, w_{i-1}) \approx P(w_i|w_{i-n+1}, \ldots, w_{i-1})

where the context window spans exactly n1n-1 preceding words. The approximation symbol \approx reminds us this is a simplification: we're discarding information from the distant past to make estimation tractable.

Why does this help? Consider the difference in what we need to estimate:

  • Full history: We need P(matthe cat sat on the)P(\text{mat}|\text{the cat sat on the}), requiring counts of the specific five-word context "the cat sat on the"
  • Trigram: We need P(maton the)P(\text{mat}|\text{on the}), requiring only counts of the two-word context "on the"
  • Bigram: We need P(matthe)P(\text{mat}|\text{the}), requiring only counts of the single word "the"

Shorter contexts appear far more frequently in training data, giving us more reliable probability estimates.

In[4]:
Code
# Compare full history vs Markov approximations
sentence = "the cat sat on the mat"
words = sentence.split()

# Different n-gram approximations for P(mat | the cat sat on the)
full_history = "P(mat | the cat sat on the)"
trigram = "P(mat | on the)"
bigram = "P(mat | the)"
unigram = "P(mat)"

approximations = [
    ("Full history", full_history, "exact but impractical"),
    ("Trigram (n=3)", trigram, "considers 2 previous words"),
    ("Bigram (n=2)", bigram, "considers 1 previous word"),
    ("Unigram (n=1)", unigram, "ignores all context"),
]
Out[5]:
Console
Markov Approximations for the final word:
------------------------------------------------------------

Full history:
  P(mat | the cat sat on the)
  → exact but impractical

Trigram (n=3):
  P(mat | on the)
  → considers 2 previous words

Bigram (n=2):
  P(mat | the)
  → considers 1 previous word

Unigram (n=1):
  P(mat)
  → ignores all context

Is the Markov assumption reasonable? Strictly speaking, no. The probability of "mat" after "the cat sat on the" depends on "cat" and "sat," which establish that we're talking about an animal sitting somewhere. A bigram model, seeing only "the," has no idea whether we're discussing cats on mats, birds in trees, or books on shelves.

Yet this approximation works well in practice, especially for larger values of nn. Language has enough local structure that nearby words carry most of the predictive information. The word "on" strongly predicts a location or surface will follow. The phrase "sat on" even more strongly suggests something like "mat," "chair," or "floor." We lose some information, but we gain the ability to estimate probabilities reliably.

Maximum Likelihood Estimation

With the Markov assumption in hand, we've reduced our problem to estimating conditional probabilities like P(maton the)P(\text{mat}|\text{on the}). But how do we actually compute these from data?

The most straightforward approach is maximum likelihood estimation (MLE): count how often each n-gram appears in our training corpus, then normalize. The intuition comes from the definition of conditional probability:

P(AB)=P(A and B)P(B)P(A|B) = \frac{P(A \text{ and } B)}{P(B)}

where:

  • P(AB)P(A|B): the probability of AA given BB
  • P(A and B)P(A \text{ and } B): the joint probability of both events occurring
  • P(B)P(B): the probability of the conditioning event

For language, this translates to: the probability of word ww following context cc equals the fraction of times we observed cc followed by ww, out of all the times we observed cc.

For bigrams, the MLE estimate is:

P(wiwi1)=C(wi1,wi)C(wi1)P(w_i|w_{i-1}) = \frac{C(w_{i-1}, w_i)}{C(w_{i-1})}

where:

  • C(wi1,wi)C(w_{i-1}, w_i): the count of times the bigram (wi1,wi)(w_{i-1}, w_i) appears consecutively in the training corpus
  • C(wi1)C(w_{i-1}): the count of times the context word wi1w_{i-1} appears (in any context)
  • The ratio gives the proportion of wi1w_{i-1} occurrences that are followed by wiw_i

Example: If "the" appears 100 times in our corpus, and "the cat" appears 8 times, then:

P(catthe)=C(the,cat)C(the)=8100=0.08P(\text{cat}|\text{the}) = \frac{C(\text{the}, \text{cat})}{C(\text{the})} = \frac{8}{100} = 0.08

This means 8% of the time "the" is followed by "cat."

For trigrams, we extend the same logic to two-word contexts:

P(wiwi2,wi1)=C(wi2,wi1,wi)C(wi2,wi1)P(w_i|w_{i-2}, w_{i-1}) = \frac{C(w_{i-2}, w_{i-1}, w_i)}{C(w_{i-2}, w_{i-1})}

where:

  • C(wi2,wi1,wi)C(w_{i-2}, w_{i-1}, w_i): the count of the trigram (three consecutive words)
  • C(wi2,wi1)C(w_{i-2}, w_{i-1}): the count of the bigram context (the first two words of the trigram)

The general pattern for any n-gram order is:

P(wiwin+1,,wi1)=C(win+1,,wi1,wi)C(win+1,,wi1)P(w_i|w_{i-n+1}, \ldots, w_{i-1}) = \frac{C(w_{i-n+1}, \ldots, w_{i-1}, w_i)}{C(w_{i-n+1}, \ldots, w_{i-1})}

where:

  • wiw_i: the word whose probability we want to estimate
  • win+1,,wi1w_{i-n+1}, \ldots, w_{i-1}: the context of n1n-1 words immediately preceding wiw_i
  • C(win+1,,wi1,wi)C(w_{i-n+1}, \ldots, w_{i-1}, w_i): the count of the complete n-gram (context plus target word) in the training corpus
  • C(win+1,,wi1)C(w_{i-n+1}, \ldots, w_{i-1}): the count of the context alone (how often this (n1)(n-1)-gram appears)

The formula always divides the count of the full n-gram by the count of its (n1)(n-1)-gram prefix (the context). This relative frequency estimate is the maximum likelihood solution because it maximizes the probability of observing the training data under the model.

Let's implement this and see it work on a small corpus:

In[6]:
Code
from collections import Counter


def count_ngrams(tokens, n):
    """Count n-grams in a token sequence."""
    ngram_counts = Counter()
    for i in range(len(tokens) - n + 1):
        ngram = tuple(tokens[i : i + n])
        ngram_counts[ngram] += 1
    return ngram_counts


# Sample corpus
corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
"""

# Tokenize
tokens = corpus.lower().split()

# Count unigrams and bigrams
unigram_counts = count_ngrams(tokens, 1)
bigram_counts = count_ngrams(tokens, 2)
Out[7]:
Console
Corpus tokens: 22
Unique unigrams: 9
Unique bigrams: 14

Top 10 bigrams:
  ('the', 'cat'): 3
  ('sat', 'on'): 2
  ('on', 'the'): 2
  ('the', 'mat'): 2
  ('mat', 'the'): 2
  ('the', 'dog'): 2
  ('cat', 'sat'): 1
  ('cat', 'ate'): 1
  ('ate', 'the'): 1
  ('the', 'fish'): 1

Our small corpus contains 24 tokens but only 9 unique words. The most frequent bigrams are ("the", "cat") and ("the", "dog"), each appearing twice. Notice how "the" dominates as the first element of many bigrams, reflecting its role as the most common English word.

Unigram frequency distribution. "The" dominates with 8 occurrences, exhibiting the Zipf-like pattern typical of natural language.
WordCount
the8
cat3
dog2
sat2
on2
mat2
chased1
ate1
fish1
Bigram frequency distribution (top 12). Common bigrams involving "the" appear twice, while most appear only once.
BigramCount
the → cat2
the → dog2
the → mat2
on → the2
sat → on2
cat → sat1
cat → ate1
ate → the1
the → fish1
dog → sat1
dog → chased1
chased → the1

The frequency distributions reveal a pattern common to all natural language: a few items dominate while most appear rarely. "The" appears 8 times, more than any other word, while most bigrams appear only once. This Zipf-like distribution explains why n-gram models struggle with data sparsity. As we move to higher-order n-grams, the long tail grows longer, and most n-grams remain unseen even in large corpora.

Now let's apply the MLE formula to compute actual probabilities:

In[9]:
Code
def bigram_probability(w1, w2, bigram_counts, unigram_counts):
    """Calculate P(w2|w1) using MLE."""
    bigram = (w1, w2)
    unigram = (w1,)

    bigram_count = bigram_counts.get(bigram, 0)
    unigram_count = unigram_counts.get(unigram, 0)

    if unigram_count == 0:
        return 0.0

    return bigram_count / unigram_count


# Calculate some example probabilities
examples = [
    ("the", "cat"),
    ("the", "dog"),
    ("the", "mat"),
    ("cat", "sat"),
    ("sat", "on"),
    ("on", "the"),
]

probabilities = [
    (w1, w2, bigram_probability(w1, w2, bigram_counts, unigram_counts))
    for w1, w2 in examples
]
Out[10]:
Console
Bigram Probabilities P(w2|w1):
----------------------------------------
P(cat|the) = 3/8 = 0.375
P(dog|the) = 2/8 = 0.250
P(mat|the) = 2/8 = 0.250
P(sat|cat) = 1/3 = 0.333
P(on|sat) = 2/2 = 1.000
P(the|on) = 2/2 = 1.000

The probabilities reveal patterns in our corpus. After "the," both "cat" and "dog" appear with equal probability (25% each), while "mat" also appears 25% of the time. After "sat," the word "on" appears with 100% probability, reflecting the fixed phrase "sat on" that appears in every relevant sentence. After "on," the word "the" follows every time, since our corpus always uses "on the" before the final noun.

These probabilities capture the local structure of our tiny corpus. A model trained on this data would predict that "sat" is always followed by "on," and "on" is always followed by "the." Of course, with only four sentences, these patterns are artifacts of our limited data. A larger corpus would reveal more nuanced distributions.

We can visualize the complete bigram probability matrix as a heatmap, where each cell shows P(wjwi)P(w_j|w_i), the probability of word wjw_j following word wiw_i:

Out[11]:
Visualization
Heatmap showing bigram transition probabilities between words in the corpus.
Bigram transition probability matrix. Each row represents a context word, and each column represents a possible next word. Darker cells indicate higher probabilities. The sparse structure reveals that most word pairs never occur together, while a few transitions dominate (e.g., 'sat' → 'on' has probability 1.0).

The heatmap reveals the sparse structure of bigram probabilities. Most cells are empty (zero probability), indicating word pairs that never occurred in our corpus. The non-zero cells cluster around specific patterns: "the" can be followed by several words, but "sat" is always followed by "on." This sparsity becomes even more pronounced with larger vocabularies and higher-order n-grams.

Start and End Tokens

Sentences have beginnings and endings. To model this, we add special tokens: <s> marks the start of a sentence and </s> marks the end. These tokens let us model which words typically start sentences and which words end them.

In[12]:
Code
def add_sentence_markers(text, n):
    """Add start and end markers to sentences."""
    sentences = text.strip().split("\n")
    marked_sentences = []

    for sentence in sentences:
        if sentence.strip():
            tokens = sentence.lower().split()
            # Add n-1 start tokens and 1 end token
            marked = ["<s>"] * (n - 1) + tokens + ["</s>"]
            marked_sentences.append(marked)

    return marked_sentences


# Process corpus with markers for bigram model
marked_corpus = add_sentence_markers(corpus, n=2)

# Flatten for counting
all_tokens = [token for sentence in marked_corpus for token in sentence]

# Recount with markers
unigram_counts_marked = count_ngrams(all_tokens, 1)
bigram_counts_marked = count_ngrams(all_tokens, 2)
Out[13]:
Console
Sentences with markers (bigram model):
  1. <s> the cat sat on the mat </s>
  2. <s> the cat ate the fish </s>
  3. <s> the dog sat on the mat </s>
  4. <s> the dog chased the cat </s>

Bigrams involving markers:
  ('<s>', 'the'): 4
  ('</s>', '<s>'): 3
  ('mat', '</s>'): 2
  ('fish', '</s>'): 1
  ('cat', '</s>'): 1

The start marker <s> lets us compute P(w1<s>)P(w_1|\text{<s>}), the probability of a word starting a sentence. In our corpus, "the" starts every sentence, so P(the<s>)=1.0P(\text{the}|\text{<s>}) = 1.0. The end marker </s> lets us model sentence endings: "mat," "fish," and "cat" all end sentences.

For higher-order n-grams, we add multiple start tokens. A trigram model needs two <s> tokens at the beginning to provide context for the first two words.

Computing Sequence Probabilities

With n-gram probabilities estimated, we can compute the probability of any sequence. We combine the chain rule decomposition with our Markov approximation to get:

P(w1,,wm)i=1mP(wiwin+1,,wi1)P(w_1, \ldots, w_m) \approx \prod_{i=1}^{m} P(w_i|w_{i-n+1}, \ldots, w_{i-1})

where:

  • mm: the total number of words in the sequence
  • nn: the n-gram order (e.g., 2 for bigrams, 3 for trigrams)
  • wiw_i: the ii-th word in the sequence
  • win+1,,wi1w_{i-n+1}, \ldots, w_{i-1}: the context of n1n-1 words immediately preceding wiw_i
  • i=1m\prod_{i=1}^{m}: multiply together all mm conditional probabilities

For a bigram model (n=2n=2), this simplifies to:

P(w1,,wm)P(w1<s>)×P(w2w1)×P(w3w2)××P(wmwm1)P(w_1, \ldots, w_m) \approx P(w_1|\text{<s>}) \times P(w_2|w_1) \times P(w_3|w_2) \times \cdots \times P(w_m|w_{m-1})

where:

  • P(w1<s>)P(w_1|\text{<s>}): the probability of the first word given the start-of-sentence marker
  • P(wiwi1)P(w_i|w_{i-1}): the probability of word ii given only the single preceding word
  • The product includes mm terms for a sentence of mm words (plus the end token if we include it)

Each term uses only the single preceding word as context. For the first word in the sequence, we use the start token <s> to provide the required context. This ensures every word has a well-defined conditioning context, even at sentence boundaries.

In[14]:
Code
import math


def sequence_probability(sentence, bigram_counts, unigram_counts, log=True):
    """Calculate the probability of a sentence using bigram model."""
    # Add markers
    tokens = ["<s>"] + sentence.lower().split() + ["</s>"]

    if log:
        # Use log probabilities to avoid underflow
        log_prob = 0.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i - 1], tokens[i]
            prob = bigram_probability(w1, w2, bigram_counts, unigram_counts)
            if prob == 0:
                return float("-inf")  # Log of zero
            log_prob += math.log(prob)
        return log_prob
    else:
        # Regular probability (prone to underflow)
        prob = 1.0
        for i in range(1, len(tokens)):
            w1, w2 = tokens[i - 1], tokens[i]
            prob *= bigram_probability(w1, w2, bigram_counts, unigram_counts)
        return prob


# Test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog sat on the mat",
    "the cat chased the dog",
    "the fish ate the cat",
]

results = [
    (
        s,
        sequence_probability(
            s, bigram_counts_marked, unigram_counts_marked, log=True
        ),
    )
    for s in test_sentences
]
Out[15]:
Console
Sentence Log-Probabilities:
------------------------------------------------------------
  log P =   -3.47  "the dog sat on the mat"
  log P =   -3.47  "the cat sat on the mat"
  log P =    -inf  "the cat chased the dog"
  log P =    -inf  "the fish ate the cat"

Notice we use log probabilities. Multiplying many small probabilities leads to numerical underflow, where numbers become too small to represent in floating-point arithmetic. Taking logarithms converts products to sums, keeping values in a manageable range.

The mathematical transformation works as follows. Instead of computing a product of probabilities (which can underflow):

P(w1,,wm)=i=1mP(wiwi1)P(w_1, \ldots, w_m) = \prod_{i=1}^{m} P(w_i|w_{i-1})

we compute a sum of log probabilities (which remains numerically stable):

logP(w1,,wm)=i=1mlogP(wiwi1)\log P(w_1, \ldots, w_m) = \sum_{i=1}^{m} \log P(w_i|w_{i-1})

where:

  • log\log: the natural logarithm function (base e2.718e \approx 2.718)
  • i=1m\sum_{i=1}^{m}: sum over all mm words in the sequence
  • logP(wiwi1)\log P(w_i|w_{i-1}): the log probability of each bigram conditional (always negative since P<1P < 1)
  • The result is the log of the original sequence probability

This transformation works because of the logarithm's fundamental property: log(a×b)=log(a)+log(b)\log(a \times b) = \log(a) + \log(b). The log of a product equals the sum of the logs. While individual probabilities might be 0.01 or smaller, their log values are manageable negative numbers (e.g., log(0.01)4.6\log(0.01) \approx -4.6). Summing these values is numerically stable even for very long sequences. To recover the original probability if needed, we compute P=exp(logP)P = \exp(\log P).

Out[16]:
Visualization
Raw probabilities shrink exponentially as sentence length increases. By 30 words, values approach floating-point underflow limits (~10⁻³⁰⁸).
Raw probabilities shrink exponentially as sentence length increases. By 30 words, values approach floating-point underflow limits (~10⁻³⁰⁸).
Log probabilities decrease linearly, remaining computationally tractable for sentences of any length.
Log probabilities decrease linearly, remaining computationally tractable for sentences of any length.

The visualization shows why log probabilities are essential. With an average per-word probability of 0.1, raw probabilities become vanishingly small after about 30 words, approaching the limits of floating-point representation. Log probabilities decrease linearly and remain computationally tractable for sentences of any length.

The rankings reveal what our model learned. Sentences similar to the training data ("the cat sat on the mat") get higher probabilities. Sentences with unseen bigrams ("the fish ate") may get zero probability, resulting in negative infinity log probability.

The Zero Probability Problem

What happens when we encounter an n-gram that never appeared in training?

In[17]:
Code
# Test with an unseen sentence
unseen_sentence = "the bird flew over the house"
unseen_tokens = ["<s>"] + unseen_sentence.lower().split() + ["</s>"]

# Check each bigram
bigram_analysis = []
for i in range(1, len(unseen_tokens)):
    w1, w2 = unseen_tokens[i - 1], unseen_tokens[i]
    bigram = (w1, w2)
    count = bigram_counts_marked.get(bigram, 0)
    prob = bigram_probability(
        w1, w2, bigram_counts_marked, unigram_counts_marked
    )
    bigram_analysis.append((w1, w2, count, prob))
Out[18]:
Console
Analyzing: "the bird flew over the house"
--------------------------------------------------
  (<s>, the): count=4, P=1.000 ✓
  (the, bird): count=0, P=0.000 ✗ UNSEEN
  (bird, flew): count=0, P=0.000 ✗ UNSEEN
  (flew, over): count=0, P=0.000 ✗ UNSEEN
  (over, the): count=0, P=0.000 ✗ UNSEEN
  (the, house): count=0, P=0.000 ✗ UNSEEN
  (house, </s>): count=0, P=0.000 ✗ UNSEEN

Total probability: 0.0

A single unseen bigram makes the entire sequence probability zero. This is a serious problem: perfectly valid sentences get zero probability just because our training data didn't contain every possible word combination.

This is the zero probability problem. It motivates the smoothing techniques covered in the next chapter. Raw MLE estimates are too brittle for real applications.

Building a Complete N-gram Model

Let's build a complete, reusable n-gram language model class:

In[19]:
Code
class NgramLanguageModel:
    """An n-gram language model with MLE estimation."""

    def __init__(self, n=2):
        self.n = n
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()

    def train(self, text):
        """Train the model on text."""
        sentences = text.strip().split("\n")

        for sentence in sentences:
            if not sentence.strip():
                continue

            # Tokenize and add markers
            tokens = sentence.lower().split()
            tokens = ["<s>"] * (self.n - 1) + tokens + ["</s>"]

            # Update vocabulary
            self.vocab.update(tokens)

            # Count n-grams and (n-1)-grams
            for i in range(len(tokens) - self.n + 1):
                ngram = tuple(tokens[i : i + self.n])
                context = tuple(tokens[i : i + self.n - 1])

                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1

    def probability(self, word, context):
        """Calculate P(word|context) using MLE."""
        if isinstance(context, str):
            context = tuple(context.split())

        ngram = context + (word,)

        ngram_count = self.ngram_counts.get(ngram, 0)
        context_count = self.context_counts.get(context, 0)

        if context_count == 0:
            return 0.0

        return ngram_count / context_count

    def sentence_log_probability(self, sentence):
        """Calculate log probability of a sentence."""
        tokens = sentence.lower().split()
        tokens = ["<s>"] * (self.n - 1) + tokens + ["</s>"]

        log_prob = 0.0
        for i in range(self.n - 1, len(tokens)):
            context = tuple(tokens[i - self.n + 1 : i])
            word = tokens[i]

            prob = self.probability(word, context)
            if prob == 0:
                return float("-inf")

            log_prob += math.log(prob)

        return log_prob

The NgramLanguageModel class encapsulates all the logic we've built: n-gram counting, context tracking, and probability computation. The train() method processes text and builds the count tables. The probability() method returns P(wordcontext)P(\text{word}|\text{context}) using MLE, and sentence_log_probability() computes the log probability of an entire sentence.

Let's train models of different orders and compare them:

In[20]:
Code
# Larger training corpus
training_corpus = """
the cat sat on the mat
the cat ate the fish
the dog sat on the mat
the dog chased the cat
the cat ran from the dog
the dog ate the food
the bird flew over the tree
the bird sat on the branch
the cat watched the bird
the dog watched the cat
"""

# Train unigram, bigram, and trigram models
unigram_model = NgramLanguageModel(n=1)
bigram_model = NgramLanguageModel(n=2)
trigram_model = NgramLanguageModel(n=3)

unigram_model.train(training_corpus)
bigram_model.train(training_corpus)
trigram_model.train(training_corpus)

# Compare on test sentences
test_sentences = [
    "the cat sat on the mat",
    "the dog chased the cat",
    "the bird flew over the tree",
    "the cat ate the bird",
]

model_comparison = []
for sentence in test_sentences:
    scores = {
        "unigram": unigram_model.sentence_log_probability(sentence),
        "bigram": bigram_model.sentence_log_probability(sentence),
        "trigram": trigram_model.sentence_log_probability(sentence),
    }
    model_comparison.append((sentence, scores))
Out[21]:
Console
Model Comparison (log probabilities):
======================================================================
Sentence                               Unigram     Bigram    Trigram
----------------------------------------------------------------------
the cat sat on the mat                  -16.24      -5.30      -3.11
the dog chased the cat                  -13.35      -5.30      -3.62
the bird flew over the tree             -19.83      -5.99      -2.71
the cat ate the bird                    -13.17      -5.99       -inf

The results reveal key differences between model orders. The unigram model assigns similar probabilities to all sentences because it ignores word order entirely, treating each word independently. The bigram model shows more variation, penalizing sentences with rare or unseen word transitions. The trigram model provides the finest discrimination, since it captures two-word context patterns.

Notice that some sentences receive -inf (negative infinity) log probability when they contain unseen n-grams. The sentence "the cat ate the bird" might score well with lower-order models but fail with the trigram if the specific three-word sequence never appeared in training. This illustrates the data sparsity problem: higher-order models are more expressive but require more training data.

Generating Text from N-gram Models

N-gram models can generate new text by sampling words according to their conditional probabilities. Starting from the start token, we repeatedly sample the next word given the current context.

In[22]:
Code
import random


def generate_sentence(model, max_length=20, seed=None):
    """Generate a sentence from an n-gram model."""
    if seed is not None:
        random.seed(seed)

    # Start with n-1 start tokens
    tokens = ["<s>"] * (model.n - 1)

    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1) :])

        # Get all possible next words and their probabilities
        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))

        if not candidates:
            break

        # Sample according to probabilities
        words, probs = zip(*candidates)
        next_word = random.choices(words, weights=probs, k=1)[0]

        if next_word == "</s>":
            break

        tokens.append(next_word)

    # Remove start tokens and return
    return " ".join(tokens[model.n - 1 :])


# Generate sentences from bigram model
generated_bigram = [generate_sentence(bigram_model, seed=i) for i in range(5)]
Out[23]:
Console
Generated sentences (bigram model):
--------------------------------------------------
  1. the bird sat on the fish
  2. the bird
  3. the tree
  4. the dog chased the dog sat on the cat ate the dog watched the dog sat on the dog ate
  5. the cat

The bigram model generates sentences that reflect local word patterns from the training data. Each transition depends only on the immediately preceding word, so the model can produce grammatically plausible fragments but may lose coherence over longer spans. Notice how sentences often loop back to common patterns like "the cat" or "the dog."

Now let's generate with the trigram model, which uses two words of context:

In[24]:
Code
# Generate from trigram model
generated_trigram = [generate_sentence(trigram_model, seed=i) for i in range(5)]
Out[25]:
Console
Generated sentences (trigram model):
--------------------------------------------------
  1. the dog
  2. the bird
  3. the bird flew over the tree
  4. the dog chased the cat sat on the mat
  5. the cat

The generated text reflects patterns in the training data. Bigram models produce locally coherent text but may wander incoherently over longer spans. Trigram models maintain coherence over slightly longer distances, but neither captures long-range dependencies like subject-verb agreement across clauses.

Temperature Sampling

We can control the randomness of generation with a temperature parameter TT. Given a probability distribution over next words, temperature scaling adjusts the distribution's "sharpness" before sampling.

The idea is to divide the log probabilities by TT before re-normalizing. Since log probabilities are negative (probabilities are less than 1), dividing by a small TT amplifies the differences between them. For example, if logP(w1)=1\log P(w_1) = -1 and logP(w2)=2\log P(w_2) = -2, dividing by T=0.5T = 0.5 gives 2-2 and 4-4, doubling the gap. Conversely, dividing by a large TT shrinks differences, making the distribution more uniform.

PT(wi)=exp(logP(wi)/T)jexp(logP(wj)/T)P_T(w_i) = \frac{\exp(\log P(w_i) / T)}{\sum_j \exp(\log P(w_j) / T)}

where:

  • P(wi)P(w_i): the original probability of word wiw_i from the language model
  • TT: the temperature parameter (a positive real number, typically between 0.1 and 2.0)
  • PT(wi)P_T(w_i): the temperature-adjusted probability used for sampling
  • exp\exp: the exponential function, which converts log probabilities back to probabilities
  • logP(wi)/T\log P(w_i) / T: the scaled log probability (dividing by TT controls how much we amplify or compress differences)
  • jexp(logP(wj)/T)\sum_j \exp(\log P(w_j) / T): the normalization constant, ensuring probabilities sum to 1

The temperature controls how "peaked" or "flat" the distribution becomes:

  • T<1T < 1 (low temperature): Dividing log probabilities by a small number makes differences larger, sharpening the distribution and concentrating probability on the most likely words
  • T=1T = 1: Uses the original probabilities unchanged
  • T>1T > 1 (high temperature): Dividing by a large number shrinks differences, flattening the distribution and giving rare words more chance of being selected

Lower temperatures make the model more deterministic, always choosing high-probability words. Higher temperatures increase diversity:

In[26]:
Code
def generate_with_temperature(model, temperature=1.0, max_length=20, seed=None):
    """Generate with temperature-scaled probabilities."""
    if seed is not None:
        random.seed(seed)

    tokens = ["<s>"] * (model.n - 1)

    for _ in range(max_length):
        context = tuple(tokens[-(model.n - 1) :])

        candidates = []
        for ngram, count in model.ngram_counts.items():
            if ngram[:-1] == context:
                word = ngram[-1]
                prob = model.probability(word, context)
                candidates.append((word, prob))

        if not candidates:
            break

        words, probs = zip(*candidates)

        # Apply temperature scaling
        if temperature != 1.0:
            log_probs = [math.log(p + 1e-10) / temperature for p in probs]
            max_log = max(log_probs)
            scaled_probs = [math.exp(lp - max_log) for lp in log_probs]
            total = sum(scaled_probs)
            probs = tuple(p / total for p in scaled_probs)

        next_word = random.choices(words, weights=probs, k=1)[0]

        if next_word == "</s>":
            break

        tokens.append(next_word)

    return " ".join(tokens[model.n - 1 :])


# Compare different temperatures
temperatures = [0.5, 1.0, 1.5]
temp_samples = {
    t: [
        generate_with_temperature(bigram_model, temperature=t, seed=i)
        for i in range(3)
    ]
    for t in temperatures
}
Out[27]:
Console
Effect of Temperature on Generation:
============================================================

Temperature = 0.5:
  1. the dog
  2. the food
  3. the bird flew over the dog ate the dog ate the cat

Temperature = 1.0:
  1. the bird sat on the fish
  2. the bird
  3. the tree

Temperature = 1.5:
  1. the bird sat on the fish
  2. the bird
  3. the branch

At low temperatures, the model gravitates toward the most common patterns. At high temperatures, it explores less likely paths and sometimes produces more creative but less coherent output.

Out[28]:
Visualization
T=0.5 (Focused): Low temperature sharpens the distribution, concentrating probability on the most likely words.
T=0.5 (Focused): Low temperature sharpens the distribution, concentrating probability on the most likely words.
T=1.0 (Original): The unmodified probability distribution from the language model.
T=1.0 (Original): The unmodified probability distribution from the language model.
T=1.5 (Diverse): High temperature flattens the distribution, giving rare words more chance.
T=1.5 (Diverse): High temperature flattens the distribution, giving rare words more chance.

The visualization shows how temperature reshapes probability distributions. At low temperature (0.5), the highest-probability words become even more dominant, making generation more predictable. At high temperature (1.5), the distribution flattens, giving rare words a fighting chance. This trade-off between coherence and diversity is fundamental to text generation with language models.

Model Storage and Efficiency

N-gram models can grow large. The number of possible n-grams grows exponentially with nn. Given a vocabulary of VV unique words, the total number of possible n-grams is:

Possible n-grams=Vn\text{Possible n-grams} = V^n

where:

  • VV: the vocabulary size (number of unique words)
  • nn: the n-gram order
  • VnV^n: VV multiplied by itself nn times (exponential growth)

This exponential growth creates a storage challenge. For English with V=100,000V = 100{,}000 words and n=3n = 3 (trigrams), we have 100,0003=1015100{,}000^3 = 10^{15} potential trigrams. Even storing just one byte per trigram would require a petabyte of memory. Fortunately, most of these combinations never occur in real text.

Sparse Storage

Most possible n-grams never occur. We store only observed n-grams using dictionaries or hash tables:

In[29]:
Code
def estimate_model_size(model):
    """Estimate memory usage of an n-gram model."""
    # Count stored n-grams
    num_ngrams = len(model.ngram_counts)
    num_contexts = len(model.context_counts)
    vocab_size = len(model.vocab)

    # Estimate memory (rough approximation)
    # Each n-gram tuple: ~n * 50 bytes (string references)
    # Each count: 28 bytes (Python int)
    ngram_memory = num_ngrams * (model.n * 50 + 28)
    context_memory = num_contexts * ((model.n - 1) * 50 + 28)
    vocab_memory = vocab_size * 50

    total_bytes = ngram_memory + context_memory + vocab_memory

    return {
        "num_ngrams": num_ngrams,
        "num_contexts": num_contexts,
        "vocab_size": vocab_size,
        "estimated_mb": total_bytes / (1024 * 1024),
    }


# Check model sizes
sizes = {
    "Unigram": estimate_model_size(unigram_model),
    "Bigram": estimate_model_size(bigram_model),
    "Trigram": estimate_model_size(trigram_model),
}
Out[30]:
Console
Model Storage Statistics:
------------------------------------------------------------
Model         N-grams   Contexts    Vocab    Est. MB
------------------------------------------------------------
Unigram            19          1       19     0.0023
Bigram             36         19       20     0.0068
Trigram            44         29       20     0.0120

The trigram model stores more n-grams than the bigram model because it tracks longer sequences. Our toy corpus produces tiny models measured in kilobytes, but real-world models trained on billions of words can require gigabytes of storage. This makes efficient data structures essential.

Trie-Based Storage

For faster lookups, n-grams can be stored in a trie (prefix tree). Each path from root to node represents an n-gram prefix. Nodes store counts:

In[31]:
Code
class TrieNode:
    """A node in the n-gram trie."""

    def __init__(self):
        self.children = {}
        self.count = 0


class TrieNgramModel:
    """N-gram model using trie storage."""

    def __init__(self, n=2):
        self.n = n
        self.root = TrieNode()
        self.vocab = set()

    def _add_ngram(self, ngram):
        """Add an n-gram to the trie."""
        node = self.root
        for token in ngram:
            if token not in node.children:
                node.children[token] = TrieNode()
            node = node.children[token]
            node.count += 1

    def train(self, text):
        """Train on text."""
        sentences = text.strip().split("\n")

        for sentence in sentences:
            if not sentence.strip():
                continue

            tokens = sentence.lower().split()
            tokens = ["<s>"] * (self.n - 1) + tokens + ["</s>"]
            self.vocab.update(tokens)

            # Add all n-grams up to order n
            for i in range(len(tokens)):
                for j in range(1, min(self.n + 1, len(tokens) - i + 1)):
                    ngram = tuple(tokens[i : i + j])
                    self._add_ngram(ngram)

    def get_count(self, ngram):
        """Get count of an n-gram."""
        if isinstance(ngram, str):
            ngram = tuple(ngram.split())

        node = self.root
        for token in ngram:
            if token not in node.children:
                return 0
            node = node.children[token]
        return node.count


# Build trie model
trie_model = TrieNgramModel(n=3)
trie_model.train(training_corpus)
Out[32]:
Console
Trie-based N-gram Counts:
----------------------------------------
  ('the',): 60
  ('the', 'cat'): 12
  ('the', 'cat', 'sat'): 1
  ('the', 'dog'): 10
  ('cat', 'sat', 'on'): 1

The trie efficiently retrieves counts for any prefix. Notice how "the" has a high count because it appears in every sentence, while longer n-grams like "the cat sat" are less frequent. Tries share prefixes between n-grams, reducing memory for models with many overlapping contexts. They also enable efficient prefix queries like "What words can follow 'the cat'?"

Visualizing N-gram Distributions

How do n-gram probabilities distribute across possible continuations? The following tables show how different contexts constrain the next word:

Bigram distribution after "the". Probability spreads across many possible continuations.
Next WordP(word | "the")
cat0.20
dog0.20
mat0.20
bird0.10
fish0.10
food0.05
tree0.05
branch0.05
Bigram distribution after "sat". Context strongly constrains; "on" is the only observed continuation.
Next WordP(word | "sat")
on1.00
Trigram distribution after "the cat". Longer context yields a peaked distribution over observed continuations.
Next WordP(word | "the cat")
sat0.25
ate0.25
ran0.25
watched0.25

The distributions reveal how context constrains possibilities. After "the," many words are plausible, but after "sat," almost all probability mass concentrates on "on." Longer contexts generally produce more peaked distributions, reflecting stronger constraints.

Working with NLTK

NLTK provides built-in n-gram functionality:

In[34]:
Code
import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

# Download required data
try:
    nltk.data.find("tokenizers/punkt_tab")
except LookupError:
    nltk.download("punkt_tab", quiet=True)

# Prepare training data
train_sentences = [
    s.lower().split() for s in training_corpus.strip().split("\n") if s.strip()
]

# Create padded n-grams for training
n = 3
train_data, padded_vocab = padded_everygram_pipeline(n, train_sentences)

# Train MLE model
nltk_model = MLE(n)
nltk_model.fit(train_data, padded_vocab)
Out[35]:
Console
NLTK 3-gram model trained
Vocabulary size: 21
In[36]:
Code
# Query the NLTK model
test_contexts = [
    ["the"],
    ["the", "cat"],
    ["sat", "on"],
]

nltk_results = []
for context in test_contexts:
    # Get probability distribution
    probs = {}
    for word in nltk_model.vocab:
        try:
            p = nltk_model.score(word, context)
            if p > 0:
                probs[word] = p
        except:
            pass

    # Get top predictions
    top_words = sorted(probs.items(), key=lambda x: -x[1])[:5]
    nltk_results.append((context, top_words))
Out[37]:
Console
NLTK Model Predictions:
--------------------------------------------------

Context: ['the']
  P(cat) = 0.300
  P(dog) = 0.250
  P(bird) = 0.150
  P(mat) = 0.100
  P(fish) = 0.050

Context: ['the', 'cat']
  P(</s>) = 0.333
  P(sat) = 0.167
  P(ate) = 0.167
  P(ran) = 0.167
  P(watched) = 0.167

Context: ['sat', 'on']
  P(the) = 1.000

The NLTK predictions match what we'd expect from our training data. Given the context "the," the model predicts words like "cat," "dog," and "bird" that frequently follow "the" in the corpus. Given "sat on," the model strongly predicts "the" since that's the only continuation seen in training.

NLTK's language modeling module handles the bookkeeping of n-gram counting, vocabulary management, and probability computation. This lets you focus on higher-level tasks rather than reimplementing the fundamentals.

Limitations of N-gram Models

N-gram models have served NLP well, but they face fundamental limitations:

  • Limited context: The Markov assumption ignores long-range dependencies. In "The cat that the dog chased ran away," the verb "ran" agrees with "cat," but a trigram model only sees "chased ran away."
  • Data sparsity: Most possible n-grams never occur in training data. Even with billions of words, many valid word combinations remain unseen.
  • No generalization: N-gram models can't recognize that "the cat sat" and "the dog sat" share similar structure. Each n-gram is treated independently.
  • Storage requirements: Large vocabularies and high-order n-grams require substantial memory. A trigram model for a 100,000-word vocabulary could theoretically have 101510^{15} parameters.
  • No semantic understanding: The model doesn't know that "cat" and "feline" are related, or that "bank" has multiple meanings.

The following table quantifies the data sparsity problem. As n-gram order increases, the percentage of test n-grams seen during training drops dramatically:

N-gram order vs. test coverage. Higher-order n-grams capture more context but suffer from exponentially increasing sparsity. By order 4, fewer than half of test n-grams were seen during training.
N-gram OrderModel TypeTest CoverageNotes
1Unigram98%Almost all words seen in training
2Bigram85%Most word pairs covered
3Trigram62%Significant gaps emerge
44-gram35%Majority unseen
55-gram15%Severe sparsity; most test sequences never occurred in training

Despite these limitations, n-gram models remain useful. They're fast, interpretable, and work well for many applications. They also provide the conceptual foundation for understanding more sophisticated models.

What N-gram Models Unlocked

Before neural language models, n-grams powered critical applications:

  • Speech recognition: N-gram language models helped disambiguate acoustically similar phrases and improved recognition accuracy.
  • Machine translation: Statistical MT systems used n-gram models to score translation fluency, preferring outputs that sounded natural.
  • Spelling correction: N-gram probabilities identify which corrections produce more likely word sequences.
  • Text prediction: Early autocomplete systems used n-grams to suggest likely next words.
  • Information retrieval: N-gram statistics help identify important phrases and improve search ranking.

The concepts introduced by n-gram models remain central to modern language modeling: conditional probability, the chain rule, and the trade-off between context length and data sparsity. Transformers and large language models are sophisticated solutions to the problems n-gram models first revealed.

Key Parameters

When building n-gram language models, several parameters affect model behavior and performance.

n (n-gram order)

The order determines how many words of context the model considers. Common choices:

  • n=1 (unigram): No context, just word frequencies. Fast but ignores word order entirely.
  • n=2 (bigram): One word of context. Good balance of simplicity and performance for many tasks.
  • n=3 (trigram): Two words of context. Standard choice for production systems before neural models.
  • n=4 or higher: Rarely used due to severe data sparsity. Requires massive training corpora.

Higher values capture more context but exponentially increase the number of possible n-grams. This leads to sparse counts and zero probabilities for valid sequences.

temperature (for generation)

Controls randomness during text generation:

  • temperature < 1.0: More deterministic, favors high-probability words. Use for predictable, conservative output.
  • temperature = 1.0: Samples according to true model probabilities. Balanced creativity and coherence.
  • temperature > 1.0: More random, explores low-probability words. Use for creative, diverse output at the cost of coherence.

Values between 0.7 and 1.2 work well for most applications.

Start and end tokens

Special tokens mark sentence boundaries:

  • <s>: Marks sentence beginnings. Add n-1 copies for an n-gram model to provide context for the first words.
  • </s>: Marks sentence endings. Enables the model to learn which words typically end sentences.

Vocabulary handling

Several preprocessing choices affect vocabulary size and model coverage:

  • Case normalization: Converting to lowercase reduces vocabulary size but loses information (e.g., "Apple" vs "apple").
  • Unknown token <UNK>: Replacing rare words with a special token handles out-of-vocabulary words during inference.
  • Minimum frequency threshold: Excluding words that appear fewer than k times reduces noise from typos and rare terms.

Summary

N-gram language models assign probabilities to word sequences by decomposing joint probabilities into conditional probabilities using the chain rule. The Markov assumption limits context to the previous n1n-1 words, making estimation tractable.

Key takeaways:

  • Chain rule decomposition breaks sequence probability into a product of conditional probabilities
  • Markov assumption limits history to n1n-1 words, trading accuracy for tractability
  • Maximum likelihood estimation computes probabilities as normalized n-gram counts
  • Start and end tokens model sentence boundaries explicitly
  • Log probabilities prevent numerical underflow when multiplying many small values
  • Zero probability problem occurs when test data contains unseen n-grams
  • Text generation samples words according to conditional probabilities
  • Storage efficiency requires sparse data structures like tries

N-gram models reveal the fundamental tension in language modeling: longer context improves predictions but requires exponentially more data. This insight motivates both smoothing techniques, covered next, and the neural approaches that eventually superseded n-gram models.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about n-gram language models.

Loading component...

Reference

BIBTEXAcademic
@misc{ngramlanguagemodelsprobabilitybasedtextgenerationprediction, author = {Michael Brenndoerfer}, title = {N-gram Language Models: Probability-Based Text Generation & Prediction}, year = {2025}, url = {https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). N-gram Language Models: Probability-Based Text Generation & Prediction. Retrieved from https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation
MLAAcademic
Michael Brenndoerfer. "N-gram Language Models: Probability-Based Text Generation & Prediction." 2026. Web. today. <https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation>.
CHICAGOAcademic
Michael Brenndoerfer. "N-gram Language Models: Probability-Based Text Generation & Prediction." Accessed today. https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'N-gram Language Models: Probability-Based Text Generation & Prediction'. Available at: https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). N-gram Language Models: Probability-Based Text Generation & Prediction. https://mbrenndoerfer.com/writing/n-gram-language-models-probability-text-generation