Perplexity: The Standard Metric for Evaluating Language Models

Michael BrenndoerferUpdated March 29, 202543 min read

Learn how perplexity measures language model quality through cross-entropy and information theory. Understand the branching factor interpretation, implement perplexity for n-gram models, and discover when perplexity predicts downstream performance.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Perplexity

You've built an n-gram language model and applied smoothing techniques. Now comes the critical question: how good is your model? Perplexity provides the answer. It's the standard metric for evaluating language models, used everywhere from academic papers to production systems. Understanding perplexity shows not just how to measure model quality, but what it means for a model to "understand" language.

This chapter develops perplexity from first principles. We'll derive it from information theory, connect it to intuitive concepts like "branching factor," implement it from scratch, and explore both its power and its limitations.

The Evaluation Problem

Consider two language models trained on the same corpus. Model A assigns P("the cat sat on the mat")=0.001P(\text{"the cat sat on the mat"}) = 0.001. Model B assigns P("the cat sat on the mat")=0.0001P(\text{"the cat sat on the mat"}) = 0.0001. Which model is better?

At first glance, Model A seems superior because it assigns higher probability to a grammatical sentence. But this comparison is misleading. Model A might assign high probability to everything, including nonsense like "mat the on sat cat the." Model B might be more discriminating, reserving high probability for truly likely sequences.

We need a metric that rewards models for assigning high probability to actual language while penalizing them for wasting probability mass on unlikely sequences. Perplexity does exactly this by measuring how well a model predicts held-out test data.

Held-Out Evaluation

The practice of evaluating a model on data it wasn't trained on. This tests whether the model learned generalizable patterns rather than memorizing the training data. The held-out data is called the test set or evaluation set.

Cross-Entropy: The Foundation

To measure how well a language model predicts text, we need a principled way to quantify "prediction quality." This is where information theory provides the perfect framework. The key insight is that prediction and compression are two sides of the same coin: if you can predict what comes next, you can compress it efficiently. Cross-entropy formalizes this connection, and perplexity translates it into an intuitive scale.

Let's build up to perplexity step by step, starting with the most fundamental question: how do we measure uncertainty?

Entropy: Quantifying Uncertainty

Imagine you're playing a word-guessing game. Your friend thinks of a word, and you have to guess it using only yes/no questions. How many questions do you need?

The answer depends on how predictable the word is. If your friend always picks from {"cat", "dog"} with equal probability, you need exactly one question ("Is it cat?"). But if they pick from a thousand equally likely words, you need about 10 questions (since 210=10242^{10} = 1024). This number of questions is exactly what entropy measures.

For a probability distribution PP over a vocabulary VV, entropy is:

H(P)=wVP(w)log2P(w)H(P) = -\sum_{w \in V} P(w) \log_2 P(w)

where:

  • H(P)H(P): entropy of distribution PP (measured in bits when using log2\log_2)
  • P(w)P(w): probability of word ww
  • The sum runs over all words in the vocabulary

The formula might look abstract, but it captures a clear intuition. The term log2P(w)-\log_2 P(w) is called the surprisal or information content of word ww. It measures how many bits of information you gain by learning that ww occurred:

  • A very likely word (say P(w)=0.5P(w) = 0.5) has surprisal log2(0.5)=1-\log_2(0.5) = 1 bit
  • An unlikely word (say P(w)=0.0625P(w) = 0.0625) has surprisal log2(0.0625)=4-\log_2(0.0625) = 4 bits
  • A certain word (P(w)=1P(w) = 1) has surprisal log2(1)=0-\log_2(1) = 0 bits (no surprise!)

Entropy is the expected surprisal: we weight each word's surprisal by how often it occurs P(w)P(w) and sum up. Rare words contribute more bits because they're more surprising, while common words contribute fewer.

Let's see this in action with a simple two-word vocabulary:

In[2]:
Code
import numpy as np


def entropy(probs):
    """Calculate entropy of a probability distribution."""
    # Filter out zero probabilities to avoid log(0)
    probs = np.array([p for p in probs if p > 0])
    return -np.sum(probs * np.log2(probs))


# Different probability distributions over two words
distributions = [
    ([0.5, 0.5], "Equal probability"),
    ([0.9, 0.1], "Skewed (90/10)"),
    ([0.99, 0.01], "Very skewed (99/1)"),
    ([1.0, 0.0], "Certain"),
]
Out[3]:
Console
Entropy for different distributions over {cat, dog}:
--------------------------------------------------
Equal probability         H = 1.0000 bits
Skewed (90/10)            H = 0.4690 bits
Very skewed (99/1)        H = 0.0808 bits
Certain                   H = -0.0000 bits

Notice the pattern: when both words are equally likely, entropy is maximal at 1 bit. You need one yes/no question. As one word dominates, entropy drops because the outcome becomes more predictable. When the outcome is certain, entropy is zero: no questions needed, no uncertainty remains.

Out[4]:
Visualization
Line plot showing entropy on y-axis versus probability of first outcome on x-axis, forming an inverted U shape with peak at 0.5.
Entropy as a function of probability distribution skew. When one outcome dominates (probability near 0 or 1), entropy approaches zero because the result is predictable. Maximum entropy occurs at equal probabilities (p=0.5), where uncertainty is highest.

This gives us our first key insight: entropy measures the inherent unpredictability of a distribution. A good language model should have low entropy on real text because it can predict what comes next.

Cross-Entropy: Measuring Model Mismatch

Entropy tells us about the true distribution, but we don't have access to that. We only have our model's predictions. This is where cross-entropy enters the picture.

Cross-entropy asks: "If the true distribution is PP, but we use model QQ to make predictions, how many bits do we need on average?" The answer is always at least as many as entropy, and usually more, because our model isn't perfect.

H(P,Q)=wVP(w)log2Q(w)H(P, Q) = -\sum_{w \in V} P(w) \log_2 Q(w)

where:

  • H(P,Q)H(P, Q): cross-entropy of PP relative to QQ (measured in bits)
  • P(w)P(w): true probability of word ww (from the data)
  • Q(w)Q(w): model's predicted probability of word ww
  • VV: the vocabulary (set of all possible words)
  • The sum computes the expected number of bits, weighting each word's encoding cost log2Q(w)-\log_2 Q(w) by its true frequency P(w)P(w)

The key relationship is: H(P,Q)H(P)H(P, Q) \geq H(P). Cross-entropy is always at least as large as entropy. This inequality holds because using the "wrong" distribution QQ for encoding is never better than using the true distribution PP. The gap between them, called KL divergence, measures exactly how much our model's predictions differ from reality.

Think of it this way:

  • Entropy H(P)H(P): the minimum bits needed with a perfect model
  • Cross-entropy H(P,Q)H(P, Q): the bits needed with our actual model QQ
  • KL divergence H(P,Q)H(P)H(P, Q) - H(P): the "cost" of using an imperfect model
Cross-entropy decomposition showing how model quality affects the KL divergence component. The entropy (2.5 bits) represents the inherent uncertainty in the data, the minimum bits needed with a perfect model. The KL divergence measures the additional cost of using an imperfect model. Better models have smaller KL divergence.
Model QualityEntropy H(P)KL DivergenceCross-Entropy
Perfect Model2.5 bits0.0 bits2.5 bits
Good Model2.5 bits0.5 bits3.0 bits
Average Model2.5 bits1.5 bits4.0 bits
Poor Model2.5 bits3.0 bits5.5 bits
Random Guessing2.5 bits5.0 bits7.5 bits

In practice, we don't know the true distribution PP. Instead, we approximate it using the empirical distribution from our test data. If word ww appears C(w)C(w) times in a test corpus of NN words, we estimate:

P^(w)=C(w)N\hat{P}(w) = \frac{C(w)}{N}

where:

  • P^(w)\hat{P}(w): empirical probability of word ww (the "hat" notation indicates an estimate)
  • C(w)C(w): count of how many times word ww appears in the test corpus
  • NN: total number of words in the test corpus

When we substitute this empirical distribution into the cross-entropy formula, something elegant happens. Since each word wiw_i in the corpus contributes equally to the empirical distribution, the weighted sum simplifies to a simple average:

H(P^,Q)=1Ni=1Nlog2Q(wi)H(\hat{P}, Q) = -\frac{1}{N} \sum_{i=1}^{N} \log_2 Q(w_i)

where:

  • H(P^,Q)H(\hat{P}, Q): cross-entropy between the empirical distribution P^\hat{P} and model QQ
  • NN: total number of words in the test corpus
  • wiw_i: the ii-th word in the test corpus
  • Q(wi)Q(w_i): the probability our model assigns to word wiw_i

In words: cross-entropy is the average negative log probability that our model assigns to each word in the test corpus. Lower is better because it means the model assigned higher probabilities to the words that actually appeared.

Let's compute this for a simple unigram model:

In[5]:
Code
def cross_entropy(test_tokens, model_probs):
    """
    Calculate cross-entropy of a model on test data.

    Args:
        test_tokens: List of tokens from test corpus
        model_probs: Function that returns P(token | context)
    """
    log_prob_sum = 0.0
    count = 0

    for i, token in enumerate(test_tokens):
        prob = model_probs(token, i, test_tokens)
        if prob > 0:
            log_prob_sum += np.log2(prob)
            count += 1
        else:
            # Handle zero probability (model assigns impossible)
            return float("inf")

    return -log_prob_sum / count


# Simple unigram model for demonstration
train_corpus = """
the cat sat on the mat
the dog sat on the rug
the cat chased the dog
the dog chased the cat
""".lower().split()

from collections import Counter

word_counts = Counter(train_corpus)
total_words = len(train_corpus)


def unigram_prob(token, position, tokens):
    """Return unigram probability P(token)."""
    count = word_counts.get(token, 0)
    # Add-1 smoothing to handle unseen words
    vocab_size = len(word_counts) + 1  # +1 for unknown
    return (count + 1) / (total_words + vocab_size)


# Test on held-out data
test_corpus = "the cat sat on the rug".split()
Out[6]:
Console
Test corpus: the cat sat on the rug
Cross-entropy: 2.8692 bits per word

A cross-entropy around 2.7 bits per word indicates moderate predictability. For comparison, a uniform distribution over our vocabulary of about 8 unique words would give log2(8)=3\log_2(8) = 3 bits. Our unigram model does slightly better because it learned that some words (like "the") are more common than others.

The cross-entropy tells us how many bits on average our model needs to encode each word. This number has a concrete interpretation: if we used our model's probabilities to design a compression scheme, each word would require about this many bits on average.

From Cross-Entropy to Perplexity

Cross-entropy is useful, but "2.7 bits per word" doesn't immediately convey how good a model is. Is that good? Bad? It depends on the vocabulary size and the inherent predictability of the text.

Perplexity solves this by converting bits back to a count, specifically the number of equally likely choices. The transformation is simple:

PP(W)=2H(P,Q)\text{PP}(W) = 2^{H(P, Q)}

where:

  • PP(W)\text{PP}(W): perplexity of the model on word sequence WW
  • H(P,Q)H(P, Q): cross-entropy between the true distribution PP and model distribution QQ

Why raise 2 to the power of cross-entropy? Because entropy measures bits, and each bit doubles the number of possibilities. A cross-entropy of 3 bits means 23=82^3 = 8 equally likely choices; 10 bits means 210=10242^{10} = 1024 choices.

Expanding this using our practical cross-entropy formula:

PP(W)=21Ni=1Nlog2Q(wi)\text{PP}(W) = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 Q(w_i)}

where:

  • NN: total number of words in the test sequence
  • wiw_i: the ii-th word in the sequence
  • Q(wi)Q(w_i): model's probability for word wiw_i (given its context)

Using properties of logarithms, we can derive an equivalent product form. The derivation proceeds as follows:

Step 1: Start with the exponential form and use the property that 2log2x=x2^{\log_2 x} = x:

PP(W)=21Ni=1Nlog2Q(wi)\text{PP}(W) = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 Q(w_i)}

Step 2: Move the 1N-\frac{1}{N} inside using 2ab=(2a)b2^{ab} = (2^a)^b:

PP(W)=(2i=1Nlog2Q(wi))1N\text{PP}(W) = \left( 2^{\sum_{i=1}^{N} \log_2 Q(w_i)} \right)^{-\frac{1}{N}}

Step 3: Apply 2ilog2xi=ixi2^{\sum_i \log_2 x_i} = \prod_i x_i (the sum of logs becomes the log of a product):

PP(W)=(i=1NQ(wi))1N\text{PP}(W) = \left( \prod_{i=1}^{N} Q(w_i) \right)^{-\frac{1}{N}}

This shows that perplexity is the geometric mean of the inverse probabilities. Equivalently, we can write:

PP(W)=1i=1NQ(wi)N\text{PP}(W) = \frac{1}{\sqrt[N]{\prod_{i=1}^{N} Q(w_i)}}

where N\sqrt[N]{\cdot} denotes the NN-th root.

The geometric mean matters here for two reasons. First, unlike the arithmetic mean, it's sensitive to very small probabilities. A single word with near-zero probability will dramatically increase perplexity. If any Q(wi)0Q(w_i) \approx 0, then iQ(wi)0\prod_i Q(w_i) \approx 0, making PP\text{PP} \rightarrow \infty. This is exactly what we want: a good language model shouldn't assign tiny probabilities to words that actually occur. Second, the geometric mean is scale-independent: doubling all probabilities would halve perplexity, regardless of the base probability level.

Perplexity

The perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words. It can be interpreted as the weighted average number of choices the model faces at each step. Lower perplexity means the model is less "perplexed" by the test data.

For language models that condition on context (like n-gram models), we use conditional probabilities. The formula above assumed each word probability Q(wi)Q(w_i) was independent, but real language models predict each word based on its history. The perplexity formula becomes:

PP(W)=i=1N1P(wiw1,,wi1)N\text{PP}(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i | w_1, \ldots, w_{i-1})}}

where:

  • W=w1,w2,,wNW = w_1, w_2, \ldots, w_N: the test sequence of NN words
  • P(wiw1,,wi1)P(w_i | w_1, \ldots, w_{i-1}): the probability of word wiw_i given all preceding words (the model's prediction)
  • N\sqrt[N]{\cdot}: the NN-th root (geometric mean)

In practice, we work with log probabilities to avoid numerical underflow. When multiplying many small probabilities (each less than 1), the product quickly approaches zero, causing floating-point arithmetic to lose precision. By working in log space, we convert products to sums:

PP(W)=21Ni=1Nlog2P(wiw1,,wi1)\text{PP}(W) = 2^{-\frac{1}{N} \sum_{i=1}^{N} \log_2 P(w_i | w_1, \ldots, w_{i-1})}

This is numerically stable because we're adding log probabilities (negative numbers around -3 to -15 for typical words) rather than multiplying tiny probabilities (numbers like 10510^{-5} to 102010^{-20}).

In[7]:
Code
def perplexity(test_tokens, model_log_prob):
    """
    Calculate perplexity of a model on test data.

    Args:
        test_tokens: List of tokens from test corpus
        model_log_prob: Function returning log2 P(token | context)
    """
    log_prob_sum = 0.0
    count = 0

    for i, token in enumerate(test_tokens):
        log_p = model_log_prob(token, i, test_tokens)
        if log_p == float("-inf"):
            return float("inf")  # Zero probability encountered
        log_prob_sum += log_p
        count += 1

    cross_ent = -log_prob_sum / count
    return 2**cross_ent


def unigram_log_prob(token, position, tokens):
    """Return log2 P(token) for unigram model with smoothing."""
    count = word_counts.get(token, 0)
    vocab_size = len(word_counts) + 1
    prob = (count + 1) / (total_words + vocab_size)
    return np.log2(prob)
Out[8]:
Console
Test corpus: the cat sat on the rug
Perplexity: 7.31
Effective branching factor: ~7 choices per word

Now the interpretation is immediate: a perplexity around 6-7 means the model faces roughly the same uncertainty as choosing uniformly among 6-7 equally likely words at each position. For a simple unigram model on a small corpus, this is reasonable. The model has learned that some words (like "the") are much more common than others.

The Branching Factor Interpretation

The most useful insight about perplexity comes from thinking of it as the effective branching factor. Imagine language as a tree where each node represents a word, and branches represent possible next words. At each step, the model must choose which branch to follow.

If all branches were equally likely, the number of branches would be the vocabulary size, potentially tens of thousands. But language isn't uniform. After "the," words like "cat" and "dog" are much more likely than words like "xylophone" or "quasar." A good model exploits this structure to effectively prune the tree.

Perplexity tells you the effective width of this tree. If perplexity is 100, the model is as uncertain as if it were choosing uniformly among 100 words at each step, even though the vocabulary might contain 50,000 words. The model has effectively eliminated 99.8% of possibilities based on context.

Out[9]:
Visualization
PP = 2: The model chooses among 2 equally likely options at each step.
PP = 2: The model chooses among 2 equally likely options at each step.
PP = 4: The model chooses among 4 equally likely options at each step.
PP = 4: The model chooses among 4 equally likely options at each step.
PP = 8: The model chooses among 8 equally likely options at each step.
PP = 8: The model chooses among 8 equally likely options at each step.

This interpretation explains why perplexity is so useful as an evaluation metric. A vocabulary of 50,000 words could theoretically produce perplexity of 50,000 (uniform distribution). A good language model achieves perplexity of 50-200 on typical text, meaning it has effectively reduced the uncertainty by orders of magnitude, from tens of thousands of possibilities to just dozens or hundreds.

Worked Example: Tracing Through the Calculation

Let's make this concrete with a step-by-step calculation. Consider evaluating a bigram model on the sentence "the cat sat." We'll trace through exactly how perplexity emerges from the individual predictions.

Suppose our bigram model gives these probabilities:

Bigram model probabilities for the sentence "the cat sat." Each row shows the probability of the next word given its immediate predecessor.
PredictionProbabilityInterpretation
P(the<s>)P(\text{the} \vert \text{<s>})0.20"the" is a common sentence starter
P(catthe)P(\text{cat} \vert \text{the})0.10"cat" is one of many words following "the"
P(satcat)P(\text{sat} \vert \text{cat})0.05"sat" is a plausible but not dominant verb
P(</s>sat)P(\text{</s>} \vert \text{sat})0.10sentences often end after simple verbs

The probability of the entire sequence is the product of the individual conditional probabilities (using the chain rule of probability):

P(W)=P(the<s>)×P(catthe)×P(satcat)×P(</s>sat)P(W) = P(\text{the}|\text{<s>}) \times P(\text{cat}|\text{the}) \times P(\text{sat}|\text{cat}) \times P(\text{</s>}|\text{sat})

Substituting our values:

P(W)=0.2×0.1×0.05×0.1=0.0001P(W) = 0.2 \times 0.1 \times 0.05 \times 0.1 = 0.0001

This tiny number (10410^{-4}) is hard to interpret directly. How "good" is a probability of 0.0001? The answer depends on the sequence length: a longer sequence naturally has lower probability because it's more specific.

Perplexity normalizes by the sequence length, computing the geometric mean of inverse probabilities. With N=4N = 4 tokens:

Step 1: Compute the inverse of the total probability:

1P(W)=10.0001=10,000\frac{1}{P(W)} = \frac{1}{0.0001} = 10{,}000

Step 2: Take the NN-th root to normalize by sequence length:

PP=(10,000)14=10,0000.25=10\text{PP} = \left( 10{,}000 \right)^{\frac{1}{4}} = 10{,}000^{0.25} = 10

The model faces an average of 10 equally likely choices at each step. This matches our intuition from the table: some transitions are more predictable (like "the" starting a sentence with probability 0.2, equivalent to choosing among 1/0.2=51/0.2 = 5 options) while others are harder (like predicting "sat" after "cat" with probability 0.05, equivalent to choosing among 1/0.05=201/0.05 = 20 options). The geometric mean balances these out to 10.

Per-word contribution to perplexity for the sequence "the cat sat." Each word contributes differently based on how predictable it is in context. Words with lower probability (like "sat" after "cat") contribute more to the overall perplexity. The geometric mean of the inverse probabilities gives the final perplexity of 10.
WordProbability PInverse 1/PContribution
"the"0.205Relatively predictable
"cat"0.1010Moderately uncertain
"sat"0.0520Most uncertain
""0.1010Moderately uncertain
Geometric Mean-10.0= Perplexity
In[10]:
Code
# Verify the calculation
probs = [0.2, 0.1, 0.05, 0.1]
n = len(probs)

# Method 1: Direct calculation
product = np.prod(probs)
pp_direct = (1 / product) ** (1 / n)

# Method 2: Log probability
log_probs = [np.log2(p) for p in probs]
avg_log_prob = sum(log_probs) / n
pp_log = 2 ** (-avg_log_prob)
Out[11]:
Console
Perplexity calculation for 'the cat sat':
--------------------------------------------------
Token probabilities: [0.2, 0.1, 0.05, 0.1]
Product of probabilities: 0.000100
Number of tokens: 4

Direct calculation: PP = (1/0.000100)^(1/4) = 10.00
Log probability method: PP = 2^(3.3219) = 10.00

Both methods give identical results, confirming our formulas are consistent. The log probability method is preferred in practice because it avoids numerical underflow. When computing products of many small probabilities, the result can become so small that floating-point arithmetic loses precision. Summing log probabilities keeps the numbers in a manageable range.

Bits Per Character and Bits Per Word

Perplexity can be expressed in different units depending on what you're predicting. Since perplexity and cross-entropy are related by PP=2cross-entropy\text{PP} = 2^{\text{cross-entropy}}, we can also report results directly as cross-entropy (bits).

Bits per word (BPW): This is the cross-entropy when predicting words. Since PP=2BPW\text{PP} = 2^{\text{BPW}}, we have BPW=log2(PP)\text{BPW} = \log_2(\text{PP}). A perplexity of 100 corresponds to log2(100)=6.64\log_2(100) = 6.64 bits per word.

Bits per character (BPC): When predicting characters instead of words, we use bits per character. Character-level models typically achieve 1-2 BPC on English text (corresponding to perplexity of 2-4 per character).

The relationship between word-level and character-level metrics depends on average word length. If the average word has LL characters, a rough approximation is:

BPCBPWL+1\text{BPC} \approx \frac{\text{BPW}}{L + 1}

where:

  • BPC\text{BPC}: bits per character (cross-entropy at character level)
  • BPW\text{BPW}: bits per word (cross-entropy at word level)
  • LL: average word length in characters
  • The +1+1 accounts for the space between words (which is also a character to predict)

This approximation assumes the model's uncertainty is distributed roughly uniformly across characters within words. In practice, character-level models often achieve better compression than this formula suggests because they can exploit subword regularities (like common prefixes and suffixes).

In[12]:
Code
def perplexity_to_bits(pp):
    """Convert perplexity to bits (cross-entropy)."""
    return np.log2(pp)


def bits_to_perplexity(bits):
    """Convert bits (cross-entropy) to perplexity."""
    return 2**bits


# Example conversions
perplexities = [10, 50, 100, 500, 1000]
Out[13]:
Console
Perplexity to Bits Conversion:
----------------------------------------
  Perplexity   Bits per word
----------------------------------------
          10            3.32
          50            5.64
         100            6.64
         500            8.97
        1000            9.97
Out[14]:
Visualization
Line plot showing the exponential relationship between bits per word and perplexity.
Relationship between perplexity and bits per word. The logarithmic relationship means that halving bits per word corresponds to squaring the perplexity. A reduction from 100 to 50 perplexity saves just 1 bit, while a reduction from 1000 to 500 also saves 1 bit.

The logarithmic scale shows an important pattern: improvements in perplexity become harder as models get better. Reducing perplexity from 1000 to 500 saves the same number of bits as reducing from 100 to 50, but the latter is typically much harder to achieve.

Implementing Perplexity for N-gram Models

Let's build a complete perplexity evaluation system for n-gram language models.

In[15]:
Code
from collections import Counter
import math


class NgramModel:
    """N-gram language model with add-k smoothing."""

    def __init__(self, n=2, k=0.01):
        self.n = n
        self.k = k
        self.ngram_counts = Counter()
        self.context_counts = Counter()
        self.vocab = set()

    def train(self, sentences):
        """Train on a list of tokenized sentences."""
        for tokens in sentences:
            # Add start and end markers
            padded = ["<s>"] * (self.n - 1) + tokens + ["</s>"]
            self.vocab.update(padded)

            # Count n-grams and contexts
            for i in range(len(padded) - self.n + 1):
                ngram = tuple(padded[i : i + self.n])
                context = tuple(padded[i : i + self.n - 1])
                self.ngram_counts[ngram] += 1
                self.context_counts[context] += 1

    def log_prob(self, word, context):
        """Return log2 P(word | context) with add-k smoothing."""
        if isinstance(context, list):
            context = tuple(context)

        ngram = context + (word,)
        ngram_count = self.ngram_counts.get(ngram, 0)
        context_count = self.context_counts.get(context, 0)

        V = len(self.vocab)

        # Add-k smoothing
        prob = (ngram_count + self.k) / (context_count + self.k * V)

        return math.log2(prob)

    def sentence_log_prob(self, tokens):
        """Return total log probability of a sentence."""
        padded = ["<s>"] * (self.n - 1) + tokens + ["</s>"]

        total_log_prob = 0.0
        for i in range(self.n - 1, len(padded)):
            context = tuple(padded[i - self.n + 1 : i])
            word = padded[i]
            total_log_prob += self.log_prob(word, context)

        return total_log_prob

    def perplexity(self, test_sentences):
        """Calculate perplexity on test sentences."""
        total_log_prob = 0.0
        total_words = 0

        for tokens in test_sentences:
            padded = ["<s>"] * (self.n - 1) + tokens + ["</s>"]

            for i in range(self.n - 1, len(padded)):
                context = tuple(padded[i - self.n + 1 : i])
                word = padded[i]
                total_log_prob += self.log_prob(word, context)
                total_words += 1

        avg_log_prob = total_log_prob / total_words
        return 2 ** (-avg_log_prob)


# Prepare training data
train_text = """
the cat sat on the mat
the dog sat on the rug  
the cat chased the dog
the dog chased the cat
the bird flew over the tree
the cat watched the bird
the dog watched the cat
a cat sat on a mat
a dog sat on a rug
"""

train_sentences = [
    line.lower().split()
    for line in train_text.strip().split("\n")
    if line.strip()
]

# Train models of different orders
models = {}
for n in [1, 2, 3]:
    model = NgramModel(n=n, k=0.01)
    model.train(train_sentences)
    models[n] = model
Out[16]:
Console
Training corpus: 9 sentences
Vocabulary size: 16 tokens

1-gram model:
  Unique 1-grams: 15
  Unique contexts: 1
2-gram model:
  Unique 2-grams: 32
  Unique contexts: 15
3-gram model:
  Unique 3-grams: 43
  Unique contexts: 27

With 9 training sentences, we have a small but workable corpus. The vocabulary of about 20 tokens is typical for such toy examples. Notice how higher-order models have more unique n-grams: the trigram model distinguishes more context patterns than the bigram model.

Now let's evaluate these models on held-out test data.

In[17]:
Code
# Test sentences (some seen, some novel)
test_sentences = [
    "the cat sat on the mat".split(),  # Seen in training
    "the dog chased the bird".split(),  # Partially novel
    "a bird flew over the mat".split(),  # More novel combinations
    "the cat and the dog played".split(),  # Novel structure
]

# Calculate perplexity for each model
results = []
for n, model in models.items():
    pp = model.perplexity(test_sentences)
    results.append((n, pp))
Out[18]:
Console
Perplexity on Test Set:
----------------------------------------
       Model      Perplexity    Bits/Word
----------------------------------------
1-gram                 17.72         4.15
2-gram                  5.09         2.35
3-gram                  6.34         2.66

The results reveal a common pattern: higher-order models achieve lower perplexity when they have enough training data. The bigram model outperforms the unigram model because it captures local word dependencies (like "the cat" being more likely than "the xylophone"). The trigram model may or may not improve further depending on corpus size. With limited data, trigrams suffer from sparsity.

Per-Sentence Perplexity Analysis

Aggregate perplexity hides important details. Let's examine how perplexity varies across individual sentences.

In[19]:
Code
def sentence_perplexity(model, tokens):
    """Calculate perplexity for a single sentence."""
    padded = ["<s>"] * (model.n - 1) + tokens + ["</s>"]

    total_log_prob = 0.0
    num_tokens = 0

    for i in range(model.n - 1, len(padded)):
        context = tuple(padded[i - model.n + 1 : i])
        word = padded[i]
        total_log_prob += model.log_prob(word, context)
        num_tokens += 1

    avg_log_prob = total_log_prob / num_tokens
    return 2 ** (-avg_log_prob)


# Analyze each test sentence with bigram model
bigram_model = models[2]
sentence_analysis = []

for tokens in test_sentences:
    pp = sentence_perplexity(bigram_model, tokens)
    sentence_analysis.append((" ".join(tokens), pp))
Out[20]:
Console
Per-Sentence Perplexity (Bigram Model):
------------------------------------------------------------
PP =   2.32  "the cat sat on the mat"
PP =   2.69  "the dog chased the bird"
PP =   5.02  "a bird flew over the mat"
PP =  19.47  "the cat and the dog played"

Sentences that closely match training patterns have lower perplexity. Novel combinations increase perplexity because the model is less certain about them. The output above shows this clearly: sentences with familiar bigram patterns (like "the cat sat on the mat") achieve lower perplexity than those with novel word combinations.

Held-Out Evaluation Methodology

Proper evaluation requires careful data splitting. The standard approach uses three sets:

  1. Training set: Used to estimate model parameters (n-gram counts)
  2. Development set (dev set): Used to tune hyperparameters (like smoothing constants)
  3. Test set: Used only for final evaluation, never touched during development
In[21]:
Code
def split_data(sentences, train_ratio=0.7, dev_ratio=0.15, seed=42):
    """Split data into train/dev/test sets."""
    import random

    random.seed(seed)

    shuffled = sentences.copy()
    random.shuffle(shuffled)

    n = len(shuffled)
    train_end = int(n * train_ratio)
    dev_end = int(n * (train_ratio + dev_ratio))

    return {
        "train": shuffled[:train_end],
        "dev": shuffled[train_end:dev_end],
        "test": shuffled[dev_end:],
    }


# Create a larger corpus for demonstration
larger_corpus = """
the quick brown fox jumps over the lazy dog
a quick brown dog runs through the green field
the lazy cat sleeps on the warm mat
a small bird sings in the tall tree
the big dog chases the small cat
a brown fox hides behind the old tree
the green field stretches to the blue horizon
a lazy dog lies in the warm sun
the tall tree provides cool shade
a quick cat catches the small mouse
the warm sun shines on the green grass
a big bird flies over the blue lake
the old tree stands in the open field
a small mouse hides in the tall grass
the blue lake reflects the clear sky
""".strip().split("\n")

corpus_sentences = [line.lower().split() for line in larger_corpus]
splits = split_data(corpus_sentences)
Out[22]:
Console
Data Split:
----------------------------------------
   train: 10 sentences
     dev: 2 sentences
    test: 3 sentences

With this split, most data goes to training, a smaller portion to development for tuning, and the rest is held out for final evaluation. The test set remains untouched until we've finalized all model choices.

Now let's use the development set to tune the smoothing parameter.

In[23]:
Code
def tune_smoothing(train_sentences, dev_sentences, n=2):
    """Find optimal smoothing parameter k using dev set."""
    k_values = [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
    results = []

    for k in k_values:
        model = NgramModel(n=n, k=k)
        model.train(train_sentences)
        pp = model.perplexity(dev_sentences)
        results.append((k, pp))

    return results


tuning_results = tune_smoothing(splits["train"], splits["dev"], n=2)
best_k, best_pp = min(tuning_results, key=lambda x: x[1])
Out[24]:
Console
Smoothing Parameter Tuning (Bigram Model):
----------------------------------------
         k  Dev Perplexity
----------------------------------------
     0.001          111.06
     0.005           59.89
     0.010           47.33
     0.050           32.07
     0.100           29.75 <-- best
     0.500           30.79
     1.000           33.06

The optimal smoothing parameter balances two competing effects. Too little smoothing (small k) assigns very low probabilities to unseen n-grams, causing high perplexity when the test set contains novel combinations. Too much smoothing (large k) flattens the probability distribution, making all words nearly equally likely regardless of context.

Out[25]:
Visualization
Line plot showing perplexity on development set as a function of smoothing parameter k, with a U-shaped curve.
Tuning the smoothing parameter k on held-out development data. Too little smoothing (small k) causes high perplexity due to very low probabilities for unseen n-grams. Too much smoothing (large k) flattens the distribution excessively. The optimal k balances these effects.

Finally, we evaluate on the test set using the tuned parameter.

In[26]:
Code
# Train final model with best k
final_model = NgramModel(n=2, k=best_k)
final_model.train(splits["train"])

# Evaluate on test set
test_pp = final_model.perplexity(splits["test"])
Out[27]:
Console
Final Evaluation:
----------------------------------------
Best smoothing parameter: k = 0.1
Development perplexity: 29.75
Test perplexity: 17.58
Out[28]:
Console

Test perplexity is 40.9% lower than development, suggesting the test set may be easier or more similar to training data.

The test perplexity may differ from development perplexity because the test set contains different sentences. If test perplexity is much higher, it could indicate overfitting to the development set during tuning.

Comparing Models with Perplexity

Perplexity enables fair comparison between different models, but several caveats apply. For perplexity to be meaningful across models, certain conditions must be met: identical vocabulary, identical test set, and consistent tokenization. Violating these conditions leads to misleading comparisons.

Same Vocabulary Requirement

Models must use the same vocabulary for perplexity to be comparable. A model with a larger vocabulary faces a harder prediction problem because it has more choices at each step.

In[29]:
Code
# Demonstrate vocabulary effect
def create_model_with_vocab_limit(sentences, n=2, k=0.01, max_vocab=None):
    """Create model with optional vocabulary limit."""
    model = NgramModel(n=n, k=k)

    if max_vocab:
        # Count word frequencies
        word_freq = Counter()
        for tokens in sentences:
            word_freq.update(tokens)

        # Keep only top words
        top_words = set(w for w, _ in word_freq.most_common(max_vocab))

        # Replace rare words with <UNK>
        filtered_sentences = []
        for tokens in sentences:
            filtered = [w if w in top_words else "<unk>" for w in tokens]
            filtered_sentences.append(filtered)

        model.train(filtered_sentences)
    else:
        model.train(sentences)

    return model


# Compare models with different vocabulary sizes
# Use sizes smaller than full vocab to show clear progression
vocab_sizes = [5, 10, 15, 20, 30, None]  # None means full vocabulary
vocab_comparison = []

for max_v in vocab_sizes:
    model = create_model_with_vocab_limit(
        splits["train"], n=2, k=0.01, max_vocab=max_v
    )

    # Need to process test set with same vocabulary
    if max_v:
        word_freq = Counter()
        for tokens in splits["train"]:
            word_freq.update(tokens)
        top_words = set(w for w, _ in word_freq.most_common(max_v))
        test_filtered = [
            [w if w in top_words else "<unk>" for w in tokens]
            for tokens in splits["test"]
        ]
    else:
        test_filtered = splits["test"]

    pp = model.perplexity(test_filtered)
    vocab_comparison.append((max_v if max_v else len(model.vocab), pp))
Out[30]:
Console
Effect of Vocabulary Size on Perplexity:
----------------------------------------
  Vocab Size   Perplexity
----------------------------------------
           5         2.19
          10         2.74
          15         3.64
          20         6.89
          30         7.55
          43        17.89

Smaller vocabularies generally yield lower perplexity because the model has fewer options to choose from at each step. This illustrates why vocabulary size must be controlled when comparing models: a model with a 10-word vocabulary will always outperform one with 10,000 words on raw perplexity, even if the larger-vocabulary model is objectively better.

Out[31]:
Visualization
Line plot showing perplexity increasing as vocabulary size grows, demonstrating the relationship between vocabulary size and model perplexity.
Effect of vocabulary size on perplexity. Smaller vocabularies yield lower perplexity because the model has fewer choices at each step. This demonstrates why vocabulary size must be controlled when comparing models. Larger vocabulary means a harder prediction task.

Same Test Set Requirement

Perplexity scores are only comparable when computed on the same test set. Different test sets may have different inherent difficulty levels.

Statistical Significance

Small differences in perplexity may not be meaningful. Always consider whether improvements are statistically significant, especially when comparing on small test sets.

Perplexity vs Downstream Performance

Perplexity measures how well a model predicts text, but this doesn't always translate to better performance on downstream tasks. The relationship between perplexity and task-specific metrics depends heavily on whether the task fundamentally involves text prediction. Understanding when perplexity is a useful proxy for downstream performance helps guide model selection and evaluation strategy.

When Perplexity Correlates with Task Performance

Perplexity tends to correlate well with tasks that directly involve predicting or generating text:

  • Speech recognition: Lower perplexity language models produce better transcriptions
  • Machine translation: Language model perplexity correlates with translation fluency
  • Text generation: Lower perplexity models generate more coherent text

When Perplexity Doesn't Tell the Whole Story

For other tasks, perplexity may be a poor predictor:

  • Sentiment analysis: A model might have excellent perplexity but poor sentiment predictions
  • Question answering: Predicting the next word well doesn't mean understanding content
  • Named entity recognition: Local word prediction doesn't capture entity boundaries
Illustrative correlation between perplexity and downstream task performance. Tasks that directly involve predicting or generating text (speech recognition, translation, generation) correlate strongly with perplexity. Tasks requiring semantic understanding (QA, sentiment) show weaker correlation because next-word prediction doesn't capture deeper comprehension.
TaskCorrelationStrength
Speech Recognition0.9Strong
Machine Translation0.85Strong
Text Generation0.8Strong
Summarization0.6Moderate
Question Answering0.4Weak
Sentiment Analysis0.3Weak

Limitations and Caveats

While perplexity is the standard language model metric, it has important limitations. Factors like out-of-vocabulary words, sentence length, and domain mismatch can significantly affect perplexity scores and their interpretation. Understanding these caveats helps avoid common pitfalls in model evaluation.

Out-of-Vocabulary Words

When the test set contains words not in the training vocabulary, standard perplexity calculation breaks. Solutions include:

  • Replace with <UNK>: Map unknown words to a special token
  • Character-level models: Avoid the OOV problem entirely
  • Open vocabulary: Use subword tokenization (BPE, WordPiece)
In[32]:
Code
def perplexity_with_oov_handling(model, test_sentences, oov_token="<unk>"):
    """Calculate perplexity with OOV word handling."""
    total_log_prob = 0.0
    total_words = 0
    oov_count = 0

    for tokens in test_sentences:
        # Replace OOV words
        processed = []
        for w in tokens:
            if w in model.vocab:
                processed.append(w)
            else:
                processed.append(oov_token)
                oov_count += 1

        padded = ["<s>"] * (model.n - 1) + processed + ["</s>"]

        for i in range(model.n - 1, len(padded)):
            context = tuple(padded[i - model.n + 1 : i])
            word = padded[i]
            total_log_prob += model.log_prob(word, context)
            total_words += 1

    avg_log_prob = total_log_prob / total_words
    pp = 2 ** (-avg_log_prob)

    return pp, oov_count, total_words


# Test with sentences containing OOV words
oov_test = [
    "the elephant danced gracefully".split(),
    "a penguin swam in the ocean".split(),
]

pp, oov, total = perplexity_with_oov_handling(final_model, oov_test)
Out[33]:
Console
OOV Handling Results:
----------------------------------------
Test sentences: 2
Total words: 12
OOV words: 6 (50.0%)
Perplexity (with OOV handling): 28.71

Warning: OOV rate of 50.0% is high. Perplexity may be unreliable.

The OOV rate here reflects that these test sentences contain words like "elephant," "penguin," and "ocean" that never appeared in our training corpus. Mapping these to <unk> allows perplexity calculation to proceed, but the resulting value is less meaningful because the model treats all unknown words identically regardless of their actual likelihood.

Sentence Length Effects

Perplexity can vary with sentence length. Very short sentences may have artificially low perplexity because they contain only common words. Very long sentences may have higher perplexity due to accumulated uncertainty.

Out[34]:
Visualization
Scatter plot showing perplexity values for sentences of different lengths, with a trend line showing how perplexity stabilizes for longer sentences.
Relationship between sentence length and perplexity. Shorter sentences tend to have more variable perplexity, while longer sentences converge toward the model's average performance. This variability in short sentences can make comparisons unreliable.

Domain Mismatch

A model trained on news text will have high perplexity on social media text, even if both are "English." This domain mismatch makes cross-domain perplexity comparisons problematic.

The Perplexity Trap

Optimizing solely for perplexity can lead to models that are good at predicting common patterns but poor at handling rare but important cases. A model might achieve low perplexity by always predicting "the" but be useless for real applications.

Historical Context and Modern Usage

Perplexity has been the standard language model metric since the 1980s, when it was used to evaluate early speech recognition systems. Its continued relevance speaks to its utility, but modern usage has evolved.

Classical Era (1980s-2000s)

N-gram models were evaluated primarily by perplexity. A trigram model with Kneser-Ney smoothing might achieve perplexity around 100-200 on news text. Improvements of even 5-10% in perplexity were considered significant.

Neural Era (2010s)

Recurrent neural networks (RNNs) and LSTMs dramatically reduced perplexity. Models achieved perplexity below 100, then below 50. Perplexity remained the primary metric for comparing architectures.

Transformer Era (2017-present)

Transformer models pushed perplexity even lower. GPT-2 achieved perplexity around 20 on certain benchmarks. However, researchers increasingly recognize that perplexity alone doesn't capture model capabilities like reasoning, factual accuracy, or safety.

Historical progression of language model perplexity on the Penn Treebank benchmark. Each architectural innovation brought significant perplexity reductions, from ~150 for n-gram models to ~20 for large transformers. Note that modern evaluation increasingly emphasizes task-specific metrics alongside perplexity.
EraArchitecturePerplexity (PTB)Improvement
1990sN-gram (Kneser-Ney)~150Baseline
2010Neural Language Model~9040% reduction
2015LSTM~6033% reduction
2017Transformer~4033% reduction
2019GPT-2~2538% reduction
2020GPT-3~2020% reduction

Summary

Perplexity measures how well a language model predicts held-out text. It's derived from cross-entropy and can be interpreted as the effective branching factor: the average number of equally likely choices the model faces at each prediction step.

Key takeaways:

  • Cross-entropy measures the average bits needed to encode test data using the model's probability distribution
  • Perplexity equals 2cross-entropy2^{\text{cross-entropy}}, converting bits to an interpretable scale
  • Lower perplexity means the model assigns higher probability to actual text, indicating better predictions
  • Branching factor interpretation: perplexity of 100 means the model is as uncertain as choosing among 100 equally likely words
  • Held-out evaluation prevents overfitting by testing on unseen data
  • Same vocabulary and test set are required for fair model comparison
  • Perplexity doesn't always predict task performance, especially for tasks beyond text prediction
  • OOV handling is essential when test data contains words not in training vocabulary

Perplexity remains the standard intrinsic evaluation metric for language models. While it doesn't capture everything important about model quality, it provides a principled, comparable measure of a model's core capability: predicting what comes next in natural language.

Key Parameters

When computing and interpreting perplexity, these factors have the most impact:

Key parameters affecting perplexity computation and interpretation. Vocabulary size has the largest impact on raw perplexity values.
ParameterTypical ValuesEffect
Vocabulary size10K-100K wordsLarger vocabularies increase perplexity because the model has more choices. Always compare models with the same vocabulary.
N-gram order2-5Higher orders typically reduce perplexity but require more data. Diminishing returns beyond trigrams for most corpora.
Smoothing parameter0.001-0.5Affects perplexity through probability estimates. Tune on development data, not test data.
Test set size10K+ wordsLarger test sets give more stable perplexity estimates. Small test sets may have high variance.
OOV rate< 5% idealHigh OOV rates make perplexity less meaningful. Consider subword tokenization for open-vocabulary evaluation.
Log base2 or eUsing log2\log_2 gives bits; using ln\ln gives nats. Both are valid but not directly comparable.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about perplexity and language model evaluation.

Loading component...

Reference

BIBTEXAcademic
@misc{perplexitythestandardmetricforevaluatinglanguagemodels, author = {Michael Brenndoerfer}, title = {Perplexity: The Standard Metric for Evaluating Language Models}, year = {2025}, url = {https://mbrenndoerfer.com/writing/perplexity-language-model-evaluation-metric}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Perplexity: The Standard Metric for Evaluating Language Models. Retrieved from https://mbrenndoerfer.com/writing/perplexity-language-model-evaluation-metric
MLAAcademic
Michael Brenndoerfer. "Perplexity: The Standard Metric for Evaluating Language Models." 2026. Web. today. <https://mbrenndoerfer.com/writing/perplexity-language-model-evaluation-metric>.
CHICAGOAcademic
Michael Brenndoerfer. "Perplexity: The Standard Metric for Evaluating Language Models." Accessed today. https://mbrenndoerfer.com/writing/perplexity-language-model-evaluation-metric.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Perplexity: The Standard Metric for Evaluating Language Models'. Available at: https://mbrenndoerfer.com/writing/perplexity-language-model-evaluation-metric (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Perplexity: The Standard Metric for Evaluating Language Models. https://mbrenndoerfer.com/writing/perplexity-language-model-evaluation-metric