Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding

Michael Brenndoerfer

Language AI Handbook Machine Learning natural-language-processing

Master the Viterbi algorithm for finding optimal tag sequences in HMMs. Learn dynamic programming, backpointer tracking, log-space computation, and constrained decoding.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Viterbi AlgorithmLink Copied

When labeling a sequence of tokens, you face a fundamental decision: should you choose each label independently, or should you consider the entire sequence at once? Greedy decoding picks the best label at each position, ignoring how earlier choices affect later ones. But sequence labeling isn't about finding locally optimal tags; it's about finding the globally optimal tag sequence.

The Viterbi algorithm solves this problem elegantly. Given a sequence of observations and a model that defines transition and emission probabilities, Viterbi finds the single best sequence of hidden states. For part-of-speech tagging, it finds the best tag sequence. For named entity recognition, it finds the best BIO label sequence. The algorithm runs in time linear in the sequence length, making it practical for real applications.

This chapter develops the Viterbi algorithm from first principles. You'll understand why exhaustive search is intractable, how dynamic programming makes the problem solvable, and how to implement efficient decoding in practice. We also explore log-space computation to avoid numerical underflow and show how Viterbi forms the foundation for beam search decoding in neural sequence models.

The Optimal Path ProblemLink Copied

Consider a hidden Markov model for part-of-speech tagging. You observe a sequence of words $w_1, w_2, \ldots, w_n$ and want to find the sequence of tags $t_1, t_2, \ldots, t_n$ that maximizes the joint probability of seeing both the tags and the words together. The HMM factors this joint probability into a product of simpler terms:

P(t_1, \ldots, t_n, w_1, \ldots, w_n) = P(t_1) \prod_{i=2}^{n} P(t_i | t_{i-1}) \prod_{i=1}^{n} P(w_i | t_i)

where:

$t_1, \ldots, t_n$ : the sequence of hidden states (POS tags) we want to infer
$w_1, \ldots, w_n$ : the observed sequence of words
$P(t_1)$ : the initial state probability, capturing how likely the sentence starts with tag $t_1$
$P(t_i | t_{i-1})$ : the transition probability from tag $t_{i-1}$ to tag $t_i$ , encoding grammatical patterns like "determiners often precede nouns"
$P(w_i | t_i)$ : the emission probability of word $w_i$ given tag $t_i$ , capturing which words are typical for each tag

The first product captures transition probabilities: how likely is each tag given the previous one? The second product captures emission probabilities: how likely is each word given its tag? This factorization embodies the Markov assumption: each tag depends only on its immediate predecessor, not on the full history.

Optimal Path Problem

Given a sequence of observations and a model with transition and emission probabilities, find the sequence of hidden states that maximizes the probability of generating those observations.

Why Greedy Decoding FailsLink Copied

A greedy approach picks the best tag at each position independently, considering only the previous tag. At each position $i$ , greedy decoding selects:

t_i^* = \arg\max_{t_i} P(w_i | t_i) \cdot P(t_i | t_{i-1}^*)

where:

$t_i^*$ : the greedily chosen tag at position $i$
$\arg\max_{t_i}$ : selects the tag $t_i$ that maximizes the following expression
$P(w_i | t_i)$ : the emission probability of the current word given candidate tag $t_i$
$P(t_i | t_{i-1}^*)$ : the transition probability from the previously chosen tag $t_{i-1}^*$ to candidate tag $t_i$

This seems reasonable but can produce suboptimal sequences. The problem is that a locally good choice might force bad choices later. Consider tagging "The can can hold water." A greedy tagger might assign MODAL to the first "can" because modals are common after determiners. But this makes the second "can" difficult: modals rarely follow modals. The globally optimal solution assigns NOUN to the first "can" and VERB to the second, even though NOUN isn't the highest-probability tag for "can" in isolation.

In[2]:

Code

# Demonstration of greedy vs. optimal tagging

# Simple transition probabilities (log space for clarity)
transitions = {
    ("<START>", "DT"): 0.3,
    ("<START>", "NN"): 0.1,
    ("DT", "NN"): 0.5,
    ("DT", "MD"): 0.3,
    ("DT", "VB"): 0.1,
    ("NN", "VB"): 0.4,
    ("NN", "MD"): 0.2,
    ("MD", "VB"): 0.8,
    ("MD", "MD"): 0.05,  # Rare: modal after modal
    ("VB", "NN"): 0.3,
    ("VB", "DT"): 0.2,
}

# Emission probabilities for "can"
emissions_can = {
    "MD": 0.6,  # Modal is most common usage
    "NN": 0.25,  # Container noun
    "VB": 0.15,  # Preserve/put in cans
}

sentence = ["The", "can", "can", "hold", "water"]

# Demonstration of greedy vs. optimal tagging

# Simple transition probabilities (log space for clarity)
transitions = {
    ("<START>", "DT"): 0.3,
    ("<START>", "NN"): 0.1,
    ("DT", "NN"): 0.5,
    ("DT", "MD"): 0.3,
    ("DT", "VB"): 0.1,
    ("NN", "VB"): 0.4,
    ("NN", "MD"): 0.2,
    ("MD", "VB"): 0.8,
    ("MD", "MD"): 0.05,  # Rare: modal after modal
    ("VB", "NN"): 0.3,
    ("VB", "DT"): 0.2,
}

# Emission probabilities for "can"
emissions_can = {
    "MD": 0.6,  # Modal is most common usage
    "NN": 0.25,  # Container noun
    "VB": 0.15,  # Preserve/put in cans
}

sentence = ["The", "can", "can", "hold", "water"]

Out[3]:

Console

The Greedy Decoding Problem
=======================================================

Sentence: The can can hold water

At position 2 (first 'can'), greedy considers:
  P('can'|MD) × P(MD|DT) = 0.60 × 0.30 = 0.180
  P('can'|NN) × P(NN|DT) = 0.25 × 0.50 = 0.125
  P('can'|VB) × P(VB|DT) = 0.15 × 0.10 = 0.015

Greedy picks MD (highest score: 0.180)

But now at position 3 (second 'can'):
  P('can'|VB) × P(VB|MD) = 0.15 × 0.80 = 0.120
  P('can'|MD) × P(MD|MD) = 0.60 × 0.05 = 0.030

The greedy choice of MD at position 2 forces a bad situation at position 3.
If we had chosen NN at position 2 instead:
  P('can'|VB) × P(VB|NN) = 0.15 × 0.40 = 0.060

This example demonstrates why greedy decoding fails. The locally optimal choice of MD for the first "can" yields score 0.180, beating NN's 0.125. But this decision creates a problem: the second "can" must follow a modal, and modals rarely follow modals (P(MD|MD) = 0.05). The alternative path through NN enables a much higher score at the next position (0.060 vs 0.030), potentially yielding a better overall sequence probability.

Let's trace the cumulative scores for both paths:

Cumulative path scores comparing greedy vs alternative decoding.

Position	Greedy Path (DT→MD→VB)	Alternative Path (DT→NN→VB)
Position 2 (first "can")	0.180	0.125
Position 3 (second "can")	0.0216	0.0075

The greedy path starts stronger (0.180 vs 0.125) because MD has higher emission probability for "can." But this local advantage creates a problem: the MD→VB transition at position 3 yields only 0.0216, while NN→VB would have enabled a score of 0.0075. The globally optimal path requires looking ahead, something greedy decoding cannot do.

The Exponential Search SpaceLink Copied

If greedy decoding doesn't work, why not try all possible tag sequences? With $k$ possible tags and a sequence of length $n$ , there are $k^n$ possible sequences. For a modest vocabulary of 45 tags and a 20-word sentence, that's $45^{20} \approx 10^{33}$ sequences, far more than the number of atoms in the observable universe.

In[4]:

Code

def count_sequences(num_tags, seq_length):
    """Calculate the number of possible tag sequences."""
    return num_tags**seq_length


# Common scenarios
scenarios = [
    ("Simple (5 tags, 10 words)", 5, 10),
    ("Penn Treebank (45 tags, 15 words)", 45, 15),
    ("Penn Treebank (45 tags, 20 words)", 45, 20),
    ("Penn Treebank (45 tags, 30 words)", 45, 30),
]

sequence_counts = [
    (name, count_sequences(tags, length)) for name, tags, length in scenarios
]

def count_sequences(num_tags, seq_length):
    """Calculate the number of possible tag sequences."""
    return num_tags**seq_length


# Common scenarios
scenarios = [
    ("Simple (5 tags, 10 words)", 5, 10),
    ("Penn Treebank (45 tags, 15 words)", 45, 15),
    ("Penn Treebank (45 tags, 20 words)", 45, 20),
    ("Penn Treebank (45 tags, 30 words)", 45, 30),
]

sequence_counts = [
    (name, count_sequences(tags, length)) for name, tags, length in scenarios
]

Out[5]:

Console

Exhaustive Search is Intractable
============================================================

Scenario                                          Sequences
------------------------------------------------------------
Simple (5 tags, 10 words)                             ~10^6
Penn Treebank (45 tags, 15 words)                    ~10^24
Penn Treebank (45 tags, 20 words)                    ~10^33
Penn Treebank (45 tags, 30 words)                    ~10^49

For comparison:
  Atoms in the universe: ~10^80
  Nanoseconds since Big Bang: ~10^26

Even with a modest 45-tag vocabulary, a 20-word sentence produces more possible sequences than atoms in the observable universe. Checking each sequence individually would take longer than the age of the universe on any computer. The key insight is that we don't need to enumerate every sequence. Many sequences share common prefixes, and we can avoid recomputing scores for these shared subsequences. This is where dynamic programming enters.

The Viterbi RecursionLink Copied

We've established that greedy decoding fails because local choices can force globally suboptimal sequences, and that exhaustive search is computationally impossible. How can we find the optimal path without examining every possibility? The answer lies in a beautiful insight from dynamic programming: we can build the optimal solution incrementally from smaller optimal solutions.

The Key Insight: Optimal SubstructureLink Copied

Imagine you're at position 3 in a sequence, and you want to know the best path that ends at the tag NOUN. Here's the crucial observation: regardless of how that path reached position 2, the best continuation to NOUN at position 3 depends only on:

What state we're in at position 2 (not how we got there)
The transition probability to NOUN
The emission probability of the observed word from NOUN

This property is called optimal substructure. If the best path through the entire sequence passes through state $s'$ at position 2 and then reaches NOUN at position 3, then the portion of that path from the start to $s'$ must itself be optimal. Why? If there were a better path to $s'$ , we could substitute it and get a better overall path, contradicting our assumption that we had the best path.

Viterbi Principle (Bellman's Optimality)

The best path ending at state $s$ at position $i$ consists of the best path to some state $s'$ at position $i-1$ , followed by a transition from $s'$ to $s$ . We can find the globally optimal path by computing locally optimal subpaths.

This principle transforms our approach entirely. Instead of tracking complete paths (exponentially many), we only need to track the best score for reaching each state at each position. At any position, we have just $k$ values to maintain, one per possible state, rather than an exponentially growing set of paths.

Building the Recurrence: From Intuition to FormulaLink Copied

Let's formalize this insight. We define $V_i(s)$ as the probability of the most likely path that:

Ends at state $s$ at position $i$
Generates the observations $w_1, \ldots, w_i$ seen so far

How do we compute $V_i(s)$ ? Consider any path ending at state $s$ at position $i$ . That path must have come from some predecessor state $s'$ at position $i-1$ . The probability of this path factors into three terms:

The probability of the best path to $s'$ : This is $V_{i-1}(s')$ , which we've already computed
The transition probability from $s'$ to $s$ : This is $P(s | s')$
The probability of emitting the current observation from $s$ : This is $P(w_i | s)$

Since we want the best path to $s$ , we take the maximum over all possible predecessor states. This gives us the Viterbi recurrence:

V_i(s) = \max_{s'} \left[ V_{i-1}(s') \cdot P(s | s') \cdot P(w_i | s) \right]

where:

$V_i(s)$ : the probability of the best path ending at state $s$ at position $i$
$\max_{s'}$ : takes the maximum over all possible predecessor states $s'$
$V_{i-1}(s')$ : the best score for reaching state $s'$ at the previous position (already computed)
$P(s | s')$ : the transition probability from predecessor state $s'$ to current state $s$
$P(w_i | s)$ : the emission probability of observing word $w_i$ from state $s$

Notice that $P(w_i | s)$ appears inside the maximization but doesn't depend on the predecessor $s'$ . This is because once we decide to be in state $s$ at position $i$ , the emission probability is fixed regardless of how we arrived. We could factor it out:

V_i(s) = P(w_i | s) \cdot \max_{s'} \left[ V_{i-1}(s') \cdot P(s | s') \right]

This form is computationally equivalent but makes clear that we're choosing the best predecessor based solely on the accumulated path score and transition, then multiplying by the emission.

The Base Case: Starting the SequenceLink Copied

The recurrence needs a starting point. At position 1, there is no predecessor to consider. The probability of the best path ending at state $s$ after seeing just the first observation is simply:

V_1(s) = P(s) \cdot P(w_1 | s)

where:

$P(s)$ : the initial state probability (how likely the sequence starts in state $s$ )
$P(w_1 | s)$ : the emission probability of the first observation from state $s$

This captures both how likely we are to start in state $s$ and how well state $s$ explains the first observed word.

Complexity: Why This WorksLink Copied

Let's count the operations. At each of the $n$ positions (after the first), we compute $V_i(s)$ for each of the $k$ states. Computing each $V_i(s)$ requires examining $k$ predecessors. Total: $O(n \cdot k^2)$ operations.

Compare this to the $k^n$ paths we would examine exhaustively:

Complexity comparison between exhaustive search and Viterbi.

Approach	Complexity	45 tags, 20 words
Exhaustive	$O(k^n)$	~ $10^{33}$
Viterbi	$O(n \cdot k^2)$	40,500

The exponential has become polynomial. This is the power of dynamic programming: by recognizing that we can solve the problem incrementally and that we only need to remember the best score per state (not every path), we collapse an intractable problem into an efficient algorithm.

Visualizing the TrellisLink Copied

It helps to visualize the Viterbi computation as filling a grid called a trellis. The horizontal axis represents positions in the sequence (1 through $n$ ), and the vertical axis represents possible states (the $k$ tags or labels). Each cell $(i, s)$ holds the score $V_i(s)$ , the probability of the best path ending at state $s$ at position $i$ .

Arrows connect cells at adjacent positions. Each arrow from $(i-1, s')$ to $(i, s)$ represents one term in the maximization: "if I came from state $s'$ , what would my score be?" The algorithm fills this grid column by column, left to right, computing each cell from the cells in the previous column.

Out[6]:

Visualization

Grid diagram showing positions 1-4 horizontally and three possible tags vertically, with arrows connecting adjacent columns representing transition probabilities. — A Viterbi trellis for part-of-speech tagging. Each column represents a position in the sequence, and each row represents a possible tag. Blue nodes show states, and arrows show transitions weighted by transition and emission probabilities. The algorithm finds the highest-scoring path from left to right through this lattice.

In the figure above, gray arrows show all possible transitions between adjacent positions. The red path highlights the optimal sequence discovered by Viterbi: DT → NN → VB → NN for the sentence "The can hold water." Notice that:

At position 1, DT (determiner) wins because "The" is overwhelmingly a determiner
At position 2, NN (noun) wins, not because "can" is most often a noun, but because this choice enables better continuations
The algorithm found this globally optimal path without explicitly enumerating all 81 possible sequences (3 states × 4 positions)

Now we understand how Viterbi computes scores efficiently. But there's a missing piece: at the end of the algorithm, we know the score of the best path, but we haven't recorded the path itself. How do we reconstruct which states we actually passed through?

Backpointer TrackingLink Copied

The Viterbi recurrence gives us the score of the best path to each state, but we've lost something in the process: which path achieved that score? At each step, we took the maximum over predecessors, keeping only the winning score. To reconstruct the actual optimal sequence, we need to remember which predecessor won at each cell.

The Need for BackpointersLink Copied

Think of it this way: when we compute $V_3(\text{NOUN})$ , we're answering "what's the best path probability ending at NOUN at position 3?" But we're not recording "and that path came through VERB at position 2." Without this information, when we finish computing the entire trellis, we'll know that the best final state is (say) NOUN with score 0.0073, but we won't know the sequence of states that led there.

The solution is elegantly simple: alongside each score, store the predecessor that achieved it. We call this a backpointer because it points backward to the previous state in the optimal path.

Formalizing BackpointersLink Copied

The backpointer $B_i(s)$ records which predecessor state $s'$ maximized the score when computing $V_i(s)$ :

B_i(s) = \arg\max_{s'} \left[ V_{i-1}(s') \cdot P(s | s') \cdot P(w_i | s) \right]

where:

$B_i(s)$ : the backpointer for state $s$ at position $i$ , storing which predecessor led to the best score
$\arg\max_{s'}$ : returns the state $s'$ (not the value) that maximizes the expression
The expression inside is identical to the Viterbi recurrence

Notice the difference between $\max$ and $\arg\max$ :

$\max$ returns the value: "the best score is 0.042"
$\arg\max$ returns the argument: "the best predecessor is VERB"

We compute both simultaneously: the same computation that finds the maximum score also identifies which predecessor achieved it.

Reconstructing the Path: The Backward PassLink Copied

Once we've filled the entire trellis, we have scores and backpointers for every (position, state) pair. Reconstruction proceeds in three steps:

Step 1: Find the best final state. The optimal path ends at whichever state has the highest score at the final position:

s_n^* = \arg\max_s V_n(s)

where:

$s_n^*$ : the optimal state at the final position $n$
$V_n(s)$ : the Viterbi score at the final position for each state $s$

Step 2: Follow backpointers backward. Starting from $s_n^*$ , we trace back through the stored pointers:

s_i^* = B_{i+1}(s_{i+1}^*)

where:

$s_i^*$ : the optimal state at position $i$
$B_{i+1}(s_{i+1}^*)$ : the backpointer stored at position $i+1$ for state $s_{i+1}^*$ , which tells us which state at position $i$ led to the best path through $s_{i+1}^*$

Each backpointer tells us: "to reach this state optimally, I came from that state." By following this chain, we recover the complete path.

Step 3: Reverse the sequence. Since we traced from position $n$ back to position 1, the recovered sequence is in reverse order. A simple reversal gives us the forward path.

Why Backpointers WorkLink Copied

The beauty of this approach is that backpointers require only $O(n \cdot k)$ additional storage, one pointer per cell in the trellis, and add no complexity to the forward pass (we're already examining all predecessors to compute the maximum). The backward pass is a simple $O(n)$ traversal.

In[7]:

Code

def viterbi_with_backpointers(
    observations, states, start_prob, trans_prob, emit_prob
):
    """
    Viterbi algorithm with backpointer tracking.

    Args:
        observations: List of observed symbols
        states: List of possible hidden states
        start_prob: Dict mapping state -> initial probability
        trans_prob: Dict mapping (prev_state, state) -> transition probability
        emit_prob: Dict mapping (state, observation) -> emission probability

    Returns:
        best_path: List of states forming the optimal path
        best_prob: Probability of the optimal path
    """
    n = len(observations)

    # Initialize Viterbi table and backpointers
    # V[i][s] = best probability of path ending at state s at position i
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case: first observation
    for s in states:
        emit = emit_prob.get((s, observations[0]), 1e-10)
        V[0][s] = start_prob.get(s, 1e-10) * emit
        backpointer[0][s] = None

    # Recursive case: fill the trellis
    for i in range(1, n):
        for s in states:
            # Find best predecessor
            best_prev = None
            best_score = 0

            emit = emit_prob.get((s, observations[i]), 1e-10)

            for s_prev in states:
                trans = trans_prob.get((s_prev, s), 1e-10)
                score = V[i - 1][s_prev] * trans * emit

                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            V[i][s] = best_score
            backpointer[i][s] = best_prev

    # Find best final state
    best_final_state = max(states, key=lambda s: V[n - 1][s])
    best_prob = V[n - 1][best_final_state]

    # Backtrack to find the path
    path = [best_final_state]
    current = best_final_state
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        path.append(current)

    path.reverse()
    return path, best_prob

def viterbi_with_backpointers(
    observations, states, start_prob, trans_prob, emit_prob
):
    """
    Viterbi algorithm with backpointer tracking.

    Args:
        observations: List of observed symbols
        states: List of possible hidden states
        start_prob: Dict mapping state -> initial probability
        trans_prob: Dict mapping (prev_state, state) -> transition probability
        emit_prob: Dict mapping (state, observation) -> emission probability

    Returns:
        best_path: List of states forming the optimal path
        best_prob: Probability of the optimal path
    """
    n = len(observations)

    # Initialize Viterbi table and backpointers
    # V[i][s] = best probability of path ending at state s at position i
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case: first observation
    for s in states:
        emit = emit_prob.get((s, observations[0]), 1e-10)
        V[0][s] = start_prob.get(s, 1e-10) * emit
        backpointer[0][s] = None

    # Recursive case: fill the trellis
    for i in range(1, n):
        for s in states:
            # Find best predecessor
            best_prev = None
            best_score = 0

            emit = emit_prob.get((s, observations[i]), 1e-10)

            for s_prev in states:
                trans = trans_prob.get((s_prev, s), 1e-10)
                score = V[i - 1][s_prev] * trans * emit

                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            V[i][s] = best_score
            backpointer[i][s] = best_prev

    # Find best final state
    best_final_state = max(states, key=lambda s: V[n - 1][s])
    best_prob = V[n - 1][best_final_state]

    # Backtrack to find the path
    path = [best_final_state]
    current = best_final_state
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        path.append(current)

    path.reverse()
    return path, best_prob

Reading the ImplementationLink Copied

Let's walk through how this code implements the formulas we developed:

Initialization (lines 17-24): We create two data structures: V for Viterbi scores and backpointer for tracking predecessors. Each is a list of dictionaries, one per position.
Base case (lines 26-30): For the first observation, we apply $V_1(s) = P(s) \cdot P(w_1 | s)$ . There's no predecessor, so backpointers are None.
Recursion (lines 32-50): The nested loops implement the recurrence. For each position i and each state s, we examine all predecessors s_prev, compute the score $V_{i-1}(s') \cdot P(s|s') \cdot P(w_i|s)$ , and keep the maximum. We store both the score (V[i][s]) and which predecessor achieved it (backpointer[i][s]).
Termination and backtracking (lines 52-63): We find the state with the highest final score (this is $s_n^* = \arg\max_s V_n(s)$ ), then follow backpointers backward to reconstruct the path.

The implementation uses a small epsilon (1e-10) instead of zero for missing probabilities to avoid numerical issues with exact zeros.

A Worked ExampleLink Copied

Now let's trace through this algorithm on a concrete problem. We'll use a simple HMM for weather inference: given a sequence of observed activities (walking, shopping, cleaning), infer the most likely sequence of weather states (sunny or rainy).

In[8]:

Code

# Simple HMM: weather affects activities
states = ["Sunny", "Rainy"]
observations = ["walk", "shop", "clean"]

# Initial probabilities
start_prob = {"Sunny": 0.6, "Rainy": 0.4}

# Transition probabilities
trans_prob = {
    ("Sunny", "Sunny"): 0.7,
    ("Sunny", "Rainy"): 0.3,
    ("Rainy", "Sunny"): 0.4,
    ("Rainy", "Rainy"): 0.6,
}

# Emission probabilities
emit_prob = {
    ("Sunny", "walk"): 0.6,
    ("Sunny", "shop"): 0.3,
    ("Sunny", "clean"): 0.1,
    ("Rainy", "walk"): 0.1,
    ("Rainy", "shop"): 0.4,
    ("Rainy", "clean"): 0.5,
}

# Run Viterbi
best_path, best_prob = viterbi_with_backpointers(
    observations, states, start_prob, trans_prob, emit_prob
)

# Simple HMM: weather affects activities
states = ["Sunny", "Rainy"]
observations = ["walk", "shop", "clean"]

# Initial probabilities
start_prob = {"Sunny": 0.6, "Rainy": 0.4}

# Transition probabilities
trans_prob = {
    ("Sunny", "Sunny"): 0.7,
    ("Sunny", "Rainy"): 0.3,
    ("Rainy", "Sunny"): 0.4,
    ("Rainy", "Rainy"): 0.6,
}

# Emission probabilities
emit_prob = {
    ("Sunny", "walk"): 0.6,
    ("Sunny", "shop"): 0.3,
    ("Sunny", "clean"): 0.1,
    ("Rainy", "walk"): 0.1,
    ("Rainy", "shop"): 0.4,
    ("Rainy", "clean"): 0.5,
}

# Run Viterbi
best_path, best_prob = viterbi_with_backpointers(
    observations, states, start_prob, trans_prob, emit_prob
)

The HMM parameters encode two types of knowledge about the weather-activity relationship:

Transition Probabilities $P(s_t | s_{t-1})$ : How likely is each weather state given the previous day's weather?

Transition probability matrix showing weather persistence. Sunny and rainy days both tend to continue (0.7 and 0.6 respectively).

Current State	→ Sunny	→ Rainy
Sunny	0.7	0.3
Rainy	0.4	0.6

Emission Probabilities $P(w | s)$ : How likely is each activity given the current weather?

Emission probability matrix showing activity preferences by weather. Walking strongly indicates sunny weather (0.6 vs 0.1); cleaning strongly indicates rain (0.5 vs 0.1). Shopping is ambiguous.

Weather State	walk	shop	clean
Sunny	0.6	0.3	0.1
Rainy	0.1	0.4	0.5

These tables reveal the model's structure. Weather tends to persist: sunny days have 70% chance of staying sunny, rainy days 60% chance of staying rainy. For activities, walking is a strong indicator of sunny weather, while cleaning suggests rain. Shopping is the most ambiguous activity, slightly favoring rainy days.

Out[9]:

Console

Viterbi Example: Weather from Activities
=======================================================

Observations: ['walk', 'shop', 'clean']
Possible states: ['Sunny', 'Rainy']

Step-by-step computation:
-------------------------------------------------------

Position 0: observing 'walk'
  V[0][Sunny] = P(Sunny) × P('walk'|Sunny)
           = 0.6 × 0.6 = 0.360
  V[0][Rainy] = P(Rainy) × P('walk'|Rainy)
           = 0.4 × 0.1 = 0.040

Position 1: observing 'shop'
  V[1][Sunny] = max over predecessors:
    from Sunny: 0.360 × 0.7 × 0.3 = 0.0756
    from Rainy: 0.040 × 0.4 × 0.3 = 0.0048
  V[1][Rainy] = max over predecessors:
    from Sunny: 0.360 × 0.3 × 0.4 = 0.0432
    from Rainy: 0.040 × 0.6 × 0.4 = 0.0096

-------------------------------------------------------

Optimal path: Sunny → Rainy → Rainy
Path probability: 0.012960

Let's interpret this step-by-step computation:

Position 0 (observing "walk"): We apply the base case formula $V_1(s) = P(s) \cdot P(w_1|s)$ . Sunny scores higher (0.360 vs 0.040) for two reinforcing reasons: the prior favors sunny days (0.6 vs 0.4), and walking is much more likely on sunny days (0.6 vs 0.1).

Position 1 (observing "shop"): Now the recurrence kicks in. For each current state, we consider both possible predecessors:

To reach Sunny: the best predecessor is Sunny (score 0.0756), since we benefit from both the high previous score and a favorable transition (sunny days tend to stay sunny)
To reach Rainy: again the best predecessor is Sunny (score 0.0432), despite the transition penalty, because the previous Sunny score was so much higher than Rainy

The result: The optimal path Sunny → Sunny → Rainy aligns with intuition: walking strongly suggests good weather, shopping is ambiguous (slightly favoring rain), and cleaning strongly suggests rain. The algorithm found this without examining all 8 possible paths. It computed only 12 values (2 states × 3 positions × 2 predecessor checks).

Out[10]:

Visualization

Grid showing two rows (Sunny, Rainy) and three columns (walk, shop, clean) with Viterbi scores and the optimal path highlighted in red. — Complete Viterbi trellis for the weather-activity example. Numbers in cells show the Viterbi score V[i][s] at each position. The red path marks the optimal sequence found by backtracking from the highest-scoring final state. Dashed arrows show the backpointers used for path reconstruction.

Log-Space ComputationLink Copied

Multiplying many small probabilities causes numerical underflow. For a sequence of 100 tokens with probabilities around 0.01, the path probability would be approximately $0.01^{100} = 10^{-200}$ , far below the smallest representable floating-point number (approximately $10^{-308}$ for 64-bit floats).

The solution is to work in log space. The logarithm function has a useful property: it converts products into sums. Since $\log(ab) = \log(a) + \log(b)$ , we can rewrite the Viterbi recurrence entirely in terms of log probabilities:

\log V_i(s) = \max_{s'} \left[ \log V_{i-1}(s') + \log P(s | s') + \log P(w_i | s) \right]

where:

$\log V_i(s)$ : the log probability of the best path ending at state $s$ at position $i$
$\log V_{i-1}(s')$ : the log probability of the best path to predecessor state $s'$ (already computed)
$\log P(s | s')$ : the log transition probability from $s'$ to $s$
$\log P(w_i | s)$ : the log emission probability of word $w_i$ from state $s$

Log probabilities are negative since probabilities are between 0 and 1, and $\log(x) < 0$ when $x < 1$ . The best path has the highest (least negative) log probability. The $\max$ operation is unchanged because log is monotonic: if $a > b$ , then $\log(a) > \log(b)$ . This means finding the maximum log probability gives the same answer as finding the maximum probability.

In[11]:

Code

import math


def viterbi_log_space(observations, states, start_prob, trans_prob, emit_prob):
    """
    Viterbi algorithm in log space for numerical stability.
    """
    n = len(observations)

    # Convert to log probabilities
    def safe_log(p):
        return math.log(p) if p > 0 else float("-inf")

    log_start = {s: safe_log(p) for s, p in start_prob.items()}
    log_trans = {k: safe_log(p) for k, p in trans_prob.items()}
    log_emit = {k: safe_log(p) for k, p in emit_prob.items()}

    # Viterbi in log space
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case
    for s in states:
        emit = log_emit.get((s, observations[0]), float("-inf"))
        V[0][s] = log_start.get(s, float("-inf")) + emit
        backpointer[0][s] = None

    # Recursive case
    for i in range(1, n):
        for s in states:
            best_prev = None
            best_score = float("-inf")

            emit = log_emit.get((s, observations[i]), float("-inf"))

            for s_prev in states:
                trans = log_trans.get((s_prev, s), float("-inf"))
                score = V[i - 1][s_prev] + trans + emit

                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            V[i][s] = best_score
            backpointer[i][s] = best_prev

    # Find best final state
    best_final = max(states, key=lambda s: V[n - 1][s])
    best_log_prob = V[n - 1][best_final]

    # Backtrack
    path = [best_final]
    current = best_final
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        path.append(current)

    path.reverse()
    return path, best_log_prob


# Compare probability space vs log space
path_log, log_prob = viterbi_log_space(
    observations, states, start_prob, trans_prob, emit_prob
)

import math


def viterbi_log_space(observations, states, start_prob, trans_prob, emit_prob):
    """
    Viterbi algorithm in log space for numerical stability.
    """
    n = len(observations)

    # Convert to log probabilities
    def safe_log(p):
        return math.log(p) if p > 0 else float("-inf")

    log_start = {s: safe_log(p) for s, p in start_prob.items()}
    log_trans = {k: safe_log(p) for k, p in trans_prob.items()}
    log_emit = {k: safe_log(p) for k, p in emit_prob.items()}

    # Viterbi in log space
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case
    for s in states:
        emit = log_emit.get((s, observations[0]), float("-inf"))
        V[0][s] = log_start.get(s, float("-inf")) + emit
        backpointer[0][s] = None

    # Recursive case
    for i in range(1, n):
        for s in states:
            best_prev = None
            best_score = float("-inf")

            emit = log_emit.get((s, observations[i]), float("-inf"))

            for s_prev in states:
                trans = log_trans.get((s_prev, s), float("-inf"))
                score = V[i - 1][s_prev] + trans + emit

                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            V[i][s] = best_score
            backpointer[i][s] = best_prev

    # Find best final state
    best_final = max(states, key=lambda s: V[n - 1][s])
    best_log_prob = V[n - 1][best_final]

    # Backtrack
    path = [best_final]
    current = best_final
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        path.append(current)

    path.reverse()
    return path, best_log_prob


# Compare probability space vs log space
path_log, log_prob = viterbi_log_space(
    observations, states, start_prob, trans_prob, emit_prob
)

Out[12]:

Console

Log-Space vs Probability-Space Computation
=======================================================

Optimal path (both methods): Sunny → Rainy → Rainy

Probability space: P = 0.012960
Log space: log(P) = -4.345888
Converted back: exp(log(P)) = 0.012960

For long sequences, probability space would underflow:

  Sequence length: 100
  Per-step probability: 0.1
  Probability space: 1.0000000000000056e-100 (underflow to 0)
  Log space: -230.26 (computable)

Both methods find the identical optimal path, confirming that log-space computation preserves correctness. The probability-space result (best_prob) and the exponentiated log-space result match to numerical precision. For the 100-token demonstration, probability space underflows to exactly zero while log space produces a meaningful value of -230.26. This is why all production Viterbi implementations work in log space.

Out[13]:

Visualization

Two line plots. Left shows probability decaying exponentially to zero. Right shows log probability decreasing linearly and remaining stable. — Path probability decay in probability space vs log space as sequence length increases. Left: probability space values crash to zero (underflow) around length 30. Right: log-space values remain computationally stable, decreasing linearly. The red dashed line marks the smallest representable float64 value.

The visualization makes the underflow problem concrete. With a typical per-step probability of 0.1, probability-space values hit the float64 minimum around sequence length 308 and become indistinguishable from zero. Log-space values, in contrast, decrease linearly and remain perfectly representable regardless of sequence length. Even a 1000-word document produces a log probability around -2300, large but entirely computable.

Log-space computation is essential for practical implementations. The algorithm produces identical paths but with stable numerics.

Complexity AnalysisLink Copied

The Viterbi algorithm's efficiency comes from its dynamic programming structure. Let's analyze the time and space requirements.

Time ComplexityLink Copied

For a sequence of length $n$ with $k$ possible states:

Initialization: $O(k)$ operations to set up the first column
Recursion: For each of the $n-1$ remaining positions, we compute $k$ cells. Each cell considers $k$ predecessors. Total: $O(n \cdot k^2)$
Backtracking: $O(n)$ to trace back through the path

The overall time complexity is $O(n \cdot k^2)$ , which is linear in sequence length and quadratic in the number of states.

In[14]:

Code

import time


def benchmark_viterbi(seq_lengths, n_states_list):
    """Benchmark Viterbi performance."""
    results = []

    for n_states in n_states_list:
        for seq_len in seq_lengths:
            # Create synthetic HMM
            states = [f"S{i}" for i in range(n_states)]
            obs = [f"O{i % 5}" for i in range(seq_len)]

            # Uniform probabilities for benchmarking
            start = {s: 1 / n_states for s in states}
            trans = {(s1, s2): 1 / n_states for s1 in states for s2 in states}
            emit = {(s, f"O{j}"): 0.2 for s in states for j in range(5)}

            # Time the algorithm
            start_time = time.perf_counter()
            for _ in range(3):  # Average over 3 runs
                viterbi_log_space(obs, states, start, trans, emit)
            elapsed = (time.perf_counter() - start_time) / 3

            results.append(
                {
                    "seq_len": seq_len,
                    "n_states": n_states,
                    "time_ms": elapsed * 1000,
                }
            )

    return results


# Run benchmarks
seq_lengths = [10, 50, 100, 200]
n_states_list = [5, 10, 20]
benchmarks = benchmark_viterbi(seq_lengths, n_states_list)

import time


def benchmark_viterbi(seq_lengths, n_states_list):
    """Benchmark Viterbi performance."""
    results = []

    for n_states in n_states_list:
        for seq_len in seq_lengths:
            # Create synthetic HMM
            states = [f"S{i}" for i in range(n_states)]
            obs = [f"O{i % 5}" for i in range(seq_len)]

            # Uniform probabilities for benchmarking
            start = {s: 1 / n_states for s in states}
            trans = {(s1, s2): 1 / n_states for s1 in states for s2 in states}
            emit = {(s, f"O{j}"): 0.2 for s in states for j in range(5)}

            # Time the algorithm
            start_time = time.perf_counter()
            for _ in range(3):  # Average over 3 runs
                viterbi_log_space(obs, states, start, trans, emit)
            elapsed = (time.perf_counter() - start_time) / 3

            results.append(
                {
                    "seq_len": seq_len,
                    "n_states": n_states,
                    "time_ms": elapsed * 1000,
                }
            )

    return results


# Run benchmarks
seq_lengths = [10, 50, 100, 200]
n_states_list = [5, 10, 20]
benchmarks = benchmark_viterbi(seq_lengths, n_states_list)

Out[15]:

Console

Viterbi Algorithm Benchmarks
=======================================================

Seq Length   States     Time (ms)   
-------------------------------------------------------
10           5          0.038
50           5          0.194
100          5          0.290
200          5          0.576
10           10         0.105
50           10         0.614
100          10         0.966
200          10         2.045
10           20         0.369
50           20         1.758
100          20         3.688
200          20         7.428

The benchmarks confirm the theoretical complexity. Doubling sequence length roughly doubles runtime (linear scaling), while doubling the number of states roughly quadruples runtime (quadratic scaling). Even with 20 states and 200-word sequences, Viterbi completes in tens of milliseconds, enabling real-time tagging in production systems.

Out[16]:

Visualization

Two line plots showing Viterbi runtime. Left plot shows linear growth with sequence length, right plot shows quadratic growth with number of states. — Empirical time complexity of the Viterbi algorithm. Left: time grows linearly with sequence length for a fixed number of states. Right: time grows quadratically with the number of states for a fixed sequence length. These curves confirm the theoretical O(n × k²) complexity.

Space ComplexityLink Copied

The algorithm stores:

Viterbi table: $O(n \cdot k)$ scores
Backpointers: $O(n \cdot k)$ references

Total space is $O(n \cdot k)$ . However, if we only need the final path and not the full table, we can reduce space to $O(k)$ by keeping only two columns at a time, though we'd need to run a second pass to recover the path.

Implementing Viterbi EfficientlyLink Copied

A production implementation incorporates several optimizations beyond the basic algorithm. Let's build a more efficient version.

In[17]:

Code

import numpy as np


class EfficientViterbi:
    """
    Optimized Viterbi implementation using NumPy.

    Uses matrix operations for faster computation on larger state spaces.
    """

    def __init__(self, states, trans_matrix, emit_matrix, start_probs):
        """
        Args:
            states: List of state names
            trans_matrix: (n_states, n_states) transition probabilities
            emit_matrix: (n_states, n_observations) emission probabilities
            start_probs: (n_states,) initial state probabilities
        """
        self.states = states
        self.n_states = len(states)

        # Store log probabilities
        self.log_trans = np.log(trans_matrix + 1e-10)
        self.log_emit = emit_matrix  # Will be indexed per observation
        self.log_start = np.log(start_probs + 1e-10)

    def decode(self, observation_indices):
        """
        Find the optimal state sequence for a sequence of observations.

        Args:
            observation_indices: List of observation indices

        Returns:
            path: List of state indices
            score: Log probability of the path
        """
        n = len(observation_indices)

        # Viterbi table: (sequence_length, n_states)
        V = np.full((n, self.n_states), float("-inf"))
        backpointers = np.zeros((n, self.n_states), dtype=int)

        # Base case
        obs_0 = observation_indices[0]
        V[0] = self.log_start + self.log_emit[:, obs_0]

        # Fill the trellis
        for i in range(1, n):
            obs_i = observation_indices[i]
            emit_i = self.log_emit[:, obs_i]

            # For each current state, find best predecessor
            # V[i-1] is (n_states,), log_trans is (n_states, n_states)
            # We want: for each current state s, max over s' of V[i-1, s'] + log_trans[s', s]

            # Broadcast: V[i-1, :, np.newaxis] + log_trans = (n_states, n_states)
            # where entry [s', s] = V[i-1, s'] + log_trans[s', s]
            scores = V[i - 1, :, np.newaxis] + self.log_trans

            # Max over predecessors (axis 0) gives best score for each current state
            best_scores = scores.max(axis=0)
            best_prevs = scores.argmax(axis=0)

            V[i] = best_scores + emit_i
            backpointers[i] = best_prevs

        # Find best final state
        best_final = V[n - 1].argmax()
        best_score = V[n - 1, best_final]

        # Backtrack
        path = [best_final]
        current = best_final
        for i in range(n - 1, 0, -1):
            current = backpointers[i, current]
            path.append(current)

        path.reverse()
        return path, best_score

import numpy as np


class EfficientViterbi:
    """
    Optimized Viterbi implementation using NumPy.

    Uses matrix operations for faster computation on larger state spaces.
    """

    def __init__(self, states, trans_matrix, emit_matrix, start_probs):
        """
        Args:
            states: List of state names
            trans_matrix: (n_states, n_states) transition probabilities
            emit_matrix: (n_states, n_observations) emission probabilities
            start_probs: (n_states,) initial state probabilities
        """
        self.states = states
        self.n_states = len(states)

        # Store log probabilities
        self.log_trans = np.log(trans_matrix + 1e-10)
        self.log_emit = emit_matrix  # Will be indexed per observation
        self.log_start = np.log(start_probs + 1e-10)

    def decode(self, observation_indices):
        """
        Find the optimal state sequence for a sequence of observations.

        Args:
            observation_indices: List of observation indices

        Returns:
            path: List of state indices
            score: Log probability of the path
        """
        n = len(observation_indices)

        # Viterbi table: (sequence_length, n_states)
        V = np.full((n, self.n_states), float("-inf"))
        backpointers = np.zeros((n, self.n_states), dtype=int)

        # Base case
        obs_0 = observation_indices[0]
        V[0] = self.log_start + self.log_emit[:, obs_0]

        # Fill the trellis
        for i in range(1, n):
            obs_i = observation_indices[i]
            emit_i = self.log_emit[:, obs_i]

            # For each current state, find best predecessor
            # V[i-1] is (n_states,), log_trans is (n_states, n_states)
            # We want: for each current state s, max over s' of V[i-1, s'] + log_trans[s', s]

            # Broadcast: V[i-1, :, np.newaxis] + log_trans = (n_states, n_states)
            # where entry [s', s] = V[i-1, s'] + log_trans[s', s]
            scores = V[i - 1, :, np.newaxis] + self.log_trans

            # Max over predecessors (axis 0) gives best score for each current state
            best_scores = scores.max(axis=0)
            best_prevs = scores.argmax(axis=0)

            V[i] = best_scores + emit_i
            backpointers[i] = best_prevs

        # Find best final state
        best_final = V[n - 1].argmax()
        best_score = V[n - 1, best_final]

        # Backtrack
        path = [best_final]
        current = best_final
        for i in range(n - 1, 0, -1):
            current = backpointers[i, current]
            path.append(current)

        path.reverse()
        return path, best_score

This implementation uses NumPy's vectorized operations to compute scores for all states simultaneously, avoiding Python loops over states. The key insight is that the inner loop over predecessors becomes a matrix operation.

In[18]:

Code

# Demonstrate the efficient implementation

# Create a small HMM
n_states = 3
n_obs = 4

# Random transition and emission matrices
np.random.seed(42)
trans_matrix = np.random.dirichlet(np.ones(n_states), size=n_states)
emit_matrix = np.random.dirichlet(np.ones(n_obs), size=n_states)
start_probs = np.ones(n_states) / n_states

# Convert emission to log space
log_emit = np.log(emit_matrix + 1e-10)

# Create decoder
states = ["State0", "State1", "State2"]
decoder = EfficientViterbi(states, trans_matrix, log_emit, start_probs)

# Decode a sequence
observations = [0, 1, 2, 1, 0, 3, 2]
path, score = decoder.decode(observations)
path_names = [states[i] for i in path]

# Demonstrate the efficient implementation

# Create a small HMM
n_states = 3
n_obs = 4

# Random transition and emission matrices
np.random.seed(42)
trans_matrix = np.random.dirichlet(np.ones(n_states), size=n_states)
emit_matrix = np.random.dirichlet(np.ones(n_obs), size=n_states)
start_probs = np.ones(n_states) / n_states

# Convert emission to log space
log_emit = np.log(emit_matrix + 1e-10)

# Create decoder
states = ["State0", "State1", "State2"]
decoder = EfficientViterbi(states, trans_matrix, log_emit, start_probs)

# Decode a sequence
observations = [0, 1, 2, 1, 0, 3, 2]
path, score = decoder.decode(observations)
path_names = [states[i] for i in path]

Out[19]:

Console

Efficient NumPy Viterbi
=======================================================

Transition matrix shape: (3, 3)
Emission matrix shape: (3, 4)

Observation sequence: [0, 1, 2, 1, 0, 3, 2]
Decoded path: ['State2', 'State1', 'State0', 'State1', 'State0', 'State1', 'State0']
Log probability: -11.7855

The NumPy implementation produces the same results as the dictionary-based version but runs significantly faster for large state spaces. The log probability is negative (as expected for log probabilities) and represents the joint probability of the path and observations. The vectorized approach eliminates Python loops over states, leveraging NumPy's optimized C implementations.

Sparse TransitionsLink Copied

Many HMMs have sparse transition matrices where most transitions have zero probability. We can exploit this by only considering valid transitions:

In[20]:

Code

def viterbi_sparse(
    observations, states, start_prob, valid_transitions, emit_prob
):
    """
    Viterbi with sparse transition handling.

    Args:
        valid_transitions: Dict mapping state -> list of valid predecessor states
                          with their log probabilities
    """
    n = len(observations)
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case
    for s in states:
        emit = emit_prob.get((s, observations[0]), float("-inf"))
        V[0][s] = start_prob.get(s, float("-inf")) + emit
        backpointer[0][s] = None

    # Recursion - only consider valid predecessors
    for i in range(1, n):
        for s in states:
            best_prev = None
            best_score = float("-inf")
            emit = emit_prob.get((s, observations[i]), float("-inf"))

            # Only iterate over valid predecessors
            for s_prev, log_trans in valid_transitions.get(s, []):
                score = V[i - 1][s_prev] + log_trans + emit
                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            if best_score > float("-inf"):
                V[i][s] = best_score
                backpointer[i][s] = best_prev

    # Find best path
    valid_finals = [s for s in states if s in V[n - 1]]
    if not valid_finals:
        return None, float("-inf")

    best_final = max(valid_finals, key=lambda s: V[n - 1][s])
    best_score = V[n - 1][best_final]

    path = [best_final]
    current = best_final
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        if current is None:
            break
        path.append(current)

    path.reverse()
    return path, best_score

def viterbi_sparse(
    observations, states, start_prob, valid_transitions, emit_prob
):
    """
    Viterbi with sparse transition handling.

    Args:
        valid_transitions: Dict mapping state -> list of valid predecessor states
                          with their log probabilities
    """
    n = len(observations)
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case
    for s in states:
        emit = emit_prob.get((s, observations[0]), float("-inf"))
        V[0][s] = start_prob.get(s, float("-inf")) + emit
        backpointer[0][s] = None

    # Recursion - only consider valid predecessors
    for i in range(1, n):
        for s in states:
            best_prev = None
            best_score = float("-inf")
            emit = emit_prob.get((s, observations[i]), float("-inf"))

            # Only iterate over valid predecessors
            for s_prev, log_trans in valid_transitions.get(s, []):
                score = V[i - 1][s_prev] + log_trans + emit
                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            if best_score > float("-inf"):
                V[i][s] = best_score
                backpointer[i][s] = best_prev

    # Find best path
    valid_finals = [s for s in states if s in V[n - 1]]
    if not valid_finals:
        return None, float("-inf")

    best_final = max(valid_finals, key=lambda s: V[n - 1][s])
    best_score = V[n - 1][best_final]

    path = [best_final]
    current = best_final
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        if current is None:
            break
        path.append(current)

    path.reverse()
    return path, best_score

For BIO tagging where I-tags can only follow B-tags or I-tags of the same type, sparse transitions significantly reduce computation. With 9 entity types, you have 19 tags: one O tag plus B and I tags for each entity type (9 × 2 + 1 = 19). A dense transition matrix has $19^2 = 361$ entries, but valid transitions are far fewer since I-LOC cannot follow B-PER.

Viterbi for Beam Search FoundationLink Copied

While Viterbi finds the single best path, many applications need the $k$ best paths. Neural sequence models often use beam search, which maintains the top $k$ partial hypotheses at each step. Beam search is essentially a pruned version of Viterbi.

Beam Search

Beam search is a heuristic search algorithm that explores the most promising paths in parallel. At each step, it keeps only the top $k$ (beam width) partial sequences, pruning less promising alternatives.

The connection to Viterbi is direct: if we set the beam width to equal the number of states, beam search becomes equivalent to Viterbi. Smaller beam widths trade optimality for speed.

In[21]:

Code

import heapq


def beam_search_viterbi(
    observations, states, start_prob, trans_prob, emit_prob, beam_width=3
):
    """
    Beam search approximation to Viterbi.

    Keeps only the top beam_width hypotheses at each step.
    """
    n = len(observations)

    # Initialize with start probabilities
    # Each hypothesis is (score, path)
    beam = []
    for s in states:
        emit = emit_prob.get((s, observations[0]), 1e-10)
        score = math.log(start_prob.get(s, 1e-10)) + math.log(emit)
        beam.append((score, [s]))

    # Keep top k
    beam = heapq.nlargest(beam_width, beam, key=lambda x: x[0])

    # Extend hypotheses
    for i in range(1, n):
        new_beam = []

        for score, path in beam:
            prev_state = path[-1]
            emit_obs = observations[i]

            for s in states:
                trans = trans_prob.get((prev_state, s), 1e-10)
                emit = emit_prob.get((s, emit_obs), 1e-10)
                new_score = score + math.log(trans) + math.log(emit)
                new_beam.append((new_score, path + [s]))

        # Prune to beam width
        beam = heapq.nlargest(beam_width, new_beam, key=lambda x: x[0])

    # Return best hypothesis
    best_score, best_path = max(beam, key=lambda x: x[0])
    return best_path, best_score


# Compare Viterbi (full) vs beam search
beam_widths = [1, 2, 3, len(states)]  # 1 is greedy, len(states) is full Viterbi
results = []

for bw in beam_widths:
    path, score = beam_search_viterbi(
        observations, states, start_prob, trans_prob, emit_prob, beam_width=bw
    )
    results.append((bw, path, score))

import heapq


def beam_search_viterbi(
    observations, states, start_prob, trans_prob, emit_prob, beam_width=3
):
    """
    Beam search approximation to Viterbi.

    Keeps only the top beam_width hypotheses at each step.
    """
    n = len(observations)

    # Initialize with start probabilities
    # Each hypothesis is (score, path)
    beam = []
    for s in states:
        emit = emit_prob.get((s, observations[0]), 1e-10)
        score = math.log(start_prob.get(s, 1e-10)) + math.log(emit)
        beam.append((score, [s]))

    # Keep top k
    beam = heapq.nlargest(beam_width, beam, key=lambda x: x[0])

    # Extend hypotheses
    for i in range(1, n):
        new_beam = []

        for score, path in beam:
            prev_state = path[-1]
            emit_obs = observations[i]

            for s in states:
                trans = trans_prob.get((prev_state, s), 1e-10)
                emit = emit_prob.get((s, emit_obs), 1e-10)
                new_score = score + math.log(trans) + math.log(emit)
                new_beam.append((new_score, path + [s]))

        # Prune to beam width
        beam = heapq.nlargest(beam_width, new_beam, key=lambda x: x[0])

    # Return best hypothesis
    best_score, best_path = max(beam, key=lambda x: x[0])
    return best_path, best_score


# Compare Viterbi (full) vs beam search
beam_widths = [1, 2, 3, len(states)]  # 1 is greedy, len(states) is full Viterbi
results = []

for bw in beam_widths:
    path, score = beam_search_viterbi(
        observations, states, start_prob, trans_prob, emit_prob, beam_width=bw
    )
    results.append((bw, path, score))

Out[22]:

Console

Beam Search vs Viterbi
=======================================================

greedy      : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

beam=2      : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

Viterbi     : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

Viterbi     : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

Beam width vs optimality:
  k=1: matches optimal ✓
  k=2: matches optimal ✓
  k=3: matches optimal ✓
  k=3: matches optimal ✓

The comparison reveals the tradeoff between beam width and optimality. Greedy search (k=1) may find a suboptimal path because it prunes too aggressively. As beam width increases, we retain more hypotheses and are more likely to find the optimal solution. With beam width equal to the number of states, beam search becomes exact Viterbi. In practice, small beam widths (3-10) often find near-optimal solutions while running much faster than exact Viterbi.

Beam search with $k=1$ is greedy decoding. As $k$ increases, we approach Viterbi's optimal solution. The tradeoff is computation time: beam search is $O(n \cdot k \cdot |S|)$ where $k$ is the beam width, compared to Viterbi's $O(n \cdot |S|^2)$ .

Constrained DecodingLink Copied

In sequence labeling, certain tag transitions are invalid. An I-LOC tag cannot follow a B-PER tag. Rather than learning these constraints from data, we can enforce them during decoding by setting invalid transition probabilities to zero (or negative infinity in log space).

In[23]:

Code

def get_bio_constraints(entity_types):
    """
    Generate valid transition constraints for BIO tagging.

    Returns a set of valid (prev_tag, curr_tag) pairs.
    """
    tags = ["O"]
    for etype in entity_types:
        tags.extend([f"B-{etype}", f"I-{etype}"])

    valid_transitions = set()

    for prev_tag in tags + ["<START>"]:
        for curr_tag in tags:
            # O can follow anything
            if curr_tag == "O":
                valid_transitions.add((prev_tag, curr_tag))
                continue

            # B-X can follow anything (starts new entity)
            if curr_tag.startswith("B-"):
                valid_transitions.add((prev_tag, curr_tag))
                continue

            # I-X can only follow B-X or I-X
            if curr_tag.startswith("I-"):
                curr_type = curr_tag[2:]
                if prev_tag == f"B-{curr_type}" or prev_tag == f"I-{curr_type}":
                    valid_transitions.add((prev_tag, curr_tag))

    return valid_transitions


# Generate constraints
entity_types = ["PER", "LOC", "ORG"]
valid = get_bio_constraints(entity_types)

def get_bio_constraints(entity_types):
    """
    Generate valid transition constraints for BIO tagging.

    Returns a set of valid (prev_tag, curr_tag) pairs.
    """
    tags = ["O"]
    for etype in entity_types:
        tags.extend([f"B-{etype}", f"I-{etype}"])

    valid_transitions = set()

    for prev_tag in tags + ["<START>"]:
        for curr_tag in tags:
            # O can follow anything
            if curr_tag == "O":
                valid_transitions.add((prev_tag, curr_tag))
                continue

            # B-X can follow anything (starts new entity)
            if curr_tag.startswith("B-"):
                valid_transitions.add((prev_tag, curr_tag))
                continue

            # I-X can only follow B-X or I-X
            if curr_tag.startswith("I-"):
                curr_type = curr_tag[2:]
                if prev_tag == f"B-{curr_type}" or prev_tag == f"I-{curr_type}":
                    valid_transitions.add((prev_tag, curr_tag))

    return valid_transitions


# Generate constraints
entity_types = ["PER", "LOC", "ORG"]
valid = get_bio_constraints(entity_types)

Out[24]:

Console

BIO Tagging Constraints
=======================================================

Entity types: ['PER', 'LOC', 'ORG']
Tags: O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG

Sample valid transitions:
  O        → B-PER    ✓
  B-PER    → I-PER    ✓
  I-PER    → I-PER    ✓
  I-PER    → B-LOC    ✓
  B-LOC    → O        ✓

Sample invalid transitions:
  O        → I-PER    ✗
  B-PER    → I-LOC    ✗
  I-LOC    → I-PER    ✗

Constraint summary: 38 valid transitions out of 49 possible

The constraints follow BIO semantics: I-tags can only continue entities of the same type, while B-tags and O can follow any tag. Invalid transitions like O → I-PER (starting an entity with a continuation tag) or B-PER → I-LOC (continuing a PER entity as LOC) are explicitly forbidden. By setting these transition probabilities to negative infinity during Viterbi decoding, we guarantee the output is always a well-formed BIO sequence.

Out[25]:

Visualization

Heatmap showing 7x7 matrix of BIO tag transitions. Valid transitions are green, invalid are red. Clear diagonal pattern visible for I-tags. — BIO transition constraint matrix as a heatmap. Green cells indicate valid transitions, red cells indicate forbidden transitions. Notice the diagonal pattern in the I-tag region: I-PER can only follow B-PER or I-PER, never tags of different entity types. This sparse structure enables significant computational optimization.

The heatmap reveals the structure of BIO constraints at a glance. The first row and column (O tag) show that O can follow anything and anything can precede O. The B-tags (columns 2, 4, 6) are fully green because they can follow any tag since they start new entities. The distinctive pattern appears in the I-tag columns: each I-tag has exactly two green cells, corresponding to its matching B and I tags. This sparsity (25 valid out of 49 possible transitions) enables significant optimization during Viterbi decoding.

Constrained decoding ensures the output is always a valid BIO sequence, even if the model's raw scores would suggest otherwise. This is particularly important for neural models that don't explicitly model transition constraints.

Limitations and Practical ConsiderationsLink Copied

The Viterbi algorithm is optimal for first-order Markov models where the current state depends only on the previous state. Several limitations arise in practice:

Higher-order dependencies exist in natural language. The optimal tag for a word might depend on tags two or three positions back. Extending Viterbi to second-order Markov models is possible but increases complexity to $O(n \cdot k^3)$ since we must track pairs of previous states.
Non-local features are common in neural sequence models. A BiLSTM encodes the entire sentence before making predictions, so the score at each position depends on the full context, not just the previous tag. CRF layers on top of neural models use Viterbi for decoding, but the emission scores come from the neural network rather than simple probability tables.
Approximate inference becomes necessary with large state spaces. In machine translation, the "states" are partial translations with exponentially many possibilities. Beam search provides a practical approximation, trading optimality for tractability.
Label dependencies beyond transitions sometimes matter. In some tagging schemes, the entity type of a word might depend on entity types earlier in the sentence, not just the immediately preceding tag. Such long-range dependencies require more sophisticated inference methods.

Despite these limitations, Viterbi remains the workhorse algorithm for sequence labeling. Its efficiency, optimality guarantees for Markov models, and straightforward implementation make it the right choice for most POS tagging and NER systems. When combined with neural feature extractors and CRF output layers, it achieves state-of-the-art results on many benchmarks.

SummaryLink Copied

The Viterbi algorithm solves the optimal path problem for sequence labeling. Given a sequence of observations and a model with transition and emission probabilities, it finds the single best sequence of hidden states in time linear in sequence length.

The key ideas covered in this chapter:

Dynamic programming makes the exponential search space tractable. By storing the best score for each state at each position, we avoid recomputing shared subsequences. The $O(n \cdot k^2)$ complexity enables real-time tagging of long documents.

Backpointer tracking enables path reconstruction. At each cell in the Viterbi table, we record which predecessor achieved the maximum score. After filling the table, we trace back from the best final state to recover the optimal path.

Log-space computation prevents numerical underflow. Working with log probabilities converts products to sums, keeping values in a representable range even for long sequences. This is essential for any practical implementation.

Constrained decoding enforces structural requirements. For BIO tagging, we set invalid transition probabilities to zero, ensuring the output is always a valid tag sequence regardless of what the model's raw scores suggest.

Beam search connects to Viterbi as an approximation. By keeping only the top $k$ hypotheses at each step, beam search trades optimality for speed. With beam width equal to the number of states, it recovers Viterbi exactly.

Key Parameters and FunctionsLink Copied

When implementing or using Viterbi decoding, these parameters control behavior:

states: The set of possible hidden states (tags). Larger state sets increase computation quadratically
transition_prob: Matrix or dictionary of $P(s_i | s_{i-1})$ values. Sparse matrices enable optimization
emission_prob: Matrix or dictionary of $P(w_i | s_i)$ values. In neural models, these come from network outputs
start_prob: Initial state distribution $P(s_1)$ . Often uniform or learned from data
beam_width: For approximate decoding, controls the number of hypotheses retained. Larger values approach exact Viterbi

The next chapter introduces Conditional Random Fields, which generalize HMMs by allowing arbitrary feature functions while still using Viterbi for efficient decoding.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Viterbi algorithm and optimal sequence decoding.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Hidden Markov Models

Next Chapter

Conditional Random Fields

Coming Soon

Reference

BIBTEXAcademic

@misc{viterbialgorithmdynamicprogrammingforoptimalsequencedecoding, author = {Michael Brenndoerfer}, title = {Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding}, year = {2025}, url = {https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding. Retrieved from https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling

MLAAcademic

Michael Brenndoerfer. "Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling>.

CHICAGOAcademic

Michael Brenndoerfer. "Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding'. Available at: https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding. https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling

Direct link:

https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding

Viterbi AlgorithmLink Copied

The Optimal Path ProblemLink Copied

Why Greedy Decoding FailsLink Copied

The Exponential Search SpaceLink Copied

The Viterbi RecursionLink Copied

The Key Insight: Optimal SubstructureLink Copied

Building the Recurrence: From Intuition to FormulaLink Copied

The Base Case: Starting the SequenceLink Copied

Complexity: Why This WorksLink Copied

Visualizing the TrellisLink Copied

Backpointer TrackingLink Copied

The Need for BackpointersLink Copied

Formalizing BackpointersLink Copied

Reconstructing the Path: The Backward PassLink Copied

Why Backpointers WorkLink Copied

Reading the ImplementationLink Copied

A Worked ExampleLink Copied

Log-Space ComputationLink Copied

Complexity AnalysisLink Copied

Time ComplexityLink Copied

Space ComplexityLink Copied

Implementing Viterbi EfficientlyLink Copied

Sparse TransitionsLink Copied

Viterbi for Beam Search FoundationLink Copied

Constrained DecodingLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

Key Parameters and FunctionsLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization

Part-of-Speech Tagging: Tag Sets, Algorithms & Implementation

Named Entity Recognition: Extracting People, Places & Organizations

Stay updated