Search

Search articles

Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding

Michael BrenndoerferDecember 15, 202538 min read

Master the Viterbi algorithm for finding optimal tag sequences in HMMs. Learn dynamic programming, backpointer tracking, log-space computation, and constrained decoding.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Viterbi Algorithm

When labeling a sequence of tokens, you face a fundamental decision: should you choose each label independently, or should you consider the entire sequence at once? Greedy decoding picks the best label at each position, ignoring how earlier choices affect later ones. But sequence labeling isn't about finding locally optimal tags; it's about finding the globally optimal tag sequence.

The Viterbi algorithm solves this problem elegantly. Given a sequence of observations and a model that defines transition and emission probabilities, Viterbi finds the single best sequence of hidden states. For part-of-speech tagging, it finds the best tag sequence. For named entity recognition, it finds the best BIO label sequence. The algorithm runs in time linear in the sequence length, making it practical for real applications.

This chapter develops the Viterbi algorithm from first principles. You'll understand why exhaustive search is intractable, how dynamic programming makes the problem solvable, and how to implement efficient decoding in practice. We also explore log-space computation to avoid numerical underflow and show how Viterbi forms the foundation for beam search decoding in neural sequence models.

The Optimal Path Problem

Consider a hidden Markov model for part-of-speech tagging. You observe a sequence of words w1,w2,,wnw_1, w_2, \ldots, w_n and want to find the sequence of tags t1,t2,,tnt_1, t_2, \ldots, t_n that maximizes the joint probability of seeing both the tags and the words together. The HMM factors this joint probability into a product of simpler terms:

P(t1,,tn,w1,,wn)=P(t1)i=2nP(titi1)i=1nP(witi)P(t_1, \ldots, t_n, w_1, \ldots, w_n) = P(t_1) \prod_{i=2}^{n} P(t_i | t_{i-1}) \prod_{i=1}^{n} P(w_i | t_i)

where:

  • t1,,tnt_1, \ldots, t_n: the sequence of hidden states (POS tags) we want to infer
  • w1,,wnw_1, \ldots, w_n: the observed sequence of words
  • P(t1)P(t_1): the initial state probability, capturing how likely the sentence starts with tag t1t_1
  • P(titi1)P(t_i | t_{i-1}): the transition probability from tag ti1t_{i-1} to tag tit_i, encoding grammatical patterns like "determiners often precede nouns"
  • P(witi)P(w_i | t_i): the emission probability of word wiw_i given tag tit_i, capturing which words are typical for each tag

The first product captures transition probabilities: how likely is each tag given the previous one? The second product captures emission probabilities: how likely is each word given its tag? This factorization embodies the Markov assumption: each tag depends only on its immediate predecessor, not on the full history.

Optimal Path Problem

Given a sequence of observations and a model with transition and emission probabilities, find the sequence of hidden states that maximizes the probability of generating those observations.

Why Greedy Decoding Fails

A greedy approach picks the best tag at each position independently, considering only the previous tag. At each position ii, greedy decoding selects:

ti=argmaxtiP(witi)P(titi1)t_i^* = \arg\max_{t_i} P(w_i | t_i) \cdot P(t_i | t_{i-1}^*)

where:

  • tit_i^*: the greedily chosen tag at position ii
  • argmaxti\arg\max_{t_i}: selects the tag tit_i that maximizes the following expression
  • P(witi)P(w_i | t_i): the emission probability of the current word given candidate tag tit_i
  • P(titi1)P(t_i | t_{i-1}^*): the transition probability from the previously chosen tag ti1t_{i-1}^* to candidate tag tit_i

This seems reasonable but can produce suboptimal sequences. The problem is that a locally good choice might force bad choices later. Consider tagging "The can can hold water." A greedy tagger might assign MODAL to the first "can" because modals are common after determiners. But this makes the second "can" difficult: modals rarely follow modals. The globally optimal solution assigns NOUN to the first "can" and VERB to the second, even though NOUN isn't the highest-probability tag for "can" in isolation.

In[2]:
Code
# Demonstration of greedy vs. optimal tagging

# Simple transition probabilities (log space for clarity)
transitions = {
    ("<START>", "DT"): 0.3,
    ("<START>", "NN"): 0.1,
    ("DT", "NN"): 0.5,
    ("DT", "MD"): 0.3,
    ("DT", "VB"): 0.1,
    ("NN", "VB"): 0.4,
    ("NN", "MD"): 0.2,
    ("MD", "VB"): 0.8,
    ("MD", "MD"): 0.05,  # Rare: modal after modal
    ("VB", "NN"): 0.3,
    ("VB", "DT"): 0.2,
}

# Emission probabilities for "can"
emissions_can = {
    "MD": 0.6,  # Modal is most common usage
    "NN": 0.25,  # Container noun
    "VB": 0.15,  # Preserve/put in cans
}

sentence = ["The", "can", "can", "hold", "water"]
Out[3]:
Console
The Greedy Decoding Problem
=======================================================

Sentence: The can can hold water

At position 2 (first 'can'), greedy considers:
  P('can'|MD) × P(MD|DT) = 0.60 × 0.30 = 0.180
  P('can'|NN) × P(NN|DT) = 0.25 × 0.50 = 0.125
  P('can'|VB) × P(VB|DT) = 0.15 × 0.10 = 0.015

Greedy picks MD (highest score: 0.180)

But now at position 3 (second 'can'):
  P('can'|VB) × P(VB|MD) = 0.15 × 0.80 = 0.120
  P('can'|MD) × P(MD|MD) = 0.60 × 0.05 = 0.030

The greedy choice of MD at position 2 forces a bad situation at position 3.
If we had chosen NN at position 2 instead:
  P('can'|VB) × P(VB|NN) = 0.15 × 0.40 = 0.060

This example demonstrates why greedy decoding fails. The locally optimal choice of MD for the first "can" yields score 0.180, beating NN's 0.125. But this decision creates a problem: the second "can" must follow a modal, and modals rarely follow modals (P(MD|MD) = 0.05). The alternative path through NN enables a much higher score at the next position (0.060 vs 0.030), potentially yielding a better overall sequence probability.

Let's trace the cumulative scores for both paths:

Cumulative path scores comparing greedy vs alternative decoding.
PositionGreedy Path (DT→MD→VB)Alternative Path (DT→NN→VB)
Position 2 (first "can")0.1800.125
Position 3 (second "can")0.02160.0075

The greedy path starts stronger (0.180 vs 0.125) because MD has higher emission probability for "can." But this local advantage creates a problem: the MD→VB transition at position 3 yields only 0.0216, while NN→VB would have enabled a score of 0.0075. The globally optimal path requires looking ahead, something greedy decoding cannot do.

The Exponential Search Space

If greedy decoding doesn't work, why not try all possible tag sequences? With kk possible tags and a sequence of length nn, there are knk^n possible sequences. For a modest vocabulary of 45 tags and a 20-word sentence, that's 4520103345^{20} \approx 10^{33} sequences, far more than the number of atoms in the observable universe.

In[4]:
Code
def count_sequences(num_tags, seq_length):
    """Calculate the number of possible tag sequences."""
    return num_tags**seq_length


# Common scenarios
scenarios = [
    ("Simple (5 tags, 10 words)", 5, 10),
    ("Penn Treebank (45 tags, 15 words)", 45, 15),
    ("Penn Treebank (45 tags, 20 words)", 45, 20),
    ("Penn Treebank (45 tags, 30 words)", 45, 30),
]

sequence_counts = [
    (name, count_sequences(tags, length)) for name, tags, length in scenarios
]
Out[5]:
Console
Exhaustive Search is Intractable
============================================================

Scenario                                          Sequences
------------------------------------------------------------
Simple (5 tags, 10 words)                             ~10^6
Penn Treebank (45 tags, 15 words)                    ~10^24
Penn Treebank (45 tags, 20 words)                    ~10^33
Penn Treebank (45 tags, 30 words)                    ~10^49

For comparison:
  Atoms in the universe: ~10^80
  Nanoseconds since Big Bang: ~10^26

Even with a modest 45-tag vocabulary, a 20-word sentence produces more possible sequences than atoms in the observable universe. Checking each sequence individually would take longer than the age of the universe on any computer. The key insight is that we don't need to enumerate every sequence. Many sequences share common prefixes, and we can avoid recomputing scores for these shared subsequences. This is where dynamic programming enters.

The Viterbi Recursion

We've established that greedy decoding fails because local choices can force globally suboptimal sequences, and that exhaustive search is computationally impossible. How can we find the optimal path without examining every possibility? The answer lies in a beautiful insight from dynamic programming: we can build the optimal solution incrementally from smaller optimal solutions.

The Key Insight: Optimal Substructure

Imagine you're at position 3 in a sequence, and you want to know the best path that ends at the tag NOUN. Here's the crucial observation: regardless of how that path reached position 2, the best continuation to NOUN at position 3 depends only on:

  1. What state we're in at position 2 (not how we got there)
  2. The transition probability to NOUN
  3. The emission probability of the observed word from NOUN

This property is called optimal substructure. If the best path through the entire sequence passes through state ss' at position 2 and then reaches NOUN at position 3, then the portion of that path from the start to ss' must itself be optimal. Why? If there were a better path to ss', we could substitute it and get a better overall path, contradicting our assumption that we had the best path.

Viterbi Principle (Bellman's Optimality)

The best path ending at state ss at position ii consists of the best path to some state ss' at position i1i-1, followed by a transition from ss' to ss. We can find the globally optimal path by computing locally optimal subpaths.

This principle transforms our approach entirely. Instead of tracking complete paths (exponentially many), we only need to track the best score for reaching each state at each position. At any position, we have just kk values to maintain, one per possible state, rather than an exponentially growing set of paths.

Building the Recurrence: From Intuition to Formula

Let's formalize this insight. We define Vi(s)V_i(s) as the probability of the most likely path that:

  • Ends at state ss at position ii
  • Generates the observations w1,,wiw_1, \ldots, w_i seen so far

How do we compute Vi(s)V_i(s)? Consider any path ending at state ss at position ii. That path must have come from some predecessor state ss' at position i1i-1. The probability of this path factors into three terms:

  1. The probability of the best path to ss': This is Vi1(s)V_{i-1}(s'), which we've already computed
  2. The transition probability from ss' to ss: This is P(ss)P(s | s')
  3. The probability of emitting the current observation from ss: This is P(wis)P(w_i | s)

Since we want the best path to ss, we take the maximum over all possible predecessor states. This gives us the Viterbi recurrence:

Vi(s)=maxs[Vi1(s)P(ss)P(wis)]V_i(s) = \max_{s'} \left[ V_{i-1}(s') \cdot P(s | s') \cdot P(w_i | s) \right]

where:

  • Vi(s)V_i(s): the probability of the best path ending at state ss at position ii
  • maxs\max_{s'}: takes the maximum over all possible predecessor states ss'
  • Vi1(s)V_{i-1}(s'): the best score for reaching state ss' at the previous position (already computed)
  • P(ss)P(s | s'): the transition probability from predecessor state ss' to current state ss
  • P(wis)P(w_i | s): the emission probability of observing word wiw_i from state ss

Notice that P(wis)P(w_i | s) appears inside the maximization but doesn't depend on the predecessor ss'. This is because once we decide to be in state ss at position ii, the emission probability is fixed regardless of how we arrived. We could factor it out:

Vi(s)=P(wis)maxs[Vi1(s)P(ss)]V_i(s) = P(w_i | s) \cdot \max_{s'} \left[ V_{i-1}(s') \cdot P(s | s') \right]

This form is computationally equivalent but makes clear that we're choosing the best predecessor based solely on the accumulated path score and transition, then multiplying by the emission.

The Base Case: Starting the Sequence

The recurrence needs a starting point. At position 1, there is no predecessor to consider. The probability of the best path ending at state ss after seeing just the first observation is simply:

V1(s)=P(s)P(w1s)V_1(s) = P(s) \cdot P(w_1 | s)

where:

  • P(s)P(s): the initial state probability (how likely the sequence starts in state ss)
  • P(w1s)P(w_1 | s): the emission probability of the first observation from state ss

This captures both how likely we are to start in state ss and how well state ss explains the first observed word.

Complexity: Why This Works

Let's count the operations. At each of the nn positions (after the first), we compute Vi(s)V_i(s) for each of the kk states. Computing each Vi(s)V_i(s) requires examining kk predecessors. Total: O(nk2)O(n \cdot k^2) operations.

Compare this to the knk^n paths we would examine exhaustively:

Complexity comparison between exhaustive search and Viterbi.
ApproachComplexity45 tags, 20 words
ExhaustiveO(kn)O(k^n)~103310^{33}
ViterbiO(nk2)O(n \cdot k^2)40,500

The exponential has become polynomial. This is the power of dynamic programming: by recognizing that we can solve the problem incrementally and that we only need to remember the best score per state (not every path), we collapse an intractable problem into an efficient algorithm.

Visualizing the Trellis

It helps to visualize the Viterbi computation as filling a grid called a trellis. The horizontal axis represents positions in the sequence (1 through nn), and the vertical axis represents possible states (the kk tags or labels). Each cell (i,s)(i, s) holds the score Vi(s)V_i(s), the probability of the best path ending at state ss at position ii.

Arrows connect cells at adjacent positions. Each arrow from (i1,s)(i-1, s') to (i,s)(i, s) represents one term in the maximization: "if I came from state ss', what would my score be?" The algorithm fills this grid column by column, left to right, computing each cell from the cells in the previous column.

Out[6]:
Visualization
Grid diagram showing positions 1-4 horizontally and three possible tags vertically, with arrows connecting adjacent columns representing transition probabilities.
A Viterbi trellis for part-of-speech tagging. Each column represents a position in the sequence, and each row represents a possible tag. Blue nodes show states, and arrows show transitions weighted by transition and emission probabilities. The algorithm finds the highest-scoring path from left to right through this lattice.

In the figure above, gray arrows show all possible transitions between adjacent positions. The red path highlights the optimal sequence discovered by Viterbi: DT → NN → VB → NN for the sentence "The can hold water." Notice that:

  • At position 1, DT (determiner) wins because "The" is overwhelmingly a determiner
  • At position 2, NN (noun) wins, not because "can" is most often a noun, but because this choice enables better continuations
  • The algorithm found this globally optimal path without explicitly enumerating all 81 possible sequences (3 states × 4 positions)

Now we understand how Viterbi computes scores efficiently. But there's a missing piece: at the end of the algorithm, we know the score of the best path, but we haven't recorded the path itself. How do we reconstruct which states we actually passed through?

Backpointer Tracking

The Viterbi recurrence gives us the score of the best path to each state, but we've lost something in the process: which path achieved that score? At each step, we took the maximum over predecessors, keeping only the winning score. To reconstruct the actual optimal sequence, we need to remember which predecessor won at each cell.

The Need for Backpointers

Think of it this way: when we compute V3(NOUN)V_3(\text{NOUN}), we're answering "what's the best path probability ending at NOUN at position 3?" But we're not recording "and that path came through VERB at position 2." Without this information, when we finish computing the entire trellis, we'll know that the best final state is (say) NOUN with score 0.0073, but we won't know the sequence of states that led there.

The solution is elegantly simple: alongside each score, store the predecessor that achieved it. We call this a backpointer because it points backward to the previous state in the optimal path.

Formalizing Backpointers

The backpointer Bi(s)B_i(s) records which predecessor state ss' maximized the score when computing Vi(s)V_i(s):

Bi(s)=argmaxs[Vi1(s)P(ss)P(wis)]B_i(s) = \arg\max_{s'} \left[ V_{i-1}(s') \cdot P(s | s') \cdot P(w_i | s) \right]

where:

  • Bi(s)B_i(s): the backpointer for state ss at position ii, storing which predecessor led to the best score
  • argmaxs\arg\max_{s'}: returns the state ss' (not the value) that maximizes the expression
  • The expression inside is identical to the Viterbi recurrence

Notice the difference between max\max and argmax\arg\max:

  • max\max returns the value: "the best score is 0.042"
  • argmax\arg\max returns the argument: "the best predecessor is VERB"

We compute both simultaneously: the same computation that finds the maximum score also identifies which predecessor achieved it.

Reconstructing the Path: The Backward Pass

Once we've filled the entire trellis, we have scores and backpointers for every (position, state) pair. Reconstruction proceeds in three steps:

Step 1: Find the best final state. The optimal path ends at whichever state has the highest score at the final position:

sn=argmaxsVn(s)s_n^* = \arg\max_s V_n(s)

where:

  • sns_n^*: the optimal state at the final position nn
  • Vn(s)V_n(s): the Viterbi score at the final position for each state ss

Step 2: Follow backpointers backward. Starting from sns_n^*, we trace back through the stored pointers:

si=Bi+1(si+1)s_i^* = B_{i+1}(s_{i+1}^*)

where:

  • sis_i^*: the optimal state at position ii
  • Bi+1(si+1)B_{i+1}(s_{i+1}^*): the backpointer stored at position i+1i+1 for state si+1s_{i+1}^*, which tells us which state at position ii led to the best path through si+1s_{i+1}^*

Each backpointer tells us: "to reach this state optimally, I came from that state." By following this chain, we recover the complete path.

Step 3: Reverse the sequence. Since we traced from position nn back to position 1, the recovered sequence is in reverse order. A simple reversal gives us the forward path.

Why Backpointers Work

The beauty of this approach is that backpointers require only O(nk)O(n \cdot k) additional storage, one pointer per cell in the trellis, and add no complexity to the forward pass (we're already examining all predecessors to compute the maximum). The backward pass is a simple O(n)O(n) traversal.

In[7]:
Code
def viterbi_with_backpointers(
    observations, states, start_prob, trans_prob, emit_prob
):
    """
    Viterbi algorithm with backpointer tracking.

    Args:
        observations: List of observed symbols
        states: List of possible hidden states
        start_prob: Dict mapping state -> initial probability
        trans_prob: Dict mapping (prev_state, state) -> transition probability
        emit_prob: Dict mapping (state, observation) -> emission probability

    Returns:
        best_path: List of states forming the optimal path
        best_prob: Probability of the optimal path
    """
    n = len(observations)

    # Initialize Viterbi table and backpointers
    # V[i][s] = best probability of path ending at state s at position i
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case: first observation
    for s in states:
        emit = emit_prob.get((s, observations[0]), 1e-10)
        V[0][s] = start_prob.get(s, 1e-10) * emit
        backpointer[0][s] = None

    # Recursive case: fill the trellis
    for i in range(1, n):
        for s in states:
            # Find best predecessor
            best_prev = None
            best_score = 0

            emit = emit_prob.get((s, observations[i]), 1e-10)

            for s_prev in states:
                trans = trans_prob.get((s_prev, s), 1e-10)
                score = V[i - 1][s_prev] * trans * emit

                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            V[i][s] = best_score
            backpointer[i][s] = best_prev

    # Find best final state
    best_final_state = max(states, key=lambda s: V[n - 1][s])
    best_prob = V[n - 1][best_final_state]

    # Backtrack to find the path
    path = [best_final_state]
    current = best_final_state
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        path.append(current)

    path.reverse()
    return path, best_prob

Reading the Implementation

Let's walk through how this code implements the formulas we developed:

  1. Initialization (lines 17-24): We create two data structures: V for Viterbi scores and backpointer for tracking predecessors. Each is a list of dictionaries, one per position.

  2. Base case (lines 26-30): For the first observation, we apply V1(s)=P(s)P(w1s)V_1(s) = P(s) \cdot P(w_1 | s). There's no predecessor, so backpointers are None.

  3. Recursion (lines 32-50): The nested loops implement the recurrence. For each position i and each state s, we examine all predecessors s_prev, compute the score Vi1(s)P(ss)P(wis)V_{i-1}(s') \cdot P(s|s') \cdot P(w_i|s), and keep the maximum. We store both the score (V[i][s]) and which predecessor achieved it (backpointer[i][s]).

  4. Termination and backtracking (lines 52-63): We find the state with the highest final score (this is sn=argmaxsVn(s)s_n^* = \arg\max_s V_n(s)), then follow backpointers backward to reconstruct the path.

The implementation uses a small epsilon (1e-10) instead of zero for missing probabilities to avoid numerical issues with exact zeros.

A Worked Example

Now let's trace through this algorithm on a concrete problem. We'll use a simple HMM for weather inference: given a sequence of observed activities (walking, shopping, cleaning), infer the most likely sequence of weather states (sunny or rainy).

In[8]:
Code
# Simple HMM: weather affects activities
states = ["Sunny", "Rainy"]
observations = ["walk", "shop", "clean"]

# Initial probabilities
start_prob = {"Sunny": 0.6, "Rainy": 0.4}

# Transition probabilities
trans_prob = {
    ("Sunny", "Sunny"): 0.7,
    ("Sunny", "Rainy"): 0.3,
    ("Rainy", "Sunny"): 0.4,
    ("Rainy", "Rainy"): 0.6,
}

# Emission probabilities
emit_prob = {
    ("Sunny", "walk"): 0.6,
    ("Sunny", "shop"): 0.3,
    ("Sunny", "clean"): 0.1,
    ("Rainy", "walk"): 0.1,
    ("Rainy", "shop"): 0.4,
    ("Rainy", "clean"): 0.5,
}

# Run Viterbi
best_path, best_prob = viterbi_with_backpointers(
    observations, states, start_prob, trans_prob, emit_prob
)

The HMM parameters encode two types of knowledge about the weather-activity relationship:

Transition Probabilities P(stst1)P(s_t | s_{t-1}): How likely is each weather state given the previous day's weather?

Transition probability matrix showing weather persistence. Sunny and rainy days both tend to continue (0.7 and 0.6 respectively).
Current State→ Sunny→ Rainy
Sunny0.70.3
Rainy0.40.6

Emission Probabilities P(ws)P(w | s): How likely is each activity given the current weather?

Emission probability matrix showing activity preferences by weather. Walking strongly indicates sunny weather (0.6 vs 0.1); cleaning strongly indicates rain (0.5 vs 0.1). Shopping is ambiguous.
Weather Statewalkshopclean
Sunny0.60.30.1
Rainy0.10.40.5

These tables reveal the model's structure. Weather tends to persist: sunny days have 70% chance of staying sunny, rainy days 60% chance of staying rainy. For activities, walking is a strong indicator of sunny weather, while cleaning suggests rain. Shopping is the most ambiguous activity, slightly favoring rainy days.

Out[9]:
Console
Viterbi Example: Weather from Activities
=======================================================

Observations: ['walk', 'shop', 'clean']
Possible states: ['Sunny', 'Rainy']

Step-by-step computation:
-------------------------------------------------------

Position 0: observing 'walk'
  V[0][Sunny] = P(Sunny) × P('walk'|Sunny)
           = 0.6 × 0.6 = 0.360
  V[0][Rainy] = P(Rainy) × P('walk'|Rainy)
           = 0.4 × 0.1 = 0.040

Position 1: observing 'shop'
  V[1][Sunny] = max over predecessors:
    from Sunny: 0.360 × 0.7 × 0.3 = 0.0756
    from Rainy: 0.040 × 0.4 × 0.3 = 0.0048
  V[1][Rainy] = max over predecessors:
    from Sunny: 0.360 × 0.3 × 0.4 = 0.0432
    from Rainy: 0.040 × 0.6 × 0.4 = 0.0096

-------------------------------------------------------

Optimal path: Sunny → Rainy → Rainy
Path probability: 0.012960

Let's interpret this step-by-step computation:

Position 0 (observing "walk"): We apply the base case formula V1(s)=P(s)P(w1s)V_1(s) = P(s) \cdot P(w_1|s). Sunny scores higher (0.360 vs 0.040) for two reinforcing reasons: the prior favors sunny days (0.6 vs 0.4), and walking is much more likely on sunny days (0.6 vs 0.1).

Position 1 (observing "shop"): Now the recurrence kicks in. For each current state, we consider both possible predecessors:

  • To reach Sunny: the best predecessor is Sunny (score 0.0756), since we benefit from both the high previous score and a favorable transition (sunny days tend to stay sunny)
  • To reach Rainy: again the best predecessor is Sunny (score 0.0432), despite the transition penalty, because the previous Sunny score was so much higher than Rainy

The result: The optimal path Sunny → Sunny → Rainy aligns with intuition: walking strongly suggests good weather, shopping is ambiguous (slightly favoring rain), and cleaning strongly suggests rain. The algorithm found this without examining all 8 possible paths. It computed only 12 values (2 states × 3 positions × 2 predecessor checks).

Out[10]:
Visualization
Grid showing two rows (Sunny, Rainy) and three columns (walk, shop, clean) with Viterbi scores and the optimal path highlighted in red.
Complete Viterbi trellis for the weather-activity example. Numbers in cells show the Viterbi score V[i][s] at each position. The red path marks the optimal sequence found by backtracking from the highest-scoring final state. Dashed arrows show the backpointers used for path reconstruction.

Log-Space Computation

Multiplying many small probabilities causes numerical underflow. For a sequence of 100 tokens with probabilities around 0.01, the path probability would be approximately 0.01100=102000.01^{100} = 10^{-200}, far below the smallest representable floating-point number (approximately 1030810^{-308} for 64-bit floats).

The solution is to work in log space. The logarithm function has a useful property: it converts products into sums. Since log(ab)=log(a)+log(b)\log(ab) = \log(a) + \log(b), we can rewrite the Viterbi recurrence entirely in terms of log probabilities:

logVi(s)=maxs[logVi1(s)+logP(ss)+logP(wis)]\log V_i(s) = \max_{s'} \left[ \log V_{i-1}(s') + \log P(s | s') + \log P(w_i | s) \right]

where:

  • logVi(s)\log V_i(s): the log probability of the best path ending at state ss at position ii
  • logVi1(s)\log V_{i-1}(s'): the log probability of the best path to predecessor state ss' (already computed)
  • logP(ss)\log P(s | s'): the log transition probability from ss' to ss
  • logP(wis)\log P(w_i | s): the log emission probability of word wiw_i from state ss

Log probabilities are negative since probabilities are between 0 and 1, and log(x)<0\log(x) < 0 when x<1x < 1. The best path has the highest (least negative) log probability. The max\max operation is unchanged because log is monotonic: if a>ba > b, then log(a)>log(b)\log(a) > \log(b). This means finding the maximum log probability gives the same answer as finding the maximum probability.

In[11]:
Code
import math


def viterbi_log_space(observations, states, start_prob, trans_prob, emit_prob):
    """
    Viterbi algorithm in log space for numerical stability.
    """
    n = len(observations)

    # Convert to log probabilities
    def safe_log(p):
        return math.log(p) if p > 0 else float("-inf")

    log_start = {s: safe_log(p) for s, p in start_prob.items()}
    log_trans = {k: safe_log(p) for k, p in trans_prob.items()}
    log_emit = {k: safe_log(p) for k, p in emit_prob.items()}

    # Viterbi in log space
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case
    for s in states:
        emit = log_emit.get((s, observations[0]), float("-inf"))
        V[0][s] = log_start.get(s, float("-inf")) + emit
        backpointer[0][s] = None

    # Recursive case
    for i in range(1, n):
        for s in states:
            best_prev = None
            best_score = float("-inf")

            emit = log_emit.get((s, observations[i]), float("-inf"))

            for s_prev in states:
                trans = log_trans.get((s_prev, s), float("-inf"))
                score = V[i - 1][s_prev] + trans + emit

                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            V[i][s] = best_score
            backpointer[i][s] = best_prev

    # Find best final state
    best_final = max(states, key=lambda s: V[n - 1][s])
    best_log_prob = V[n - 1][best_final]

    # Backtrack
    path = [best_final]
    current = best_final
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        path.append(current)

    path.reverse()
    return path, best_log_prob


# Compare probability space vs log space
path_log, log_prob = viterbi_log_space(
    observations, states, start_prob, trans_prob, emit_prob
)
Out[12]:
Console
Log-Space vs Probability-Space Computation
=======================================================

Optimal path (both methods): Sunny → Rainy → Rainy

Probability space: P = 0.012960
Log space: log(P) = -4.345888
Converted back: exp(log(P)) = 0.012960

For long sequences, probability space would underflow:

  Sequence length: 100
  Per-step probability: 0.1
  Probability space: 1.0000000000000056e-100 (underflow to 0)
  Log space: -230.26 (computable)

Both methods find the identical optimal path, confirming that log-space computation preserves correctness. The probability-space result (best_prob) and the exponentiated log-space result match to numerical precision. For the 100-token demonstration, probability space underflows to exactly zero while log space produces a meaningful value of -230.26. This is why all production Viterbi implementations work in log space.

Out[13]:
Visualization
Two line plots. Left shows probability decaying exponentially to zero. Right shows log probability decreasing linearly and remaining stable.
Path probability decay in probability space vs log space as sequence length increases. Left: probability space values crash to zero (underflow) around length 30. Right: log-space values remain computationally stable, decreasing linearly. The red dashed line marks the smallest representable float64 value.

The visualization makes the underflow problem concrete. With a typical per-step probability of 0.1, probability-space values hit the float64 minimum around sequence length 308 and become indistinguishable from zero. Log-space values, in contrast, decrease linearly and remain perfectly representable regardless of sequence length. Even a 1000-word document produces a log probability around -2300, large but entirely computable.

Log-space computation is essential for practical implementations. The algorithm produces identical paths but with stable numerics.

Complexity Analysis

The Viterbi algorithm's efficiency comes from its dynamic programming structure. Let's analyze the time and space requirements.

Time Complexity

For a sequence of length nn with kk possible states:

  1. Initialization: O(k)O(k) operations to set up the first column
  2. Recursion: For each of the n1n-1 remaining positions, we compute kk cells. Each cell considers kk predecessors. Total: O(nk2)O(n \cdot k^2)
  3. Backtracking: O(n)O(n) to trace back through the path

The overall time complexity is O(nk2)O(n \cdot k^2), which is linear in sequence length and quadratic in the number of states.

In[14]:
Code
import time


def benchmark_viterbi(seq_lengths, n_states_list):
    """Benchmark Viterbi performance."""
    results = []

    for n_states in n_states_list:
        for seq_len in seq_lengths:
            # Create synthetic HMM
            states = [f"S{i}" for i in range(n_states)]
            obs = [f"O{i % 5}" for i in range(seq_len)]

            # Uniform probabilities for benchmarking
            start = {s: 1 / n_states for s in states}
            trans = {(s1, s2): 1 / n_states for s1 in states for s2 in states}
            emit = {(s, f"O{j}"): 0.2 for s in states for j in range(5)}

            # Time the algorithm
            start_time = time.perf_counter()
            for _ in range(3):  # Average over 3 runs
                viterbi_log_space(obs, states, start, trans, emit)
            elapsed = (time.perf_counter() - start_time) / 3

            results.append(
                {
                    "seq_len": seq_len,
                    "n_states": n_states,
                    "time_ms": elapsed * 1000,
                }
            )

    return results


# Run benchmarks
seq_lengths = [10, 50, 100, 200]
n_states_list = [5, 10, 20]
benchmarks = benchmark_viterbi(seq_lengths, n_states_list)
Out[15]:
Console
Viterbi Algorithm Benchmarks
=======================================================

Seq Length   States     Time (ms)   
-------------------------------------------------------
10           5          0.038
50           5          0.194
100          5          0.290
200          5          0.576
10           10         0.105
50           10         0.614
100          10         0.966
200          10         2.045
10           20         0.369
50           20         1.758
100          20         3.688
200          20         7.428

The benchmarks confirm the theoretical complexity. Doubling sequence length roughly doubles runtime (linear scaling), while doubling the number of states roughly quadruples runtime (quadratic scaling). Even with 20 states and 200-word sequences, Viterbi completes in tens of milliseconds, enabling real-time tagging in production systems.

Out[16]:
Visualization
Two line plots showing Viterbi runtime. Left plot shows linear growth with sequence length, right plot shows quadratic growth with number of states.
Empirical time complexity of the Viterbi algorithm. Left: time grows linearly with sequence length for a fixed number of states. Right: time grows quadratically with the number of states for a fixed sequence length. These curves confirm the theoretical O(n × k²) complexity.

Space Complexity

The algorithm stores:

  1. Viterbi table: O(nk)O(n \cdot k) scores
  2. Backpointers: O(nk)O(n \cdot k) references

Total space is O(nk)O(n \cdot k). However, if we only need the final path and not the full table, we can reduce space to O(k)O(k) by keeping only two columns at a time, though we'd need to run a second pass to recover the path.

Implementing Viterbi Efficiently

A production implementation incorporates several optimizations beyond the basic algorithm. Let's build a more efficient version.

In[17]:
Code
import numpy as np


class EfficientViterbi:
    """
    Optimized Viterbi implementation using NumPy.

    Uses matrix operations for faster computation on larger state spaces.
    """

    def __init__(self, states, trans_matrix, emit_matrix, start_probs):
        """
        Args:
            states: List of state names
            trans_matrix: (n_states, n_states) transition probabilities
            emit_matrix: (n_states, n_observations) emission probabilities
            start_probs: (n_states,) initial state probabilities
        """
        self.states = states
        self.n_states = len(states)

        # Store log probabilities
        self.log_trans = np.log(trans_matrix + 1e-10)
        self.log_emit = emit_matrix  # Will be indexed per observation
        self.log_start = np.log(start_probs + 1e-10)

    def decode(self, observation_indices):
        """
        Find the optimal state sequence for a sequence of observations.

        Args:
            observation_indices: List of observation indices

        Returns:
            path: List of state indices
            score: Log probability of the path
        """
        n = len(observation_indices)

        # Viterbi table: (sequence_length, n_states)
        V = np.full((n, self.n_states), float("-inf"))
        backpointers = np.zeros((n, self.n_states), dtype=int)

        # Base case
        obs_0 = observation_indices[0]
        V[0] = self.log_start + self.log_emit[:, obs_0]

        # Fill the trellis
        for i in range(1, n):
            obs_i = observation_indices[i]
            emit_i = self.log_emit[:, obs_i]

            # For each current state, find best predecessor
            # V[i-1] is (n_states,), log_trans is (n_states, n_states)
            # We want: for each current state s, max over s' of V[i-1, s'] + log_trans[s', s]

            # Broadcast: V[i-1, :, np.newaxis] + log_trans = (n_states, n_states)
            # where entry [s', s] = V[i-1, s'] + log_trans[s', s]
            scores = V[i - 1, :, np.newaxis] + self.log_trans

            # Max over predecessors (axis 0) gives best score for each current state
            best_scores = scores.max(axis=0)
            best_prevs = scores.argmax(axis=0)

            V[i] = best_scores + emit_i
            backpointers[i] = best_prevs

        # Find best final state
        best_final = V[n - 1].argmax()
        best_score = V[n - 1, best_final]

        # Backtrack
        path = [best_final]
        current = best_final
        for i in range(n - 1, 0, -1):
            current = backpointers[i, current]
            path.append(current)

        path.reverse()
        return path, best_score

This implementation uses NumPy's vectorized operations to compute scores for all states simultaneously, avoiding Python loops over states. The key insight is that the inner loop over predecessors becomes a matrix operation.

In[18]:
Code
# Demonstrate the efficient implementation

# Create a small HMM
n_states = 3
n_obs = 4

# Random transition and emission matrices
np.random.seed(42)
trans_matrix = np.random.dirichlet(np.ones(n_states), size=n_states)
emit_matrix = np.random.dirichlet(np.ones(n_obs), size=n_states)
start_probs = np.ones(n_states) / n_states

# Convert emission to log space
log_emit = np.log(emit_matrix + 1e-10)

# Create decoder
states = ["State0", "State1", "State2"]
decoder = EfficientViterbi(states, trans_matrix, log_emit, start_probs)

# Decode a sequence
observations = [0, 1, 2, 1, 0, 3, 2]
path, score = decoder.decode(observations)
path_names = [states[i] for i in path]
Out[19]:
Console
Efficient NumPy Viterbi
=======================================================

Transition matrix shape: (3, 3)
Emission matrix shape: (3, 4)

Observation sequence: [0, 1, 2, 1, 0, 3, 2]
Decoded path: ['State2', 'State1', 'State0', 'State1', 'State0', 'State1', 'State0']
Log probability: -11.7855

The NumPy implementation produces the same results as the dictionary-based version but runs significantly faster for large state spaces. The log probability is negative (as expected for log probabilities) and represents the joint probability of the path and observations. The vectorized approach eliminates Python loops over states, leveraging NumPy's optimized C implementations.

Sparse Transitions

Many HMMs have sparse transition matrices where most transitions have zero probability. We can exploit this by only considering valid transitions:

In[20]:
Code
def viterbi_sparse(
    observations, states, start_prob, valid_transitions, emit_prob
):
    """
    Viterbi with sparse transition handling.

    Args:
        valid_transitions: Dict mapping state -> list of valid predecessor states
                          with their log probabilities
    """
    n = len(observations)
    V = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Base case
    for s in states:
        emit = emit_prob.get((s, observations[0]), float("-inf"))
        V[0][s] = start_prob.get(s, float("-inf")) + emit
        backpointer[0][s] = None

    # Recursion - only consider valid predecessors
    for i in range(1, n):
        for s in states:
            best_prev = None
            best_score = float("-inf")
            emit = emit_prob.get((s, observations[i]), float("-inf"))

            # Only iterate over valid predecessors
            for s_prev, log_trans in valid_transitions.get(s, []):
                score = V[i - 1][s_prev] + log_trans + emit
                if score > best_score:
                    best_score = score
                    best_prev = s_prev

            if best_score > float("-inf"):
                V[i][s] = best_score
                backpointer[i][s] = best_prev

    # Find best path
    valid_finals = [s for s in states if s in V[n - 1]]
    if not valid_finals:
        return None, float("-inf")

    best_final = max(valid_finals, key=lambda s: V[n - 1][s])
    best_score = V[n - 1][best_final]

    path = [best_final]
    current = best_final
    for i in range(n - 1, 0, -1):
        current = backpointer[i][current]
        if current is None:
            break
        path.append(current)

    path.reverse()
    return path, best_score

For BIO tagging where I-tags can only follow B-tags or I-tags of the same type, sparse transitions significantly reduce computation. With 9 entity types, you have 19 tags: one O tag plus B and I tags for each entity type (9 × 2 + 1 = 19). A dense transition matrix has 192=36119^2 = 361 entries, but valid transitions are far fewer since I-LOC cannot follow B-PER.

Viterbi for Beam Search Foundation

While Viterbi finds the single best path, many applications need the kk best paths. Neural sequence models often use beam search, which maintains the top kk partial hypotheses at each step. Beam search is essentially a pruned version of Viterbi.

Beam Search

Beam search is a heuristic search algorithm that explores the most promising paths in parallel. At each step, it keeps only the top kk (beam width) partial sequences, pruning less promising alternatives.

The connection to Viterbi is direct: if we set the beam width to equal the number of states, beam search becomes equivalent to Viterbi. Smaller beam widths trade optimality for speed.

In[21]:
Code
import heapq


def beam_search_viterbi(
    observations, states, start_prob, trans_prob, emit_prob, beam_width=3
):
    """
    Beam search approximation to Viterbi.

    Keeps only the top beam_width hypotheses at each step.
    """
    n = len(observations)

    # Initialize with start probabilities
    # Each hypothesis is (score, path)
    beam = []
    for s in states:
        emit = emit_prob.get((s, observations[0]), 1e-10)
        score = math.log(start_prob.get(s, 1e-10)) + math.log(emit)
        beam.append((score, [s]))

    # Keep top k
    beam = heapq.nlargest(beam_width, beam, key=lambda x: x[0])

    # Extend hypotheses
    for i in range(1, n):
        new_beam = []

        for score, path in beam:
            prev_state = path[-1]
            emit_obs = observations[i]

            for s in states:
                trans = trans_prob.get((prev_state, s), 1e-10)
                emit = emit_prob.get((s, emit_obs), 1e-10)
                new_score = score + math.log(trans) + math.log(emit)
                new_beam.append((new_score, path + [s]))

        # Prune to beam width
        beam = heapq.nlargest(beam_width, new_beam, key=lambda x: x[0])

    # Return best hypothesis
    best_score, best_path = max(beam, key=lambda x: x[0])
    return best_path, best_score


# Compare Viterbi (full) vs beam search
beam_widths = [1, 2, 3, len(states)]  # 1 is greedy, len(states) is full Viterbi
results = []

for bw in beam_widths:
    path, score = beam_search_viterbi(
        observations, states, start_prob, trans_prob, emit_prob, beam_width=bw
    )
    results.append((bw, path, score))
Out[22]:
Console
Beam Search vs Viterbi
=======================================================

greedy      : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

beam=2      : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

Viterbi     : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

Viterbi     : State0 → State0 → State0 → State0 → State0 → State0 → State0
             score = -322.3619

Beam width vs optimality:
  k=1: matches optimal ✓
  k=2: matches optimal ✓
  k=3: matches optimal ✓
  k=3: matches optimal ✓

The comparison reveals the tradeoff between beam width and optimality. Greedy search (k=1) may find a suboptimal path because it prunes too aggressively. As beam width increases, we retain more hypotheses and are more likely to find the optimal solution. With beam width equal to the number of states, beam search becomes exact Viterbi. In practice, small beam widths (3-10) often find near-optimal solutions while running much faster than exact Viterbi.

Beam search with k=1k=1 is greedy decoding. As kk increases, we approach Viterbi's optimal solution. The tradeoff is computation time: beam search is O(nkS)O(n \cdot k \cdot |S|) where kk is the beam width, compared to Viterbi's O(nS2)O(n \cdot |S|^2).

Constrained Decoding

In sequence labeling, certain tag transitions are invalid. An I-LOC tag cannot follow a B-PER tag. Rather than learning these constraints from data, we can enforce them during decoding by setting invalid transition probabilities to zero (or negative infinity in log space).

In[23]:
Code
def get_bio_constraints(entity_types):
    """
    Generate valid transition constraints for BIO tagging.

    Returns a set of valid (prev_tag, curr_tag) pairs.
    """
    tags = ["O"]
    for etype in entity_types:
        tags.extend([f"B-{etype}", f"I-{etype}"])

    valid_transitions = set()

    for prev_tag in tags + ["<START>"]:
        for curr_tag in tags:
            # O can follow anything
            if curr_tag == "O":
                valid_transitions.add((prev_tag, curr_tag))
                continue

            # B-X can follow anything (starts new entity)
            if curr_tag.startswith("B-"):
                valid_transitions.add((prev_tag, curr_tag))
                continue

            # I-X can only follow B-X or I-X
            if curr_tag.startswith("I-"):
                curr_type = curr_tag[2:]
                if prev_tag == f"B-{curr_type}" or prev_tag == f"I-{curr_type}":
                    valid_transitions.add((prev_tag, curr_tag))

    return valid_transitions


# Generate constraints
entity_types = ["PER", "LOC", "ORG"]
valid = get_bio_constraints(entity_types)
Out[24]:
Console
BIO Tagging Constraints
=======================================================

Entity types: ['PER', 'LOC', 'ORG']
Tags: O, B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG

Sample valid transitions:
  O        → B-PER    ✓
  B-PER    → I-PER    ✓
  I-PER    → I-PER    ✓
  I-PER    → B-LOC    ✓
  B-LOC    → O        ✓

Sample invalid transitions:
  O        → I-PER    ✗
  B-PER    → I-LOC    ✗
  I-LOC    → I-PER    ✗

Constraint summary: 38 valid transitions out of 49 possible

The constraints follow BIO semantics: I-tags can only continue entities of the same type, while B-tags and O can follow any tag. Invalid transitions like O → I-PER (starting an entity with a continuation tag) or B-PER → I-LOC (continuing a PER entity as LOC) are explicitly forbidden. By setting these transition probabilities to negative infinity during Viterbi decoding, we guarantee the output is always a well-formed BIO sequence.

Out[25]:
Visualization
Heatmap showing 7x7 matrix of BIO tag transitions. Valid transitions are green, invalid are red. Clear diagonal pattern visible for I-tags.
BIO transition constraint matrix as a heatmap. Green cells indicate valid transitions, red cells indicate forbidden transitions. Notice the diagonal pattern in the I-tag region: I-PER can only follow B-PER or I-PER, never tags of different entity types. This sparse structure enables significant computational optimization.

The heatmap reveals the structure of BIO constraints at a glance. The first row and column (O tag) show that O can follow anything and anything can precede O. The B-tags (columns 2, 4, 6) are fully green because they can follow any tag since they start new entities. The distinctive pattern appears in the I-tag columns: each I-tag has exactly two green cells, corresponding to its matching B and I tags. This sparsity (25 valid out of 49 possible transitions) enables significant optimization during Viterbi decoding.

Constrained decoding ensures the output is always a valid BIO sequence, even if the model's raw scores would suggest otherwise. This is particularly important for neural models that don't explicitly model transition constraints.

Limitations and Practical Considerations

The Viterbi algorithm is optimal for first-order Markov models where the current state depends only on the previous state. Several limitations arise in practice:

  • Higher-order dependencies exist in natural language. The optimal tag for a word might depend on tags two or three positions back. Extending Viterbi to second-order Markov models is possible but increases complexity to O(nk3)O(n \cdot k^3) since we must track pairs of previous states.

  • Non-local features are common in neural sequence models. A BiLSTM encodes the entire sentence before making predictions, so the score at each position depends on the full context, not just the previous tag. CRF layers on top of neural models use Viterbi for decoding, but the emission scores come from the neural network rather than simple probability tables.

  • Approximate inference becomes necessary with large state spaces. In machine translation, the "states" are partial translations with exponentially many possibilities. Beam search provides a practical approximation, trading optimality for tractability.

  • Label dependencies beyond transitions sometimes matter. In some tagging schemes, the entity type of a word might depend on entity types earlier in the sentence, not just the immediately preceding tag. Such long-range dependencies require more sophisticated inference methods.

Despite these limitations, Viterbi remains the workhorse algorithm for sequence labeling. Its efficiency, optimality guarantees for Markov models, and straightforward implementation make it the right choice for most POS tagging and NER systems. When combined with neural feature extractors and CRF output layers, it achieves state-of-the-art results on many benchmarks.

Summary

The Viterbi algorithm solves the optimal path problem for sequence labeling. Given a sequence of observations and a model with transition and emission probabilities, it finds the single best sequence of hidden states in time linear in sequence length.

The key ideas covered in this chapter:

Dynamic programming makes the exponential search space tractable. By storing the best score for each state at each position, we avoid recomputing shared subsequences. The O(nk2)O(n \cdot k^2) complexity enables real-time tagging of long documents.

Backpointer tracking enables path reconstruction. At each cell in the Viterbi table, we record which predecessor achieved the maximum score. After filling the table, we trace back from the best final state to recover the optimal path.

Log-space computation prevents numerical underflow. Working with log probabilities converts products to sums, keeping values in a representable range even for long sequences. This is essential for any practical implementation.

Constrained decoding enforces structural requirements. For BIO tagging, we set invalid transition probabilities to zero, ensuring the output is always a valid tag sequence regardless of what the model's raw scores suggest.

Beam search connects to Viterbi as an approximation. By keeping only the top kk hypotheses at each step, beam search trades optimality for speed. With beam width equal to the number of states, it recovers Viterbi exactly.

Key Parameters and Functions

When implementing or using Viterbi decoding, these parameters control behavior:

  • states: The set of possible hidden states (tags). Larger state sets increase computation quadratically
  • transition_prob: Matrix or dictionary of P(sisi1)P(s_i | s_{i-1}) values. Sparse matrices enable optimization
  • emission_prob: Matrix or dictionary of P(wisi)P(w_i | s_i) values. In neural models, these come from network outputs
  • start_prob: Initial state distribution P(s1)P(s_1). Often uniform or learned from data
  • beam_width: For approximate decoding, controls the number of hypotheses retained. Larger values approach exact Viterbi

The next chapter introduces Conditional Random Fields, which generalize HMMs by allowing arbitrary feature functions while still using Viterbi for efficient decoding.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Viterbi algorithm and optimal sequence decoding.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{viterbialgorithmdynamicprogrammingforoptimalsequencedecoding, author = {Michael Brenndoerfer}, title = {Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding}, year = {2025}, url = {https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding. Retrieved from https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling
MLAAcademic
Michael Brenndoerfer. "Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling>.
CHICAGOAcademic
Michael Brenndoerfer. "Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding'. Available at: https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding. https://mbrenndoerfer.com/writing/viterbi-algorithm-sequence-labeling
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free