A comprehensive exploration of Andrew Viterbi's groundbreaking 1967 algorithm that revolutionized sequence decoding. Learn how dynamic programming made optimal inference in Hidden Markov Models computationally feasible, transforming speech recognition, part-of-speech tagging, and sequence labeling tasks in natural language processing.

This article is part of the free-to-read History of Language AI
Sign in to mark chapters as read and track your learning journey
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
1967: Viterbi Algorithm
By the mid-1960s, researchers working on communication systems faced a fundamental computational challenge. They were developing decoding algorithms for convolutional codes, a type of error-correcting code used in digital communication. The problem seemed straightforward: given a noisy received signal, determine the most likely sequence of bits that the transmitter intended to send. But the straightforward approach—examining every possible bit sequence and computing its probability—was computationally intractable. With sequences of even modest length, the number of possibilities exploded exponentially, making exhaustive search completely impractical.
Andrew Viterbi, then at the University of California, Los Angeles and later at the University of Southern California, recognized that this combinatorial explosion could be tamed. The key insight was that not all possible sequences needed to be considered independently. The structure of the problem allowed for a dynamic programming solution that could find the optimal sequence efficiently by breaking it into smaller, overlapping subproblems. In 1967, Viterbi published the algorithm that would bear his name, transforming what had been an exponentially hard problem into one solvable in polynomial time.
The Viterbi algorithm introduced a powerful technique that extended far beyond its original application in error-correcting codes. Within a few years, researchers in speech recognition realized that the same mathematical structure—finding the most likely sequence of hidden states given observations—appeared throughout natural language processing. When processing spoken language, the acoustic signal doesn't directly reveal the words being spoken. Instead, you observe acoustic features and must infer the underlying sequence of phonemes or words. This was exactly the kind of decoding problem the Viterbi algorithm was designed to solve.
By the 1980s, the Viterbi algorithm had become central to Hidden Markov Models (HMMs) in speech recognition. Every major speech recognition system used it to decode acoustic observations into word sequences. But its influence extended further: part-of-speech taggers used Viterbi decoding to assign grammatical categories to words in sentences. Named entity recognition systems used it to identify sequences of words representing people, places, or organizations. The algorithm's efficiency and optimality made it the standard solution whenever the task involved finding the best sequence alignment between observations and hidden states.
The impact of the Viterbi algorithm on language AI was profound. It demonstrated that dynamic programming could solve seemingly intractable problems in sequence modeling, establishing a pattern that would persist through decades of research. More fundamentally, it showed that optimal inference in sequential models was computationally feasible, making probabilistic sequence models practical tools rather than theoretical curiosities. The algorithm remains in use today, both as a direct tool in HMM-based systems and as a conceptual foundation for understanding sequence decoding in neural architectures.
The Problem: Finding Needles in Exponential Haystacks
The core challenge that motivated the Viterbi algorithm appears whenever you need to determine the most likely hidden sequence from observations. In error-correcting codes, the hidden sequence is the original message bits, and the observations are the noisy received signal. In speech recognition, the hidden sequence is the words or phonemes being spoken, and the observations are acoustic features extracted from audio. In part-of-speech tagging, the hidden sequence is the grammatical categories of words, and the observations are the words themselves.
In each case, the challenge is the same: given an observation sequence of length , how do you find the hidden state sequence that most likely produced those observations? If you have possible states at each time step, there are possible sequences to consider. For even modest values—say, states and observations—you're dealing with possible sequences, a number so large that enumerating all possibilities is computationally impossible.
The naive approach would be to compute the probability of each possible sequence given the observations and select the one with the highest probability. But this requires evaluating an exponential number of sequences. Even if each evaluation took only a microsecond, examining all sequences would take longer than the age of the universe. Clearly, a different approach was needed.
The key insight that makes the Viterbi algorithm possible is the Markov property: the probability of being in a particular state at time depends only on the state at time , not on the entire history of states. This means that optimal sequences have optimal prefixes. If the optimal sequence ending at time passes through state at time , then the path from the start to state at time must also be optimal. Otherwise, you could improve the overall sequence by replacing that suboptimal prefix with an optimal one.
The problem becomes even more challenging when you consider that observations can be ambiguous. In speech recognition, the same acoustic feature might be consistent with multiple different phonemes. In part-of-speech tagging, the word "bank" could be a noun (financial institution) or a verb (to deposit). The system must use context—both the transition probabilities between states and the observation probabilities for each state—to resolve these ambiguities. But doing so requires reasoning about sequences, not just individual decisions.
Before the Viterbi algorithm, researchers attempted various approximations. Some systems made greedy decisions, choosing the most likely state at each time step independently without considering how these choices affected later decisions. This local optimization often led to globally suboptimal sequences. Other systems explored only a subset of possible sequences using heuristics, but couldn't guarantee finding the true optimal solution. What was needed was an algorithm that could efficiently find the globally optimal sequence without exhaustive enumeration.
The Solution: Dynamic Programming Through Time
The Viterbi algorithm solves the sequence decoding problem using dynamic programming, a technique that builds up solutions to complex problems by combining solutions to simpler subproblems. Instead of considering all possible sequences simultaneously, the algorithm processes the observations sequentially, maintaining for each state at each time step the probability of the most likely path ending in that state, along with a pointer to the previous state in that optimal path.
Formally, if we let represent the probability of the most likely sequence of states ending in state at time , the algorithm computes these values recursively. To compute , we consider all possible previous states and find which transition maximizes the combined probability:
where is the transition probability from state to state , and is the probability of observing given that we're in state . The algorithm also maintains backpointers that record which previous state achieved this maximum, allowing reconstruction of the optimal sequence once all observations have been processed.
The algorithm begins by initializing for all states using the initial state probabilities and the first observation's emission probabilities. Then it iterates forward through time, computing for all states at each time step. After processing all observations, it finds the state that maximizes and uses the backpointers to trace backwards through time, reconstructing the optimal state sequence.
The key efficiency of the Viterbi algorithm comes from pruning. At each time step, for each state, we keep only the single best path leading to that state. All other paths are discarded because they cannot become optimal later. This pruning reduces the computational complexity from (exponential) to (polynomial). For states and observations, this reduces the computation from to approximately operations—a reduction by a factor of roughly .
The algorithm's correctness relies on the optimal substructure property: if the optimal sequence passes through state at time , then the optimal path from the start to state at time is part of the globally optimal sequence. This means we can safely discard all non-optimal paths to state at time , confident that they cannot contribute to an optimal solution. This property holds because of the Markov assumption—future states depend only on the current state, not on the path taken to reach it.
Implementation Details: Managing Memory and Precision
Implementing the Viterbi algorithm efficiently requires careful attention to numerical stability and memory management. When computing probabilities, multiplying many small numbers together can lead to numerical underflow—probabilities become so small that they round to zero in floating-point arithmetic. To avoid this, implementations typically work in the log domain, converting multiplications to additions:
Working with log-probabilities prevents underflow and can actually improve computational efficiency, since addition is faster than multiplication. The algorithm computes the same optimal sequence whether using probabilities or log-probabilities, since the logarithm is a monotonic function that preserves ordering.
Memory requirements are modest. The algorithm needs to store for all states at the current time step, plus the backpointers for all states at all time steps. For a sequence of length with states, this requires memory, which is typically manageable. In practice, if only the final optimal sequence is needed, backpointers for old time steps can be discarded once they're no longer needed, though this requires more complex bookkeeping.
The algorithm can be parallelized to some extent. Within each time step, computing for different states can be done in parallel, since these computations are independent. However, the sequential nature of the algorithm—processing time steps in order—limits the overall parallelization. Still, for systems with many states, this intra-timestep parallelism provides significant speedups on modern hardware.
Applications: From Communication to Language Processing
The earliest application of the Viterbi algorithm was in decoding convolutional codes for error correction in digital communication. When transmitting data over noisy channels (radio links, phone lines, storage media), convolutional codes add redundant information that allows the receiver to detect and correct errors. The decoder must determine the most likely transmitted sequence given the noisy received sequence—exactly the problem the Viterbi algorithm solves. By the 1970s, Viterbi decoding had become standard in satellite communication, modems, and digital cellular systems.
But it was in speech recognition that the algorithm found its most transformative application. In the 1970s and 1980s, as Hidden Markov Models became the dominant framework for speech recognition, the Viterbi algorithm became the standard method for decoding acoustic observations into word sequences. Every major speech recognition system used it: IBM's Tangora system, Dragon Systems' NaturallySpeaking, and the research systems that preceded them all relied on Viterbi decoding.
The algorithm worked as follows: acoustic features extracted from speech audio were treated as observations. The HMM had states representing phonemes or sub-phonetic units, with transitions modeling how sounds follow each other in the language and emissions modeling the acoustic features associated with each sound. Given a sequence of acoustic features, the Viterbi algorithm found the most likely sequence of phonemes, which could then be mapped to words using a pronunciation dictionary and language model.
Part-of-speech tagging emerged as another major application. Given a sequence of words, a part-of-speech tagger assigns each word its grammatical category (noun, verb, adjective, etc.). HMM taggers modeled words as observations and grammatical categories as hidden states. Transition probabilities captured how grammatical categories follow each other in valid sentences (verbs often follow nouns, determiners precede nouns, etc.), while emission probabilities captured how likely each word is given each grammatical category. The Viterbi algorithm found the most likely sequence of tags, resolving ambiguous cases like whether "bank" is a noun or verb based on surrounding context.
Named entity recognition systems used similar approaches. Given a sentence, the task is to identify which words form named entities (people, organizations, locations) and classify them. Systems modeled this using sequence labeling, where each word is assigned a label like B-PERSON (beginning of person name), I-PERSON (inside person name), or O (outside any entity). HMMs with Viterbi decoding provided early solutions to this problem, though they were later largely replaced by more sophisticated sequence labeling methods.
Machine translation systems in the 1990s and early 2000s used the Viterbi algorithm in various ways. Some phrase-based translation systems used it to find the best sequence of phrase translations. Statistical alignment models used it to find the most likely word alignment between source and target sentences. Even when neural translation systems later replaced these approaches, the conceptual framework of finding optimal sequences persisted.
Limitations: When Optimal Isn't Optimal Enough
Despite its power, the Viterbi algorithm has limitations that motivated the development of alternative approaches. The most significant is its focus on finding a single optimal sequence. In many applications, you want not just the best sequence but a probability distribution over possible sequences, or the best sequences for further processing. The algorithm's pruning strategy discards all but the single best path to each state, losing information about alternative hypotheses.
This became problematic in applications like speech recognition, where you might want to generate multiple candidate transcriptions ranked by probability, or in machine translation where producing multiple translation hypotheses improves quality. The Viterbi algorithm gives you one answer, but sometimes you need more. Alternative algorithms like the -best Viterbi or forward-backward algorithms address this by maintaining multiple hypotheses or computing full probability distributions.
The algorithm assumes that observations are conditionally independent given the hidden states—that the probability of observing depends only on the current state , not on previous observations or states. This assumption is reasonable for some applications (like part-of-speech tagging) but problematic for others. In speech recognition, acoustic features can depend on broader context, and the independence assumption can lead to suboptimal decodings.
The Markov assumption itself—that future states depend only on the current state—is another limitation. This works well for short-range dependencies but fails to capture longer-range patterns in language. A word's grammatical category might depend on words several positions earlier in the sentence, not just the immediately preceding word. Viterbi decoding in HMMs can miss these long-range dependencies, leading to errors that more sophisticated models avoid.
The algorithm also assumes that all possible state transitions and observation emissions are known in advance. In practice, HMM parameters must be learned from training data, typically using the Baum-Welch algorithm (an instance of expectation-maximization). If these parameters are poorly estimated, the Viterbi algorithm will produce suboptimal results even though it finds the optimal sequence according to the model. The algorithm is only as good as the model it's decoding.
Computational efficiency, while much better than exhaustive search, can still be limiting for very large models. When the number of states is very large (tens of thousands or more, as in some modern language models), the complexity becomes costly. Pruning techniques—discarding states with very low probabilities—can help, but they introduce approximations that might miss the true optimal sequence.
Finally, the algorithm's focus on finding the single most likely sequence can be misleading when probabilities are close. If two sequences have nearly identical probabilities, choosing one over the other might seem arbitrary, yet the algorithm must make a definitive choice. In such cases, the "optimal" sequence might not be meaningfully better than alternatives, and other criteria (like diversity, length, or domain-specific constraints) might be more relevant.
Legacy: Dynamic Programming as a Foundation
The Viterbi algorithm established dynamic programming as a fundamental technique in sequence modeling and language processing. Its influence extends far beyond the specific applications where it's directly used. The core insight—that optimal solutions to sequential problems can be found by maintaining the best solution to each subproblem and building up recursively—appears throughout language AI.
Modern neural sequence models, despite using very different architectures, still face the same fundamental challenge: finding good sequences. When transformer models generate text, they use beam search—a technique directly inspired by Viterbi's pruning strategy. Instead of exploring all possible sequences, beam search maintains the best partial sequences at each step, pruning away less promising candidates. This is conceptually identical to Viterbi's approach of maintaining the best path to each state, just adapted for autoregressive generation where sequences are built incrementally.
The forward-backward algorithm, used for computing probability distributions over sequences rather than finding optimal sequences, shares the same dynamic programming structure as Viterbi. Both algorithms process sequences forward through time, maintaining partial computations that are combined to produce final results. The mathematical framework that makes Viterbi efficient is the same framework that makes forward-backward efficient.
The progression from HMMs decoded with Viterbi to Conditional Random Fields (CRFs) to modern transformer models shows how the fundamental challenge persists even as solutions evolve. CRFs addressed HMM limitations by allowing arbitrary feature functions and avoiding the independence assumptions, but still used Viterbi-like dynamic programming for decoding. Transformers replaced sequence-to-sequence decoding with parallel processing, but still face the challenge of generating coherent sequences, now addressed through attention mechanisms and beam search rather than explicit dynamic programming.
The algorithm's impact on speech recognition was transformative. It made HMM-based speech recognition computationally feasible, enabling the development of practical systems that could run in real-time. Without efficient decoding, speech recognition would have remained a research curiosity rather than becoming the widely deployed technology it is today. The algorithm's efficiency made it possible to build systems that could handle large vocabularies and complex language models.
In computational linguistics, the Viterbi algorithm helped establish probabilistic approaches as practical alternatives to purely rule-based systems. Early NLP systems relied heavily on hand-crafted grammars and rules. Viterbi decoding in HMMs demonstrated that statistical models trained from data could achieve competitive performance, setting the stage for the data-driven revolution that would transform the field in the 1990s and 2000s.
The algorithm also influenced how researchers think about sequence problems. It established the pattern of formulating sequence tasks as finding optimal paths through graphs or lattices—a framework that appears in parsing (finding optimal parse trees), alignment (finding optimal word alignments), and segmentation (finding optimal boundaries). The conceptual shift from "process each element independently" to "find the optimal sequence as a whole" became fundamental to sequence modeling.
Modern Applications and Extensions
Even as neural networks and deep learning transformed language AI, the Viterbi algorithm and its extensions remain in active use. HMM-based systems still power many production applications, particularly where interpretability, speed, or training data limitations make neural approaches less attractive. Part-of-speech taggers, named entity recognizers, and some speech recognition systems continue to use Viterbi decoding.
Modern extensions address the algorithm's limitations. The -best Viterbi algorithm finds not just the optimal sequence but the most likely sequences, maintaining multiple hypotheses at each step. This is valuable for applications like machine translation or speech recognition where generating multiple candidates improves quality. Lattice-based decoding generalizes this further, maintaining a graph structure that compactly represents many possible sequences.
Pruning techniques make the algorithm practical for very large state spaces. Instead of considering all states at each step, systems maintain only the states with probabilities above a threshold or within the top by probability. While this introduces approximation, the speedup can be dramatic, and carefully tuned pruning often produces results very close to the true optimum.
Neural-HMM hybrid systems combine the representational power of neural networks with the efficient decoding of Viterbi. Neural networks learn emission and transition probabilities from data, potentially capturing complex patterns that simple HMMs miss, while Viterbi decoding provides efficient inference. This combines the best of both worlds: neural learning of probabilities with dynamic programming for optimal sequence finding.
In computational biology, the Viterbi algorithm finds applications in sequence alignment, gene finding, and protein structure prediction. The same mathematical structure—finding optimal sequences given observations—appears throughout bioinformatics, and Viterbi decoding remains a standard tool despite the development of more sophisticated methods.
Connections to Modern Sequence Models
The relationship between the Viterbi algorithm and modern language models reveals how fundamental insights persist across paradigm shifts. Transformer models don't use explicit dynamic programming, but they face the same core challenge: generating or decoding sequences optimally. The attention mechanism that allows transformers to model dependencies across entire sequences can be viewed as a learned alternative to the Markov assumption—instead of assuming dependencies only to the previous state, attention learns which past positions are most relevant.
Beam search in neural language models is a direct conceptual descendant of Viterbi's pruning strategy. Instead of exploring all possible continuations, beam search maintains the most promising partial sequences, expanding only from those. This is exactly what Viterbi does for each state—maintain the single best path and expand only from that. Beam search generalizes this to maintaining multiple hypotheses, but the core idea of pruning unlikely paths remains identical.
The contrast between HMMs decoded with Viterbi and modern autoregressive models highlights an important trade-off. Viterbi decoding finds the globally optimal sequence according to the model, but requires processing the entire sequence before producing any output. Autoregressive models generate sequences left-to-right, making local decisions at each step. They're more flexible (can generate indefinitely long sequences) but don't guarantee global optimality—each decision is made based on what's been generated so far, without knowledge of future tokens.
Some modern systems attempt to bridge this gap. Non-autoregressive translation models generate all tokens in parallel, then use iterative refinement to improve the sequence—a process that shares some conceptual similarity with Viterbi's global optimization approach. These models trade the efficiency of autoregressive generation for the potential of better global coherence, though in practice they haven't yet surpassed autoregressive models in most applications.
The Viterbi algorithm also influenced how researchers think about training sequence models. The forward-backward algorithm, closely related to Viterbi, is used in the expectation step of Baum-Welch training for HMMs. This pattern—using efficient dynamic programming for inference during training—appears in other contexts, like the forward and backward passes in neural sequence models. The efficient computation of gradients through sequences relies on similar dynamic programming principles.
Conclusion: An Algorithm That Endured
When Andrew Viterbi published his algorithm in 1967, he solved a specific problem in digital communication: efficiently decoding convolutional codes. He probably couldn't have anticipated that the same algorithm would become central to speech recognition, part-of-speech tagging, machine translation, and countless other applications in language processing. Yet that's exactly what happened, and the algorithm remains in active use more than half a century later.
The Viterbi algorithm's longevity comes from its combination of optimality and efficiency. It solves the sequence decoding problem exactly—finding the truly optimal solution according to the model—while doing so in polynomial time rather than exponential time. This combination is rare in computer science. Most efficient algorithms are approximations; most exact algorithms are intractable. Viterbi gives you both: optimality and efficiency.
More fundamentally, the algorithm established a way of thinking about sequential problems that has proven remarkably durable. The idea of maintaining optimal solutions to subproblems and building up recursively appears throughout sequence modeling, even when the specific algorithm differs. The conceptual framework—formulate as a graph search, maintain best paths, prune unlikely candidates—persists in modern systems even when the mathematical details change.
As language AI evolved from HMMs to CRFs to neural networks to transformers, each new paradigm faced the same fundamental challenge: processing sequences efficiently while capturing relevant patterns. The solutions evolved—from explicit dynamic programming to learned attention mechanisms—but the core problem remained. The Viterbi algorithm provided the first computationally feasible solution to sequence decoding, demonstrating that optimal inference in sequential models was possible.
Today, while many applications have moved beyond HMMs and Viterbi decoding, the algorithm remains relevant both as a practical tool and as a conceptual foundation. It continues to power production systems in speech recognition, part-of-speech tagging, and other sequence labeling tasks. More importantly, it established the pattern that subsequent advances would follow: find efficient ways to reason about sequences as wholes rather than processing elements independently. In the history of language AI, the Viterbi algorithm represents a moment when sequence modeling became computationally practical, setting the stage for everything that followed.
Quiz
Ready to test your understanding of the Viterbi algorithm? This quiz covers the key concepts, mathematical foundations, computational efficiency, and historical significance of this foundational algorithm for sequence decoding. Challenge yourself and see how well you've grasped these important ideas!
Sign in to mark chapters as read and track your learning journey
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Chinese Room Argument - Syntax, Semantics, and the Limits of Computation
Explore John Searle's influential 1980 thought experiment challenging strong AI. Learn how the Chinese Room argument demonstrates that symbol manipulation alone cannot produce genuine understanding, forcing confrontations with fundamental questions about syntax vs. semantics, intentionality, and the nature of mind in artificial intelligence.

Augmented Transition Networks - Procedural Parsing Formalism for Natural Language
Explore William Woods's influential 1970 parsing formalism that extended finite-state machines with registers, recursion, and actions. Learn how Augmented Transition Networks enabled procedural parsing of natural language, handled ambiguity through backtracking, and integrated syntactic analysis with semantic processing in systems like LUNAR.

Conceptual Dependency - Canonical Meaning Representation for Natural Language Understanding
Explore Roger Schank's foundational 1969 theory that revolutionized natural language understanding by representing sentences as structured networks of primitive actions and conceptual cases. Learn how Conceptual Dependency enabled semantic equivalence recognition, inference, and question answering through canonical meaning representations independent of surface form.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
No spam, unsubscribe anytime.
Create a free account to unlock exclusive features, track your progress, and join the conversation.


Comments