Search

Search articles

Hidden Markov Models: Probabilistic Sequence Labeling for NLP

Michael BrenndoerferDecember 16, 202526 min read

Learn how Hidden Markov Models use transition and emission probabilities to solve sequence labeling tasks like POS tagging, with Python implementation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Hidden Markov Models

Sequence labeling tasks like POS tagging require more than just looking up each word in a dictionary. The word "bank" could be a noun (a financial institution) or a verb (to bank on something). The word "flies" could be a noun (the insects) or a verb (the action of flying). Context determines the correct label, and that context often spans multiple words in a sequence.

Hidden Markov Models (HMMs) provide a principled probabilistic framework for sequence labeling. An HMM models two interacting processes: a hidden sequence of states (the tags you want to predict) and an observable sequence of outputs (the words you can see). The model learns statistical patterns about how states transition to each other and how each state generates observations. Given a new sentence, the HMM uses these learned patterns to infer the most likely hidden sequence.

This chapter covers HMM components and their probabilistic foundations. You'll learn how to estimate HMM parameters from annotated data, understand the key assumptions that make HMMs tractable, and implement a working POS tagger. We'll also examine where HMMs succeed and where they fall short, setting up the motivation for more advanced models like Conditional Random Fields.

The HMM Framework

An HMM defines a joint probability distribution over two sequences: a hidden state sequence and an observable output sequence. In POS tagging, the hidden states are the part-of-speech tags, and the observations are the words.

Hidden Markov Model

A Hidden Markov Model is a probabilistic sequence model with two components: an unobservable (hidden) state sequence that evolves according to Markov dynamics, and an observable output sequence where each output depends only on the current hidden state. The "hidden" refers to the fact that you cannot directly observe the states, only infer them from the observations.

The model has five components:

  • States (SS): A finite set of hidden states. For POS tagging, these are the tags like NN, VB, DT, JJ.
  • Observations (VV): A finite vocabulary of observable symbols. These are the words in your corpus.
  • Initial probabilities (π\pi): The probability of starting in each state. For sentences, this captures which tags typically begin sentences.
  • Transition probabilities (AA): The probability of moving from one state to another. This captures tag-to-tag patterns like "determiners are often followed by nouns."
  • Emission probabilities (BB): The probability of generating each observation from each state. This captures which words are associated with which tags.

Let's define these components mathematically. For a state sequence s1,s2,,sns_1, s_2, \ldots, s_n and observation sequence o1,o2,,ono_1, o_2, \ldots, o_n, we define three probability distributions.

The initial probability captures how likely each state is to appear at the start of a sequence:

πi=P(s1=i)\pi_i = P(s_1 = i)

where πi\pi_i is the probability that the first state in the sequence is state ii.

The transition probability captures how states follow one another:

aij=P(st=jst1=i)a_{ij} = P(s_t = j \,|\, s_{t-1} = i)

where:

  • aija_{ij}: the probability of transitioning from state ii to state jj
  • sts_t: the state at position tt in the sequence
  • st1s_{t-1}: the state at the previous position

The emission probability captures how each state generates observations:

bi(o)=P(ot=ost=i)b_i(o) = P(o_t = o \,|\, s_t = i)

where:

  • bi(o)b_i(o): the probability of observing symbol oo when in state ii
  • oto_t: the observation at position tt
  • sts_t: the hidden state at position tt

These three distributions fully characterize an HMM's behavior. The initial probabilities determine how sequences start, transitions determine how states evolve, and emissions determine what we observe.

To make these concepts concrete, let's build a toy HMM for POS tagging. We'll use just three states (determiner, noun, verb) and five words, which is enough to see the key patterns while keeping the numbers tractable.

In[2]:
Code
# A toy HMM for POS tagging with 3 states and 5 words
states = ["DT", "NN", "VB"]  # Determiner, Noun, Verb
vocab = ["the", "dog", "cat", "runs", "sleeps"]

# Initial probabilities: sentences often start with determiners
pi = {"DT": 0.6, "NN": 0.3, "VB": 0.1}

# Transition probabilities
# P(next_state | current_state)
A = {
    "DT": {"DT": 0.1, "NN": 0.8, "VB": 0.1},  # DT usually followed by NN
    "NN": {"DT": 0.1, "NN": 0.3, "VB": 0.6},  # NN often followed by VB
    "VB": {"DT": 0.3, "NN": 0.4, "VB": 0.3},  # VB can lead to various
}

# Emission probabilities
# P(word | state)
B = {
    "DT": {"the": 0.9, "dog": 0.02, "cat": 0.02, "runs": 0.03, "sleeps": 0.03},
    "NN": {"the": 0.01, "dog": 0.45, "cat": 0.45, "runs": 0.05, "sleeps": 0.04},
    "VB": {"the": 0.01, "dog": 0.02, "cat": 0.02, "runs": 0.50, "sleeps": 0.45},
}
Out[3]:
Console
HMM Components for POS Tagging
==================================================

States (Tags): ['DT', 'NN', 'VB']
Vocabulary: ['the', 'dog', 'cat', 'runs', 'sleeps']

Initial Probabilities π:
  P(start with DT) = 0.60
  P(start with NN) = 0.30
  P(start with VB) = 0.10

Sample Transition Probabilities A:
  P(NN | DT) = 0.8  (determiners often precede nouns)
  P(VB | NN) = 0.6  (nouns often precede verbs)

Sample Emission Probabilities B:
  P('the' | DT) = 0.9  (determiners emit 'the' frequently)
  P('dog' | NN) = 0.45  (nouns emit 'dog')
  P('runs' | VB) = 0.5  (verbs emit 'runs')

The transition matrix encodes grammatical patterns. Determiners (DT) are followed by nouns (NN) with probability 0.8, capturing the common English pattern "the dog" or "a cat." The emission matrix encodes lexical associations. The word "the" is emitted by DT with probability 0.9, while "runs" is emitted by VB with probability 0.5.

The Two HMM Assumptions

HMMs make two key independence assumptions that simplify computation. These assumptions trade off modeling power for tractability.

The Markov Assumption

The first assumption is the Markov property: the probability of transitioning to a state depends only on the current state, not on the full history.

P(sts1,s2,,st1)=P(stst1)P(s_t \,|\, s_1, s_2, \ldots, s_{t-1}) = P(s_t \,|\, s_{t-1})

where:

  • sts_t: the state at position tt (what we want to predict)
  • s1,s2,,st1s_1, s_2, \ldots, s_{t-1}: the full history of previous states
  • st1s_{t-1}: only the immediately preceding state

The equation states that knowing the entire history provides no additional information beyond knowing just the previous state. For POS tagging, this means the probability of a tag depends only on the previous tag, not on tags from earlier in the sentence. If the previous tag is DT (determiner), the probability of NN (noun) is the same regardless of what came before the determiner.

This is a strong assumption. Consider "the old old man" versus "the very old man." After seeing "old" once, the probability of another adjective might differ from seeing "old" for the first time. A first-order HMM cannot capture this distinction.

The Output Independence Assumption

The second assumption is output independence: the probability of an observation depends only on the current state, not on other observations or states.

P(ots1,,sn,o1,,ot1,ot+1,,on)=P(otst)P(o_t \,|\, s_1, \ldots, s_n, o_1, \ldots, o_{t-1}, o_{t+1}, \ldots, o_n) = P(o_t \,|\, s_t)

where:

  • oto_t: the observation at position tt (the word we see)
  • s1,,sns_1, \ldots, s_n: all hidden states in the sequence
  • o1,,ot1,ot+1,,ono_1, \ldots, o_{t-1}, o_{t+1}, \ldots, o_n: all other observations (past and future words)
  • sts_t: only the current hidden state

The equation states that the probability of emitting a particular word depends solely on the current tag, not on any surrounding context. For POS tagging, this means the word "bank" has the same emission probability from the NN state regardless of surrounding words. The model cannot use the context "river bank" to increase the probability of the noun interpretation.

In[4]:
Code
# Demonstrating the Markov assumption
# P(NN | DT) is the same regardless of history

# History 1: Started with DT
history_1 = ["DT"]
# History 2: Started with VB, then DT
history_2 = ["VB", "DT"]

# Under Markov assumption, both give same P(NN | DT)
prob_nn_given_dt = A["DT"]["NN"]
Out[5]:
Console
The Markov Assumption in Action
----------------------------------------
History 1: ['DT']
History 2: ['VB', 'DT']

P(next=NN | current=DT) = 0.8

Both histories give the same probability because
HMM only conditions on the immediately previous state.

These assumptions enable efficient algorithms for inference and learning. The Viterbi algorithm (covered in the next chapter) finds the most likely state sequence in O(nS2)O(n \cdot |S|^2) time, where nn is the sequence length and S|S| is the number of states. Without these assumptions, exact inference would require exponential time.

Computing Sequence Probabilities

Given an HMM, how do we compute the probability of a particular observation sequence and state sequence? The joint probability factors according to the HMM structure, exploiting the independence assumptions to decompose a complex probability into simple terms.

For an observation sequence O=o1,o2,,onO = o_1, o_2, \ldots, o_n and state sequence S=s1,s2,,snS = s_1, s_2, \ldots, s_n, the joint probability is:

P(O,S)=P(s1)P(o1s1)t=2nP(stst1)P(otst)P(O, S) = P(s_1) \cdot P(o_1 \,|\, s_1) \cdot \prod_{t=2}^{n} P(s_t \,|\, s_{t-1}) \cdot P(o_t \,|\, s_t)

where:

  • P(O,S)P(O, S): the joint probability of observing sequence OO while the hidden states follow sequence SS
  • P(s1)P(s_1): the initial probability of starting in state s1s_1, denoted πs1\pi_{s_1}
  • P(o1s1)P(o_1 \,|\, s_1): the emission probability of the first observation given the first state, denoted bs1(o1)b_{s_1}(o_1)
  • t=2n\prod_{t=2}^{n}: the product over all positions from 2 to nn
  • P(stst1)P(s_t \,|\, s_{t-1}): the transition probability from state st1s_{t-1} to state sts_t, denoted ast1,sta_{s_{t-1}, s_t}
  • P(otst)P(o_t \,|\, s_t): the emission probability of observation oto_t from state sts_t, denoted bst(ot)b_{s_t}(o_t)

The formula captures the generative story of an HMM: start in some state (with probability π\pi), emit the first word (with probability bb), then repeatedly transition to a new state (with probability aa) and emit the next word (with probability bb).

Let's compute this for a concrete example: the sentence "the dog runs" with tags "DT NN VB." We'll trace through each step of the computation, showing how the probabilities multiply together to give the joint probability of this particular word-tag pairing.

In[6]:
Code
def compute_joint_probability(observations, states, pi, A, B):
    """Compute P(observations, states) for an HMM."""
    n = len(observations)

    # Start with initial probability and first emission
    prob = pi[states[0]] * B[states[0]][observations[0]]

    # Multiply by transition and emission for each subsequent position
    for t in range(1, n):
        transition_prob = A[states[t - 1]][states[t]]
        emission_prob = B[states[t]][observations[t]]
        prob *= transition_prob * emission_prob

    return prob


# Our example sentence
observations = ["the", "dog", "runs"]
tag_sequence = ["DT", "NN", "VB"]

joint_prob = compute_joint_probability(observations, tag_sequence, pi, A, B)
Out[7]:
Console
Computing P('the dog runs', 'DT NN VB')
==================================================

1. Initial: P(DT) = 0.6
2. Emission: P('the'|DT) = 0.9
3. Transition: P(NN|DT) = 0.8
4. Emission: P('dog'|NN) = 0.45
5. Transition: P(VB|NN) = 0.6
6. Emission: P('runs'|VB) = 0.5

Joint probability: 0.6 × 0.9 × 0.8 × 0.45 × 0.6 × 0.5
                 = 0.058320

The joint probability is quite small because we're multiplying many probabilities together. For longer sequences, these products become vanishingly small, leading to numerical underflow. In practice, we work in log space, converting multiplications to additions.

Parameter Estimation from Data

An HMM is only useful if we can learn its parameters from data. For POS tagging, we have access to annotated corpora where each word is labeled with its correct tag. This is supervised learning, and parameter estimation is straightforward: we count occurrences and normalize.

Maximum Likelihood Estimation

The maximum likelihood estimates for HMM parameters are simple frequency ratios derived from counting occurrences in the training data.

For initial probabilities, we count how often each state starts a sequence:

π^i=count(s1=i)count(sequences)\hat{\pi}_i = \frac{\text{count}(s_1 = i)}{\text{count}(\text{sequences})}

where:

  • π^i\hat{\pi}_i: the estimated probability of starting in state ii
  • count(s1=i)\text{count}(s_1 = i): the number of sequences that begin with state ii
  • count(sequences)\text{count}(\text{sequences}): the total number of sequences in the training data

For transition probabilities, we count how often state jj follows state ii:

a^ij=count(st1=i,st=j)count(st1=i)\hat{a}_{ij} = \frac{\text{count}(s_{t-1} = i, s_t = j)}{\text{count}(s_{t-1} = i)}

where:

  • a^ij\hat{a}_{ij}: the estimated probability of transitioning from state ii to state jj
  • count(st1=i,st=j)\text{count}(s_{t-1} = i, s_t = j): the number of times state ii is immediately followed by state jj
  • count(st1=i)\text{count}(s_{t-1} = i): the total number of times state ii appears (except at sequence end)

For emission probabilities, we count how often state ii emits observation oo:

b^i(o)=count(st=i,ot=o)count(st=i)\hat{b}_i(o) = \frac{\text{count}(s_t = i, o_t = o)}{\text{count}(s_t = i)}

where:

  • b^i(o)\hat{b}_i(o): the estimated probability of emitting observation oo from state ii
  • count(st=i,ot=o)\text{count}(s_t = i, o_t = o): the number of times state ii emits observation oo
  • count(st=i)\text{count}(s_t = i): the total number of times state ii appears

In words: the initial probability of state ii is the fraction of sequences that start with ii. The transition probability from ii to jj is the fraction of times ii is followed by jj. The emission probability of oo from ii is the fraction of times state ii emits oo.

Let's implement this with NLTK's tagged corpora. The Brown corpus contains over 57,000 sentences with POS tags, giving us enough data to estimate reliable probabilities. We'll use the Universal tagset, which maps the fine-grained Brown tags to 17 coarse categories for cleaner patterns.

In[8]:
Code
# Download required data (run once interactively, then comment out)
# nltk.download("brown", quiet=True)
# nltk.download("universal_tagset", quiet=True)

# Load Brown corpus with Universal POS tags
from nltk.corpus import brown

tagged_sents = brown.tagged_sents(tagset="universal")
In[9]:
Code
from collections import defaultdict


def estimate_hmm_parameters(tagged_sentences):
    """Estimate HMM parameters from tagged sentences using MLE."""

    # Counts for estimation
    initial_counts = defaultdict(int)
    transition_counts = defaultdict(lambda: defaultdict(int))
    emission_counts = defaultdict(lambda: defaultdict(int))
    tag_counts = defaultdict(int)

    for sentence in tagged_sentences:
        if len(sentence) == 0:
            continue

        # Count initial state
        first_tag = sentence[0][1]
        initial_counts[first_tag] += 1

        # Count emissions and transitions
        prev_tag = None
        for word, tag in sentence:
            word = word.lower()  # Normalize case

            emission_counts[tag][word] += 1
            tag_counts[tag] += 1

            if prev_tag is not None:
                transition_counts[prev_tag][tag] += 1

            prev_tag = tag

    # Convert counts to probabilities
    num_sentences = len(tagged_sentences)

    # Initial probabilities
    pi = {tag: count / num_sentences for tag, count in initial_counts.items()}

    # Transition probabilities
    A = {}
    for tag1, next_tags in transition_counts.items():
        total = sum(next_tags.values())
        A[tag1] = {tag2: count / total for tag2, count in next_tags.items()}

    # Emission probabilities
    B = {}
    for tag, words in emission_counts.items():
        total = tag_counts[tag]
        B[tag] = {word: count / total for word, count in words.items()}

    return pi, A, B, tag_counts


# Estimate parameters from Brown corpus
pi_est, A_est, B_est, tag_counts = estimate_hmm_parameters(tagged_sents)
Out[10]:
Console
HMM Parameters Estimated from Brown Corpus
==================================================
Training sentences: 57,340
Unique tags: 12

Transition Probabilities (examples):
  P(NOUN | DET)  = 0.6266
  P(VERB | NOUN) = 0.1593
  P(ADV | VERB)  = 0.1033

Emission Probabilities (examples):
  P('the' | DET)   = 0.5106
  P('is' | VERB)   = 0.0553
  P('good' | ADJ)  = 0.0087

The estimated parameters reveal clear patterns. The initial probability distribution shows which tags typically begin sentences in English:

Out[11]:
Console
| Tag | Initial Probability | Common Examples |
|:---:|-------------------:|:----------------|
| DET | 0.2134 | "The dog...", "A cat..." |
| PRON | 0.1597 | "He said...", "It was..." |
| NOUN | 0.1411 | "Dogs are...", "Time flies..." |
| ADP | 0.1228 | "In 1990,...", "For example,..." |
| ADV | 0.0913 | "However,...", "Then..." |
| . | 0.0889 | Sentence fragments |
| CONJ | 0.0491 | "And then...", "But..." |
| VERB | 0.0451 | "Run!", "Consider..." |
| PRT | 0.0367 | Rare as sentence start |
| ADJ | 0.0343 | "Good morning", "Many people..." |
| NUM | 0.0168 | "1984 was...", "Two men..." |
| X | 0.0005 | Rare |

: Initial state probabilities estimated from the Brown corpus. Determiners (DET), nouns (NOUN), and pronouns (PRON) dominate sentence beginnings, reflecting common English patterns. {#tbl-initial-probs}

Determiners, nouns, and pronouns dominate sentence beginnings, together accounting for over 75% of sentence starts.

Another revealing view is how emission probabilities vary across tags for ambiguous words. The word "run" can be a noun ("a morning run") or a verb ("I run daily"). Let's see how the learned emission probabilities capture this ambiguity:

Out[12]:
Console
| Word | P(word \| NOUN) | P(word \| VERB) | P(word \| ADJ) | Primary Tag |
|:----:|---------------:|---------------:|---------------:|:-----------:|
| time | 0.005796 | 0.000005 | 0.000000 | NOUN |
| run | 0.000200 | 0.000859 | 0.000000 | VERB |
| set | 0.000323 | 0.001778 | 0.000000 | VERB |

: Emission probabilities for ambiguous words across POS tags. "Time" shows moderate noun vs. verb ambiguity. "Run" is more strongly verbal. "Set" has substantial probability from both noun and verb tags, reflecting its high ambiguity in English. {#tbl-emission-ambiguity}

The table reveals that "time" has higher emission probability from NOUN than VERB, but both are non-zero, reflecting genuine ambiguity. These patterns form the statistical backbone of HMM-based tagging.

Handling Unknown Words

A critical challenge emerges when we encounter words not seen during training. If a word oo never appeared with tag ii, then b^i(o)=0\hat{b}_i(o) = 0. When we multiply by zero, the entire sequence probability becomes zero, making it impossible to tag sentences containing unknown words.

Smoothing techniques address this problem. The simplest approach is add-one (Laplace) smoothing: add a small count to every word-tag combination before normalizing.

b^i(o)=count(st=i,ot=o)+1count(st=i)+V\hat{b}_i(o) = \frac{\text{count}(s_t = i, o_t = o) + 1}{\text{count}(s_t = i) + |V|}

where:

  • b^i(o)\hat{b}_i(o): the smoothed emission probability of observation oo from state ii
  • count(st=i,ot=o)\text{count}(s_t = i, o_t = o): the number of times state ii emits observation oo (zero for unseen combinations)
  • count(st=i)\text{count}(s_t = i): the total number of times state ii appears
  • V|V|: the vocabulary size (number of unique observations)

The numerator adds 1 to every count, ensuring no probability is ever zero. The denominator adds V|V| to compensate, keeping the probabilities properly normalized (summing to 1). This ensures every word has a non-zero probability from every tag, though the probability is very small for unseen combinations.

In[13]:
Code
def estimate_smoothed_emissions(tagged_sentences, alpha=1.0):
    """Estimate emission probabilities with add-alpha smoothing."""

    emission_counts = defaultdict(lambda: defaultdict(int))
    tag_counts = defaultdict(int)
    vocab = set()

    for sentence in tagged_sentences:
        for word, tag in sentence:
            word = word.lower()
            emission_counts[tag][word] += 1
            tag_counts[tag] += 1
            vocab.add(word)

    vocab_size = len(vocab)

    # Smoothed emission probabilities
    B_smoothed = {}
    for tag in tag_counts:
        B_smoothed[tag] = {}
        denominator = tag_counts[tag] + alpha * vocab_size

        for word in vocab:
            count = emission_counts[tag].get(word, 0)
            B_smoothed[tag][word] = (count + alpha) / denominator

        # Store smoothed probability for unknown words
        B_smoothed[tag]["<UNK>"] = alpha / denominator

    return B_smoothed, vocab


B_smoothed, vocab = estimate_smoothed_emissions(tagged_sents)
Out[14]:
Console
Smoothed Emission Probabilities
----------------------------------------
Vocabulary size: 49,815

Known word 'the':
  Unsmoothed P('the' | DET) = 0.510645
  Smoothed   P('the' | DET) = 0.374498

Unknown word '<UNK>':
  Unsmoothed P('<UNK>' | DET) = 0.000000
  Smoothed   P('<UNK>' | DET) = 0.00000535
  Smoothed   P('<UNK>' | NOUN) = 0.00000307

With smoothing, unseen words get a small but non-zero probability from each tag. This allows the HMM to tag sentences containing new words by relying on transition probabilities to disambiguate.

HMM for POS Tagging

Now let's build a complete HMM POS tagger. Given a sentence, we want to find the most likely tag sequence. This is the decoding problem:

S^=argmaxSP(SO)=argmaxSP(O,S)\hat{S} = \arg\max_S P(S \,|\, O) = \arg\max_S P(O, S)

where:

  • S^\hat{S}: the optimal (most likely) state sequence
  • argmaxS\arg\max_S: "the value of SS that maximizes" the following expression
  • P(SO)P(S \,|\, O): the posterior probability of state sequence SS given observations OO
  • P(O,S)P(O, S): the joint probability of observations and states

The second equality follows from Bayes' rule: P(SO)=P(O,S)/P(O)P(S \,|\, O) = P(O, S) / P(O). Since P(O)P(O) is constant for a given observation sequence (it doesn't depend on which state sequence we're evaluating), maximizing the posterior is equivalent to maximizing the joint probability.

A naive approach would enumerate all possible tag sequences, compute their joint probabilities, and return the maximum. But with S|S| tags and nn words, there are Sn|S|^n possible sequences. For 12 tags and a 20-word sentence, that's over 3.8 quadrillion possibilities.

The Viterbi algorithm solves this efficiently in O(nS2)O(n \cdot |S|^2) time using dynamic programming. We'll cover Viterbi in detail in the next chapter. For now, let's implement a simplified greedy decoder that selects the locally best tag at each position.

Greedy decoding works as follows: for the first word, we pick the tag that maximizes πibi(o1)\pi_i \cdot b_i(o_1). For each subsequent word, we pick the tag that maximizes aprev,ibi(ot)a_{prev, i} \cdot b_i(o_t) given the previous tag we chose. This is fast but can make mistakes when a locally optimal choice leads to a globally suboptimal sequence.

In[15]:
Code
class SimpleHMMTagger:
    """A simple HMM POS tagger with greedy decoding."""

    def __init__(self, pi, A, B, vocab):
        self.pi = pi
        self.A = A
        self.B = B
        self.vocab = vocab
        self.tags = list(A.keys())

    def get_emission_prob(self, tag, word):
        """Get emission probability, handling unknown words."""
        word = word.lower()
        if word in self.B[tag]:
            return self.B[tag][word]
        return self.B[tag]["<UNK>"]

    def tag_greedy(self, sentence):
        """Tag sentence using greedy (locally optimal) decoding."""
        tags = []

        for i, word in enumerate(sentence):
            if i == 0:
                # First word: use initial and emission probabilities
                best_tag = None
                best_prob = -1
                for tag in self.tags:
                    prob = self.pi.get(tag, 1e-10) * self.get_emission_prob(
                        tag, word
                    )
                    if prob > best_prob:
                        best_prob = prob
                        best_tag = tag
            else:
                # Subsequent words: use transition and emission
                prev_tag = tags[-1]
                best_tag = None
                best_prob = -1
                for tag in self.tags:
                    trans_prob = self.A.get(prev_tag, {}).get(tag, 1e-10)
                    emit_prob = self.get_emission_prob(tag, word)
                    prob = trans_prob * emit_prob
                    if prob > best_prob:
                        best_prob = prob
                        best_tag = tag

            tags.append(best_tag)

        return tags


# Create tagger from estimated parameters
tagger = SimpleHMMTagger(pi_est, A_est, B_smoothed, vocab)
In[16]:
Code
# Test on some sentences
test_sentences = [
    ["The", "dog", "runs", "quickly"],
    ["She", "reads", "books"],
    ["The", "old", "man", "the", "boat"],  # Classic garden path sentence
]

predictions = []
for sent in test_sentences:
    tags = tagger.tag_greedy(sent)
    predictions.append(list(zip(sent, tags)))
Out[17]:
Console
Greedy HMM Tagger Results
==================================================

Sentence 1: The dog runs quickly
Tags: The/DET dog/NOUN runs/NOUN quickly/ADV 

Sentence 2: She reads books
Tags: She/PRON reads/VERB books/NOUN 

Sentence 3: The old man the boat
Tags: The/DET old/ADJ man/NOUN the/DET boat/NOUN 

The greedy tagger works reasonably well for straightforward sentences but can make errors when local decisions conflict with global coherence. The sentence "The old man the boat" is a famous garden path sentence where "man" is actually a verb (meaning "to operate") and "old" is a noun phrase ("the elderly people"). Greedy decoding often misses such non-local dependencies because each decision only considers the immediately preceding tag.

Visualizing HMM Structure

The transition matrix is the heart of an HMM's syntactic knowledge. By visualizing it as a heatmap, we can see which tag sequences the model considers likely and which it considers rare. This provides insight into both the structure of English grammar and the quality of our learned model.

Out[18]:
Visualization
Heatmap showing transition probabilities between POS tags with color intensity indicating probability values.
Transition probability matrix learned from the Brown corpus. Darker cells indicate higher probabilities. Clear patterns emerge: determiners strongly predict nouns, nouns often precede verbs, and punctuation frequently follows nouns and verbs. These transitions encode the statistical regularities of English syntax.

The transition matrix reveals clear syntactic patterns. Determiners (DET) have high transition probability to nouns (NOUN) and adjectives (ADJ). Nouns frequently transition to verbs (VERB) or punctuation. Prepositions (ADP) strongly predict nouns, reflecting prepositional phrase structure. These patterns are exactly what we'd expect from English grammar.

Evaluating the Tagger

How well does our HMM tagger actually perform? To answer this rigorously, we need to evaluate on data the model hasn't seen during training. We'll split the Brown corpus into 90% training and 10% test, re-estimate all parameters on the training portion, and measure accuracy on the held-out test sentences.

We'll also track performance separately for known words (those seen during training) versus unknown words. This distinction matters because the model handles these cases very differently: known words have learned emission probabilities, while unknown words must rely entirely on transition patterns and smoothed probabilities.

In[19]:
Code
from sklearn.model_selection import train_test_split

# Split data into train and test
train_sents, test_sents = train_test_split(
    tagged_sents, test_size=0.1, random_state=42
)

# Re-estimate parameters on training data only
pi_train, A_train, B_train, _ = estimate_hmm_parameters(train_sents)
B_train_smoothed, vocab_train = estimate_smoothed_emissions(train_sents)

# Create tagger with training parameters
test_tagger = SimpleHMMTagger(pi_train, A_train, B_train_smoothed, vocab_train)
In[20]:
Code
def evaluate_tagger(tagger, test_sentences):
    """Evaluate tagger accuracy on test sentences."""
    correct = 0
    total = 0
    known_correct = 0
    known_total = 0
    unknown_correct = 0
    unknown_total = 0

    for sentence in test_sentences:
        words = [word for word, tag in sentence]
        gold_tags = [tag for word, tag in sentence]
        pred_tags = tagger.tag_greedy(words)

        for word, gold, pred in zip(words, gold_tags, pred_tags):
            total += 1
            is_known = word.lower() in tagger.vocab

            if gold == pred:
                correct += 1
                if is_known:
                    known_correct += 1
                else:
                    unknown_correct += 1

            if is_known:
                known_total += 1
            else:
                unknown_total += 1

    return {
        "overall": correct / total if total > 0 else 0,
        "known": known_correct / known_total if known_total > 0 else 0,
        "unknown": unknown_correct / unknown_total if unknown_total > 0 else 0,
        "unknown_rate": unknown_total / total if total > 0 else 0,
    }


results = evaluate_tagger(test_tagger, test_sents)
Out[21]:
Console
Training sentences: 51,606
Test sentences: 5,734
Unknown word rate: 2.00%

| Metric | Accuracy |
|:-------|--------:|
| Overall | 92.5% |
| Known words | 93.8% |
| Unknown words | 25.7% |

: HMM tagger accuracy on the Brown corpus test set. The model achieves high accuracy on known words where emission probabilities provide strong signals. Unknown word accuracy drops substantially because the model must rely solely on transition probabilities and smoothed emissions. {#tbl-accuracy}

The accuracy gap between known and unknown words is striking. For known words, the model performs reasonably well because emission probabilities provide strong signals. For unknown words, the model must rely entirely on transition probabilities, which provide weaker disambiguation.

This evaluation uses greedy decoding, which makes locally optimal choices. The Viterbi algorithm (next chapter) finds the globally optimal sequence and typically improves accuracy by 2-5 percentage points, especially for difficult cases where local and global optima differ.

Worked Example: Tagging "Time flies"

To understand how an HMM resolves ambiguity, let's trace through the complete probability computation for the classic ambiguous sentence "Time flies." This two-word sentence has two plausible interpretations, and both words can function as either nouns or verbs:

  • "Time flies" could mean "Time passes quickly" (NOUN VERB)
  • "Time flies" could mean "Measure how fast flies move" (VERB NOUN)

We'll compute the joint probability for each interpretation and see which one the HMM prefers. Since we're multiplying many small probabilities, we'll work in log space to avoid numerical underflow.

In[22]:
Code
# Use parameters estimated from training data
def compute_log_joint(words, tags, pi, A, B):
    """Compute log P(words, tags) to avoid underflow."""
    import math

    log_prob = 0

    # Initial probability
    init_p = pi.get(tags[0], 1e-10)
    log_prob += math.log(init_p)

    # First emission
    word = words[0].lower()
    emit_p = B.get(tags[0], {}).get(
        word, B.get(tags[0], {}).get("<UNK>", 1e-10)
    )
    log_prob += math.log(emit_p)

    # Subsequent transitions and emissions
    for i in range(1, len(words)):
        trans_p = A.get(tags[i - 1], {}).get(tags[i], 1e-10)
        word = words[i].lower()
        emit_p = B.get(tags[i], {}).get(
            word, B.get(tags[i], {}).get("<UNK>", 1e-10)
        )

        log_prob += math.log(trans_p) + math.log(emit_p)

    return log_prob


words = ["Time", "flies"]

interpretations = [
    (["NOUN", "VERB"], "Time (subject) flies (action)"),
    (["VERB", "NOUN"], "Time (action) flies (object)"),
]

log_probs = []
for tags, description in interpretations:
    log_p = compute_log_joint(words, tags, pi_train, A_train, B_train_smoothed)
    log_probs.append((tags, log_p, description))
Out[23]:
Console
Analyzing 'Time flies'
==================================================

Interpretation: NOUN VERB
  Meaning: Time (subject) flies (action)
  Log probability: -19.7839

Interpretation: VERB NOUN
  Meaning: Time (action) flies (object)
  Log probability: -27.3867

==================================================
Most likely interpretation: NOUN VERB
  (Time (subject) flies (action))

NOUN VERB is 2003.79x more likely than VERB NOUN

To understand why the HMM prefers NOUN VERB, let's decompose the log probability into its components: initial probability, emissions, and transitions.

Out[24]:
Visualization
Stacked horizontal bar chart showing log probability components for NOUN VERB vs VERB NOUN interpretations.
Log probability breakdown for 'Time flies' under two interpretations. Each bar segment shows the contribution from initial probability (π), emissions (B), and transitions (A). NOUN VERB wins primarily due to higher initial probability for NOUN and stronger emission of 'flies' from VERB.

The HMM prefers "Time flies" as NOUN VERB. This makes sense: in the training data, sentences starting with nouns are more common than sentences starting with verbs, and noun-verb transitions are frequent. The emission probability of "time" from NOUN is also higher than from VERB, since "time" appears as a noun more often in general text.

This example illustrates how HMMs combine multiple sources of evidence: initial probabilities favor starting with nouns, transitions favor noun-verb patterns, and emissions favor "time" as a noun and "flies" as a verb.

Limitations and Impact

HMMs were the dominant approach to sequence labeling from the 1980s through the early 2000s. They remain instructive for understanding probabilistic sequence models and provide a foundation for more advanced techniques.

Where HMMs Fall Short

The independence assumptions that make HMMs tractable also limit their power. The emission independence assumption is particularly restrictive: an HMM cannot condition on neighboring words when deciding what tag to emit. Consider the word "set." In "the set of numbers," it's a noun. In "set the table," it's a verb. An HMM can only use transition probabilities to disambiguate, ignoring the direct evidence from surrounding words like "the" before "set" or "table" after "set."

The Markov assumption also causes problems for long-distance dependencies. In "The dog that the cat chased runs," the verb "runs" must agree with "dog," but an HMM only sees the immediately preceding "chased." Higher-order HMMs (conditioning on two or three previous tags) help but create exponentially more parameters to estimate.

Unknown words present another challenge. With smoothing, unknown words get similar emission probabilities from all tags. The model must rely on transitions, but transition probabilities alone often can't disambiguate. Modern approaches use character-level features (suffixes like "-tion" suggest nouns, "-ly" suggests adverbs) or pretrained embeddings that generalize to unseen words.

What HMMs Made Possible

Despite these limitations, HMMs achieved remarkable success and enabled practical NLP systems for the first time.

The Viterbi algorithm made efficient exact inference possible, allowing HMM taggers to process text at thousands of words per second. This efficiency was crucial for applying NLP to large document collections.

The forward-backward algorithm enabled unsupervised learning from raw text. You could train an HMM without any labeled data, discovering latent structure in the tag sequences. While unsupervised HMMs never matched supervised accuracy, they opened up possibilities for low-resource languages.

The probabilistic framework integrated naturally with other statistical models. HMM outputs could feed into named entity recognizers, parsers, and information extraction systems, with probabilities propagating through the pipeline.

HMMs also established the paradigm of sequence labeling as a structured prediction problem. The insight that labels depend on each other, not just on inputs, led to CRFs, neural sequence models, and eventually transformers with attention over the full sequence.

Key Parameters

When implementing HMM-based taggers, several parameters affect model performance:

  • Smoothing constant (α\alpha): Controls how much probability mass is reserved for unseen word-tag combinations. Higher values (e.g., 1.0 for Laplace smoothing) give more weight to unseen events but dilute the learned emission probabilities. Lower values (e.g., 0.01) preserve learned distributions better but may underestimate rare events. Start with α=1.0\alpha = 1.0 and tune based on unknown word accuracy.

  • Tag set granularity: Finer tag sets (like Penn Treebank's 36 tags) capture more distinctions but require more training data and increase computational cost. Coarser sets (like Universal Dependencies' 17 tags) are easier to learn but lose information. Choose based on your downstream task requirements.

  • Case normalization: Converting words to lowercase reduces vocabulary size and improves emission estimates for rare words, but loses capitalization signals that help identify proper nouns and sentence boundaries. The implementation above uses lowercase normalization.

  • Unknown word handling: Beyond smoothing, you can improve unknown word accuracy by using morphological features (suffixes, prefixes, capitalization patterns) or by backing off to character-level models. The simple <UNK> token approach shown here provides a baseline.

Summary

Hidden Markov Models provide a probabilistic framework for sequence labeling. The model jointly models hidden state sequences (like POS tags) and observable output sequences (like words), learning statistical patterns for how states transition and what observations each state generates.

Key takeaways:

  • HMM components include states, observations, initial probabilities, transition probabilities, and emission probabilities. These five elements fully specify the model
  • The Markov assumption says each state depends only on the previous state, not the full history. This enables efficient algorithms but limits the model's ability to capture long-range dependencies
  • The output independence assumption says each observation depends only on its current state. This prevents the model from using surrounding context to disambiguate emissions
  • Maximum likelihood estimation learns parameters by counting occurrences in annotated data. Smoothing handles unseen word-tag pairs that would otherwise have zero probability
  • Decoding finds the most likely state sequence given observations. Greedy decoding is fast but suboptimal; the Viterbi algorithm (next chapter) finds the globally optimal sequence efficiently
  • HMM limitations include inability to use rich features, difficulty with long-distance dependencies, and weak handling of unknown words. These motivated the development of Conditional Random Fields and neural sequence models

HMMs exemplify a recurring theme in NLP: strong independence assumptions enable efficient algorithms, but real language violates these assumptions in important ways. The next chapter covers the Viterbi algorithm, which finds optimal sequences in O(nS2)O(n \cdot |S|^2) time, and subsequent chapters introduce CRFs, which relax HMM's independence assumptions while preserving computational tractability.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Hidden Markov Models and sequence labeling.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{hiddenmarkovmodelsprobabilisticsequencelabelingfornlp, author = {Michael Brenndoerfer}, title = {Hidden Markov Models: Probabilistic Sequence Labeling for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/hidden-markov-models-sequence-labeling-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Hidden Markov Models: Probabilistic Sequence Labeling for NLP. Retrieved from https://mbrenndoerfer.com/writing/hidden-markov-models-sequence-labeling-nlp
MLAAcademic
Michael Brenndoerfer. "Hidden Markov Models: Probabilistic Sequence Labeling for NLP." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/hidden-markov-models-sequence-labeling-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Hidden Markov Models: Probabilistic Sequence Labeling for NLP." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/hidden-markov-models-sequence-labeling-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Hidden Markov Models: Probabilistic Sequence Labeling for NLP'. Available at: https://mbrenndoerfer.com/writing/hidden-markov-models-sequence-labeling-nlp (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Hidden Markov Models: Probabilistic Sequence Labeling for NLP. https://mbrenndoerfer.com/writing/hidden-markov-models-sequence-labeling-nlp
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free