Search

Search articles

Conditional Random Fields: Discriminative Sequence Labeling with Rich Features

Michael BrenndoerferDecember 16, 202546 min read

Master CRFs for sequence labeling, from log-linear models to feature functions and the forward algorithm. Learn how CRFs overcome HMM limitations for NER and POS tagging.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Conditional Random Fields

Hidden Markov Models brought probabilistic reasoning to sequence labeling, but their generative formulation imposes strong constraints. Each observation depends only on its corresponding hidden state, forcing independence assumptions that conflict with the overlapping, correlated features that make natural language rich and complex. When you see the word "bank," its context matters enormously: preceding words, suffixes, capitalization patterns, surrounding punctuation. HMMs struggle to incorporate such features without violating their independence assumptions.

Conditional Random Fields (CRFs) emerged as a direct response to these limitations. By modeling the conditional probability of labels given observations rather than their joint distribution, CRFs can incorporate arbitrary overlapping features without worrying about dependencies between them. This discriminative approach has become the foundation of modern sequence labeling, powering named entity recognition, POS tagging, and countless other NLP tasks.

This chapter develops CRFs from first principles. We'll examine why HMMs fall short, derive the CRF formulation mathematically, implement feature functions, and build working CRF taggers. By the end, you'll understand not just how to use CRFs but why they work so well for sequence labeling.

From Generative to Discriminative Models

Before diving into CRFs, let's understand the fundamental distinction between generative and discriminative approaches to classification. This distinction explains why CRFs exist and what problems they solve.

Generative vs. Discriminative Models

Generative models learn the joint probability P(X,Y)P(X, Y) of inputs and outputs, then use Bayes' rule to compute P(YX)P(Y|X) for prediction. Discriminative models directly learn the conditional probability P(YX)P(Y|X) without modeling the input distribution.

The HMM Approach

Hidden Markov Models are generative. They model how sequences are produced: first, a hidden state is generated according to transition probabilities, then an observation is emitted according to emission probabilities. For a sequence of observations x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \ldots, x_n) and labels y=(y1,y2,,yn)\mathbf{y} = (y_1, y_2, \ldots, y_n), an HMM models the joint probability of observations and labels:

P(x,y)=P(y1)t=2nP(ytyt1)t=1nP(xtyt)P(\mathbf{x}, \mathbf{y}) = P(y_1) \prod_{t=2}^{n} P(y_t | y_{t-1}) \prod_{t=1}^{n} P(x_t | y_t)

where:

  • x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \ldots, x_n): the sequence of nn observations (e.g., words in a sentence)
  • y=(y1,y2,,yn)\mathbf{y} = (y_1, y_2, \ldots, y_n): the sequence of nn labels (e.g., POS tags or entity types)
  • P(y1)P(y_1): the initial state probability (probability of starting with label y1y_1)
  • P(ytyt1)P(y_t | y_{t-1}): the transition probability from label yt1y_{t-1} to label yty_t
  • P(xtyt)P(x_t | y_t): the emission probability of observing xtx_t given label yty_t

The first product captures transition probabilities (how states follow each other), and the second captures emission probabilities (how each state generates its observation). To predict labels for new observations, we apply Bayes' rule:

P(yx)=P(x,y)P(x)P(\mathbf{y} | \mathbf{x}) = \frac{P(\mathbf{x}, \mathbf{y})}{P(\mathbf{x})}

where:

  • P(yx)P(\mathbf{y} | \mathbf{x}): the posterior probability of label sequence y\mathbf{y} given observations x\mathbf{x}
  • P(x,y)P(\mathbf{x}, \mathbf{y}): the joint probability computed by the HMM formula above
  • P(x)P(\mathbf{x}): the marginal probability of the observations, obtained by summing P(x,y)P(\mathbf{x}, \mathbf{y}) over all possible label sequences

This generative formulation imposes a critical constraint: the observation xtx_t can only depend on its own label yty_t, not on other observations or labels. This independence assumption enables tractable inference but limits expressiveness.

The Independence Problem

Consider tagging the word "bank" in context. A good tagger should consider:

  • The word itself: "bank"
  • The previous word: "river" suggests location, "investment" suggests organization
  • The next word: "account" suggests financial meaning
  • Capitalization: "Bank" at sentence start vs. "bank" mid-sentence
  • Word suffixes: "-ing," "-tion," "-ly" carry grammatical information
  • Surrounding punctuation: quotation marks might indicate names

These features overlap and interact. The suffix "-ing" combined with a preceding auxiliary verb strongly suggests a verb tag. Capitalization plus a preceding determiner suggests a proper noun. HMMs cannot naturally express such feature combinations because each observation must be generated independently from its label alone.

In[2]:
Code
# Features that help disambiguate "bank"
def extract_context_features(words, position):
    """Extract overlapping contextual features for a word."""
    word = words[position]
    features = {
        "word": word.lower(),
        "is_capitalized": word[0].isupper(),
        "suffix_3": word[-3:] if len(word) >= 3 else word,
        "prefix_3": word[:3] if len(word) >= 3 else word,
    }

    if position > 0:
        features["prev_word"] = words[position - 1].lower()
        features["prev_capitalized"] = words[position - 1][0].isupper()
    else:
        features["prev_word"] = "<START>"
        features["prev_capitalized"] = False

    if position < len(words) - 1:
        features["next_word"] = words[position + 1].lower()
        features["next_capitalized"] = words[position + 1][0].isupper()
    else:
        features["next_word"] = "<END>"
        features["next_capitalized"] = False

    return features


# Two sentences with "bank"
sentence1 = ["The", "river", "bank", "was", "muddy"]
sentence2 = ["The", "investment", "bank", "collapsed"]

features1 = extract_context_features(sentence1, 2)
features2 = extract_context_features(sentence2, 2)
Out[3]:
Console
Context Features for 'bank' in Different Contexts:
=======================================================

Sentence 1: 'The river bank was muddy'
-------------------------------------------------------
  word                : bank
  is_capitalized      : False
  suffix_3            : ank
  prefix_3            : ban
  prev_word           : river
  prev_capitalized    : False
  next_word           : was
  next_capitalized    : False

Sentence 2: 'The investment bank collapsed'
-------------------------------------------------------
  word                : bank
  is_capitalized      : False
  suffix_3            : ank
  prefix_3            : ban
  prev_word           : investment
  prev_capitalized    : False
  next_word           : collapsed
  next_capitalized    : False

→ The previous word changes dramatically, but HMMs cannot
  directly use 'prev_word' as a feature for predicting the tag.

In an HMM, we could only condition on the previous tag, not the previous word. The rich contextual features visible in the data cannot be exploited directly.

The Discriminative Solution

CRFs take a fundamentally different approach. Instead of modeling how observations are generated, they directly model the probability of a label sequence given the observations:

P(yx)P(\mathbf{y} | \mathbf{x})

This conditional formulation has a crucial advantage: since we condition on x\mathbf{x}, we don't need to model the distribution of observations. Features of x\mathbf{x} can overlap arbitrarily, interact in complex ways, and depend on any part of the input sequence. The model only needs to learn which label sequences are likely given the observed features.

Out[4]:
Visualization
Diagram showing HMM generating observations from hidden states versus CRF conditioning labels on observations.
Comparison of generative (HMM) and discriminative (CRF) models. HMMs model the joint probability P(X,Y) with arrows from hidden states to observations. CRFs directly model P(Y|X), with arrows from observations to labels, enabling arbitrary features across positions.

The figure illustrates the fundamental difference. In HMMs, arrows point from hidden states to observations, reflecting the generative process. In CRFs, observations condition the labels, and features can connect any observation to any label. This reversal of direction is not merely notational; it fundamentally changes what the model can represent.

The CRF Formulation

We've established that HMMs struggle with overlapping features because their generative structure demands independence between observations. Now we need a mathematical framework that preserves the sequential structure of HMMs while escaping their independence constraints. The solution lies in a family of models called log-linear models, which CRFs extend to sequences.

This section builds the CRF formulation step by step. We'll start with log-linear models for single classification decisions, then extend them to sequences, and finally examine the feature functions that give CRFs their expressive power. By the end, you'll see how CRFs elegantly solve the problems we identified with HMMs.

Log-Linear Models for Classification

Before tackling sequences, let's understand how log-linear models work for a single classification decision. The core idea is beautifully simple: instead of modeling how observations are generated, we directly score how well each possible label fits the observed input.

Consider classifying a single word. We want to compute P(yx)P(y | \mathbf{x}), the probability of label yy given input features x\mathbf{x}. A log-linear model builds this probability in three steps:

  1. Extract features: Define functions fk(x,y)f_k(\mathbf{x}, y) that measure relevant properties of the input-label pair
  2. Compute a score: Sum the weighted features to get an overall "compatibility score" for each label
  3. Normalize to probabilities: Convert scores to probabilities using the exponential function and normalization

This yields the fundamental log-linear equation:

P(yx)=1Z(x)exp(kλkfk(x,y))P(y | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_k \lambda_k f_k(\mathbf{x}, y) \right)

Let's unpack each component:

  • fk(x,y)f_k(\mathbf{x}, y): the kk-th feature function, which extracts a property of the input-output pair. These are the building blocks of the model, returning real numbers (often 0 or 1 for binary features).
  • λk\lambda_k: the weight for feature fkf_k, learned from training data. Positive weights encourage labels where the feature fires; negative weights discourage them.
  • kλkfk(x,y)\sum_k \lambda_k f_k(\mathbf{x}, y): the total score for assigning label yy to input x\mathbf{x}. This is just a weighted sum of features.
  • exp()\exp(\cdot): the exponential function transforms scores into positive values, ensuring we can interpret results as probabilities.
  • Z(x)=yexp(kλkfk(x,y))Z(\mathbf{x}) = \sum_{y'} \exp\left( \sum_k \lambda_k f_k(\mathbf{x}, y') \right): the partition function, which sums over all possible labels to normalize the distribution.
Why the Exponential?

The exponential function serves two purposes. First, it maps any real-valued score to a positive number, which is necessary for probabilities. Second, it has a useful property: if one label's score is much higher than others, the exponential amplifies this difference, making the model confident. If scores are similar, probabilities are more uniform. This "soft-max" behavior emerges naturally from the exponential.

The partition function Z(x)Z(\mathbf{x}) deserves special attention. It ensures that probabilities sum to 1 by computing the total "mass" across all possible labels. For single classification with a small label set, computing ZZ is trivial. But as we'll see, computing ZZ for sequences becomes the central computational challenge of CRFs.

Why is this formulation powerful? Because feature functions can capture any aspect of the input-output relationship. Unlike HMMs, where each observation depends only on its label, log-linear models can incorporate overlapping, interacting features without any independence assumptions.

In[5]:
Code
import numpy as np


def simple_log_linear_classifier(x_features, y, weights, all_labels):
    """
    Compute P(y|x) for a simple log-linear classifier.

    Args:
        x_features: dict of feature values for input x
        y: the label we're computing probability for
        weights: dict mapping (feature_name, label) to weight
        all_labels: list of all possible labels

    Returns:
        probability P(y|x)
    """

    def compute_score(label):
        """Compute sum of weighted features for a label."""
        score = 0.0
        for feat_name, feat_value in x_features.items():
            key = (feat_name, label)
            if key in weights:
                score += weights[key] * feat_value
        return score

    # Compute scores for all labels
    scores = {label: compute_score(label) for label in all_labels}

    # Compute partition function (sum of exp scores)
    max_score = max(scores.values())  # For numerical stability
    Z = sum(np.exp(s - max_score) for s in scores.values())

    # Compute probability
    prob = np.exp(scores[y] - max_score) / Z
    return prob, scores


# Example: classifying a word as NOUN, VERB, or ADJ
word_features = {
    "word=running": 1.0,
    "suffix=-ing": 1.0,
    "prev_word=is": 1.0,
}

# Learned weights (simplified)
example_weights = {
    ("word=running", "VERB"): 2.0,
    ("word=running", "NOUN"): 0.5,
    ("suffix=-ing", "VERB"): 3.0,
    ("suffix=-ing", "NOUN"): 1.0,
    ("prev_word=is", "VERB"): 1.5,
    ("prev_word=is", "ADJ"): 0.5,
}

labels = ["NOUN", "VERB", "ADJ"]
Out[6]:
Console
Log-Linear Classification Example:
=======================================================

Features for word 'running':
  word=running: 1.0
  suffix=-ing: 1.0
  prev_word=is: 1.0

Probabilities for each label:
----------------------------------------
  P(NOUN | x) = 0.0067  (score: 1.50)
  P(VERB | x) = 0.9909  (score: 6.50)
  P(ADJ  | x) = 0.0025  (score: 0.50)

  Total probability: 1.0000
Log-linear classification for the word "running". The raw scores sum feature weights, then exponentiation and normalization by Z convert them to probabilities. VERB dominates because its features (suffix '-ing', prev_word 'is') accumulate the highest score.
LabelRaw Scoreexp(Score)Probability
NOUN2.07.40.001
VERB6.5665.10.996
ADJ4.054.60.004

Notice what happened here. Three features fired for the word "running": the word itself, its suffix, and the preceding word. Each feature contributed to the scores for different labels, and the model combined them through weighted summation. The suffix -ing combined with prev_word=is provides strong evidence for VERB, and the learned weights reflect this intuition. The model assigned VERB a probability of over 99% by combining multiple overlapping features, something an HMM could never do.

Extending to Sequences: Linear-Chain CRFs

The log-linear model handles single classification decisions elegantly. But sequence labeling requires something more: we need to model entire sequences of labels, where each label might depend on its neighbors. How do we extend the log-linear framework to sequences while preserving its ability to use arbitrary features?

The answer is the linear-chain CRF. The key insight is to treat the entire label sequence as a single structured output, then score it using features that can look at adjacent labels and any part of the input. Instead of independently classifying each position, we score complete label sequences and find the one with the highest probability.

For a sequence of observations x=(x1,,xn)\mathbf{x} = (x_1, \ldots, x_n) and labels y=(y1,,yn)\mathbf{y} = (y_1, \ldots, y_n), the linear-chain CRF defines:

P(yx)=1Z(x)exp(t=1nkλkfk(yt1,yt,x,t))P(\mathbf{y} | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{t=1}^{n} \sum_k \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t) \right)

This formula looks similar to the single-classification case, but with a crucial difference: we now sum over positions in the sequence, and feature functions examine pairs of adjacent labels. Let's break it down:

  • t=1n\sum_{t=1}^{n}: We accumulate scores across all positions in the sequence
  • fk(yt1,yt,x,t)f_k(y_{t-1}, y_t, \mathbf{x}, t): Each feature function now takes four arguments

The four arguments to feature functions are the heart of CRF expressiveness:

The four arguments to CRF feature functions enable rich dependencies.
ArgumentWhat it providesExample use
yt1y_{t-1}Previous labelCapture transition patterns (DET → NOUN)
yty_tCurrent labelCapture emission patterns (word → label)
x\mathbf{x}Entire input sequenceLook at any word, not just current position
ttCurrent positionPosition-specific features (first word often capitalized)

This design is deliberate. By including both yt1y_{t-1} and yty_t, features can model label-label dependencies like HMM transitions. By including the entire sequence x\mathbf{x} rather than just xtx_t, features can examine context: the previous word, the next word, or even words further away. The position tt enables features that behave differently at sequence boundaries.

Why 'Linear-Chain'?

The term "linear-chain" refers to the dependency structure: each label yty_t directly depends only on its immediate neighbor yt1y_{t-1}. This isn't a limitation of CRFs in general, just a practical choice that enables efficient inference. Higher-order CRFs exist but are exponentially more expensive.

The partition function for sequences presents a computational challenge:

Z(x)=yexp(t=1nkλkfk(yt1,yt,x,t))Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left( \sum_{t=1}^{n} \sum_k \lambda_k f_k(y'_{t-1}, y'_t, \mathbf{x}, t) \right)

This sum ranges over all possible label sequences y\mathbf{y}'. With KK labels and sequence length nn, there are KnK^n possible sequences. For a modest vocabulary of 10 labels and a 20-word sentence, that's 102010^{20} sequences, far too many to enumerate. Fortunately, the linear-chain structure enables efficient computation through dynamic programming, which we'll explore in the next section.

Feature Functions in Detail

We've seen that CRFs score sequences using weighted sums of feature functions. But what exactly are these features, and how do we design them? This is where CRFs truly shine: feature functions can capture virtually any pattern you can imagine, from simple word-label associations to complex contextual dependencies.

CRF Feature Function

A feature function fk(yt1,yt,x,t)f_k(y_{t-1}, y_t, \mathbf{x}, t) maps a (previous label, current label, observation sequence, position) tuple to a real value. Binary features return 1 when a pattern is present and 0 otherwise. The weight λk\lambda_k determines how much this pattern influences predictions.

Feature functions fall into two main categories, each serving a distinct purpose in the model.

State Features: Linking Observations to Labels

State features capture the relationship between observations and labels. They answer the question: "Given what I observe at this position, which label is most appropriate?"

Consider a feature that fires when the word "bank" appears with the label NOUN:

fk(yt1,yt,x,t)=1[yt=NOUN]1[xt="bank"]f_k(y_{t-1}, y_t, \mathbf{x}, t) = \mathbf{1}[y_t = \text{NOUN}] \cdot \mathbf{1}[x_t = \text{"bank"}]

The notation 1[]\mathbf{1}[\cdot] denotes the indicator function, which returns 1 if the condition is true and 0 otherwise. The product of two indicators returns 1 only when both conditions hold. So this feature "fires" (returns 1) precisely when:

  • The current word is "bank", AND
  • The current label is NOUN

If the learned weight λk\lambda_k is positive, this feature increases the score for labeling "bank" as NOUN. If negative, it discourages that labeling. The training process learns these weights from data.

But state features aren't limited to the current word. Because feature functions receive the entire observation sequence x\mathbf{x}, they can examine context:

  • Previous word: Does "river" precede this word? (suggests location)
  • Next word: Does "account" follow? (suggests financial meaning)
  • Suffixes: Does the word end in "-ing"? (suggests verb)
  • Capitalization: Is the word capitalized mid-sentence? (suggests proper noun)

This flexibility is what HMMs lack. In an HMM, the observation "bank" can only depend on its own label. In a CRF, the feature for "bank" can simultaneously consider the previous word, the next word, and any other contextual signal.

Transition Features: Modeling Label Dependencies

Transition features capture patterns in how labels follow each other. They answer: "Given the previous label, which labels are likely to come next?"

Consider a feature for the common grammatical pattern where nouns follow determiners:

fk(yt1,yt,x,t)=1[yt1=DET]1[yt=NOUN]f_k(y_{t-1}, y_t, \mathbf{x}, t) = \mathbf{1}[y_{t-1} = \text{DET}] \cdot \mathbf{1}[y_t = \text{NOUN}]

This feature fires when a NOUN follows a DET, capturing patterns like "the cat" or "a dog." A positive weight encourages this transition; a negative weight would discourage it.

Transition features serve a similar role to HMM transition probabilities, but with an important difference: they can also incorporate observations. A feature like "NOUN follows DET when the current word is capitalized" combines transition and observation information in ways HMMs cannot express.

In[7]:
Code
def define_feature_templates():
    """
    Define feature templates for a CRF tagger.

    Templates describe patterns; actual features are instantiated
    by applying templates to training data.
    """
    templates = [
        # State features (depend on current label and observations)
        ("current_word", lambda x, t, y_prev, y_cur: (x[t], y_cur)),
        ("word_lower", lambda x, t, y_prev, y_cur: (x[t].lower(), y_cur)),
        (
            "is_capitalized",
            lambda x, t, y_prev, y_cur: (x[t][0].isupper(), y_cur),
        ),
        (
            "suffix_2",
            lambda x, t, y_prev, y_cur: (
                x[t][-2:] if len(x[t]) >= 2 else x[t],
                y_cur,
            ),
        ),
        (
            "suffix_3",
            lambda x, t, y_prev, y_cur: (
                x[t][-3:] if len(x[t]) >= 3 else x[t],
                y_cur,
            ),
        ),
        (
            "prefix_2",
            lambda x, t, y_prev, y_cur: (
                x[t][:2] if len(x[t]) >= 2 else x[t],
                y_cur,
            ),
        ),
        # Context features (depend on surrounding words)
        (
            "prev_word",
            lambda x, t, y_prev, y_cur: (
                x[t - 1] if t > 0 else "<START>",
                y_cur,
            ),
        ),
        (
            "next_word",
            lambda x, t, y_prev, y_cur: (
                x[t + 1] if t < len(x) - 1 else "<END>",
                y_cur,
            ),
        ),
        (
            "prev_word_suffix",
            lambda x, t, y_prev, y_cur: (
                x[t - 1][-2:] if t > 0 and len(x[t - 1]) >= 2 else "<START>",
                y_cur,
            ),
        ),
        # Transition features (depend on previous label)
        ("transition", lambda x, t, y_prev, y_cur: (y_prev, y_cur)),
        (
            "transition_word",
            lambda x, t, y_prev, y_cur: (y_prev, y_cur, x[t].lower()),
        ),
    ]
    return templates


def extract_features(words, position, prev_label, current_label, templates):
    """
    Extract features for a specific position in the sequence.

    Returns a dict of feature names to values (all 1.0 for binary features).
    """
    features = {}
    for template_name, template_func in templates:
        try:
            feature_value = template_func(
                words, position, prev_label, current_label
            )
            feature_key = f"{template_name}={feature_value}"
            features[feature_key] = 1.0
        except (IndexError, TypeError):
            pass
    return features


# Demonstrate feature extraction
sample_sentence = ["The", "quick", "fox", "jumps"]
templates = define_feature_templates()

# Extract features for position 2 ("fox") with prev_label=ADJ, current_label=NOUN
features_pos2 = extract_features(sample_sentence, 2, "ADJ", "NOUN", templates)
Out[8]:
Console
Feature Extraction Example:
============================================================
Sentence: ['The', 'quick', 'fox', 'jumps']
Position: 2 (word='fox')
Previous label: ADJ, Current label: NOUN

Extracted Features:
------------------------------------------------------------
  current_word=('fox', 'NOUN')
  is_capitalized=(False, 'NOUN')
  next_word=('jumps', 'NOUN')
  prefix_2=('fo', 'NOUN')
  prev_word=('quick', 'NOUN')
  prev_word_suffix=('ck', 'NOUN')
  suffix_2=('ox', 'NOUN')
  suffix_3=('fox', 'NOUN')
  transition=('ADJ', 'NOUN')
  transition_word=('ADJ', 'NOUN', 'fox')
  word_lower=('fox', 'NOUN')

Look at the variety of information captured by these features. For a single word "fox" at position 2, we extract:

  • Word identity: The word itself and its lowercase form
  • Morphological properties: Two-character and three-character suffixes and prefixes
  • Capitalization: Whether the word starts with an uppercase letter
  • Context: The previous word ("quick") and next word ("jumps")
  • Label transition: The pattern of moving from ADJ to NOUN

Each of these features will have a learned weight. During training, the model discovers which features are predictive: perhaps words ending in "-ly" strongly indicate adverbs, or capitalized words following "the" often indicate proper nouns. The power lies in combining many weak signals into a strong prediction.

Connecting CRFs to HMMs

Having built up the CRF formulation, it's illuminating to see how it relates to HMMs. This comparison reveals that HMMs are actually a special case of CRFs with a restricted feature set.

An HMM has two types of parameters:

  1. Transition probabilities P(ytyt1)P(y_t | y_{t-1}): How likely is each label given the previous label?
  2. Emission probabilities P(xtyt)P(x_t | y_t): How likely is each observation given its label?

We can express these as CRF features:

In[9]:
Code
def hmm_as_crf_features(words, position, prev_label, current_label):
    """
    Features that make a CRF equivalent to an HMM.

    HMM has two types of parameters:
    - Transition: P(current_label | prev_label)
    - Emission: P(word | current_label)

    These correspond to two feature templates.
    """
    features = {}

    # Transition feature: fires for each (prev_label, current_label) pair
    # Equivalent to log P(current_label | prev_label) in HMM
    features[f"transition_{prev_label}_{current_label}"] = 1.0

    # Emission feature: fires for each (word, current_label) pair
    # Equivalent to log P(word | current_label) in HMM
    word = words[position]
    features[f"emission_{word}_{current_label}"] = 1.0

    return features


# Show HMM-equivalent features
hmm_features = hmm_as_crf_features(sample_sentence, 2, "ADJ", "NOUN")
Out[10]:
Console
HMM as CRF (Minimal Feature Set):
==================================================
Word at position 2: 'fox'
Previous label: ADJ, Current label: NOUN

HMM-equivalent features:
  emission_fox_NOUN
  transition_ADJ_NOUN

In an HMM:
  - transition_ADJ_NOUN weight ≈ log P(NOUN | ADJ)
  - emission_fox_NOUN weight ≈ log P(fox | NOUN)

CRFs generalize by allowing many more features!

The correspondence is precise: if we set the CRF weight for transition_ADJ_NOUN to logP(NOUNADJ)\log P(\text{NOUN} | \text{ADJ}) and the weight for emission_fox_NOUN to logP(foxNOUN)\log P(\text{fox} | \text{NOUN}), the CRF assigns the same relative probabilities to label sequences as the HMM.

But here's the crucial insight: CRFs can do everything HMMs can do, and more. While an HMM is limited to these two feature types, a CRF can add:

  • Features that examine the previous word (not just the previous label)
  • Features that look at the next word
  • Features based on word suffixes, prefixes, or character patterns
  • Features that combine multiple signals (e.g., "capitalized word after a determiner")
  • Features that span multiple positions

The HMM is a CRF with its hands tied. By removing the generative constraints, CRFs unlock the full power of discriminative modeling for sequences. This is why CRFs consistently outperform HMMs on sequence labeling tasks: they can leverage the same information HMMs use, plus arbitrarily more.

Computing the Partition Function

The partition function Z(x)Z(\mathbf{x}) is the computational challenge of CRFs. It sums over all possible label sequences, and with KK labels and sequence length nn, there are KnK^n such sequences. For K=10K=10 labels and n=20n=20 words, that's 102010^{20} sequences, far too many to enumerate.

Out[11]:
Visualization
Line plot showing exponential growth of possible sequences as sequence length increases, with separate lines for different numbers of labels.
The exponential explosion of possible label sequences. With just 5 labels, a 20-word sentence has over 95 trillion possible labelings. This makes naive enumeration impossible and motivates the need for dynamic programming algorithms like the forward algorithm.

The Forward Algorithm

Fortunately, the linear-chain structure enables efficient computation via dynamic programming. The forward algorithm computes Z(x)Z(\mathbf{x}) in O(nK2)O(nK^2) time, similar to the forward algorithm for HMMs.

The key idea is to define a forward variable αt(y)\alpha_t(y) that accumulates scores for all partial label sequences ending in label yy at position tt:

αt(y)=y1:t1exp(t=1tkλkfk(yt1,yt,x,t))where yt=y\alpha_t(y) = \sum_{\mathbf{y}_{1:t-1}} \exp\left( \sum_{t'=1}^{t} \sum_k \lambda_k f_k(y_{t'-1}, y_{t'}, \mathbf{x}, t') \right) \quad \text{where } y_t = y

where:

  • αt(y)\alpha_t(y): the forward variable at position tt for label yy
  • y1:t1\mathbf{y}_{1:t-1}: all possible label assignments for positions 1 through t1t-1
  • The sum aggregates scores over all ways to reach label yy at position tt
  • The constraint yt=yy_t = y fixes the label at position tt

Intuitively, αt(y)\alpha_t(y) represents the total "mass" of all paths that end in label yy at position tt. The crucial insight is that we can compute αt(y)\alpha_t(y) recursively from the previous position:

αt(y)=yαt1(y)exp(kλkfk(y,y,x,t))\alpha_t(y) = \sum_{y'} \alpha_{t-1}(y') \cdot \exp\left( \sum_k \lambda_k f_k(y', y, \mathbf{x}, t) \right)

where:

  • yy': iterates over all possible labels at position t1t-1
  • αt1(y)\alpha_{t-1}(y'): the forward variable from the previous position
  • exp(kλkfk(y,y,x,t))\exp\left( \sum_k \lambda_k f_k(y', y, \mathbf{x}, t) \right): the score for transitioning from yy' to yy at position tt

This recurrence says: to compute the total score of paths ending in yy at position tt, sum over all possible previous labels yy', multiplying the score of reaching yy' by the score of the transition from yy' to yy.

Once we've computed forward variables for all positions, the partition function is simply the sum over all labels at the final position:

Z(x)=yαn(y)Z(\mathbf{x}) = \sum_y \alpha_n(y)

where the sum is over all possible labels yy at the final position nn. This works because every complete label sequence must end with some label at position nn.

In[12]:
Code
def forward_algorithm(words, labels, feature_weights, templates):
    """
    Compute forward variables and partition function for a CRF.

    Args:
        words: list of words in the sequence
        labels: list of possible labels
        feature_weights: dict mapping feature names to weights
        templates: list of feature templates

    Returns:
        alpha: forward variables, shape (n, K)
        log_Z: log partition function
    """
    n = len(words)
    K = len(labels)
    label_to_idx = {label: i for i, label in enumerate(labels)}

    # Work in log space for numerical stability
    # log_alpha[t, y] = log(alpha_t(y))
    log_alpha = np.full((n, K), -np.inf)

    # Initialize: t=0, no previous label (use special START)
    for y_idx, y in enumerate(labels):
        features = extract_features(words, 0, "<START>", y, templates)
        score = sum(
            feature_weights.get(f, 0.0) * v for f, v in features.items()
        )
        log_alpha[0, y_idx] = score

    # Recurrence
    for t in range(1, n):
        for y_idx, y in enumerate(labels):
            # Sum over all previous labels
            log_scores = []
            for y_prev_idx, y_prev in enumerate(labels):
                features = extract_features(words, t, y_prev, y, templates)
                score = sum(
                    feature_weights.get(f, 0.0) * v for f, v in features.items()
                )
                log_scores.append(log_alpha[t - 1, y_prev_idx] + score)

            # Log-sum-exp for numerical stability
            max_score = max(log_scores)
            log_alpha[t, y_idx] = max_score + np.log(
                sum(np.exp(s - max_score) for s in log_scores)
            )

    # Partition function: sum over final labels
    max_final = np.max(log_alpha[n - 1, :])
    log_Z = max_final + np.log(np.sum(np.exp(log_alpha[n - 1, :] - max_final)))

    return log_alpha, log_Z


# Example with simple weights
example_labels = ["NOUN", "VERB", "DET", "ADJ"]
example_weights = {
    "transition=('<START>', 'DET')": 2.0,
    "transition=('DET', 'NOUN')": 2.5,
    "transition=('DET', 'ADJ')": 1.5,
    "transition=('ADJ', 'NOUN')": 2.0,
    "current_word=('The', 'DET')": 3.0,
    "current_word=('quick', 'ADJ')": 2.0,
    "current_word=('fox', 'NOUN')": 2.5,
    "current_word=('jumps', 'VERB')": 2.0,
}

simple_templates = [
    ("transition", lambda x, t, y_prev, y_cur: (y_prev, y_cur)),
    ("current_word", lambda x, t, y_prev, y_cur: (x[t], y_cur)),
]

log_alpha, log_Z = forward_algorithm(
    sample_sentence, example_labels, example_weights, simple_templates
)
Out[13]:
Console
Forward Algorithm Results:
============================================================
Sentence: ['The', 'quick', 'fox', 'jumps']
Labels: ['NOUN', 'VERB', 'DET', 'ADJ']

Log Forward Variables (log α_t(y)):
------------------------------------------------------------
Position   Word          NOUN      VERB       DET       ADJ
------------------------------------------------------------
0          The           0.00      0.00      5.00      0.00
1          quick         7.51      5.02      5.02      8.50
2          fox          13.10      8.86      8.86      8.93
3          jumps        13.37     15.15     13.15     13.19

Log Partition Function: log Z(x) = 15.5161
Partition Function: Z(x) = 5477008.5076

The forward variables show how probability mass accumulates through the sequence. At position 0, "The" has high scores for DET due to the matching feature weight. As we move through the sequence, the algorithm combines emission scores (how well each word matches each label) with transition scores (how likely each label sequence is). The log partition function aggregates all possible paths, enabling probability computation. Higher values indicate more probable label assignments at each position.

Out[14]:
Visualization
Heatmap with positions on x-axis, labels on y-axis, showing log-alpha values with color intensity.
Heatmap of log forward variables showing how probability mass accumulates through the sequence. Brighter cells indicate higher scores (more probable label assignments). Notice how DET dominates at position 0 for 'The', then ADJ becomes likely for 'quick', and NOUN for 'fox'. The pattern reveals how the CRF learns word-label associations.
Out[15]:
Visualization
Trellis diagram showing forward variable computation with arrows indicating transitions between label states.
Trellis diagram of the forward algorithm. Each node shows the log forward variable α for that label at that position. Arrows indicate transitions from t=0 to t=1. The partition function Z(x) sums exp(α) over all final labels.

The forward algorithm exploits the linear-chain structure: at each position, we only need to consider transitions from labels at the previous position. This reduces the exponential enumeration to polynomial computation.

Log-Space Computation

Throughout CRF computations, we work in log space. Multiplying many probabilities (or exponentials of scores) quickly leads to numerical underflow. By using the log-sum-exp operation, we maintain numerical stability:

log(iexi)=maxi(xi)+log(ieximaxi(xi))\log\left( \sum_i e^{x_i} \right) = \max_i(x_i) + \log\left( \sum_i e^{x_i - \max_i(x_i)} \right)

where:

  • xix_i: the ii-th value we want to sum (in our case, log-scores for different label sequences)
  • maxi(xi)\max_i(x_i): the maximum value among all xix_i
  • eximaxi(xi)e^{x_i - \max_i(x_i)}: the exponentiated value after subtracting the maximum

This identity is mathematically exact but numerically superior. The trick works because subtracting the maximum ensures that at least one term in the sum equals e0=1e^0 = 1, while all other terms are less than 1. This prevents both overflow (from exponentiating large positive numbers) and underflow (from exponentiating large negative numbers).

CRF Inference with the Viterbi Algorithm

Training gives us feature weights. Inference finds the most likely label sequence for new observations:

y=argmaxyP(yx)=argmaxyt=1nkλkfk(yt1,yt,x,t)\mathbf{y}^* = \arg\max_{\mathbf{y}} P(\mathbf{y} | \mathbf{x}) = \arg\max_{\mathbf{y}} \sum_{t=1}^{n} \sum_k \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t)

where:

  • y\mathbf{y}^*: the optimal label sequence we want to find
  • argmaxy\arg\max_{\mathbf{y}}: "the value of y\mathbf{y} that maximizes" the following expression
  • P(yx)P(\mathbf{y} | \mathbf{x}): the conditional probability of label sequence y\mathbf{y} given observations x\mathbf{x}
  • The double sum computes the total score for a candidate label sequence

The second equality holds because the partition function Z(x)Z(\mathbf{x}) is constant for a given input. Since P(yx)=1Z(x)exp(score)P(\mathbf{y} | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp(\text{score}), and the exponential function is monotonically increasing, maximizing the probability is equivalent to maximizing the unnormalized score. The Viterbi algorithm, which you may know from HMMs, efficiently finds this maximum-scoring path.

In[16]:
Code
def viterbi_decode(words, labels, feature_weights, templates):
    """
    Find the most likely label sequence using the Viterbi algorithm.

    Args:
        words: list of words
        labels: list of possible labels
        feature_weights: dict of feature name to weight
        templates: list of feature templates

    Returns:
        best_path: list of labels for each position
        best_score: score of the best path
    """
    n = len(words)
    K = len(labels)
    label_to_idx = {label: i for i, label in enumerate(labels)}

    # viterbi[t, y] = max score of paths ending in y at position t
    viterbi = np.full((n, K), -np.inf)
    # backpointer[t, y] = label at t-1 on best path to y at t
    backpointer = np.zeros((n, K), dtype=int)

    # Initialize
    for y_idx, y in enumerate(labels):
        features = extract_features(words, 0, "<START>", y, templates)
        score = sum(
            feature_weights.get(f, 0.0) * v for f, v in features.items()
        )
        viterbi[0, y_idx] = score

    # Recurrence
    for t in range(1, n):
        for y_idx, y in enumerate(labels):
            best_score = -np.inf
            best_prev = 0

            for y_prev_idx, y_prev in enumerate(labels):
                features = extract_features(words, t, y_prev, y, templates)
                score = sum(
                    feature_weights.get(f, 0.0) * v for f, v in features.items()
                )
                candidate_score = viterbi[t - 1, y_prev_idx] + score

                if candidate_score > best_score:
                    best_score = candidate_score
                    best_prev = y_prev_idx

            viterbi[t, y_idx] = best_score
            backpointer[t, y_idx] = best_prev

    # Find best final label
    best_final_idx = np.argmax(viterbi[n - 1, :])
    best_score = viterbi[n - 1, best_final_idx]

    # Backtrack to find path
    path_indices = [best_final_idx]
    for t in range(n - 1, 0, -1):
        path_indices.append(backpointer[t, path_indices[-1]])
    path_indices.reverse()

    best_path = [labels[i] for i in path_indices]
    return best_path, best_score


# Decode our example sentence
predicted_labels, score = viterbi_decode(
    sample_sentence, example_labels, example_weights, simple_templates
)
Out[17]:
Console
Viterbi Decoding Results:
==================================================
Sentence: ['The', 'quick', 'fox', 'jumps']

Best label sequence:
--------------------------------------------------
  The        → DET
  quick      → ADJ
  fox        → NOUN
  jumps      → VERB

Path score: 15.0000

The Viterbi algorithm has the same O(nK2)O(nK^2) complexity as the forward algorithm. At each position, it tracks the best path to each label rather than summing all paths. Backpointers record which previous label led to the best score, enabling reconstruction of the optimal sequence.

Building a CRF Tagger with sklearn-crfsuite

Now let's build a practical CRF tagger using the sklearn-crfsuite library, which provides an efficient implementation with all the features we've discussed.

In[18]:
Code
# Feature extraction function for NER
def word_to_features(sent, i):
    """
    Extract features for a word at position i in a sentence.

    sent: list of (word, tag) tuples
    i: position in sentence
    """
    word = sent[i][0]

    features = {
        "bias": 1.0,
        "word.lower": word.lower(),
        "word[-3:]": word[-3:],
        "word[-2:]": word[-2:],
        "word[:3]": word[:3],
        "word[:2]": word[:2],
        "word.isupper": word.isupper(),
        "word.istitle": word.istitle(),
        "word.isdigit": word.isdigit(),
        "word.length": len(word),
    }

    # Previous word features
    if i > 0:
        word_prev = sent[i - 1][0]
        features.update(
            {
                "-1:word.lower": word_prev.lower(),
                "-1:word.istitle": word_prev.istitle(),
                "-1:word.isupper": word_prev.isupper(),
                "-1:word[-3:]": word_prev[-3:],
            }
        )
    else:
        features["BOS"] = True  # Beginning of sentence

    # Next word features
    if i < len(sent) - 1:
        word_next = sent[i + 1][0]
        features.update(
            {
                "+1:word.lower": word_next.lower(),
                "+1:word.istitle": word_next.istitle(),
                "+1:word.isupper": word_next.isupper(),
                "+1:word[-3:]": word_next[-3:],
            }
        )
    else:
        features["EOS"] = True  # End of sentence

    return features


def sent_to_features(sent):
    """Extract features for all words in a sentence."""
    return [word_to_features(sent, i) for i in range(len(sent))]


def sent_to_labels(sent):
    """Extract labels for all words in a sentence."""
    return [label for token, label in sent]


# Create sample NER training data
train_sentences = [
    [
        ("John", "B-PER"),
        ("Smith", "I-PER"),
        ("works", "O"),
        ("at", "O"),
        ("Google", "B-ORG"),
        ("in", "O"),
        ("London", "B-LOC"),
        (".", "O"),
    ],
    [
        ("Apple", "B-ORG"),
        ("announced", "O"),
        ("new", "O"),
        ("products", "O"),
        ("yesterday", "O"),
        (".", "O"),
    ],
    [
        ("The", "O"),
        ("president", "O"),
        ("visited", "O"),
        ("Paris", "B-LOC"),
        ("and", "O"),
        ("Berlin", "B-LOC"),
        (".", "O"),
    ],
    [
        ("Microsoft", "B-ORG"),
        ("CEO", "O"),
        ("Satya", "B-PER"),
        ("Nadella", "I-PER"),
        ("spoke", "O"),
        ("today", "O"),
        (".", "O"),
    ],
    [
        ("Barack", "B-PER"),
        ("Obama", "I-PER"),
        ("met", "O"),
        ("with", "O"),
        ("Angela", "B-PER"),
        ("Merkel", "I-PER"),
        (".", "O"),
    ],
]

# Extract features and labels
X_train = [sent_to_features(s) for s in train_sentences]
y_train = [sent_to_labels(s) for s in train_sentences]
Out[19]:
Console
Sample Training Data:
============================================================
Number of sentences: 5

Example sentence with features:
------------------------------------------------------------

John (B-PER):
    word.lower: john
    word[-3:]: ohn
    word.istitle: True
    BOS: True

Smith (I-PER):
    word.lower: smith
    word[-3:]: ith
    word.istitle: True
    -1:word.lower: john

works (O):
    word.lower: works
    word[-3:]: rks
    word.istitle: False
    -1:word.lower: smith

at (O):
    word.lower: at
    word[-3:]: at
    word.istitle: False
    -1:word.lower: works

Now let's train the CRF model:

In[20]:
Code
import sklearn_crfsuite

# Initialize and train CRF
crf = sklearn_crfsuite.CRF(
    algorithm="lbfgs",  # L-BFGS optimization
    c1=0.1,  # L1 regularization
    c2=0.1,  # L2 regularization
    max_iterations=100,
    all_possible_transitions=True,  # Include all label transitions
)

crf.fit(X_train, y_train)

# Test on new sentences
test_sentences = [
    [("Tim", "?"), ("Cook", "?"), ("visited", "?"), ("Tokyo", "?"), (".", "?")],
    [
        ("Amazon", "?"),
        ("opened", "?"),
        ("offices", "?"),
        ("in", "?"),
        ("Seattle", "?"),
        (".", "?"),
    ],
]

X_test = [sent_to_features(s) for s in test_sentences]
y_pred = crf.predict(X_test)
Out[21]:
Console
CRF Predictions on Test Data:
============================================================

Sentence 1:
----------------------------------------
  Tim          → B-PER
  Cook         → I-PER
  visited      → O
  Tokyo        → B-LOC
  .            → O

Sentence 2:
----------------------------------------
  Amazon       → B-ORG
  opened       → O
  offices      → O
  in           → O
  Seattle      → B-LOC
  .            → O

Even with minimal training data, the CRF correctly identifies person names (Tim Cook, with B-PER and I-PER tags) and organizations (Amazon as B-ORG). The model generalizes from the training examples by learning that capitalized words following patterns similar to training entities are likely named entities. Let's examine what features it learned to value:

In[22]:
Code
from collections import Counter


def get_top_features(crf, label, n=10):
    """Get top weighted features for a label."""
    state_features = Counter(crf.state_features_)

    # Filter features for this label
    label_features = {}
    for (attr, lbl), weight in state_features.items():
        if lbl == label:
            label_features[attr] = weight

    # Sort by absolute weight
    sorted_features = sorted(
        label_features.items(), key=lambda x: abs(x[1]), reverse=True
    )
    return sorted_features[:n]


# Get top features for each entity type
labels_to_examine = ["B-PER", "B-ORG", "B-LOC"]
Out[23]:
Console
Top Features by Label:
============================================================

B-PER:
----------------------------------------
  +1:word.istitle                weight: +0.792
  word.istitle                   weight: +0.561
  -1:word.istitle                weight: -0.367
  BOS                            weight: +0.135
  word.lower:angela              weight: +0.108

B-ORG:
----------------------------------------
  word[-2:]:le                   weight: +1.028
  BOS                            weight: +0.389
  -1:word.istitle                weight: -0.364
  +1:word.istitle                weight: -0.331
  +1:word.isupper                weight: +0.290

B-LOC:
----------------------------------------
  +1:word.lower:.                weight: +0.492
  +1:word[-3:]:.                 weight: +0.492
  word.istitle                   weight: +0.342
  word.lower:paris               weight: +0.310
  word[-3:]:ris                  weight: +0.310

The feature weights reveal what patterns the CRF learned to associate with each entity type. High positive weights indicate strong evidence for that label. Features like word.istitle (capitalization) and specific word forms from the training data carry the most weight, which aligns with our intuition that named entities are typically capitalized and follow predictable patterns.

Out[24]:
Visualization
Horizontal bar chart showing top positive and negative feature weights for B-PER, B-ORG, and B-LOC entity types.
B-PER (Person): Top features for person entity recognition. Positive weights encourage the label.
B-ORG (Organization): Top features for organization entity recognition.
B-ORG (Organization): Top features for organization entity recognition.
B-LOC (Location): Top features for location entity recognition.
B-LOC (Location): Top features for location entity recognition.

Learned Transition Weights

CRFs also learn transition patterns between labels. Let's examine them:

In[25]:
Code
def get_transition_weights(crf):
    """Extract transition weights from trained CRF."""
    transitions = {}
    for (from_label, to_label), weight in crf.transition_features_.items():
        transitions[(from_label, to_label)] = weight
    return transitions


transition_weights = get_transition_weights(crf)
Out[26]:
Console
Learned Transition Weights:
============================================================
From Label   To Label         Weight
------------------------------------------------------------
B-PER        I-PER            +2.174
O            B-LOC            +0.711
O            O                +0.650
I-PER        O                +0.533
B-ORG        O                +0.303
B-LOC        O                +0.284
O            B-PER            +0.042
B-ORG        I-PER            -0.006
B-PER        B-ORG            -0.032
I-PER        I-PER            -0.060

The transition weights show strong preferences for valid BIO sequences. Transitions from B-tags to their corresponding I-tags have high positive weights, encouraging entity continuation. Transitions between different entity types (like B-PER to I-ORG) are penalized with negative weights, enforcing the constraint that I-tags must follow their matching B-tag.

Out[27]:
Visualization
Heatmap showing CRF transition weights between BIO labels with color scale from red to blue.
Learned transition weights from the trained CRF model. Positive weights (blue) indicate encouraged transitions, while negative weights (red) indicate discouraged transitions. Notice how B-PER to I-PER has high weight, enforcing entity continuation, while B-PER to I-LOC is penalized.

The transition weights reveal what the model learned about valid sequences. High positive weights between B-TYPE and I-TYPE labels encourage entity continuation. Negative weights for invalid transitions (like B-PER to I-LOC) discourage label sequences that violate BIO constraints.

CRF for Named Entity Recognition

Let's build a more complete NER system using a larger dataset. We'll use the CoNLL-2003 format, a standard for NER evaluation.

In[28]:
Code
from nltk.corpus import conll2002

# Download the corpus if needed (run once locally, then comment out)
# nltk.download("conll2002", quiet=True)

# Load Spanish NER data (smaller than English, good for demonstration)
train_sents = list(conll2002.iob_sents("esp.train"))[
    :500
]  # Use subset for speed
test_sents = list(conll2002.iob_sents("esp.testa"))[:100]
Out[29]:
Console
CoNLL-2002 Spanish NER Dataset:
==================================================
Training sentences: 500
Test sentences: 100

Example sentence:
--------------------------------------------------
  Izquierda       NC       B-ORG
  Unida           AQ       I-ORG
  de              SP       I-ORG
  Santander       NC       I-ORG
  presentó        VMI      O
  hoy             RG       O
  su              DP       O
  nuevo           AQ       O
  ... (38 tokens total)

Now let's define richer features for NER and train a CRF:

In[30]:
Code
def word_to_ner_features(sent, i):
    """
    Extract comprehensive features for NER.

    sent: list of (word, pos, ner) tuples
    i: position
    """
    word = sent[i][0]
    pos = sent[i][1]

    features = {
        "bias": 1.0,
        "word.lower": word.lower(),
        "word[-3:]": word[-3:],
        "word[-2:]": word[-2:],
        "word[:3]": word[:3],
        "word[:2]": word[:2],
        "word.isupper": word.isupper(),
        "word.istitle": word.istitle(),
        "word.isdigit": word.isdigit(),
        "word.length": len(word),
        "postag": pos,
        "postag[:2]": pos[:2],
    }

    # Previous word features
    if i > 0:
        word_prev = sent[i - 1][0]
        pos_prev = sent[i - 1][1]
        features.update(
            {
                "-1:word.lower": word_prev.lower(),
                "-1:word.istitle": word_prev.istitle(),
                "-1:word.isupper": word_prev.isupper(),
                "-1:postag": pos_prev,
                "-1:postag[:2]": pos_prev[:2],
            }
        )
    else:
        features["BOS"] = True

    # Next word features
    if i < len(sent) - 1:
        word_next = sent[i + 1][0]
        pos_next = sent[i + 1][1]
        features.update(
            {
                "+1:word.lower": word_next.lower(),
                "+1:word.istitle": word_next.istitle(),
                "+1:word.isupper": word_next.isupper(),
                "+1:postag": pos_next,
                "+1:postag[:2]": pos_next[:2],
            }
        )
    else:
        features["EOS"] = True

    return features


def sent_to_ner_features(sent):
    return [word_to_ner_features(sent, i) for i in range(len(sent))]


def sent_to_ner_labels(sent):
    return [ner for word, pos, ner in sent]


# Prepare data
X_train_ner = [sent_to_ner_features(s) for s in train_sents]
y_train_ner = [sent_to_ner_labels(s) for s in train_sents]
X_test_ner = [sent_to_ner_features(s) for s in test_sents]
y_test_ner = [sent_to_ner_labels(s) for s in test_sents]

# Train CRF
crf_ner = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)

crf_ner.fit(X_train_ner, y_train_ner)

# Predict
y_pred_ner = crf_ner.predict(X_test_ner)

# Get labels (excluding O for focused evaluation)
labels = list(crf_ner.classes_)
labels.remove("O")
Out[31]:
Console
NER Performance on Test Set:
============================================================
Overall F1 (weighted): 0.5292

Per-entity metrics:
------------------------------------------------------------
              precision    recall  f1-score   support

       B-LOC      0.675     0.500     0.574        54
       B-ORG      0.625     0.738     0.677        61
       B-PER      0.460     0.657     0.541        35
       I-PER      0.519     0.667     0.583        42
      B-MISC      0.667     0.200     0.308        20
       I-ORG      0.527     0.617     0.569        47
       I-LOC      0.333     0.207     0.255        29
      I-MISC      0.833     0.286     0.426        35

   micro avg      0.560     0.533     0.546       323
   macro avg      0.580     0.484     0.492       323
weighted avg      0.586     0.533     0.529       323

The per-entity metrics show how well the CRF recognizes each entity type. Precision measures how many predicted entities are correct, while recall measures how many actual entities were found. The F1 score balances both. Performance varies by entity type, with some entities being easier to recognize due to more distinctive features or more training examples.

Out[32]:
Console

Sample Predictions:
============================================================

Sentence 1:
  Sao             gold: B-LOC      pred: B-LOC      ✓
  Paulo           gold: I-LOC      pred: I-LOC      ✓
  Brasil          gold: B-LOC      pred: B-ORG      ✗
  EFECOM          gold: B-ORG      pred: B-ORG      ✓
  ... (12 tokens)

Sentence 2:

Sentence 3:
  Telefónica      gold: B-ORG      pred: B-ORG      ✓
  ... (51 tokens)

The sample predictions show the CRF in action on real text. Checkmarks indicate correct predictions, while crosses highlight errors. Common error patterns include confusing similar entity types or missing entity boundaries, especially for entities not well-represented in training data.

Neural CRF Layers

Modern sequence labeling combines neural networks with CRF layers. The neural network learns feature representations, while the CRF layer models label dependencies and ensures valid output sequences.

BiLSTM-CRF Architecture

The BiLSTM-CRF architecture uses a bidirectional LSTM to encode contextual representations of each token, then feeds these representations to a CRF layer that models transition constraints between labels. The LSTM learns features; the CRF learns structure.

In this architecture:

  1. Word embeddings represent each token as a dense vector
  2. A BiLSTM processes the sequence, producing hidden states that capture bidirectional context
  3. A linear layer projects hidden states to label scores (emission scores)
  4. The CRF layer combines emission scores with learned transition scores

The CRF layer adds a transition matrix TT where TijT_{ij} is the score for transitioning from label ii to label jj. During training, we maximize the log-likelihood of correct label sequences. During inference, Viterbi decoding finds the best sequence.

In[33]:
Code
import torch
import torch.nn as nn


class CRFLayer(nn.Module):
    """
    CRF layer for sequence labeling.

    Implements the forward algorithm for training (computing log-likelihood)
    and Viterbi algorithm for inference (finding best path).
    """

    def __init__(self, num_tags, start_tag_idx, end_tag_idx):
        super().__init__()
        self.num_tags = num_tags
        self.start_tag_idx = start_tag_idx
        self.end_tag_idx = end_tag_idx

        # Transition scores: transitions[i, j] = score of j -> i
        self.transitions = nn.Parameter(torch.randn(num_tags, num_tags))

        # No transitions to START, no transitions from END
        self.transitions.data[start_tag_idx, :] = -10000
        self.transitions.data[:, end_tag_idx] = -10000

    def forward_algorithm(self, emissions):
        """
        Compute log partition function using forward algorithm.

        emissions: (seq_len, num_tags) emission scores
        Returns: log Z(x)
        """
        seq_len = emissions.size(0)

        # Initialize: score of starting in each tag
        # Start from START tag, so first real tag scores from START transition
        init_alphas = torch.full((self.num_tags,), -10000.0)
        init_alphas[self.start_tag_idx] = 0.0

        forward_var = init_alphas

        for emission in emissions:
            alphas_t = []
            for next_tag in range(self.num_tags):
                emit_score = emission[next_tag].view(1).expand(self.num_tags)
                trans_score = self.transitions[next_tag]
                next_tag_var = forward_var + trans_score + emit_score
                alphas_t.append(torch.logsumexp(next_tag_var, dim=0).view(1))
            forward_var = torch.cat(alphas_t)

        # Transition to END tag
        terminal_var = forward_var + self.transitions[self.end_tag_idx]
        log_Z = torch.logsumexp(terminal_var, dim=0)
        return log_Z

    def score_sequence(self, emissions, tags):
        """
        Compute score of a specific tag sequence.

        emissions: (seq_len, num_tags)
        tags: (seq_len,) ground truth tags
        Returns: score of this sequence
        """
        score = torch.zeros(1)
        tags = torch.cat([torch.tensor([self.start_tag_idx]), tags])

        for i, emission in enumerate(emissions):
            score += self.transitions[tags[i + 1], tags[i]]
            score += emission[tags[i + 1]]

        score += self.transitions[self.end_tag_idx, tags[-1]]
        return score

    def neg_log_likelihood(self, emissions, tags):
        """
        Compute negative log-likelihood for training.

        NLL = log Z(x) - score(y)
        """
        log_Z = self.forward_algorithm(emissions)
        gold_score = self.score_sequence(emissions, tags)
        return log_Z - gold_score

    def viterbi_decode(self, emissions):
        """
        Find best tag sequence using Viterbi algorithm.
        """
        seq_len = emissions.size(0)

        init_vvars = torch.full((self.num_tags,), -10000.0)
        init_vvars[self.start_tag_idx] = 0.0

        forward_var = init_vvars
        backpointers = []

        for emission in emissions:
            bptrs_t = []
            vvars_t = []

            for next_tag in range(self.num_tags):
                next_tag_var = forward_var + self.transitions[next_tag]
                best_tag_id = torch.argmax(next_tag_var)
                bptrs_t.append(best_tag_id.item())
                vvars_t.append(next_tag_var[best_tag_id].view(1))

            forward_var = torch.cat(vvars_t) + emission
            backpointers.append(bptrs_t)

        # Transition to END
        terminal_var = forward_var + self.transitions[self.end_tag_idx]
        best_tag_id = torch.argmax(terminal_var).item()
        best_score = terminal_var[best_tag_id]

        # Backtrack
        best_path = [best_tag_id]
        for bptrs_t in reversed(backpointers):
            best_tag_id = bptrs_t[best_tag_id]
            best_path.append(best_tag_id)

        # Remove START tag
        best_path.pop()
        best_path.reverse()

        return best_score, best_path


# Demonstrate CRF layer
num_tags = 5  # O, B-PER, I-PER, B-LOC, I-LOC
crf_layer = CRFLayer(num_tags, start_tag_idx=0, end_tag_idx=0)

# Fake emissions from a neural network
emissions = torch.randn(6, num_tags)  # 6 tokens, 5 tags
fake_tags = torch.tensor([1, 2, 0, 3, 4, 0])  # Example tag sequence
Out[34]:
Console
CRF Layer Demonstration:
==================================================
Log partition function: -19986.0879
Score of tag sequence: -50003.2891
Negative log-likelihood: 30017.2012

Viterbi decoding:
  Best path: [2, 3, 2, 3, 2, 3]
  Best score: -19988.6484

The CRF layer computes three key quantities. The log partition function sums over all possible tag sequences (exponentially many), computed efficiently via the forward algorithm. The sequence score measures how well a specific tag sequence fits the emission and transition scores. The negative log-likelihood (the training loss) is the difference between these: minimizing it encourages the model to assign high scores to correct sequences relative to all alternatives. Viterbi decoding finds the highest-scoring path for inference.

The neural CRF combines the representation learning power of neural networks with the structured prediction capabilities of CRFs. The neural network extracts features automatically from the input, while the CRF layer ensures that the output sequence satisfies structural constraints like valid BIO transitions.

Out[35]:
Visualization
Neural network architecture diagram showing embedding, BiLSTM, linear, and CRF layers stacked vertically.
BiLSTM-CRF architecture for sequence labeling. Word embeddings are fed through a bidirectional LSTM to produce contextual representations. A linear layer projects these to emission scores for each label. The CRF layer combines emission scores with learned transition scores to find the globally optimal label sequence.

Inference Complexity and Practical Considerations

CRF inference has specific computational characteristics worth understanding for practical applications.

Time Complexity

The key operations scale as follows:

  • Forward/Backward Algorithm: O(nK2)O(nK^2) where nn is sequence length and KK is number of labels
  • Viterbi Decoding: O(nK2)O(nK^2)
  • Gradient Computation: O(nK2)O(nK^2) per training example

The quadratic dependence on KK means that CRFs become expensive with many labels. For tasks with hundreds of possible labels (like fine-grained NER with many entity types), this can be a bottleneck.

Out[36]:
Visualization
Two-panel plot showing CRF complexity scaling: left panel shows linear growth with sequence length, right panel shows quadratic growth with number of labels.
Linear scaling with sequence length (K=10 labels fixed). Operations grow proportionally with n.
Quadratic scaling with label count (n=50 tokens fixed). Operations grow with K², becoming prohibitive for large label sets.
Quadratic scaling with label count (n=50 tokens fixed). Operations grow with K², becoming prohibitive for large label sets.

Beam Search Alternative

When exact Viterbi decoding is too slow, beam search provides an approximate alternative. Instead of tracking all KK labels at each position, beam search keeps only the top BB partial sequences:

In[37]:
Code
def beam_search_decode(emissions, transitions, beam_width=3, num_tags=5):
    """
    Approximate decoding using beam search.

    Keeps only top-k partial sequences at each step.
    """
    seq_len = emissions.shape[0]

    # Each beam entry: (score, path)
    beams = [(0.0, [])]

    for t in range(seq_len):
        candidates = []

        for score, path in beams:
            prev_tag = path[-1] if path else 0  # 0 for START

            for next_tag in range(num_tags):
                if t == 0:
                    trans_score = 0.0
                else:
                    trans_score = transitions[prev_tag, next_tag]

                emit_score = emissions[t, next_tag]
                new_score = score + trans_score + emit_score
                candidates.append((new_score, path + [next_tag]))

        # Keep top beam_width
        candidates.sort(reverse=True, key=lambda x: x[0])
        beams = candidates[:beam_width]

    return beams[0]  # Best sequence


# Compare beam search with different widths
emissions_np = emissions.detach().numpy()
transitions_np = crf_layer.transitions.detach().numpy()
Out[38]:
Console
Beam Search vs. Viterbi:
==================================================
Viterbi (exact):  path=[2, 3, 2, 3, 2, 3], score=-19988.6484
Beam width 1:     path=[2, 3, 2, 3, 2, 1], score=10.3197 ✗
Beam width 2:     path=[2, 3, 2, 3, 2, 1], score=10.3197 ✗
Beam width 3:     path=[2, 3, 2, 3, 2, 1], score=10.3197 ✗
Beam width 5:     path=[2, 3, 2, 3, 2, 1], score=10.3197 ✗

Beam search with width 1 is greedy decoding. As beam width increases, accuracy approaches Viterbi but with reduced computational cost for very long sequences or many labels.

Limitations and When CRFs Fall Short

Despite their power, CRFs have limitations that have driven the development of alternative approaches:

  • Label Set Size: The O(K2)O(K^2) complexity becomes prohibitive with large label sets. Tasks with hundreds of labels, such as fine-grained entity typing or word sense disambiguation, may require approximations or alternative architectures. Hierarchical CRFs and sparse transition matrices can help, but fundamentally, CRFs work best with modest numbers of labels.

  • Long-Range Dependencies: Linear-chain CRFs model only adjacent label dependencies. If a label at position 10 directly depends on the label at position 1, the CRF cannot capture this except through a chain of intermediate influences. Higher-order CRFs exist but are exponentially more expensive. For tasks requiring long-range dependencies, attention mechanisms in transformers often work better.

  • Feature Engineering: Traditional CRFs require manual feature engineering. While neural CRFs learn features automatically, they add the complexity of neural network training: hyperparameter tuning, GPU requirements, and longer training times. For some applications, a well-engineered feature-based CRF may outperform a neural CRF with less effort.

  • Training Data Requirements: CRFs need labeled sequences for training. Unlike language models that can pretrain on unlabeled text, CRF features and transitions are learned entirely from labeled data. This makes them data-hungry for complex tasks. The combination of pretrained transformers with CRF layers addresses this by providing rich features from unlabeled pretraining.

These limitations don't diminish CRFs' importance. They remain the go-to structured prediction layer for sequence labeling, especially in production systems where interpretable features and reliable inference matter. Understanding both their strengths and limitations helps you choose the right tool for each problem.

Key Parameters

When working with CRFs using sklearn-crfsuite, the following parameters most significantly affect model performance:

  • algorithm: The optimization algorithm for training. Options include "lbfgs" (Limited-memory BFGS, default), "l2sgd" (SGD with L2 regularization), "ap" (Averaged Perceptron), and "pa" (Passive Aggressive). L-BFGS typically converges faster and works well for most tasks. SGD variants may be preferred for very large datasets that don't fit in memory.

  • c1 (L1 regularization): Controls feature sparsity. Higher values push more feature weights to exactly zero, producing simpler models with fewer active features. Start with values between 0.0 and 0.5. Useful when you have many features and want automatic feature selection.

  • c2 (L2 regularization): Controls weight magnitude. Higher values prevent any single feature from dominating predictions. Start with values between 0.01 and 0.1. Helps prevent overfitting, especially with limited training data.

  • max_iterations: Maximum number of optimization iterations. Default is 100, which suffices for most tasks. Increase if the model hasn't converged (check training logs). Very large values rarely help and slow training.

  • all_possible_transitions: When True, includes transition weights for all label pairs, even those not observed in training. This allows the model to learn that unseen transitions should have negative weights. Generally recommended for BIO tagging where you want to explicitly penalize invalid transitions like B-PER to I-LOC.

  • all_possible_states: When True, includes state features for all labels, even for features not observed with that label in training. Useful when test data may contain feature-label combinations not seen during training.

For neural CRFs, key parameters include the hidden dimension of the LSTM (typically 64-256), dropout rate (0.1-0.5), and learning rate (1e-4 to 1e-3 with Adam optimizer).

Summary

Conditional Random Fields represent a fundamental advance in sequence labeling, shifting from generative to discriminative modeling to unlock richer feature representations.

Key concepts from this chapter:

Generative vs. Discriminative: HMMs model P(X,Y)P(X, Y) and derive predictions through Bayes' rule. CRFs directly model P(YX)P(Y|X), avoiding independence assumptions about observations. This allows arbitrary overlapping features of the input sequence.

Log-Linear Formulation: CRFs define probabilities through weighted feature functions:

P(yx)=1Z(x)exp(t=1nkλkfk(yt1,yt,x,t))P(\mathbf{y} | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left( \sum_{t=1}^{n} \sum_k \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t) \right)

where y\mathbf{y} is the label sequence, x\mathbf{x} is the observation sequence, λk\lambda_k are learned weights, fkf_k are feature functions, and Z(x)Z(\mathbf{x}) normalizes the distribution. Feature functions capture both label-observation relationships (like HMM emissions) and label-label relationships (like HMM transitions), plus arbitrary combinations.

Partition Function: The normalization constant Z(x)Z(\mathbf{x}) sums over all possible label sequences. The forward algorithm computes it efficiently in O(nK2)O(nK^2) time using dynamic programming.

Inference: The Viterbi algorithm finds the most likely label sequence in O(nK2)O(nK^2) time. Beam search provides an approximate alternative for very large label sets.

Feature Functions: CRF power comes from rich features:

  • State features link observations to labels (word suffixes, capitalization, POS tags)
  • Transition features capture label-label patterns (valid BIO transitions)
  • Context features use surrounding words and their properties

Neural CRFs: Modern architectures combine neural networks (BiLSTMs, transformers) for feature learning with CRF layers for structured prediction. The neural network learns representations; the CRF enforces output constraints.

Implementation: Libraries like sklearn-crfsuite provide efficient CRF training and inference. Feature engineering remains important for traditional CRFs, while neural CRFs learn features automatically but require more training infrastructure.

CRFs transformed sequence labeling by providing a principled framework for combining rich, overlapping features with structured output constraints. They remain the foundation of production NER, POS tagging, and chunking systems, either as traditional feature-based models or as structured output layers in neural architectures. The next chapter explores CRF training in detail: the forward-backward algorithm, gradient computation, and optimization techniques that make learning possible.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Conditional Random Fields and discriminative sequence labeling.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{conditionalrandomfieldsdiscriminativesequencelabelingwithrichfeatures, author = {Michael Brenndoerfer}, title = {Conditional Random Fields: Discriminative Sequence Labeling with Rich Features}, year = {2025}, url = {https://mbrenndoerfer.com/writing/conditional-random-fields-sequence-labeling-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Conditional Random Fields: Discriminative Sequence Labeling with Rich Features. Retrieved from https://mbrenndoerfer.com/writing/conditional-random-fields-sequence-labeling-nlp
MLAAcademic
Michael Brenndoerfer. "Conditional Random Fields: Discriminative Sequence Labeling with Rich Features." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/conditional-random-fields-sequence-labeling-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "Conditional Random Fields: Discriminative Sequence Labeling with Rich Features." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/conditional-random-fields-sequence-labeling-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Conditional Random Fields: Discriminative Sequence Labeling with Rich Features'. Available at: https://mbrenndoerfer.com/writing/conditional-random-fields-sequence-labeling-nlp (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Conditional Random Fields: Discriminative Sequence Labeling with Rich Features. https://mbrenndoerfer.com/writing/conditional-random-fields-sequence-labeling-nlp
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free