Search

Search articles

CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization

Michael BrenndoerferDecember 15, 202526 min read

Master Conditional Random Field training with the forward-backward algorithm, gradient computation, and L-BFGS optimization for sequence labeling tasks.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

CRF Training

Conditional Random Fields are powerful models for sequence labeling, but their power comes with computational challenges. Unlike simple classifiers that make independent predictions for each token, CRFs model dependencies between labels across the entire sequence. Training a CRF means finding weights that maximize the probability of correct label sequences given their input features, while accounting for all possible alternative labelings.

This chapter explores how to train CRFs effectively. You'll learn the mathematical foundation of the log-likelihood objective, understand why the forward-backward algorithm is essential for efficient gradient computation, see how modern optimizers like L-BFGS converge faster than simple gradient descent, and discover how feature templates and regularization shape what a CRF can learn. By the end, you'll be able to train CRFs on real sequence labeling tasks and understand what's happening under the hood.

The Log-Likelihood Objective

Training a CRF means finding feature weights that make correct label sequences probable. Given training data consisting of observation sequences x(i)\mathbf{x}^{(i)} and their correct label sequences y(i)\mathbf{y}^{(i)}, we want to maximize the conditional probability of seeing these correct labels.

Log-Likelihood

The log-likelihood is the sum of log probabilities of the correct label sequences across all training examples. Taking the log transforms products into sums and makes optimization numerically stable.

For a single training example with observation sequence x\mathbf{x} and label sequence y\mathbf{y}, the CRF defines the conditional probability as:

P(yx)=1Z(x)exp(t=1Tkλkfk(yt1,yt,x,t))P(\mathbf{y} | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_{t=1}^{T} \sum_{k} \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t)\right)

where:

  • y=(y1,y2,,yT)\mathbf{y} = (y_1, y_2, \ldots, y_T): the label sequence of length TT
  • x\mathbf{x}: the observation sequence (features at each position)
  • fk(yt1,yt,x,t)f_k(y_{t-1}, y_t, \mathbf{x}, t): the kk-th feature function, which examines the previous label yt1y_{t-1}, current label yty_t, the full observation sequence x\mathbf{x}, and position tt
  • λk\lambda_k: the weight for feature kk (what we're learning)
  • Z(x)Z(\mathbf{x}): the partition function, which sums over all possible label sequences to normalize probabilities

The partition function deserves special attention:

Z(x)=yexp(t=1Tkλkfk(yt1,yt,x,t))Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left(\sum_{t=1}^{T} \sum_{k} \lambda_k f_k(y'_{t-1}, y'_t, \mathbf{x}, t)\right)

where:

  • y\mathbf{y}': ranges over all possible label sequences
  • The sum includes LT|L|^T sequences, where L|L| is the number of possible labels

This summation over exponentially many sequences is what makes CRF training computationally interesting. With 9 labels and 20 tokens, we'd have 92010199^{20} \approx 10^{19} possible sequences. Enumerating them directly is impossible.

The Objective Function

Taking the log of the conditional probability gives us the log-likelihood for a single example:

logP(yx)=t=1Tkλkfk(yt1,yt,x,t)logZ(x)\log P(\mathbf{y} | \mathbf{x}) = \sum_{t=1}^{T} \sum_{k} \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t) - \log Z(\mathbf{x})

The first term is the "score" of the correct sequence, which is easy to compute. The second term, logZ(x)\log Z(\mathbf{x}), requires summing over all possible sequences, which is the computational bottleneck.

For the full training set with NN examples, we sum the log-likelihoods:

L(λ)=i=1NlogP(y(i)x(i))\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \log P(\mathbf{y}^{(i)} | \mathbf{x}^{(i)})

where:

  • L(λ)\mathcal{L}(\boldsymbol{\lambda}): the total log-likelihood as a function of all weights
  • λ=(λ1,λ2,,λK)\boldsymbol{\lambda} = (\lambda_1, \lambda_2, \ldots, \lambda_K): the vector of all KK feature weights

Our goal is to find λ=argmaxλL(λ)\boldsymbol{\lambda}^* = \arg\max_{\boldsymbol{\lambda}} \mathcal{L}(\boldsymbol{\lambda}).

Let's visualize what this objective looks like for a simple case:

The log-likelihood function for CRFs has a crucial property: it is concave. This means there are no local maxima to get trapped in. Any hill-climbing algorithm will find the global optimum given enough iterations. This mathematical guarantee makes CRF training well-behaved compared to neural network training, where the loss landscape is riddled with local minima and saddle points.

The Forward-Backward Algorithm

Computing the partition function Z(x)Z(\mathbf{x}) by enumerating all sequences is intractable. The forward-backward algorithm exploits the sequential structure of the model to compute it in polynomial time.

The key insight is that the CRF score decomposes into local terms. At each position tt, the contribution depends only on the label at tt and the label at t1t-1. This Markov property enables dynamic programming: we can build up the sum over all sequences incrementally, position by position.

Forward Variables

The forward variable αt(j)\alpha_t(j) represents the sum of scores for all partial label sequences that end with label jj at position tt:

αt(j)=y1:t:yt=jexp(t=1tkλkfk(yt1,yt,x,t))\alpha_t(j) = \sum_{\mathbf{y}_{1:t}: y_t = j} \exp\left(\sum_{t'=1}^{t} \sum_{k} \lambda_k f_k(y_{t'-1}, y_{t'}, \mathbf{x}, t')\right)

where:

  • y1:t\mathbf{y}_{1:t}: all possible label sequences from position 1 to tt
  • yt=jy_t = j: constrained to end with label jj
  • The summation accumulates scores of all paths reaching state jj at time tt

We compute forward variables recursively:

Base case (t = 1):

α1(j)=exp(kλkfk(y0,j,x,1))\alpha_1(j) = \exp\left(\sum_{k} \lambda_k f_k(y_0, j, \mathbf{x}, 1)\right)

where y0y_0 is a special start symbol.

Recursive case:

αt(j)=iLαt1(i)exp(kλkfk(i,j,x,t))\alpha_t(j) = \sum_{i \in L} \alpha_{t-1}(i) \cdot \exp\left(\sum_{k} \lambda_k f_k(i, j, \mathbf{x}, t)\right)

This recursion says: to get the total score of paths ending in jj at position tt, sum over all possible previous labels ii, multiplying the forward score at t1t-1 (all paths ending in ii) by the transition score from ii to jj.

The partition function is simply the sum of all forward variables at the final position:

Z(x)=jLαT(j)Z(\mathbf{x}) = \sum_{j \in L} \alpha_T(j)

Backward Variables

The backward variable βt(i)\beta_t(i) represents the sum of scores for all partial sequences starting from position t+1t+1, given that position tt has label ii:

βt(i)=yt+1:Texp(t=t+1Tkλkfk(yt1,yt,x,t))\beta_t(i) = \sum_{\mathbf{y}_{t+1:T}} \exp\left(\sum_{t'=t+1}^{T} \sum_{k} \lambda_k f_k(y_{t'-1}, y_{t'}, \mathbf{x}, t')\right)

where yt=iy_t = i is fixed.

Base case (t = T):

βT(i)=1for all i\beta_T(i) = 1 \quad \text{for all } i

Recursive case:

βt(i)=jLβt+1(j)exp(kλkfk(i,j,x,t+1))\beta_t(i) = \sum_{j \in L} \beta_{t+1}(j) \cdot \exp\left(\sum_{k} \lambda_k f_k(i, j, \mathbf{x}, t+1)\right)

Computing Marginal Probabilities

The forward and backward variables together give us the probability of any label at any position. The posterior marginal probability of label jj at position tt is:

P(yt=jx)=αt(j)βt(j)Z(x)P(y_t = j | \mathbf{x}) = \frac{\alpha_t(j) \cdot \beta_t(j)}{Z(\mathbf{x})}

And the probability of a specific transition from ii to jj at positions t1t-1 to tt:

P(yt1=i,yt=jx)=αt1(i)Mt(i,j)βt(j)Z(x)P(y_{t-1} = i, y_t = j | \mathbf{x}) = \frac{\alpha_{t-1}(i) \cdot M_t(i, j) \cdot \beta_t(j)}{Z(\mathbf{x})}

where Mt(i,j)=exp(kλkfk(i,j,x,t))M_t(i, j) = \exp\left(\sum_k \lambda_k f_k(i, j, \mathbf{x}, t)\right) is the transition score matrix at position tt.

Let's implement the forward-backward algorithm:

In[5]:
Code
import numpy as np


def forward_backward(log_potentials, transition_scores):
    """
    Compute forward and backward variables for a CRF.

    Args:
        log_potentials: Array of shape (T, num_labels) with emission scores
        transition_scores: Array of shape (num_labels, num_labels) with transition scores

    Returns:
        alpha: Forward variables (T, num_labels)
        beta: Backward variables (T, num_labels)
        log_Z: Log partition function
    """
    T, num_labels = log_potentials.shape

    # Forward pass (in log space for numerical stability)
    log_alpha = np.zeros((T, num_labels))
    log_alpha[0] = log_potentials[0]

    for t in range(1, T):
        for j in range(num_labels):
            # Sum over all previous labels
            scores = (
                log_alpha[t - 1]
                + transition_scores[:, j]
                + log_potentials[t, j]
            )
            log_alpha[t, j] = np.logaddexp.reduce(scores)

    # Backward pass
    log_beta = np.zeros((T, num_labels))
    log_beta[T - 1] = 0  # log(1) = 0

    for t in range(T - 2, -1, -1):
        for i in range(num_labels):
            # Sum over all next labels
            scores = (
                log_beta[t + 1]
                + transition_scores[i, :]
                + log_potentials[t + 1]
            )
            log_beta[t, i] = np.logaddexp.reduce(scores)

    # Partition function
    log_Z = np.logaddexp.reduce(log_alpha[T - 1])

    return log_alpha, log_beta, log_Z
In[6]:
Code
# Example: 5-position sequence with 3 labels (O, B-PER, I-PER)
np.random.seed(42)
T, num_labels = 5, 3
label_names = ["O", "B-PER", "I-PER"]

# Emission scores (from features)
log_potentials = np.random.randn(T, num_labels)

# Transition scores (learned)
transition_scores = np.random.randn(num_labels, num_labels)
# Make I-PER after O unlikely (BIO constraint approximation)
transition_scores[0, 2] = -5  # O -> I-PER

log_alpha, log_beta, log_Z = forward_backward(log_potentials, transition_scores)

# Compute marginal probabilities
log_marginals = log_alpha + log_beta - log_Z
marginals = np.exp(log_marginals)

Notice how the marginals sum to 1.0 at each position. This is because they represent the probability distribution over labels at that position, marginalized over all possible label sequences. The forward-backward algorithm computes these marginals in O(TL2)O(T \cdot |L|^2) time, compared to O(LT)O(|L|^T) for brute-force enumeration.

The trellis visualization shows how forward and backward passes combine. Each node's color intensity represents its marginal probability. States with high marginal probability (darker blue) are more likely to be part of the optimal sequence.

Gradient Computation

To optimize the log-likelihood, we need its gradient with respect to each weight λk\lambda_k. The gradient has an elegant form that emerges from the structure of the CRF.

Taking the derivative of the log-likelihood for a single example:

logP(yx)λk=t=1Tfk(yt1,yt,x,t)t=1Ti,jP(yt1=i,yt=jx)fk(i,j,x,t)\frac{\partial \log P(\mathbf{y} | \mathbf{x})}{\partial \lambda_k} = \sum_{t=1}^{T} f_k(y_{t-1}, y_t, \mathbf{x}, t) - \sum_{t=1}^{T} \sum_{i, j} P(y_{t-1} = i, y_t = j | \mathbf{x}) \cdot f_k(i, j, \mathbf{x}, t)

where:

  • The first term is the observed feature count: how often feature kk fires on the correct label sequence
  • The second term is the expected feature count: the expected number of times feature kk would fire under the model's current probability distribution

This gradient has an intuitive interpretation:

  • If a feature fires more often in the training data than the model expects, increase its weight (positive gradient)
  • If a feature fires less often than expected, decrease its weight (negative gradient)
  • When observed counts match expected counts, the gradient is zero (we've reached a stationary point)

Computing Expected Feature Counts

The expected feature count requires summing over all positions and all label pairs, weighted by their posterior probabilities:

E[fk]=t=1Ti,jP(yt1=i,yt=jx)fk(i,j,x,t)\mathbb{E}[f_k] = \sum_{t=1}^{T} \sum_{i, j} P(y_{t-1} = i, y_t = j | \mathbf{x}) \cdot f_k(i, j, \mathbf{x}, t)

The forward-backward algorithm gives us exactly what we need: the pairwise marginals P(yt1=i,yt=jx)P(y_{t-1} = i, y_t = j | \mathbf{x}).

In[11]:
Code
def compute_gradient(
    log_potentials, transition_scores, true_labels, log_alpha, log_beta, log_Z
):
    """
    Compute gradient of log-likelihood for one training example.

    Returns gradients for emission potentials and transition scores.
    """
    T, num_labels = log_potentials.shape

    # Gradient for emission scores
    emission_grad = np.zeros_like(log_potentials)

    # Observed counts: feature fires at correct positions
    for t, label in enumerate(true_labels):
        emission_grad[t, label] += 1.0

    # Expected counts: subtract marginal probabilities
    marginals = np.exp(log_alpha + log_beta - log_Z)
    emission_grad -= marginals

    # Gradient for transition scores
    transition_grad = np.zeros_like(transition_scores)

    # Observed transition counts
    for t in range(1, len(true_labels)):
        prev_label, curr_label = true_labels[t - 1], true_labels[t]
        transition_grad[prev_label, curr_label] += 1.0

    # Expected transition counts
    for t in range(1, T):
        for i in range(num_labels):
            for j in range(num_labels):
                log_prob = (
                    log_alpha[t - 1, i]
                    + transition_scores[i, j]
                    + log_potentials[t, j]
                    + log_beta[t, j]
                    - log_Z
                )
                transition_grad[i, j] -= np.exp(log_prob)

    return emission_grad, transition_grad
In[12]:
Code
# Example: gradient computation
true_labels = [0, 1, 2, 2, 0]  # O, B-PER, I-PER, I-PER, O

emission_grad, transition_grad = compute_gradient(
    log_potentials, transition_scores, true_labels, log_alpha, log_beta, log_Z
)

The gradient reveals which adjustments would improve the model. Positive gradients at the true labels mean the model should increase those scores. Negative gradients at incorrect labels mean those should be suppressed. The transition gradient shows similar patterns: transitions that occurred in the true sequence but have low model probability will have positive gradients.

L-BFGS Optimization

Simple gradient descent can optimize the CRF objective, but it converges slowly. Each step moves in the direction of the gradient, but the step size is tricky to tune: too large and you overshoot, too small and you crawl toward the optimum.

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is a quasi-Newton method that approximates the curvature of the objective function. Instead of using only the gradient direction, it uses gradient history to estimate how the objective curves in different directions, enabling larger steps in flat directions and smaller steps in steep directions.

L-BFGS

L-BFGS is a quasi-Newton optimization algorithm that uses a limited memory approximation to the inverse Hessian matrix. It converges much faster than gradient descent for convex problems like CRF training, typically requiring 50-200 iterations instead of thousands.

The key advantages of L-BFGS for CRF training are:

  • Faster convergence: Uses curvature information to take better steps
  • No learning rate tuning: Line search automatically finds good step sizes
  • Memory efficient: Stores only the last mm gradient pairs (typically m=10m = 10), not the full Hessian
  • Robust: Works well on the convex CRF objective without careful hyperparameter tuning
In[17]:
Code
def crf_objective(params, X_features, y_sequences, num_labels):
    """
    Compute negative log-likelihood and gradient for CRF.

    Args:
        params: Flattened parameter vector (emissions + transitions)
        X_features: List of feature matrices, one per sequence
        y_sequences: List of label sequences
        num_labels: Number of possible labels

    Returns:
        neg_ll: Negative log-likelihood (to minimize)
        neg_grad: Negative gradient
    """
    # Unpack parameters
    # For simplicity, we use emission weights per label and transition matrix
    emission_weights = params[:num_labels]
    transition_matrix = params[num_labels:].reshape(num_labels, num_labels)

    total_ll = 0.0
    total_emission_grad = np.zeros(num_labels)
    total_transition_grad = np.zeros((num_labels, num_labels))

    for X, y in zip(X_features, y_sequences):
        T = len(y)

        # Compute emission potentials
        log_potentials = np.outer(np.ones(T), emission_weights) + X

        # Forward-backward
        log_alpha, log_beta, log_Z = forward_backward(
            log_potentials, transition_matrix
        )

        # Log-likelihood contribution
        sequence_score = sum(log_potentials[t, y[t]] for t in range(T))
        sequence_score += sum(
            transition_matrix[y[t - 1], y[t]] for t in range(1, T)
        )
        total_ll += sequence_score - log_Z

        # Gradient contribution
        emission_grad, transition_grad = compute_gradient(
            log_potentials, transition_matrix, y, log_alpha, log_beta, log_Z
        )

        total_emission_grad += emission_grad.sum(axis=0)
        total_transition_grad += transition_grad

    # Return negative (for minimization)
    neg_grad = np.concatenate(
        [-total_emission_grad, -total_transition_grad.flatten()]
    )

    return -total_ll, neg_grad
In[18]:
Code
# Generate synthetic training data
np.random.seed(42)
num_sequences = 50
seq_length = 10
num_labels = 3

# Create synthetic sequences with some structure
X_features = []
y_sequences = []

for _ in range(num_sequences):
    # Random features
    X = np.random.randn(seq_length, num_labels) * 0.5

    # Generate labels with realistic BIO patterns
    y = []
    in_entity = False
    for t in range(seq_length):
        if in_entity:
            if np.random.random() < 0.7:  # Continue entity
                y.append(2)  # I-PER
            else:  # End entity
                y.append(0)  # O
                in_entity = False
        else:
            if np.random.random() < 0.2:  # Start entity
                y.append(1)  # B-PER
                in_entity = True
            else:
                y.append(0)  # O

    X_features.append(X)
    y_sequences.append(y)

# Initial parameters
init_params = np.zeros(num_labels + num_labels * num_labels)

# Track optimization progress
history = {"ll": [], "grad_norm": []}


def callback(params):
    ll, grad = crf_objective(params, X_features, y_sequences, num_labels)
    history["ll"].append(-ll)  # Convert back to positive LL
    history["grad_norm"].append(np.linalg.norm(grad))
In[19]:
Code
from scipy.optimize import minimize

# Run L-BFGS optimization
result = minimize(
    crf_objective,
    init_params,
    args=(X_features, y_sequences, num_labels),
    method="L-BFGS-B",
    jac=True,  # Function returns both value and gradient
    callback=callback,
    options={"maxiter": 100, "disp": False},
)

optimal_params = result.x

The convergence plots show the characteristic behavior of L-BFGS on a convex problem. The log-likelihood rises steeply at first, then levels off as we approach the optimum. The gradient norm drops exponentially, indicating we're getting closer to a point where the gradient is zero.

Feature Template Design

CRF features are the heart of the model. Well-designed features capture the patterns that distinguish correct label sequences from incorrect ones. Feature templates define how to extract features from the input, and the CRF learns which features matter most.

Feature Template

A feature template is a pattern that generates binary features from observations. For example, the template "current word = X" generates one feature for each unique word X in the vocabulary. Templates let you define feature types; the actual features are instantiated from training data.

Common Feature Types

Features for sequence labeling typically fall into several categories:

  • Lexical features: The word itself, lowercase form, word shape (capitalization pattern)
  • Contextual features: Surrounding words in a window
  • Morphological features: Prefixes, suffixes, presence of digits or punctuation
  • Part-of-speech features: POS tags from a tagger (if available)
  • Transition features: Combinations of previous and current labels
In[23]:
Code
def extract_features(tokens, position):
    """
    Extract features for a token at a given position.

    Returns a dictionary of feature_name: value pairs.
    Features are designed for NER tagging.
    """
    features = {}
    word = tokens[position]

    # Current word features
    features["word.lower"] = word.lower()
    features["word.isupper"] = word.isupper()
    features["word.istitle"] = word.istitle()
    features["word.isdigit"] = word.isdigit()

    # Word shape
    shape = ""
    for char in word:
        if char.isupper():
            shape += "X"
        elif char.islower():
            shape += "x"
        elif char.isdigit():
            shape += "d"
        else:
            shape += char
    features["word.shape"] = shape

    # Prefix/suffix
    features["word.prefix2"] = (
        word[:2].lower() if len(word) >= 2 else word.lower()
    )
    features["word.suffix2"] = (
        word[-2:].lower() if len(word) >= 2 else word.lower()
    )
    features["word.prefix3"] = (
        word[:3].lower() if len(word) >= 3 else word.lower()
    )
    features["word.suffix3"] = (
        word[-3:].lower() if len(word) >= 3 else word.lower()
    )

    # Context features
    if position > 0:
        prev_word = tokens[position - 1]
        features["prev.word.lower"] = prev_word.lower()
        features["prev.word.istitle"] = prev_word.istitle()
    else:
        features["BOS"] = True  # Beginning of sentence

    if position < len(tokens) - 1:
        next_word = tokens[position + 1]
        features["next.word.lower"] = next_word.lower()
        features["next.word.istitle"] = next_word.istitle()
    else:
        features["EOS"] = True  # End of sentence

    return features
In[24]:
Code
# Example: extract features for a sample sentence
sample_sentence = [
    "John",
    "Smith",
    "works",
    "at",
    "Google",
    "in",
    "California",
    ".",
]
sample_labels = ["B-PER", "I-PER", "O", "O", "B-ORG", "O", "B-LOC", "O"]

all_features = []
for i in range(len(sample_sentence)):
    features = extract_features(sample_sentence, i)
    all_features.append(features)

The feature extractor captures patterns that are predictive of entity labels. Title-cased words are often names. Words ending in "-ia" might be locations. Words preceded by "at" might be organizations. The CRF learns which of these patterns are actually reliable predictors.

Feature Templates in Practice

Real NER systems use extensive feature templates. The sklearn-crfsuite library provides a convenient way to train CRFs with custom features:

In[27]:
Code
# Install sklearn-crfsuite for CRF training
# uv pip install sklearn-crfsuite
In[28]:
Code
def prepare_sentence_features(tokens, labels):
    """Convert a sentence to CRF training format."""
    sentence_features = []
    for i in range(len(tokens)):
        features = extract_features(tokens, i)
        # Convert to string features for sklearn-crfsuite
        str_features = {k: str(v) for k, v in features.items()}
        sentence_features.append(str_features)
    return sentence_features, labels


# Create a small dataset for demonstration
sentences = [
    (
        ["John", "Smith", "works", "at", "Google", "."],
        ["B-PER", "I-PER", "O", "O", "B-ORG", "O"],
    ),
    (
        ["Mary", "visited", "Paris", "last", "summer", "."],
        ["B-PER", "O", "B-LOC", "O", "O", "O"],
    ),
    (
        ["Apple", "announced", "a", "new", "iPhone", "."],
        ["B-ORG", "O", "O", "O", "O", "O"],
    ),
    (
        ["Dr.", "Jones", "from", "MIT", "gave", "a", "talk", "."],
        ["O", "B-PER", "O", "B-ORG", "O", "O", "O", "O"],
    ),
    (
        ["The", "Eiffel", "Tower", "is", "in", "France", "."],
        ["O", "B-LOC", "I-LOC", "O", "O", "B-LOC", "O"],
    ),
]

# Prepare training data
X_train = []
y_train = []
for tokens, labels in sentences:
    features, lbls = prepare_sentence_features(tokens, labels)
    X_train.append(features)
    y_train.append(lbls)
In[29]:
Code
import sklearn_crfsuite

# Train CRF model
crf = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,  # L1 regularization
    c2=0.1,  # L2 regularization
    max_iterations=100,
    all_possible_transitions=True,
)

crf.fit(X_train, y_train)

The learned feature weights reveal what the CRF considers important. High positive weights for "word.istitle=True" on B-PER indicate that title-cased words are strong signals for person names. The CRF automatically discovers these patterns from the training data.

Regularization

Without regularization, CRF training can overfit to the training data. If a feature appears only once in training and perfectly predicts a label, the optimal weight is infinity. Regularization prevents extreme weights and improves generalization.

Regularization

Regularization adds a penalty term to the objective function that discourages large parameter values. This prevents overfitting by trading off training set fit against model complexity.

L1 and L2 Regularization

The regularized log-likelihood objective is:

Lreg(λ)=L(λ)1C1kλk12C2kλk2\mathcal{L}_{\text{reg}}(\boldsymbol{\lambda}) = \mathcal{L}(\boldsymbol{\lambda}) - \frac{1}{C_1} \sum_k |\lambda_k| - \frac{1}{2 C_2} \sum_k \lambda_k^2

where:

  • L(λ)\mathcal{L}(\boldsymbol{\lambda}): the original log-likelihood
  • C1C_1: L1 regularization coefficient (larger = less regularization)
  • C2C_2: L2 regularization coefficient (larger = less regularization)
  • λk|\lambda_k|: absolute value of weight kk (L1 penalty)
  • λk2\lambda_k^2: squared weight (L2 penalty)

L1 and L2 regularization have different effects:

Comparison of L1 and L2 regularization for CRF training.
RegularizationPenaltyEffectUse Case
L1λ\|\lambda\|Drives weights to exactly zeroFeature selection, sparse models
L2λ2\lambda^2Shrinks all weights toward zeroGeneral regularization
L1 + L2BothCombines sparsity with shrinkageOften works best in practice
In[32]:
Code
# Compare different regularization strengths
regularization_strengths = [0.01, 0.1, 1.0, 10.0]
results = []

for c in regularization_strengths:
    crf_reg = sklearn_crfsuite.CRF(
        algorithm="lbfgs",
        c1=c,
        c2=c,
        max_iterations=100,
        all_possible_transitions=True,
    )
    crf_reg.fit(X_train, y_train)

    # Count non-zero features
    nonzero_features = sum(
        1 for _, weight in crf_reg.state_features_.items() if abs(weight) > 1e-6
    )

    # Get max weight magnitude
    max_weight = max(
        abs(weight) for _, weight in crf_reg.state_features_.items()
    )

    results.append(
        {"c": c, "nonzero": nonzero_features, "max_weight": max_weight}
    )

The regularization plots show clear trends. With strong regularization (small C), many features get zeroed out, creating a sparse model that uses only the most informative features. Maximum weight magnitudes also decrease, preventing the model from over-relying on any single feature.

Training a Complete NER Model

Let's put everything together and train a CRF on a real NER dataset:

In[36]:
Code
import nltk

nltk.download("conll2002", quiet=True)
from nltk.corpus import conll2002

# Load Spanish NER data (smaller than English, good for demonstration)
train_sents = list(conll2002.iob_sents("esp.train"))[
    :200
]  # Use subset for speed
test_sents = list(conll2002.iob_sents("esp.testa"))[:50]


def word2features(sent, i):
    """Extract features for word at position i in sentence."""
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        "bias": 1.0,
        "word.lower": word.lower(),
        "word[-3:]": word[-3:],
        "word[-2:]": word[-2:],
        "word.isupper": word.isupper(),
        "word.istitle": word.istitle(),
        "word.isdigit": word.isdigit(),
        "postag": postag,
        "postag[:2]": postag[:2],
    }

    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update(
            {
                "-1:word.lower": word1.lower(),
                "-1:word.istitle": word1.istitle(),
                "-1:postag": postag1,
            }
        )
    else:
        features["BOS"] = True

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update(
            {
                "+1:word.lower": word1.lower(),
                "+1:word.istitle": word1.istitle(),
                "+1:postag": postag1,
            }
        )
    else:
        features["EOS"] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


# Prepare data
X_train_ner = [sent2features(s) for s in train_sents]
y_train_ner = [sent2labels(s) for s in train_sents]
X_test_ner = [sent2features(s) for s in test_sents]
y_test_ner = [sent2labels(s) for s in test_sents]
In[37]:
Code
# Train the CRF
crf_ner = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)

crf_ner.fit(X_train_ner, y_train_ner)

# Make predictions
y_pred_ner = crf_ner.predict(X_test_ner)

The trained CRF achieves reasonable performance even on this limited training set. The model correctly identifies many entities by learning patterns from the feature templates. Errors often occur at entity boundaries or with rare entity types that don't have enough training examples.

The transition matrix reveals that the CRF has learned BIO constraints. High weights appear for valid transitions like O→B-LOC (starting a location entity) and B-LOC→I-LOC (continuing a location). Low or negative weights appear for invalid transitions like O→I-LOC (continuation without a beginning).

Limitations and Impact

CRF training has both strengths and limitations that shape when and how to use these models effectively.

The forward-backward algorithm makes CRF training tractable, but it still requires O(TL2)O(T \cdot |L|^2) time per sequence. For tasks with many labels like fine-grained NER or semantic role labeling, this quadratic scaling becomes expensive. A sequence with 50 labels and 100 tokens requires computing 250,000 transition scores per training example. This cost compounds across thousands of training sentences and hundreds of optimization iterations.

Feature engineering remains both a strength and weakness. Hand-crafted features like word shapes, prefixes, and context windows capture linguistic knowledge effectively. But designing features requires domain expertise and extensive experimentation. Features that work well for English NER may perform poorly on German or Japanese. This manual effort contrasts sharply with neural approaches that learn representations automatically from raw text.

CRFs also struggle with long-range dependencies. The Markov assumption means each label depends only on the immediately preceding label. If a token's correct label depends on context five words away, the CRF must propagate that information through intervening transitions. In practice, this limits how much global context the model can exploit.

Despite these limitations, CRFs made sequence labeling practical before deep learning. Their contribution extends beyond the models themselves to the training paradigms they established. The idea of discriminatively training structured models, using dynamic programming for efficient inference, and designing features to capture domain knowledge influenced subsequent developments in neural NLP. Modern neural CRF layers, which place a CRF on top of BiLSTM or transformer encoders, combine learned representations with the structured prediction framework that CRFs pioneered.

For practitioners today, CRFs remain valuable in several scenarios: when training data is limited (hundreds rather than thousands of examples), when interpretability matters (examining feature weights reveals model behavior), when computational resources are constrained (CRFs train on CPUs in minutes), or as components in hybrid systems (neural encoders + CRF output layers).

Summary

Training Conditional Random Fields requires balancing computational efficiency with model expressiveness. The key ideas covered in this chapter are:

  • Log-likelihood objective: CRF training maximizes the probability of correct label sequences. The objective is concave, guaranteeing a unique global optimum.

  • Forward-backward algorithm: Dynamic programming computes the partition function and marginal probabilities in O(TL2)O(T \cdot |L|^2) time, making training tractable despite the exponential number of possible label sequences.

  • Gradient computation: The gradient equals observed feature counts minus expected feature counts. When these match, the model has converged.

  • L-BFGS optimization: Quasi-Newton methods converge much faster than gradient descent by approximating curvature information. Most CRF implementations use L-BFGS by default.

  • Feature templates: Hand-designed features capture patterns like word shapes, context windows, and morphological properties. Good features are essential for CRF performance.

  • Regularization: L1 and L2 penalties prevent overfitting by discouraging extreme weights. L1 produces sparse models; L2 shrinks all weights toward zero.

CRFs established the foundation for structured prediction in NLP. While neural approaches now dominate, the principles of discriminative training, dynamic programming, and structured output remain central to modern sequence labeling systems.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about CRF training.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{crftrainingforwardbackwardalgorithmgradientslbfgsoptimization, author = {Michael Brenndoerfer}, title = {CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization}, year = {2025}, url = {https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization. Retrieved from https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization
MLAAcademic
Michael Brenndoerfer. "CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization>.
CHICAGOAcademic
Michael Brenndoerfer. "CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization.
HARVARDAcademic
Michael Brenndoerfer (2025) 'CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization'. Available at: https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization. https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free