CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization

Michael Brenndoerfer

Master Conditional Random Field training with the forward-backward algorithm, gradient computation, and L-BFGS optimization for sequence labeling tasks.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

CRF TrainingLink Copied

Conditional Random Fields are powerful models for sequence labeling, but their power comes with computational challenges. Unlike simple classifiers that make independent predictions for each token, CRFs model dependencies between labels across the entire sequence. Training a CRF means finding weights that maximize the probability of correct label sequences given their input features, while accounting for all possible alternative labelings.

This chapter explores how to train CRFs effectively. You'll learn the mathematical foundation of the log-likelihood objective, understand why the forward-backward algorithm is essential for efficient gradient computation, see how modern optimizers like L-BFGS converge faster than simple gradient descent, and discover how feature templates and regularization shape what a CRF can learn. By the end, you'll be able to train CRFs on real sequence labeling tasks and understand what's happening under the hood.

The Log-Likelihood ObjectiveLink Copied

Training a CRF means finding feature weights that make correct label sequences probable. Given training data consisting of observation sequences $\mathbf{x}^{(i)}$ and their correct label sequences $\mathbf{y}^{(i)}$ , we want to maximize the conditional probability of seeing these correct labels.

Log-Likelihood

The log-likelihood is the sum of log probabilities of the correct label sequences across all training examples. Taking the log transforms products into sums and makes optimization numerically stable.

For a single training example with observation sequence $\mathbf{x}$ and label sequence $\mathbf{y}$ , the CRF defines the conditional probability as:

P(\mathbf{y} | \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\left(\sum_{t=1}^{T} \sum_{k} \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t)\right)

where:

$\mathbf{y} = (y_1, y_2, \ldots, y_T)$ : the label sequence of length $T$
$\mathbf{x}$ : the observation sequence (features at each position)
$f_k(y_{t-1}, y_t, \mathbf{x}, t)$ : the $k$ -th feature function, which examines the previous label $y_{t-1}$ , current label $y_t$ , the full observation sequence $\mathbf{x}$ , and position $t$
$\lambda_k$ : the weight for feature $k$ (what we're learning)
$Z(\mathbf{x})$ : the partition function, which sums over all possible label sequences to normalize probabilities

The partition function deserves special attention:

Z(\mathbf{x}) = \sum_{\mathbf{y}'} \exp\left(\sum_{t=1}^{T} \sum_{k} \lambda_k f_k(y'_{t-1}, y'_t, \mathbf{x}, t)\right)

where:

$\mathbf{y}'$ : ranges over all possible label sequences
The sum includes $|L|^T$ sequences, where $|L|$ is the number of possible labels

This summation over exponentially many sequences is what makes CRF training computationally interesting. With 9 labels and 20 tokens, we'd have $9^{20} \approx 10^{19}$ possible sequences. Enumerating them directly is impossible.

The Objective FunctionLink Copied

Taking the log of the conditional probability gives us the log-likelihood for a single example:

\log P(\mathbf{y} | \mathbf{x}) = \sum_{t=1}^{T} \sum_{k} \lambda_k f_k(y_{t-1}, y_t, \mathbf{x}, t) - \log Z(\mathbf{x})

The first term is the "score" of the correct sequence, which is easy to compute. The second term, $\log Z(\mathbf{x})$ , requires summing over all possible sequences, which is the computational bottleneck.

For the full training set with $N$ examples, we sum the log-likelihoods:

\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{N} \log P(\mathbf{y}^{(i)} | \mathbf{x}^{(i)})

where:

$\mathcal{L}(\boldsymbol{\lambda})$ : the total log-likelihood as a function of all weights
$\boldsymbol{\lambda} = (\lambda_1, \lambda_2, \ldots, \lambda_K)$ : the vector of all $K$ feature weights

Our goal is to find $\boldsymbol{\lambda}^* = \arg\max_{\boldsymbol{\lambda}} \mathcal{L}(\boldsymbol{\lambda})$ .

Let's visualize what this objective looks like for a simple case:

The log-likelihood function for CRFs has a crucial property: it is concave. This means there are no local maxima to get trapped in. Any hill-climbing algorithm will find the global optimum given enough iterations. This mathematical guarantee makes CRF training well-behaved compared to neural network training, where the loss landscape is riddled with local minima and saddle points.

The Forward-Backward AlgorithmLink Copied

Computing the partition function $Z(\mathbf{x})$ by enumerating all sequences is intractable. The forward-backward algorithm exploits the sequential structure of the model to compute it in polynomial time.

The key insight is that the CRF score decomposes into local terms. At each position $t$ , the contribution depends only on the label at $t$ and the label at $t-1$ . This Markov property enables dynamic programming: we can build up the sum over all sequences incrementally, position by position.

Forward VariablesLink Copied

The forward variable $\alpha_t(j)$ represents the sum of scores for all partial label sequences that end with label $j$ at position $t$ :

\alpha_t(j) = \sum_{\mathbf{y}_{1:t}: y_t = j} \exp\left(\sum_{t'=1}^{t} \sum_{k} \lambda_k f_k(y_{t'-1}, y_{t'}, \mathbf{x}, t')\right)

where:

$\mathbf{y}_{1:t}$ : all possible label sequences from position 1 to $t$
$y_t = j$ : constrained to end with label $j$
The summation accumulates scores of all paths reaching state $j$ at time $t$

We compute forward variables recursively:

Base case (t = 1):

\alpha_1(j) = \exp\left(\sum_{k} \lambda_k f_k(y_0, j, \mathbf{x}, 1)\right)

where $y_0$ is a special start symbol.

Recursive case:

\alpha_t(j) = \sum_{i \in L} \alpha_{t-1}(i) \cdot \exp\left(\sum_{k} \lambda_k f_k(i, j, \mathbf{x}, t)\right)

This recursion says: to get the total score of paths ending in $j$ at position $t$ , sum over all possible previous labels $i$ , multiplying the forward score at $t-1$ (all paths ending in $i$ ) by the transition score from $i$ to $j$ .

The partition function is simply the sum of all forward variables at the final position:

Z(\mathbf{x}) = \sum_{j \in L} \alpha_T(j)

Backward VariablesLink Copied

The backward variable $\beta_t(i)$ represents the sum of scores for all partial sequences starting from position $t+1$ , given that position $t$ has label $i$ :

\beta_t(i) = \sum_{\mathbf{y}_{t+1:T}} \exp\left(\sum_{t'=t+1}^{T} \sum_{k} \lambda_k f_k(y_{t'-1}, y_{t'}, \mathbf{x}, t')\right)

where $y_t = i$ is fixed.

Base case (t = T):

\beta_T(i) = 1 \quad \text{for all } i

Recursive case:

\beta_t(i) = \sum_{j \in L} \beta_{t+1}(j) \cdot \exp\left(\sum_{k} \lambda_k f_k(i, j, \mathbf{x}, t+1)\right)

Computing Marginal ProbabilitiesLink Copied

The forward and backward variables together give us the probability of any label at any position. The posterior marginal probability of label $j$ at position $t$ is:

P(y_t = j | \mathbf{x}) = \frac{\alpha_t(j) \cdot \beta_t(j)}{Z(\mathbf{x})}

And the probability of a specific transition from $i$ to $j$ at positions $t-1$ to $t$ :

P(y_{t-1} = i, y_t = j | \mathbf{x}) = \frac{\alpha_{t-1}(i) \cdot M_t(i, j) \cdot \beta_t(j)}{Z(\mathbf{x})}

where $M_t(i, j) = \exp\left(\sum_k \lambda_k f_k(i, j, \mathbf{x}, t)\right)$ is the transition score matrix at position $t$ .

Let's implement the forward-backward algorithm:

In[5]:

Code

import numpy as np


def forward_backward(log_potentials, transition_scores):
    """
    Compute forward and backward variables for a CRF.

    Args:
        log_potentials: Array of shape (T, num_labels) with emission scores
        transition_scores: Array of shape (num_labels, num_labels) with transition scores

    Returns:
        alpha: Forward variables (T, num_labels)
        beta: Backward variables (T, num_labels)
        log_Z: Log partition function
    """
    T, num_labels = log_potentials.shape

    # Forward pass (in log space for numerical stability)
    log_alpha = np.zeros((T, num_labels))
    log_alpha[0] = log_potentials[0]

    for t in range(1, T):
        for j in range(num_labels):
            # Sum over all previous labels
            scores = (
                log_alpha[t - 1]
                + transition_scores[:, j]
                + log_potentials[t, j]
            )
            log_alpha[t, j] = np.logaddexp.reduce(scores)

    # Backward pass
    log_beta = np.zeros((T, num_labels))
    log_beta[T - 1] = 0  # log(1) = 0

    for t in range(T - 2, -1, -1):
        for i in range(num_labels):
            # Sum over all next labels
            scores = (
                log_beta[t + 1]
                + transition_scores[i, :]
                + log_potentials[t + 1]
            )
            log_beta[t, i] = np.logaddexp.reduce(scores)

    # Partition function
    log_Z = np.logaddexp.reduce(log_alpha[T - 1])

    return log_alpha, log_beta, log_Z

import numpy as np


def forward_backward(log_potentials, transition_scores):
    """
    Compute forward and backward variables for a CRF.

    Args:
        log_potentials: Array of shape (T, num_labels) with emission scores
        transition_scores: Array of shape (num_labels, num_labels) with transition scores

    Returns:
        alpha: Forward variables (T, num_labels)
        beta: Backward variables (T, num_labels)
        log_Z: Log partition function
    """
    T, num_labels = log_potentials.shape

    # Forward pass (in log space for numerical stability)
    log_alpha = np.zeros((T, num_labels))
    log_alpha[0] = log_potentials[0]

    for t in range(1, T):
        for j in range(num_labels):
            # Sum over all previous labels
            scores = (
                log_alpha[t - 1]
                + transition_scores[:, j]
                + log_potentials[t, j]
            )
            log_alpha[t, j] = np.logaddexp.reduce(scores)

    # Backward pass
    log_beta = np.zeros((T, num_labels))
    log_beta[T - 1] = 0  # log(1) = 0

    for t in range(T - 2, -1, -1):
        for i in range(num_labels):
            # Sum over all next labels
            scores = (
                log_beta[t + 1]
                + transition_scores[i, :]
                + log_potentials[t + 1]
            )
            log_beta[t, i] = np.logaddexp.reduce(scores)

    # Partition function
    log_Z = np.logaddexp.reduce(log_alpha[T - 1])

    return log_alpha, log_beta, log_Z

In[6]:

Code

# Example: 5-position sequence with 3 labels (O, B-PER, I-PER)
np.random.seed(42)
T, num_labels = 5, 3
label_names = ["O", "B-PER", "I-PER"]

# Emission scores (from features)
log_potentials = np.random.randn(T, num_labels)

# Transition scores (learned)
transition_scores = np.random.randn(num_labels, num_labels)
# Make I-PER after O unlikely (BIO constraint approximation)
transition_scores[0, 2] = -5  # O -> I-PER

log_alpha, log_beta, log_Z = forward_backward(log_potentials, transition_scores)

# Compute marginal probabilities
log_marginals = log_alpha + log_beta - log_Z
marginals = np.exp(log_marginals)

# Example: 5-position sequence with 3 labels (O, B-PER, I-PER)
np.random.seed(42)
T, num_labels = 5, 3
label_names = ["O", "B-PER", "I-PER"]

# Emission scores (from features)
log_potentials = np.random.randn(T, num_labels)

# Transition scores (learned)
transition_scores = np.random.randn(num_labels, num_labels)
# Make I-PER after O unlikely (BIO constraint approximation)
transition_scores[0, 2] = -5  # O -> I-PER

log_alpha, log_beta, log_Z = forward_backward(log_potentials, transition_scores)

# Compute marginal probabilities
log_marginals = log_alpha + log_beta - log_Z
marginals = np.exp(log_marginals)

Notice how the marginals sum to 1.0 at each position. This is because they represent the probability distribution over labels at that position, marginalized over all possible label sequences. The forward-backward algorithm computes these marginals in $O(T \cdot |L|^2)$ time, compared to $O(|L|^T)$ for brute-force enumeration.

The trellis visualization shows how forward and backward passes combine. Each node's color intensity represents its marginal probability. States with high marginal probability (darker blue) are more likely to be part of the optimal sequence.

Gradient ComputationLink Copied

To optimize the log-likelihood, we need its gradient with respect to each weight $\lambda_k$ . The gradient has an elegant form that emerges from the structure of the CRF.

Taking the derivative of the log-likelihood for a single example:

\frac{\partial \log P(\mathbf{y} | \mathbf{x})}{\partial \lambda_k} = \sum_{t=1}^{T} f_k(y_{t-1}, y_t, \mathbf{x}, t) - \sum_{t=1}^{T} \sum_{i, j} P(y_{t-1} = i, y_t = j | \mathbf{x}) \cdot f_k(i, j, \mathbf{x}, t)

where:

The first term is the observed feature count: how often feature $k$ fires on the correct label sequence
The second term is the expected feature count: the expected number of times feature $k$ would fire under the model's current probability distribution

This gradient has an intuitive interpretation:

If a feature fires more often in the training data than the model expects, increase its weight (positive gradient)
If a feature fires less often than expected, decrease its weight (negative gradient)
When observed counts match expected counts, the gradient is zero (we've reached a stationary point)

Computing Expected Feature CountsLink Copied

The expected feature count requires summing over all positions and all label pairs, weighted by their posterior probabilities:

\mathbb{E}[f_k] = \sum_{t=1}^{T} \sum_{i, j} P(y_{t-1} = i, y_t = j | \mathbf{x}) \cdot f_k(i, j, \mathbf{x}, t)

The forward-backward algorithm gives us exactly what we need: the pairwise marginals $P(y_{t-1} = i, y_t = j | \mathbf{x})$ .

In[11]:

Code

def compute_gradient(
    log_potentials, transition_scores, true_labels, log_alpha, log_beta, log_Z
):
    """
    Compute gradient of log-likelihood for one training example.

    Returns gradients for emission potentials and transition scores.
    """
    T, num_labels = log_potentials.shape

    # Gradient for emission scores
    emission_grad = np.zeros_like(log_potentials)

    # Observed counts: feature fires at correct positions
    for t, label in enumerate(true_labels):
        emission_grad[t, label] += 1.0

    # Expected counts: subtract marginal probabilities
    marginals = np.exp(log_alpha + log_beta - log_Z)
    emission_grad -= marginals

    # Gradient for transition scores
    transition_grad = np.zeros_like(transition_scores)

    # Observed transition counts
    for t in range(1, len(true_labels)):
        prev_label, curr_label = true_labels[t - 1], true_labels[t]
        transition_grad[prev_label, curr_label] += 1.0

    # Expected transition counts
    for t in range(1, T):
        for i in range(num_labels):
            for j in range(num_labels):
                log_prob = (
                    log_alpha[t - 1, i]
                    + transition_scores[i, j]
                    + log_potentials[t, j]
                    + log_beta[t, j]
                    - log_Z
                )
                transition_grad[i, j] -= np.exp(log_prob)

    return emission_grad, transition_grad

def compute_gradient(
    log_potentials, transition_scores, true_labels, log_alpha, log_beta, log_Z
):
    """
    Compute gradient of log-likelihood for one training example.

    Returns gradients for emission potentials and transition scores.
    """
    T, num_labels = log_potentials.shape

    # Gradient for emission scores
    emission_grad = np.zeros_like(log_potentials)

    # Observed counts: feature fires at correct positions
    for t, label in enumerate(true_labels):
        emission_grad[t, label] += 1.0

    # Expected counts: subtract marginal probabilities
    marginals = np.exp(log_alpha + log_beta - log_Z)
    emission_grad -= marginals

    # Gradient for transition scores
    transition_grad = np.zeros_like(transition_scores)

    # Observed transition counts
    for t in range(1, len(true_labels)):
        prev_label, curr_label = true_labels[t - 1], true_labels[t]
        transition_grad[prev_label, curr_label] += 1.0

    # Expected transition counts
    for t in range(1, T):
        for i in range(num_labels):
            for j in range(num_labels):
                log_prob = (
                    log_alpha[t - 1, i]
                    + transition_scores[i, j]
                    + log_potentials[t, j]
                    + log_beta[t, j]
                    - log_Z
                )
                transition_grad[i, j] -= np.exp(log_prob)

    return emission_grad, transition_grad

In[12]:

Code

# Example: gradient computation
true_labels = [0, 1, 2, 2, 0]  # O, B-PER, I-PER, I-PER, O

emission_grad, transition_grad = compute_gradient(
    log_potentials, transition_scores, true_labels, log_alpha, log_beta, log_Z
)

# Example: gradient computation
true_labels = [0, 1, 2, 2, 0]  # O, B-PER, I-PER, I-PER, O

emission_grad, transition_grad = compute_gradient(
    log_potentials, transition_scores, true_labels, log_alpha, log_beta, log_Z
)

The gradient reveals which adjustments would improve the model. Positive gradients at the true labels mean the model should increase those scores. Negative gradients at incorrect labels mean those should be suppressed. The transition gradient shows similar patterns: transitions that occurred in the true sequence but have low model probability will have positive gradients.

L-BFGS OptimizationLink Copied

Simple gradient descent can optimize the CRF objective, but it converges slowly. Each step moves in the direction of the gradient, but the step size is tricky to tune: too large and you overshoot, too small and you crawl toward the optimum.

L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb-Shanno) is a quasi-Newton method that approximates the curvature of the objective function. Instead of using only the gradient direction, it uses gradient history to estimate how the objective curves in different directions, enabling larger steps in flat directions and smaller steps in steep directions.

L-BFGS

L-BFGS is a quasi-Newton optimization algorithm that uses a limited memory approximation to the inverse Hessian matrix. It converges much faster than gradient descent for convex problems like CRF training, typically requiring 50-200 iterations instead of thousands.

The key advantages of L-BFGS for CRF training are:

Faster convergence: Uses curvature information to take better steps
No learning rate tuning: Line search automatically finds good step sizes
Memory efficient: Stores only the last $m$ gradient pairs (typically $m = 10$ ), not the full Hessian
Robust: Works well on the convex CRF objective without careful hyperparameter tuning

In[17]:

Code

def crf_objective(params, X_features, y_sequences, num_labels):
    """
    Compute negative log-likelihood and gradient for CRF.

    Args:
        params: Flattened parameter vector (emissions + transitions)
        X_features: List of feature matrices, one per sequence
        y_sequences: List of label sequences
        num_labels: Number of possible labels

    Returns:
        neg_ll: Negative log-likelihood (to minimize)
        neg_grad: Negative gradient
    """
    # Unpack parameters
    # For simplicity, we use emission weights per label and transition matrix
    emission_weights = params[:num_labels]
    transition_matrix = params[num_labels:].reshape(num_labels, num_labels)

    total_ll = 0.0
    total_emission_grad = np.zeros(num_labels)
    total_transition_grad = np.zeros((num_labels, num_labels))

    for X, y in zip(X_features, y_sequences):
        T = len(y)

        # Compute emission potentials
        log_potentials = np.outer(np.ones(T), emission_weights) + X

        # Forward-backward
        log_alpha, log_beta, log_Z = forward_backward(
            log_potentials, transition_matrix
        )

        # Log-likelihood contribution
        sequence_score = sum(log_potentials[t, y[t]] for t in range(T))
        sequence_score += sum(
            transition_matrix[y[t - 1], y[t]] for t in range(1, T)
        )
        total_ll += sequence_score - log_Z

        # Gradient contribution
        emission_grad, transition_grad = compute_gradient(
            log_potentials, transition_matrix, y, log_alpha, log_beta, log_Z
        )

        total_emission_grad += emission_grad.sum(axis=0)
        total_transition_grad += transition_grad

    # Return negative (for minimization)
    neg_grad = np.concatenate(
        [-total_emission_grad, -total_transition_grad.flatten()]
    )

    return -total_ll, neg_grad

def crf_objective(params, X_features, y_sequences, num_labels):
    """
    Compute negative log-likelihood and gradient for CRF.

    Args:
        params: Flattened parameter vector (emissions + transitions)
        X_features: List of feature matrices, one per sequence
        y_sequences: List of label sequences
        num_labels: Number of possible labels

    Returns:
        neg_ll: Negative log-likelihood (to minimize)
        neg_grad: Negative gradient
    """
    # Unpack parameters
    # For simplicity, we use emission weights per label and transition matrix
    emission_weights = params[:num_labels]
    transition_matrix = params[num_labels:].reshape(num_labels, num_labels)

    total_ll = 0.0
    total_emission_grad = np.zeros(num_labels)
    total_transition_grad = np.zeros((num_labels, num_labels))

    for X, y in zip(X_features, y_sequences):
        T = len(y)

        # Compute emission potentials
        log_potentials = np.outer(np.ones(T), emission_weights) + X

        # Forward-backward
        log_alpha, log_beta, log_Z = forward_backward(
            log_potentials, transition_matrix
        )

        # Log-likelihood contribution
        sequence_score = sum(log_potentials[t, y[t]] for t in range(T))
        sequence_score += sum(
            transition_matrix[y[t - 1], y[t]] for t in range(1, T)
        )
        total_ll += sequence_score - log_Z

        # Gradient contribution
        emission_grad, transition_grad = compute_gradient(
            log_potentials, transition_matrix, y, log_alpha, log_beta, log_Z
        )

        total_emission_grad += emission_grad.sum(axis=0)
        total_transition_grad += transition_grad

    # Return negative (for minimization)
    neg_grad = np.concatenate(
        [-total_emission_grad, -total_transition_grad.flatten()]
    )

    return -total_ll, neg_grad

In[18]:

Code

# Generate synthetic training data
np.random.seed(42)
num_sequences = 50
seq_length = 10
num_labels = 3

# Create synthetic sequences with some structure
X_features = []
y_sequences = []

for _ in range(num_sequences):
    # Random features
    X = np.random.randn(seq_length, num_labels) * 0.5

    # Generate labels with realistic BIO patterns
    y = []
    in_entity = False
    for t in range(seq_length):
        if in_entity:
            if np.random.random() < 0.7:  # Continue entity
                y.append(2)  # I-PER
            else:  # End entity
                y.append(0)  # O
                in_entity = False
        else:
            if np.random.random() < 0.2:  # Start entity
                y.append(1)  # B-PER
                in_entity = True
            else:
                y.append(0)  # O

    X_features.append(X)
    y_sequences.append(y)

# Initial parameters
init_params = np.zeros(num_labels + num_labels * num_labels)

# Track optimization progress
history = {"ll": [], "grad_norm": []}


def callback(params):
    ll, grad = crf_objective(params, X_features, y_sequences, num_labels)
    history["ll"].append(-ll)  # Convert back to positive LL
    history["grad_norm"].append(np.linalg.norm(grad))

# Generate synthetic training data
np.random.seed(42)
num_sequences = 50
seq_length = 10
num_labels = 3

# Create synthetic sequences with some structure
X_features = []
y_sequences = []

for _ in range(num_sequences):
    # Random features
    X = np.random.randn(seq_length, num_labels) * 0.5

    # Generate labels with realistic BIO patterns
    y = []
    in_entity = False
    for t in range(seq_length):
        if in_entity:
            if np.random.random() < 0.7:  # Continue entity
                y.append(2)  # I-PER
            else:  # End entity
                y.append(0)  # O
                in_entity = False
        else:
            if np.random.random() < 0.2:  # Start entity
                y.append(1)  # B-PER
                in_entity = True
            else:
                y.append(0)  # O

    X_features.append(X)
    y_sequences.append(y)

# Initial parameters
init_params = np.zeros(num_labels + num_labels * num_labels)

# Track optimization progress
history = {"ll": [], "grad_norm": []}


def callback(params):
    ll, grad = crf_objective(params, X_features, y_sequences, num_labels)
    history["ll"].append(-ll)  # Convert back to positive LL
    history["grad_norm"].append(np.linalg.norm(grad))

In[19]:

Code

from scipy.optimize import minimize

# Run L-BFGS optimization
result = minimize(
    crf_objective,
    init_params,
    args=(X_features, y_sequences, num_labels),
    method="L-BFGS-B",
    jac=True,  # Function returns both value and gradient
    callback=callback,
    options={"maxiter": 100, "disp": False},
)

optimal_params = result.x

from scipy.optimize import minimize

# Run L-BFGS optimization
result = minimize(
    crf_objective,
    init_params,
    args=(X_features, y_sequences, num_labels),
    method="L-BFGS-B",
    jac=True,  # Function returns both value and gradient
    callback=callback,
    options={"maxiter": 100, "disp": False},
)

optimal_params = result.x

The convergence plots show the characteristic behavior of L-BFGS on a convex problem. The log-likelihood rises steeply at first, then levels off as we approach the optimum. The gradient norm drops exponentially, indicating we're getting closer to a point where the gradient is zero.

Feature Template DesignLink Copied

CRF features are the heart of the model. Well-designed features capture the patterns that distinguish correct label sequences from incorrect ones. Feature templates define how to extract features from the input, and the CRF learns which features matter most.

Feature Template

A feature template is a pattern that generates binary features from observations. For example, the template "current word = X" generates one feature for each unique word X in the vocabulary. Templates let you define feature types; the actual features are instantiated from training data.

Common Feature TypesLink Copied

Features for sequence labeling typically fall into several categories:

Lexical features: The word itself, lowercase form, word shape (capitalization pattern)
Contextual features: Surrounding words in a window
Morphological features: Prefixes, suffixes, presence of digits or punctuation
Part-of-speech features: POS tags from a tagger (if available)
Transition features: Combinations of previous and current labels

In[23]:

Code

def extract_features(tokens, position):
    """
    Extract features for a token at a given position.

    Returns a dictionary of feature_name: value pairs.
    Features are designed for NER tagging.
    """
    features = {}
    word = tokens[position]

    # Current word features
    features["word.lower"] = word.lower()
    features["word.isupper"] = word.isupper()
    features["word.istitle"] = word.istitle()
    features["word.isdigit"] = word.isdigit()

    # Word shape
    shape = ""
    for char in word:
        if char.isupper():
            shape += "X"
        elif char.islower():
            shape += "x"
        elif char.isdigit():
            shape += "d"
        else:
            shape += char
    features["word.shape"] = shape

    # Prefix/suffix
    features["word.prefix2"] = (
        word[:2].lower() if len(word) >= 2 else word.lower()
    )
    features["word.suffix2"] = (
        word[-2:].lower() if len(word) >= 2 else word.lower()
    )
    features["word.prefix3"] = (
        word[:3].lower() if len(word) >= 3 else word.lower()
    )
    features["word.suffix3"] = (
        word[-3:].lower() if len(word) >= 3 else word.lower()
    )

    # Context features
    if position > 0:
        prev_word = tokens[position - 1]
        features["prev.word.lower"] = prev_word.lower()
        features["prev.word.istitle"] = prev_word.istitle()
    else:
        features["BOS"] = True  # Beginning of sentence

    if position < len(tokens) - 1:
        next_word = tokens[position + 1]
        features["next.word.lower"] = next_word.lower()
        features["next.word.istitle"] = next_word.istitle()
    else:
        features["EOS"] = True  # End of sentence

    return features

def extract_features(tokens, position):
    """
    Extract features for a token at a given position.

    Returns a dictionary of feature_name: value pairs.
    Features are designed for NER tagging.
    """
    features = {}
    word = tokens[position]

    # Current word features
    features["word.lower"] = word.lower()
    features["word.isupper"] = word.isupper()
    features["word.istitle"] = word.istitle()
    features["word.isdigit"] = word.isdigit()

    # Word shape
    shape = ""
    for char in word:
        if char.isupper():
            shape += "X"
        elif char.islower():
            shape += "x"
        elif char.isdigit():
            shape += "d"
        else:
            shape += char
    features["word.shape"] = shape

    # Prefix/suffix
    features["word.prefix2"] = (
        word[:2].lower() if len(word) >= 2 else word.lower()
    )
    features["word.suffix2"] = (
        word[-2:].lower() if len(word) >= 2 else word.lower()
    )
    features["word.prefix3"] = (
        word[:3].lower() if len(word) >= 3 else word.lower()
    )
    features["word.suffix3"] = (
        word[-3:].lower() if len(word) >= 3 else word.lower()
    )

    # Context features
    if position > 0:
        prev_word = tokens[position - 1]
        features["prev.word.lower"] = prev_word.lower()
        features["prev.word.istitle"] = prev_word.istitle()
    else:
        features["BOS"] = True  # Beginning of sentence

    if position < len(tokens) - 1:
        next_word = tokens[position + 1]
        features["next.word.lower"] = next_word.lower()
        features["next.word.istitle"] = next_word.istitle()
    else:
        features["EOS"] = True  # End of sentence

    return features

In[24]:

Code

# Example: extract features for a sample sentence
sample_sentence = [
    "John",
    "Smith",
    "works",
    "at",
    "Google",
    "in",
    "California",
    ".",
]
sample_labels = ["B-PER", "I-PER", "O", "O", "B-ORG", "O", "B-LOC", "O"]

all_features = []
for i in range(len(sample_sentence)):
    features = extract_features(sample_sentence, i)
    all_features.append(features)

# Example: extract features for a sample sentence
sample_sentence = [
    "John",
    "Smith",
    "works",
    "at",
    "Google",
    "in",
    "California",
    ".",
]
sample_labels = ["B-PER", "I-PER", "O", "O", "B-ORG", "O", "B-LOC", "O"]

all_features = []
for i in range(len(sample_sentence)):
    features = extract_features(sample_sentence, i)
    all_features.append(features)

The feature extractor captures patterns that are predictive of entity labels. Title-cased words are often names. Words ending in "-ia" might be locations. Words preceded by "at" might be organizations. The CRF learns which of these patterns are actually reliable predictors.

Feature Templates in PracticeLink Copied

Real NER systems use extensive feature templates. The sklearn-crfsuite library provides a convenient way to train CRFs with custom features:

In[27]:

Code

# Install sklearn-crfsuite for CRF training
# uv pip install sklearn-crfsuite

# Install sklearn-crfsuite for CRF training
# uv pip install sklearn-crfsuite

In[28]:

Code

def prepare_sentence_features(tokens, labels):
    """Convert a sentence to CRF training format."""
    sentence_features = []
    for i in range(len(tokens)):
        features = extract_features(tokens, i)
        # Convert to string features for sklearn-crfsuite
        str_features = {k: str(v) for k, v in features.items()}
        sentence_features.append(str_features)
    return sentence_features, labels


# Create a small dataset for demonstration
sentences = [
    (
        ["John", "Smith", "works", "at", "Google", "."],
        ["B-PER", "I-PER", "O", "O", "B-ORG", "O"],
    ),
    (
        ["Mary", "visited", "Paris", "last", "summer", "."],
        ["B-PER", "O", "B-LOC", "O", "O", "O"],
    ),
    (
        ["Apple", "announced", "a", "new", "iPhone", "."],
        ["B-ORG", "O", "O", "O", "O", "O"],
    ),
    (
        ["Dr.", "Jones", "from", "MIT", "gave", "a", "talk", "."],
        ["O", "B-PER", "O", "B-ORG", "O", "O", "O", "O"],
    ),
    (
        ["The", "Eiffel", "Tower", "is", "in", "France", "."],
        ["O", "B-LOC", "I-LOC", "O", "O", "B-LOC", "O"],
    ),
]

# Prepare training data
X_train = []
y_train = []
for tokens, labels in sentences:
    features, lbls = prepare_sentence_features(tokens, labels)
    X_train.append(features)
    y_train.append(lbls)

def prepare_sentence_features(tokens, labels):
    """Convert a sentence to CRF training format."""
    sentence_features = []
    for i in range(len(tokens)):
        features = extract_features(tokens, i)
        # Convert to string features for sklearn-crfsuite
        str_features = {k: str(v) for k, v in features.items()}
        sentence_features.append(str_features)
    return sentence_features, labels


# Create a small dataset for demonstration
sentences = [
    (
        ["John", "Smith", "works", "at", "Google", "."],
        ["B-PER", "I-PER", "O", "O", "B-ORG", "O"],
    ),
    (
        ["Mary", "visited", "Paris", "last", "summer", "."],
        ["B-PER", "O", "B-LOC", "O", "O", "O"],
    ),
    (
        ["Apple", "announced", "a", "new", "iPhone", "."],
        ["B-ORG", "O", "O", "O", "O", "O"],
    ),
    (
        ["Dr.", "Jones", "from", "MIT", "gave", "a", "talk", "."],
        ["O", "B-PER", "O", "B-ORG", "O", "O", "O", "O"],
    ),
    (
        ["The", "Eiffel", "Tower", "is", "in", "France", "."],
        ["O", "B-LOC", "I-LOC", "O", "O", "B-LOC", "O"],
    ),
]

# Prepare training data
X_train = []
y_train = []
for tokens, labels in sentences:
    features, lbls = prepare_sentence_features(tokens, labels)
    X_train.append(features)
    y_train.append(lbls)

In[29]:

Code

import sklearn_crfsuite

# Train CRF model
crf = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,  # L1 regularization
    c2=0.1,  # L2 regularization
    max_iterations=100,
    all_possible_transitions=True,
)

crf.fit(X_train, y_train)

import sklearn_crfsuite

# Train CRF model
crf = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,  # L1 regularization
    c2=0.1,  # L2 regularization
    max_iterations=100,
    all_possible_transitions=True,
)

crf.fit(X_train, y_train)

The learned feature weights reveal what the CRF considers important. High positive weights for "word.istitle=True" on B-PER indicate that title-cased words are strong signals for person names. The CRF automatically discovers these patterns from the training data.

RegularizationLink Copied

Without regularization, CRF training can overfit to the training data. If a feature appears only once in training and perfectly predicts a label, the optimal weight is infinity. Regularization prevents extreme weights and improves generalization.

Regularization

Regularization adds a penalty term to the objective function that discourages large parameter values. This prevents overfitting by trading off training set fit against model complexity.

L1 and L2 RegularizationLink Copied

The regularized log-likelihood objective is:

\mathcal{L}_{\text{reg}}(\boldsymbol{\lambda}) = \mathcal{L}(\boldsymbol{\lambda}) - \frac{1}{C_1} \sum_k |\lambda_k| - \frac{1}{2 C_2} \sum_k \lambda_k^2

where:

$\mathcal{L}(\boldsymbol{\lambda})$ : the original log-likelihood
$C_1$ : L1 regularization coefficient (larger = less regularization)
$C_2$ : L2 regularization coefficient (larger = less regularization)
$|\lambda_k|$ : absolute value of weight $k$ (L1 penalty)
$\lambda_k^2$ : squared weight (L2 penalty)

L1 and L2 regularization have different effects:

Comparison of L1 and L2 regularization for CRF training.

Regularization	Penalty	Effect	Use Case
L1	$\\|\lambda\\|$	Drives weights to exactly zero	Feature selection, sparse models
L2	$\lambda^2$	Shrinks all weights toward zero	General regularization
L1 + L2	Both	Combines sparsity with shrinkage	Often works best in practice

In[32]:

Code

# Compare different regularization strengths
regularization_strengths = [0.01, 0.1, 1.0, 10.0]
results = []

for c in regularization_strengths:
    crf_reg = sklearn_crfsuite.CRF(
        algorithm="lbfgs",
        c1=c,
        c2=c,
        max_iterations=100,
        all_possible_transitions=True,
    )
    crf_reg.fit(X_train, y_train)

    # Count non-zero features
    nonzero_features = sum(
        1 for _, weight in crf_reg.state_features_.items() if abs(weight) > 1e-6
    )

    # Get max weight magnitude
    max_weight = max(
        abs(weight) for _, weight in crf_reg.state_features_.items()
    )

    results.append(
        {"c": c, "nonzero": nonzero_features, "max_weight": max_weight}
    )

# Compare different regularization strengths
regularization_strengths = [0.01, 0.1, 1.0, 10.0]
results = []

for c in regularization_strengths:
    crf_reg = sklearn_crfsuite.CRF(
        algorithm="lbfgs",
        c1=c,
        c2=c,
        max_iterations=100,
        all_possible_transitions=True,
    )
    crf_reg.fit(X_train, y_train)

    # Count non-zero features
    nonzero_features = sum(
        1 for _, weight in crf_reg.state_features_.items() if abs(weight) > 1e-6
    )

    # Get max weight magnitude
    max_weight = max(
        abs(weight) for _, weight in crf_reg.state_features_.items()
    )

    results.append(
        {"c": c, "nonzero": nonzero_features, "max_weight": max_weight}
    )

The regularization plots show clear trends. With strong regularization (small C), many features get zeroed out, creating a sparse model that uses only the most informative features. Maximum weight magnitudes also decrease, preventing the model from over-relying on any single feature.

Training a Complete NER ModelLink Copied

Let's put everything together and train a CRF on a real NER dataset:

In[36]:

Code

import nltk

nltk.download("conll2002", quiet=True)
from nltk.corpus import conll2002

# Load Spanish NER data (smaller than English, good for demonstration)
train_sents = list(conll2002.iob_sents("esp.train"))[
    :200
]  # Use subset for speed
test_sents = list(conll2002.iob_sents("esp.testa"))[:50]


def word2features(sent, i):
    """Extract features for word at position i in sentence."""
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        "bias": 1.0,
        "word.lower": word.lower(),
        "word[-3:]": word[-3:],
        "word[-2:]": word[-2:],
        "word.isupper": word.isupper(),
        "word.istitle": word.istitle(),
        "word.isdigit": word.isdigit(),
        "postag": postag,
        "postag[:2]": postag[:2],
    }

    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update(
            {
                "-1:word.lower": word1.lower(),
                "-1:word.istitle": word1.istitle(),
                "-1:postag": postag1,
            }
        )
    else:
        features["BOS"] = True

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update(
            {
                "+1:word.lower": word1.lower(),
                "+1:word.istitle": word1.istitle(),
                "+1:postag": postag1,
            }
        )
    else:
        features["EOS"] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


# Prepare data
X_train_ner = [sent2features(s) for s in train_sents]
y_train_ner = [sent2labels(s) for s in train_sents]
X_test_ner = [sent2features(s) for s in test_sents]
y_test_ner = [sent2labels(s) for s in test_sents]

import nltk

nltk.download("conll2002", quiet=True)
from nltk.corpus import conll2002

# Load Spanish NER data (smaller than English, good for demonstration)
train_sents = list(conll2002.iob_sents("esp.train"))[
    :200
]  # Use subset for speed
test_sents = list(conll2002.iob_sents("esp.testa"))[:50]


def word2features(sent, i):
    """Extract features for word at position i in sentence."""
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        "bias": 1.0,
        "word.lower": word.lower(),
        "word[-3:]": word[-3:],
        "word[-2:]": word[-2:],
        "word.isupper": word.isupper(),
        "word.istitle": word.istitle(),
        "word.isdigit": word.isdigit(),
        "postag": postag,
        "postag[:2]": postag[:2],
    }

    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update(
            {
                "-1:word.lower": word1.lower(),
                "-1:word.istitle": word1.istitle(),
                "-1:postag": postag1,
            }
        )
    else:
        features["BOS"] = True

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update(
            {
                "+1:word.lower": word1.lower(),
                "+1:word.istitle": word1.istitle(),
                "+1:postag": postag1,
            }
        )
    else:
        features["EOS"] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


# Prepare data
X_train_ner = [sent2features(s) for s in train_sents]
y_train_ner = [sent2labels(s) for s in train_sents]
X_test_ner = [sent2features(s) for s in test_sents]
y_test_ner = [sent2labels(s) for s in test_sents]

In[37]:

Code

# Train the CRF
crf_ner = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)

crf_ner.fit(X_train_ner, y_train_ner)

# Make predictions
y_pred_ner = crf_ner.predict(X_test_ner)

# Train the CRF
crf_ner = sklearn_crfsuite.CRF(
    algorithm="lbfgs",
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)

crf_ner.fit(X_train_ner, y_train_ner)

# Make predictions
y_pred_ner = crf_ner.predict(X_test_ner)

The trained CRF achieves reasonable performance even on this limited training set. The model correctly identifies many entities by learning patterns from the feature templates. Errors often occur at entity boundaries or with rare entity types that don't have enough training examples.

The transition matrix reveals that the CRF has learned BIO constraints. High weights appear for valid transitions like O→B-LOC (starting a location entity) and B-LOC→I-LOC (continuing a location). Low or negative weights appear for invalid transitions like O→I-LOC (continuation without a beginning).

Limitations and ImpactLink Copied

CRF training has both strengths and limitations that shape when and how to use these models effectively.

The forward-backward algorithm makes CRF training tractable, but it still requires $O(T \cdot |L|^2)$ time per sequence. For tasks with many labels like fine-grained NER or semantic role labeling, this quadratic scaling becomes expensive. A sequence with 50 labels and 100 tokens requires computing 250,000 transition scores per training example. This cost compounds across thousands of training sentences and hundreds of optimization iterations.

Feature engineering remains both a strength and weakness. Hand-crafted features like word shapes, prefixes, and context windows capture linguistic knowledge effectively. But designing features requires domain expertise and extensive experimentation. Features that work well for English NER may perform poorly on German or Japanese. This manual effort contrasts sharply with neural approaches that learn representations automatically from raw text.

CRFs also struggle with long-range dependencies. The Markov assumption means each label depends only on the immediately preceding label. If a token's correct label depends on context five words away, the CRF must propagate that information through intervening transitions. In practice, this limits how much global context the model can exploit.

Despite these limitations, CRFs made sequence labeling practical before deep learning. Their contribution extends beyond the models themselves to the training paradigms they established. The idea of discriminatively training structured models, using dynamic programming for efficient inference, and designing features to capture domain knowledge influenced subsequent developments in neural NLP. Modern neural CRF layers, which place a CRF on top of BiLSTM or transformer encoders, combine learned representations with the structured prediction framework that CRFs pioneered.

For practitioners today, CRFs remain valuable in several scenarios: when training data is limited (hundreds rather than thousands of examples), when interpretability matters (examining feature weights reveals model behavior), when computational resources are constrained (CRFs train on CPUs in minutes), or as components in hybrid systems (neural encoders + CRF output layers).

SummaryLink Copied

Training Conditional Random Fields requires balancing computational efficiency with model expressiveness. The key ideas covered in this chapter are:

Log-likelihood objective: CRF training maximizes the probability of correct label sequences. The objective is concave, guaranteeing a unique global optimum.
Forward-backward algorithm: Dynamic programming computes the partition function and marginal probabilities in $O(T \cdot |L|^2)$ time, making training tractable despite the exponential number of possible label sequences.
Gradient computation: The gradient equals observed feature counts minus expected feature counts. When these match, the model has converged.
L-BFGS optimization: Quasi-Newton methods converge much faster than gradient descent by approximating curvature information. Most CRF implementations use L-BFGS by default.
Feature templates: Hand-designed features capture patterns like word shapes, context windows, and morphological properties. Good features are essential for CRF performance.
Regularization: L1 and L2 penalties prevent overfitting by discouraging extreme weights. L1 produces sparse models; L2 shrinks all weights toward zero.

CRFs established the foundation for structured prediction in NLP. While neural approaches now dominate, the principles of discriminative training, dynamic programming, and structured output remain central to modern sequence labeling systems.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about CRF training.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Conditional Random Fields

Next Chapter

Linear Classifiers

Reference

BIBTEXAcademic

@misc{crftrainingforwardbackwardalgorithmgradientslbfgsoptimization, author = {Michael Brenndoerfer}, title = {CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization}, year = {2025}, url = {https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization. Retrieved from https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization

MLAAcademic

Michael Brenndoerfer. "CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization>.

CHICAGOAcademic

Michael Brenndoerfer. "CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization.

HARVARDAcademic

Michael Brenndoerfer (2025) 'CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization'. Available at: https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization. https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization

Direct link:

https://mbrenndoerfer.com/writing/crf-training-forward-backward-lbfgs-optimization

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

CRF Training: Forward-Backward Algorithm, Gradients & L-BFGS Optimization

CRF TrainingLink Copied

The Log-Likelihood ObjectiveLink Copied

The Objective FunctionLink Copied

The Forward-Backward AlgorithmLink Copied

Forward VariablesLink Copied

Backward VariablesLink Copied

Computing Marginal ProbabilitiesLink Copied

Gradient ComputationLink Copied

Computing Expected Feature CountsLink Copied

L-BFGS OptimizationLink Copied

Feature Template DesignLink Copied

Common Feature TypesLink Copied

Feature Templates in PracticeLink Copied

RegularizationLink Copied

L1 and L2 RegularizationLink Copied

Training a Complete NER ModelLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Viterbi Algorithm: Dynamic Programming for Optimal Sequence Decoding

Part-of-Speech Tagging: Tag Sets, Algorithms & Implementation

Named Entity Recognition: Extracting People, Places & Organizations

Stay updated