In 2001, Lafferty and colleagues introduced CRFs, a powerful probabilistic framework that revolutionized structured prediction by modeling entire sequences jointly rather than making independent predictions. By capturing dependencies between adjacent elements through conditional probability and feature functions, CRFs became essential for part-of-speech tagging, named entity recognition, and established principles that would influence all future sequence models.

This article is part of the free-to-read History of Language AI book
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2001: Conditional Random Fields
By the early 2000s, researchers working on natural language processing had encountered a recurring challenge. Many of the most important tasks in the field, from tagging words with their grammatical roles to identifying names of people and places in text, shared a common characteristic. They all required making predictions that formed sequences, and these predictions were not independent. The label assigned to one word often constrained or influenced the labels that made sense for neighboring words.
In 2001, John Lafferty, Andrew McCallum, and Fernando Pereira introduced Conditional Random Fields, a probabilistic framework that would transform how researchers approached these structured prediction problems. Their insight was deceptively simple but profoundly important. Rather than predicting each element in a sequence independently and hoping that the pieces would fit together coherently, why not model the entire sequence as a single unified prediction problem?
This shift in perspective addressed a fundamental limitation in earlier approaches. Consider the task of determining the grammatical role of each word in a sentence. Traditional methods would look at "cat" in isolation and decide whether it was more likely to be a noun or a verb. But human language follows patterns. When we see the word "the" before "cat," we know with high confidence that "cat" must be functioning as a noun, because determiners precede nouns in English. Previous models largely ignored these structural regularities, treating each word as an independent classification problem.
CRFs changed this by explicitly modeling the dependencies between adjacent predictions. They provided a principled probabilistic framework that could capture how the label at one position in a sequence should influence the labels at neighboring positions. This approach would prove essential not just for grammatical tagging, but for named entity recognition, information extraction, and many other tasks where the structure of the output mattered as much as the individual predictions themselves.
The Structured Prediction Problem
To understand why CRFs represented such an important advance, it helps to examine the nature of structured prediction problems more closely. Imagine we are building a system to analyze the grammatical structure of English sentences. Given the sentence "The cat sat on the mat," we want our system to recognize that "The" is a determiner, "cat" is a noun, "sat" is a verb, "on" is a preposition, and so on. In linguistic notation, we might represent this as the sequence DT NN VB IN DT NN.
The challenge is that the grammatical role of each word depends significantly on context. The word "sat," for instance, could theoretically be a noun in some contexts, but when it appears after a noun like "cat," it is almost certainly functioning as a verb. Similarly, after we predict that "on" is a preposition, we should expect that the following article "the" will be followed by a noun, because that is how English prepositional phrases work.
Earlier approaches to this problem treated each decision independently. They would look at "cat," consider its features in isolation, perhaps note that it is a common noun form, and assign it the label "noun." Then they would move to the next word and repeat the process. This independence assumption made the mathematics tractable and the computation efficient, but it threw away crucial information. The system had no way to prefer sequences of labels that formed grammatically coherent patterns over those that did not.
CRFs addressed this limitation head-on. Instead of making separate independent predictions for each word, a CRF models the probability of the entire label sequence given the entire input sequence. This allows the model to capture regularities like "determiners are usually followed by nouns" or "verbs rarely appear before determiners" directly in its structure. The model learns not just which labels fit which words, but which sequences of labels fit together coherently.
The Conditional Probability Framework
The mathematical foundation of CRFs rests on a principle called conditional modeling. Rather than trying to model the joint probability of both inputs and outputs together, which would require learning about the distribution of input sentences themselves, CRFs focus directly on what matters for prediction. They model the probability of a particular label sequence given an observed input sequence. This conditional approach makes the learning problem more tractable and focuses the model's capacity on the prediction task itself.
At the heart of a CRF lies a deceptively elegant formula. For an input sequence represented as and a candidate label sequence , the model assigns a probability according to this expression:
This equation encodes several key ideas. The summation in the numerator represents a score for how well a particular label sequence fits the input. This score is computed by evaluating a collection of feature functions, each denoted , that examine different aspects of the input-output pair. Each feature function has an associated weight that the model learns during training, determining how much that particular feature should influence the final prediction.
The exponential function transforms these scores into unnormalized probabilities, ensuring that higher-scoring sequences receive higher probability. The denominator, denoted and called the partition function, serves as a normalizing constant. It sums over all possible label sequences to ensure that the probabilities across all potential labelings add up to one, as any valid probability distribution must.
What makes this framework powerful is its flexibility. The feature functions can capture any aspect of the relationship between inputs and outputs that a researcher can imagine and encode. This generality allows CRFs to incorporate domain knowledge and linguistic intuitions directly into the model structure.
The Role of Feature Functions
The true power of CRFs emerges from how they use feature functions to capture different types of regularities in language. These functions serve as the building blocks through which the model understands the relationships between words and their labels. Researchers working with CRFs typically design several categories of features, each capturing a different aspect of the structured prediction problem.
State features examine how well a particular label fits a specific word given its characteristics. For instance, a state feature might fire when the word is "cat" and the proposed label is "noun," encoding the intuition that this word commonly functions as a noun in English. These features can consider various properties of the word itself, such as its spelling, capitalization, suffixes, or position in the sentence. State features essentially answer the question, "Does this label make sense for this particular word in isolation?"
Transition features capture patterns in how labels tend to follow one another. A transition feature might fire when a determiner label is followed by a noun label, reflecting the grammatical regularity that determiners typically precede nouns in English. These features encode the sequential dependencies that make CRFs effective for structured prediction. They answer the question, "Given that we assigned this label to the previous word, how plausible is this label for the current word?"
Beyond these two fundamental categories, researchers often design context features that look at a broader window of surrounding words to inform decisions. A context feature might consider whether the word two positions ahead looks like a verb when deciding how to label the current word. Some systems even incorporate global features that capture properties of the entire sentence that might influence local labeling decisions.
The beauty of this feature-based approach is its flexibility. Researchers can encode linguistic knowledge, intuitions from examining data, and domain-specific insights directly into the feature set. The CRF learning algorithm then determines which features are truly predictive and how much weight each should receive in making predictions.
A Concrete Example
To see how these pieces fit together in practice, let us walk through a concrete example of part-of-speech tagging. Consider again the sentence "The cat sat on the mat." A CRF trained for this task would evaluate many possible label sequences and assign each a probability based on how well the features match the input.
For the correct labeling, DT NN VB IN DT NN, many state features would fire strongly. The model would note that "The" exhibits characteristics typical of determiners, being a common function word that appears at the start of phrases. The word "cat" has the spelling pattern and usage frequency of a noun. The word "sat" follows verb morphology with its past tense suffix. The word "on" appears in the model's learned list of prepositions. Each of these observations contributes evidence that the proposed labels fit the observed words.
Simultaneously, the transition features would provide strong support for this labeling. The sequence begins with a determiner followed by a noun, one of the most common patterns in English noun phrases. Then a noun is followed by a verb, reflecting the typical subject-verb structure of simple sentences. The verb is followed by a preposition, a common pattern when verbs take prepositional phrase complements. Finally, the preposition is followed by another determiner-noun sequence, the standard structure of a prepositional phrase.
The CRF combines all this evidence through the weighted sum in its probability formula. Features that strongly indicate the correctness of this labeling contribute positive scores. The exponential and normalization then convert these scores into a probability. Alternative labelings would receive lower scores because fewer features would fire, or features with negative weights would activate, pulling down the overall probability.
Through training on labeled examples, the model learns which features are reliable indicators and how much weight each should receive. Features that consistently help identify correct labelings gain positive weights, while features that fire for incorrect labelings acquire negative or near-zero weights. The result is a system that can balance multiple sources of evidence to find the most probable label sequence for any input sentence.
Applications Across Natural Language Processing
The introduction of CRFs had immediate and widespread impact across natural language processing. Their ability to capture sequential dependencies made them the method of choice for numerous tasks that had previously been tackled with less principled approaches.
Named entity recognition emerged as one of the most successful applications. The task of identifying spans of text that refer to people, organizations, locations, dates, and other entity types fits naturally into the CRF framework. The model can learn that words with capital letters appearing mid-sentence are more likely to be part of entity names, that person names often follow titles like "Dr." or "President," and that once a model starts labeling a multi-word entity like "New York City," it should continue the entity label for the remaining words. The transition features prove particularly valuable here, as they help the model maintain consistency when labeling multi-word entities.
Part-of-speech tagging, the task we used for illustration, saw substantial improvements with CRFs. While earlier statistical taggers achieved reasonable accuracy, CRFs pushed performance higher by better capturing the grammatical constraints that govern how tags can follow one another. The gains were particularly noticeable on ambiguous words whose correct tag depends heavily on context.
Syntactic chunking, the task of identifying phrases like noun phrases and verb phrases without building complete parse trees, also benefited greatly from the CRF framework. The model could learn to recognize phrase boundaries by noting patterns like "a noun phrase typically continues until we encounter a verb" or "prepositional phrases follow a predictable determiner-noun pattern."
More broadly, CRFs became the foundation for information extraction systems that needed to identify and extract structured data from unstructured text. Whether extracting speaker-statement pairs from interview transcripts, pulling product names and prices from advertisements, or identifying event types and participants from news articles, the CRF's ability to model sequential structure proved invaluable. Any task where outputs formed a structured sequence, and where the elements of that structure exhibited dependencies, became a candidate for CRF-based solutions.
Theoretical Foundations and Inference
Part of what made CRFs so influential was the solid theoretical foundation they provided for structured prediction. They brought together ideas from probability theory, optimization, and dynamic programming into an elegant and principled framework.
The conditional probability formulation meant that CRFs normalized over the entire space of possible label sequences for a given input. This global normalization ensured that the model produced valid probability distributions, where all the probabilities for different possible labelings summed to exactly one. This contrasted with some earlier approaches that made local normalization decisions and could produce inconsistent global probability assignments.
The feature-based representation provided remarkable flexibility. In principle, any computable function that examines the input sequence and proposed output sequence can serve as a feature. This means researchers can incorporate insights from linguistics, domain knowledge about their specific task, or patterns observed in data analysis. The model structure itself does not constrain what features can be used, only requiring that each feature can be evaluated for any input-output pair.
Efficiently finding the most likely label sequence for a given input, despite the exponentially large space of possible sequences, is crucial for practical applications. Fortunately, the sequential structure that CRFs assume allows for efficient inference using dynamic programming algorithms. The Viterbi algorithm, which had been used for decades in speech recognition and other sequence modeling problems, can find the optimal labeling in time proportional to the length of the sequence rather than exponential in that length. This computational tractability makes CRFs practical even for long sequences.
Training a CRF involves finding the feature weights that maximize the conditional likelihood of the correct labelings in the training data. This optimization problem is convex, meaning it has a single global optimum without local optima that might trap training algorithms. Standard optimization methods can reliably find good solutions, though the computation can be expensive for large datasets due to the need to compute the partition function during training.
Advantages Over Earlier Approaches
When CRFs appeared in 2001, they represented a significant advance over the methods that had previously dominated sequence labeling tasks. Understanding these advantages helps explain why CRFs were adopted so rapidly and widely across the field.
The conditional modeling approach meant that the CRF did not need to learn anything about the distribution of input sentences themselves. Earlier generative models, which tried to model the joint probability of inputs and outputs, devoted modeling capacity to capturing patterns in the input data even though this was not necessary for making predictions. By focusing exclusively on the conditional probability of outputs given inputs, CRFs could concentrate all their learning on the prediction task itself. This focus typically led to better performance with the same amount of training data.
The feature-based architecture provided flexibility that earlier approaches lacked. A researcher could design and add new features encoding linguistic insights or domain knowledge without modifying the fundamental model structure or learning algorithm. Want to add a feature capturing whether the current word is hyphenated, or whether it appears in a gazetteer of place names? Simply add the feature and retrain. The model would automatically learn the appropriate weight for this feature. This flexibility accelerated research by making it easy to experiment with different feature sets.
The global optimization performed during inference represented another key advantage. Rather than tagging each word greedily based on local information and hoping the sequence would be coherent, CRFs found the globally optimal sequence considering all positions simultaneously. This meant that information about words later in the sentence could influence labeling decisions for earlier words, and vice versa. The result was more coherent output sequences that better respected linguistic constraints.
CRFs also provided calibrated probability estimates for their predictions. Unlike systems that simply output labels, a CRF could indicate its confidence in a particular labeling. This probabilistic output proved valuable for downstream applications that needed to reason about uncertainty, and for human review processes where high-confidence predictions might be accepted automatically while low-confidence cases received additional scrutiny.
Perhaps most fundamentally, CRFs explicitly modeled the dependencies between adjacent labels through their transition features. This direct representation of sequential structure contrasted sharply with approaches that treated each position independently and had no way to capture how labels constrain one another.
Limitations and Challenges
For all their strengths, CRFs were not without limitations. Understanding these constraints helps explain both the careful engineering required to use them effectively and the motivation for later approaches that would address some of these issues.
The reliance on hand-crafted features represented perhaps the most significant practical challenge. While the flexibility to incorporate arbitrary features was a strength, someone still needed to design those features. This required deep expertise in both the application domain and the characteristics of the data. A researcher building a named entity recognizer needed to think carefully about what patterns might distinguish entity names from common words, design features to capture those patterns, implement those features correctly, and iteratively refine the feature set based on performance. This feature engineering process was time-consuming and required skills that were as much art as science. Different tasks required different features, making it difficult to transfer systems across domains.
The computational cost of training could be substantial, especially for large datasets. Computing the partition function during training required summing over all possible label sequences, which, even with dynamic programming tricks, scaled poorly as sequences grew longer or the number of possible labels increased. Researchers working with CRFs on large-scale problems often needed significant computational resources and careful optimization of their implementations.
The structure of CRFs typically captured only pairwise dependencies between adjacent labels. While this first-order Markov assumption covered many important cases, some linguistic phenomena involve longer-range dependencies. A decision about the label for the current word might ideally depend on the labels of words several positions away, not just the immediately preceding word. Higher-order CRFs that captured these longer dependencies were possible in principle, but the computational costs grew rapidly, making them impractical for many applications.
The linear relationship between features and the log-probability score was another constraint. The model assumed that the contribution of each feature could be captured by a single weight that applied regardless of context. This meant that complex interactions between features were difficult to model. If the importance of one feature depended on the values of other features in non-linear ways, the CRF framework could not naturally capture this without explicitly designing interaction features.
Finally, while the convexity of the optimization problem eliminated local optima in terms of the model parameters, researchers still faced choices about feature design and model structure that could significantly impact performance. Finding the right feature set remained an open-ended search problem without guaranteed optimal solutions.
The Lasting Impact
The introduction of Conditional Random Fields in 2001 left an enduring mark on natural language processing that extends far beyond their specific technical contributions. The principles that CRFs embodied and the insights they provided would shape research directions for years to come, even as newer methods eventually superseded them for many applications.
CRFs established structured prediction as a central paradigm in NLP. The recognition that many important tasks involve predicting outputs with internal dependencies, and that these dependencies should be modeled explicitly rather than ignored, became foundational to how researchers think about sequence labeling and related problems. Even modern neural approaches that have largely replaced CRFs in production systems still grapple with the same fundamental challenge of capturing dependencies in structured outputs.
The conditional modeling framework demonstrated convincingly that focusing directly on the prediction task could outperform more general generative approaches. This lesson influenced the design of many subsequent models. Rather than expending modeling capacity on aspects of the data distribution that were not relevant to the prediction goal, later systems would increasingly adopt discriminative approaches that learned only what was necessary for making accurate predictions.
The feature-based paradigm that CRFs exemplified represented the state of the art in a particular era of machine learning. While neural networks would eventually learn to discover features automatically from raw data, the careful engineering of features for CRFs taught researchers a great deal about what patterns and regularities in language were actually predictive for different tasks. Many insights from this era of intensive feature engineering would later inform the design of neural architectures and the interpretation of what neural networks learned.
Perhaps most importantly, CRFs demonstrated the value of principled probabilistic approaches grounded in solid theoretical foundations. The clean mathematical framework, the convex optimization, the efficient inference algorithms, and the calibrated probability estimates all exemplified how bringing theoretical rigor to practical problems could yield both better understanding and better performance. This emphasis on principled approaches, even as methods grew more complex, would remain influential throughout the field's subsequent evolution.
The Transition to Neural Approaches
As neural networks began their resurgence in the 2010s, they did not simply replace CRFs but in many ways built upon the foundations that CRFs had established. The relationship between these two paradigms illustrates how progress in machine learning often involves synthesizing insights from different approaches rather than wholesale abandonment of earlier methods.
One of the first ways researchers combined CRFs with neural networks was to use neural networks for feature learning while retaining the CRF's structured prediction framework. Instead of hand-crafting features based on linguistic intuition, a neural network could learn useful representations from the raw input data. These learned features would then feed into a CRF layer that captured the sequential dependencies between labels. This hybrid architecture, sometimes called a neural CRF or CRF-RNN, attempted to get the best of both worlds: automatic feature learning from neural networks and principled structured prediction from CRFs.
The fundamental insight that outputs in sequence labeling tasks have dependencies that should be modeled explicitly carried forward even as purely neural approaches became dominant. Recurrent neural networks and later attention-based architectures like transformers still needed mechanisms to ensure that their predictions formed coherent sequences. Some modern architectures include explicit modules inspired by CRF dynamics, while others learn to capture these dependencies implicitly through their training objectives.
What changed most dramatically was the shift from manual feature engineering to automatic feature learning. Neural networks, given sufficient data and computational resources, could discover hierarchies of features automatically. The lower layers might learn about character patterns and word shapes, middle layers might learn about syntactic structures, and higher layers might learn about semantic relationships. This eliminated the need for the painstaking feature design that had been necessary with CRFs.
Despite these advances, CRFs have not disappeared. In settings where training data is limited, where interpretability is crucial, or where the task structure fits naturally into the CRF framework, they remain competitive. Many commercial NLP systems still use CRFs or hybrid approaches, particularly in domains like biomedical text processing where the costs of errors are high and practitioners value the interpretability that comes with explicit feature-based models.
Visualizing the Structure
To build intuition about how CRFs capture sequential dependencies, it helps to visualize the structure of the model graphically. The diagram below illustrates the relationships that a CRF models for a simple three-word sequence.
In this representation, the green nodes represent the observed input words, while the blue nodes represent the hidden labels that the model must predict. The red edges connecting input words to their corresponding labels represent the state features, which capture how well each label fits each word based on its characteristics. The gray edges connecting adjacent labels represent the transition features, which capture how well pairs of labels work together in sequence.
This graphical structure makes explicit the dependencies that the CRF models. Notice that each label node is connected both to its corresponding input word and to the adjacent label nodes. This means that when the model assigns a probability to a particular labeling, it considers both how well each label fits its word and how well the sequence of labels fits together. Information flows through these connections during both training and inference, allowing the model to find labelings that are globally coherent rather than just locally optimal.
The advent of Conditional Random Fields marked a pivotal moment in the evolution of sequence modeling for natural language processing. By providing a principled framework for structured prediction that explicitly captured dependencies between labels, CRFs addressed limitations that had constrained earlier approaches and established patterns that would persist even as the field moved toward neural methods.
The shift from treating each prediction independently to modeling entire sequences jointly represented more than just a technical improvement. It reflected a deeper understanding of the nature of language itself. Words do not function in isolation. Their meanings, grammatical roles, and relationships are fundamentally contextual, shaped by what comes before and after them. CRFs provided a way to embed this insight directly into the mathematical structure of prediction models.
This emphasis on capturing structure, on modeling dependencies explicitly rather than hoping they would emerge implicitly, would echo through subsequent developments in NLP. When researchers later designed neural architectures for sequence labeling, they carried forward the lessons learned from CRFs about the importance of considering the full sequence context when making predictions. The specific mechanisms changed, but the underlying principle remained central to how the field approached these problems.
Looking Forward
Conditional Random Fields demonstrated that bringing principled probabilistic thinking to bear on structured prediction problems could yield both theoretical elegance and practical effectiveness. The framework they provided, combining flexible feature-based representation with rigorous probabilistic inference, would influence NLP research for years to come.
As the field moved increasingly toward neural approaches in the 2010s and beyond, the specific techniques that made CRFs successful would be superseded by methods that learned features automatically and captured even more complex dependencies. Yet the fundamental insights persisted. The recognition that structure matters, that dependencies between predictions should be modeled rather than ignored, and that conditional modeling can be more effective than joint modeling would all carry forward into the neural era.
In this sense, CRFs represent a crucial step in the gradual progression from simple independent models toward systems capable of capturing the rich, interdependent structure of natural language. They showed that explicitly modeling the problem structure, rather than hoping that simple models would somehow capture it implicitly, could lead to substantial improvements in performance. This lesson would remain relevant even as the methods for capturing that structure evolved, reminding researchers that understanding the nature of the problem being solved is as important as the sophistication of the tools being applied.
The story of CRFs is thus not just about a particular mathematical framework or set of algorithms. It is about the enduring importance of structured prediction, the value of principled approaches grounded in probability theory, and the insight that language is fundamentally sequential and contextual. These themes would continue to shape the field's evolution long after CRFs themselves had largely been replaced by newer methods.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

HDBSCAN Clustering: Complete Guide to Hierarchical Density-Based Clustering with Automatic Cluster Selection
Complete guide to HDBSCAN clustering algorithm covering density-based clustering, automatic cluster selection, noise detection, and handling variable density clusters. Learn how to implement HDBSCAN for real-world clustering problems.

Hierarchical Clustering: Complete Guide with Dendrograms, Linkage Criteria & Implementation
Comprehensive guide to hierarchical clustering, including dendrograms, linkage criteria (single, complete, average, Ward), and scikit-learn implementation. Learn how to build cluster hierarchies and interpret dendrograms.

Exponential Smoothing (ETS): Complete Guide to Time Series Forecasting with Weighted Averages & Holt-Winters
Learn exponential smoothing for time series forecasting, including simple, double (Holt's), and triple (Holt-Winters) methods. Master weighted averages, smoothing parameters, and practical implementation in Python.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


