In 2002, IBM researchers introduced BLEU (Bilingual Evaluation Understudy), revolutionizing machine translation evaluation by providing the first widely adopted automatic metric that correlated well with human judgments. By comparing n-gram overlap with reference translations and adding a brevity penalty, BLEU enabled rapid iteration and development, establishing automatic evaluation as a fundamental principle across all language AI.

This article is part of the free-to-read History of Language AI book
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2002: BLEU Metric
In 2002, a team of researchers at IBM published a paper that would fundamentally change how we evaluate machine translation systems. Their contribution was not a new translation algorithm or neural architecture, but rather a deceptively simple metric called BLEU (Bilingual Evaluation Understudy). This metric would become so influential that it remains the standard evaluation method for machine translation more than two decades later, despite its well-known limitations.
The introduction of BLEU marked a turning point in natural language processing research. Before BLEU, evaluating translation quality required human experts to manually assess each translation, a process that was not only expensive and time-consuming but also subjective and inconsistent. Different evaluators would often disagree on the quality of the same translation, making it difficult to compare systems or measure progress reliably. This bottleneck in evaluation was holding back the entire field, limiting how quickly researchers could iterate on new ideas.
BLEU addressed a fundamental challenge: how do you automatically measure the quality of a machine translation when there are many valid ways to translate the same sentence? A good translation might use different words, different sentence structures, or different stylistic choices while still conveying the same meaning accurately. BLEU's elegant solution was to compare the machine's output against multiple human reference translations, focusing on n-gram overlap rather than requiring exact matches. By examining whether the machine translation used similar phrases and word combinations as human translators, BLEU could provide a reasonable proxy for translation quality without requiring human judgment for each evaluation.
The Evaluation Crisis in Machine Translation
To understand why BLEU was so revolutionary, we need to appreciate the depth of the evaluation problem that existed before 2002. Machine translation research had been progressing since the 1950s, but by the early 2000s, the field faced a paradox: while researchers were developing increasingly sophisticated translation systems, they lacked a reliable way to measure whether their improvements actually made translations better.
Human evaluation, the traditional gold standard, required bilingual experts to read both the source text and the machine translation, then rate the translation's quality on various dimensions such as adequacy (does it preserve the meaning?) and fluency (does it read naturally?). This process was not only expensive and time-consuming, but also fundamentally inconsistent. Two evaluators might disagree substantially on the quality of the same translation, particularly for subtle differences in style or when both translations were mediocre. Even the same evaluator might rate identical translations differently on different days, influenced by fatigue, context, or changing standards.
The inconsistency problem was compounded by the lack of standardization across research groups. Some teams used five-point scales, others used binary judgments. Some focused on adequacy, others on fluency. Some evaluated sentence by sentence, others looked at entire documents. This made it nearly impossible to compare results across different papers or reproduce previous findings. A system that scored well in one group's evaluation might perform poorly under another group's criteria, not because of any real difference in quality, but simply due to evaluation methodology.
Perhaps most critically, the slow pace of human evaluation created a bottleneck in the research process. Testing a single modification to a translation system might require weeks of human evaluation, making rapid experimentation impractical. Researchers needed a way to quickly test ideas, compare alternatives, and iterate on their systems. Without fast, reliable evaluation, the pace of progress in machine translation was artificially constrained by the evaluation process itself.
The Core Insight Behind BLEU
BLEU rests on a simple but powerful observation: good translations tend to use the same words and phrases as human translators would use. If a machine translation contains many of the same word sequences (n-grams) as human reference translations, it is likely to be a good translation. Conversely, if a machine translation uses completely different words and phrases, it is probably a poor translation. This insight provides a way to evaluate translation quality by measuring overlap between the machine output and human references, without requiring human judgment for each individual translation.
The metric works by examining n-grams of different lengths. An n-gram is simply a contiguous sequence of n words from the text. Unigrams are individual words, bigrams are two-word sequences, trigrams are three-word sequences, and so on. By looking at multiple n-gram lengths simultaneously, BLEU can capture both word-level accuracy (through unigrams) and phrase-level fluency (through longer n-grams). A translation might get individual words right but arrange them awkwardly, or it might have natural-sounding phrases but miss key content words. Examining multiple n-gram lengths helps BLEU distinguish between these different types of translation errors.
The mathematical formulation combines these components into a single score. BLEU calculates precision for each n-gram length, measuring what fraction of the n-grams in the machine translation also appear in the reference translations. These precision scores are then combined using a geometric mean, which has the important property that if precision is zero for any n-gram length, the entire score becomes zero. This ensures that translations must perform reasonably well at all n-gram lengths to achieve a high score.
The complete formula is:
In this formula, represents the brevity penalty (which we'll discuss shortly), are weights for each n-gram length (typically set to 1/N for uniform weighting), and is the precision for n-grams of length . Standard BLEU uses four n-gram lengths (N=4), examining unigrams, bigrams, trigrams, and 4-grams with equal weight.
The Brevity Penalty: Preventing Gaming
One of BLEU's most clever design choices addresses a potential exploit in the precision-based metric. Without any constraint on length, a system could achieve artificially high precision scores by producing very short translations. Consider a machine that simply outputs "the" for every translation. Every unigram would match reference translations (since "the" appears in most English sentences), giving perfect unigram precision. This would be useless as a translation but could score well on a naive precision metric.
The brevity penalty solves this problem by multiplying the score by a factor that penalizes translations shorter than the references. When the candidate translation is at least as long as the reference, no penalty is applied. When the candidate is shorter, an exponential penalty reduces the score, with the penalty growing more severe as the translation becomes shorter relative to the reference.
The formula for the brevity penalty is:
Here, represents the length of the candidate translation (in words) and represents the effective reference length. When multiple reference translations are available, is chosen as the length of the reference closest to the candidate length, which provides a fairer comparison and avoids penalizing reasonable variations in translation length.
This penalty has an elegant mathematical property: it approaches zero as the translation becomes arbitrarily short, and it equals one when the translation matches the reference length. For example, a translation half the length of the reference would receive a penalty of approximately 0.37, reducing the final score substantially even if the precision were perfect.
Working Through Examples
To build intuition for how BLEU operates in practice, let's walk through the calculation for several example translations, starting with a perfect match and then examining various types of translation errors.
Consider a reference translation:
"The cat sat on the mat"
When the candidate translation is identical: "The cat sat on the mat"
We first extract and compare n-grams at different levels. The unigrams are "The", "cat", "sat", "on", "the", "mat", and all six appear in the reference. The bigrams are "The cat", "cat sat", "sat on", "on the", and "the mat", and all five appear in the reference. Similarly, the trigrams and 4-grams would all match perfectly.
The precision calculations become straightforward. For unigrams, we have 6 matches out of 6 total, giving 1.0. For bigrams, we have 5 matches out of 5 total, also giving 1.0. The same pattern holds for longer n-grams. The brevity penalty is 1.0 because the candidate length matches the reference length (both have 6 words). Combining these with equal weights in the geometric mean: . A perfect translation receives a perfect score.
Now consider a poor translation: "A dog ran"
This translation has completely different meaning and is much shorter. Looking at unigrams, we have "A", "dog", and "ran". The word "a" (after lowercasing) might match, but "dog" and "ran" do not appear in the reference, giving us at most 1 out of 3 matches, or precision of 0.33. For bigrams, we have "A dog" and "dog ran", neither of which appears in the reference, giving us 0 out of 2 matches, or precision of 0.0. When any n-gram precision is zero, the geometric mean becomes zero regardless of other precisions.
The brevity penalty also penalizes this translation. With a candidate length of 3 and reference length of 6, we calculate . Even if the precision weren't zero, this penalty would substantially reduce the score. The final BLEU score becomes approximately 0.0, correctly identifying this as a very poor translation.
Let's examine a more interesting case where the translation is reasonable but imperfect: "The small cat sat on a mat"
Here we have 7 words instead of 6, and we've added "small" and changed "the mat" to "a mat". The unigrams "The", "cat", "sat", "on", "a", "mat" all appear in the reference (6 out of 7 match), giving precision of 0.86. However, the bigram "small cat" doesn't appear in the reference, and neither does "a mat", so our bigram precision will be lower, perhaps 4 out of 6, or 0.67. The brevity penalty is 1.0 because the translation is not shorter than the reference. The final score would be lower than 1.0, reflecting that while the translation captures the main meaning, it includes changes not present in the reference.
Why BLEU Succeeded
BLEU's adoption across the machine translation community was remarkably rapid, and this success stemmed from several key advantages that addressed the field's most pressing needs.
The most immediate benefit was speed. Computing a BLEU score required only simple string matching operations, which could be performed in milliseconds even on modest hardware. A researcher could evaluate thousands of translations in the time it would take a human evaluator to assess a single sentence. This dramatic acceleration in evaluation speed transformed the research workflow, enabling researchers to test variations of their systems multiple times per day rather than waiting weeks for evaluation results.
Consistency was equally important. BLEU always produced exactly the same score for the same input, eliminating the variability that plagued human evaluation. This made it possible to detect small improvements in translation quality with confidence. If a modification to a translation system increased the BLEU score by even a fraction of a point on a large test set, researchers could be certain this represented a real change rather than noise from inconsistent human judgments.
The metric's correlation with human judgments provided crucial validation. The original BLEU paper demonstrated that BLEU scores correlated strongly with human quality assessments when measured across different systems. While BLEU might disagree with human judges on individual sentences, the ranking of systems by BLEU score tended to match the ranking by human preference. This meant researchers could use BLEU for rapid iteration while still trusting that improvements in BLEU score would likely correspond to improvements that humans would appreciate.
BLEU's language independence made it universally applicable. The same algorithm worked for translating between any pair of languages, from English to French, Chinese to Arabic, or Japanese to German. Researchers didn't need to develop language-specific evaluation metrics or hire evaluators for each language pair. As long as they had reference translations in the target language, BLEU could assess translation quality.
Perhaps most importantly, BLEU provided a common standard that made research comparable across groups and reproducible over time. When different papers reported BLEU scores on the same test sets, readers could directly compare the systems' performance. This standardization accelerated progress by making it clear which approaches were most promising and enabling researchers to build on each other's work with confidence.
Beyond Translation: BLEU's Wider Impact
The success of BLEU in machine translation inspired researchers to adapt its principles to other natural language processing tasks. Whenever a problem involved generating text that could be compared against reference outputs, BLEU offered a potential evaluation solution.
Text summarization became one of the first domains to adopt BLEU-style metrics. Researchers developing automatic summarization systems faced evaluation challenges similar to those in machine translation: how do you automatically assess whether a generated summary captures the key information from a document? The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, introduced shortly after BLEU, applied similar n-gram overlap principles to summarization, though with a focus on recall rather than precision to ensure summaries captured important content.
Dialogue systems and conversational AI also borrowed from BLEU's approach. When training systems to generate natural responses in conversation, researchers needed automatic metrics to measure whether generated responses were appropriate. While BLEU had significant limitations for this task (since many different responses might be appropriate in conversation), it provided a starting point for rapid evaluation during system development.
The rise of code generation systems created another application domain. When AI systems generate programming code, their output can be compared against reference implementations using BLEU-like metrics. The same principles apply: correct code tends to use similar programming constructs and patterns as human-written reference code. While exact matching is more important in code than in natural language (since a single character difference can break a program), n-gram overlap still provides useful signal about code quality.
More broadly, BLEU established a template that influenced evaluation across sequence generation tasks. The core idea, that generated outputs should be compared against multiple human references using overlap-based metrics, became a standard approach throughout natural language processing. Even when BLEU itself wasn't the appropriate metric, its principles guided the development of task-specific alternatives.
The Limitations Become Apparent
As BLEU gained widespread adoption, researchers also became increasingly aware of its limitations. These shortcomings didn't diminish BLEU's practical value for rapid system comparison, but they highlighted important gaps between automatic metrics and true translation quality.
The most fundamental limitation stemmed from BLEU's focus on surface-level n-gram matching. BLEU treats words as atomic symbols without understanding their meaning, so it cannot recognize synonyms or semantically equivalent phrases. A translation using "big" instead of "large" would be penalized even though the meaning is nearly identical. More problematically, BLEU cannot distinguish between translations that preserve meaning and those that subtly distort it. A translation might use exactly the right words in almost the right order, achieving a high BLEU score, while completely reversing the meaning of a negation or misidentifying the subject and object of an action.
The quality of BLEU scores depends critically on the reference translations. With only a single reference, BLEU severely penalizes any legitimate variation in wording or phrasing. Multiple references help, but even with several references, BLEU cannot account for all valid translation possibilities. Two human translators working independently might produce quite different translations that are both excellent, yet a machine translation matching one would score poorly when compared against the other. This reference dependence means that BLEU measures not absolute translation quality but rather similarity to a specific set of reference translations.
BLEU exhibits various biases that can advantage certain types of systems over others. The brevity penalty addresses length bias in one direction (preventing too-short translations) but does nothing to penalize translations that are too long and wordy. The equal weighting of different n-gram lengths means that systems can potentially game the metric by optimizing for specific n-gram lengths. More subtly, BLEU tends to favor translations that are structurally similar to the references, potentially penalizing valid translations that use different but equally correct grammatical structures.
The metric also struggles with interpretability. Unlike human evaluation with clear quality scales (like "excellent" or "poor"), BLEU produces scores between 0 and 1 that lack intuitive meaning. Is a BLEU score of 0.35 acceptable? What about 0.42? The answer depends on the language pair, domain, difficulty of the source text, and number of references. Two systems with BLEU scores of 0.38 and 0.40 might produce noticeably different translation quality, or the difference might be imperceptible. Without extensive experience with BLEU scores in a particular setting, the numbers themselves provide limited insight.
Perhaps most importantly, BLEU's limitations mean it should not be the sole metric for evaluating translation quality. While invaluable for rapid iteration during development, final evaluation of systems should include human assessment, particularly for high-stakes applications where translation errors could have serious consequences. BLEU guides researchers toward better systems but cannot replace human judgment about whether a translation is truly fit for purpose.
Principles That Endured
Looking beyond BLEU's specific formula, the metric established several principles that would shape evaluation methodology throughout natural language processing. These principles proved more durable than the metric itself, influencing how researchers think about evaluation even in contexts where BLEU is inappropriate.
The principle of automatic evaluation fundamentally changed research practice. BLEU demonstrated that carefully designed automatic metrics could provide reliable signal about system quality, even if imperfect. This encouraged researchers to invest in developing automatic metrics for other tasks, accelerating progress across the field. The cycle of development became: use automatic metrics for rapid iteration, validate with human evaluation periodically, and refine automatic metrics to better align with human judgments. This workflow, pioneered for machine translation, became standard across natural language processing.
Reference-based comparison established a paradigm for evaluation. Rather than trying to define abstract rules for what makes good output, BLEU compared system outputs against what humans actually produce. This approach acknowledged that quality is best defined by example rather than by formal criteria. The same principle would later be applied to evaluating text generation, summarization, question answering, and eventually even large language models, where outputs are compared against human-written references or human preferences.
The focus on correlation with human judgments set a standard for validating automatic metrics. BLEU wasn't adopted because it perfectly captured every aspect of translation quality, but because it correlated well enough with human assessments to be useful. This established that automatic metrics should be judged by their ability to approximate human judgment rather than by theoretical elegance or mathematical sophistication. A simple metric that correlates well with humans is more valuable than a complex metric that doesn't.
Perhaps most subtly, BLEU introduced the idea that evaluation could itself be an object of research. The development of metrics like METEOR, ROUGE, and eventually BERTScore represented attempts to address BLEU's limitations while preserving its advantages. This created an ongoing research program around evaluation methodology, with papers proposing new metrics, analyzing their properties, and validating them against human judgments. Evaluation became recognized as a critical research challenge in its own right, not merely a tool for measuring progress on other problems.
The Evolution of Evaluation Metrics
BLEU's limitations motivated researchers to develop more sophisticated evaluation metrics that could address its weaknesses while maintaining its practical advantages. Each subsequent metric attempted to capture aspects of quality that BLEU missed, creating an increasingly nuanced picture of what makes generated text good.
METEOR (Metric for Evaluation of Translation with Explicit ORdering), introduced in 2004, made several improvements to BLEU's approach. Most importantly, METEOR included synonym matching through WordNet, allowing it to recognize that "big" and "large" convey similar meanings. It also incorporated stemming to match different forms of the same word ("running" and "runs") and considered word order through alignment penalties. By balancing precision and recall rather than focusing solely on precision, METEOR provided a more complete picture of translation quality. These enhancements came at the cost of some additional complexity and language-specific resources, but METEOR often correlated better with human judgments than BLEU.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) adapted BLEU's principles specifically for summarization tasks. While BLEU focuses on precision (what fraction of the generated n-grams appear in references), ROUGE emphasizes recall (what fraction of the reference n-grams appear in the generated text). This makes sense for summarization, where the goal is to ensure the summary captures key information from the references rather than to avoid extraneous content. ROUGE became the standard metric for summarization research, demonstrating that task-specific adaptation of evaluation principles often works better than applying a general-purpose metric.
The rise of neural language models enabled a qualitative leap in evaluation methodology. BERTScore, introduced in 2019, uses contextual embeddings from BERT to measure semantic similarity between generated and reference text. Instead of requiring exact n-gram matches, BERTScore computes similarity between contextualized word representations, allowing it to recognize paraphrases and semantic equivalence that BLEU would miss. This neural approach to evaluation aligned much better with human judgments than traditional n-gram metrics, though it required significantly more computation.
BLEURT took this further by fine-tuning BERT specifically for the evaluation task, training it to predict human quality ratings. By learning from large datasets of human judgments, BLEURT could capture subtle aspects of quality that hand-designed metrics miss. This represented a shift from rule-based evaluation to learned evaluation, where the metric itself is a neural model trained to approximate human judgment.
Despite these advances, human evaluation remains the ultimate standard. Automatic metrics serve as proxies for human judgment, useful for rapid development but not substitutes for careful human assessment. The most rigorous evaluations combine multiple automatic metrics with structured human evaluation, recognizing that each perspective captures different aspects of quality.
The Broader Impact on Natural Language Processing
BLEU's introduction catalyzed a transformation in how natural language processing research is conducted, with effects extending far beyond machine translation itself. The metric enabled a shift in research methodology that would prove crucial for the field's subsequent rapid progress.
Before BLEU, the cycle time for testing new ideas in machine translation was measured in weeks or months. Researchers would develop a modification to their system, wait for human evaluators to assess the results, analyze the feedback, and then start the cycle again. This slow iteration made it impractical to try many variations or to perform systematic experiments comparing different approaches. BLEU compressed this cycle from weeks to minutes, enabling researchers to test dozens of variations in a single day. This acceleration in experimentation was essential for the rapid progress that followed, particularly as deep learning methods began to dominate the field in the 2010s.
The standardization that BLEU provided made research cumulative in a way it hadn't been before. When papers reported BLEU scores on common test sets like those from the Workshop on Machine Translation, readers could directly compare results across different systems and approaches. A researcher in one country could read about a technique developed by a team in another country, implement it, and immediately verify whether it produced the claimed improvement. This reproducibility and comparability accelerated the diffusion of successful ideas and helped the community converge on the most promising directions.
Perhaps most importantly, BLEU made it feasible to optimize machine learning systems end-to-end for translation quality. With fast, differentiable automatic metrics, researchers could train neural networks to maximize translation quality directly rather than relying on indirect objectives. When neural machine translation emerged in the 2010s, BLEU scores provided the objective function that guided the learning process. The dramatic improvements in translation quality from neural systems would not have been possible without automatic evaluation metrics to guide the training.
The impact extended beyond machine translation to influence evaluation philosophy across NLP. Tasks like question answering, dialogue generation, and text summarization all adopted the pattern of developing automatic metrics validated against human judgment. This created a research culture that balanced the need for rapid experimentation with the requirement for human validation, accelerating progress while maintaining quality standards.
A Note on the Name
The acronym BLEU (Bilingual Evaluation Understudy) contains a subtle wordplay that reflects the metric's purpose. In French, "bleu" means "blue," but in English, the phrase "understudy" evokes the theater, where an understudy is an actor who learns a role to substitute for the main performer when necessary. The name cleverly suggests that BLEU serves as a stand-in for human evaluators, playing their role when they cannot be present but ready to step aside when the primary evaluators are available. This linguistic flourish hints at the metric's intended position: a useful proxy for human judgment, not a replacement for it.
The Lasting Significance
More than two decades after its introduction, BLEU remains relevant as both a practical tool and a conceptual milestone. While researchers now have access to more sophisticated evaluation metrics, BLEU continues to be widely reported in machine translation papers, partly due to its historical role in enabling comparison with earlier work, and partly because its simplicity makes it easy to compute and understand. The metric's longevity testifies to the value of practical solutions that address real research needs, even when theoretically imperfect.
BLEU's deeper significance lies in what it revealed about the nature of progress in artificial intelligence. The metric demonstrated that breakthrough innovations don't always involve novel algorithms or architectures. Sometimes the most critical advances come from developing better ways to measure progress, enabling researchers to iterate faster and compare approaches more reliably. By solving the evaluation bottleneck, BLEU accelerated progress on the underlying translation task, illustrating how infrastructure and methodology can be as important as core technical capabilities.
The principles BLEU established have proven remarkably durable. The idea that automatic metrics should correlate with human judgments, that evaluation should be fast enough to enable rapid iteration, and that standardized metrics facilitate reproducible research, all continue to guide how we evaluate language systems today. Even as we develop neural metrics like BERTScore and BLEURT, we validate them by showing they correlate better with human judgments than BLEU does, using BLEU itself as the baseline to improve upon.
Looking at the history of natural language processing, BLEU marks a transition point. Before BLEU, progress in machine translation was slow and difficult to measure reliably. After BLEU, the pace of improvement accelerated, culminating in the neural machine translation revolution of the 2010s and the multilingual capabilities of today's large language models. While BLEU alone didn't cause this progress, it removed a critical barrier that had been constraining the field. In retrospect, BLEU exemplifies how solutions to seemingly mundane problems, like evaluation methodology, can have profound impacts on the trajectory of an entire research area. The metric showed that sometimes the most transformative contributions come not from building better systems, but from creating better ways to tell whether systems are actually getting better.
Quiz: Understanding the BLEU Metric
Test your knowledge of the BLEU metric and its impact on language AI evaluation.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

HDBSCAN Clustering: Complete Guide to Hierarchical Density-Based Clustering with Automatic Cluster Selection
Complete guide to HDBSCAN clustering algorithm covering density-based clustering, automatic cluster selection, noise detection, and handling variable density clusters. Learn how to implement HDBSCAN for real-world clustering problems.

Hierarchical Clustering: Complete Guide with Dendrograms, Linkage Criteria & Implementation
Comprehensive guide to hierarchical clustering, including dendrograms, linkage criteria (single, complete, average, Ward), and scikit-learn implementation. Learn how to build cluster hierarchies and interpret dendrograms.

Exponential Smoothing (ETS): Complete Guide to Time Series Forecasting with Weighted Averages & Holt-Winters
Learn exponential smoothing for time series forecasting, including simple, double (Holt's), and triple (Holt-Winters) methods. Master weighted averages, smoothing parameters, and practical implementation in Python.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


