Instruction Following Evaluation: Benchmarks & LLM Judges

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn to evaluate instruction-tuned LLMs using benchmarks like Alpaca Eval and MT-Bench, human evaluation protocols, and LLM-as-Judge automatic methods.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Instruction Following EvaluationLink Copied

Training an instruction-tuned model is only half the battle. The other half, equally important, is determining whether your model actually follows instructions well. Unlike traditional NLP tasks where we can compute precision, recall, or BLEU scores against reference outputs, instruction following presents a fundamental evaluation challenge: for most instructions, there is no single correct answer.

Consider the instruction "Write a poem about autumn." A good response could take countless forms, from haiku to sonnet, melancholic to celebratory. Traditional metrics like exact match or BLEU score, which we might use for tasks like machine translation, become nearly meaningless here. The same challenge applies to instructions like "Explain quantum entanglement to a five-year-old" or "List pros and cons of remote work." Each admits many valid, high-quality responses.

This chapter explores how we evaluate instruction-following capabilities. We'll examine purpose-built benchmarks, contrast human evaluation approaches with automatic metrics, and investigate what makes some instructions harder than others. These evaluation methods directly inform the training decisions we discussed in the previous chapter on instruction tuning training, and they become even more critical when we move to alignment with human preferences in the upcoming Part on RLHF.

Benchmarks for Instruction FollowingLink Copied

Evaluating instruction-tuned models requires benchmarks that capture the breadth of tasks you actually care about. The challenge is creating evaluation frameworks that reflect real-world usage and produce actionable measurements. Early evaluation relied heavily on existing NLP benchmarks, but the field has developed specialized benchmarks that better reflect real-world instruction following. Understanding the landscape of available benchmarks, and the unique strengths and limitations of each, is essential for you when developing or deploying instruction-tuned models.

Standard NLP BenchmarksLink Copied

Before instruction-specific benchmarks emerged, we evaluated instruction-tuned models on established benchmarks. While these don't directly measure instruction following, they provide useful baselines for comparing model capabilities. The reasoning behind using these traditional benchmarks is straightforward: if a model cannot demonstrate strong performance on well-defined tasks with clear correct answers, we have little reason to expect it will handle the more ambiguous challenge of open-ended instruction following.

MMLU (Massive Multitask Language Understanding) tests knowledge across 57 subjects from elementary mathematics to professional law. Questions are multiple-choice, making evaluation straightforward. The breadth of MMLU's subject coverage makes it particularly valuable for assessing whether a model has acquired broad factual knowledge during training, which forms a foundation for responding helpfully to diverse requests from you:

In[2]:

Code

# Example MMLU question structure
mmlu_example = {
    "question": "What is the capital of Australia?",
    "choices": ["Sydney", "Melbourne", "Canberra", "Perth"],
    "answer": "C",
    "subject": "geography",
}

# For instruction-tuned models, we format as a prompt
prompt = f"""Answer the following multiple choice question.

Question: {mmlu_example["question"]}

A) {mmlu_example["choices"][0]}
B) {mmlu_example["choices"][1]}
C) {mmlu_example["choices"][2]}
D) {mmlu_example["choices"][3]}

Answer:"""

# Example MMLU question structure
mmlu_example = {
    "question": "What is the capital of Australia?",
    "choices": ["Sydney", "Melbourne", "Canberra", "Perth"],
    "answer": "C",
    "subject": "geography",
}

# For instruction-tuned models, we format as a prompt
prompt = f"""Answer the following multiple choice question.

Question: {mmlu_example["question"]}

A) {mmlu_example["choices"][0]}
B) {mmlu_example["choices"][1]}
C) {mmlu_example["choices"][2]}
D) {mmlu_example["choices"][3]}

Answer:"""

Out[3]:

Console

Answer the following multiple choice question.

Question: What is the capital of Australia?

A) Sydney
B) Melbourne
C) Canberra
D) Perth

Answer:

This formatting is critical for evaluation. By structuring the task as a prompt, we can parse the model's next token (e.g., "C") to determine accuracy. Evaluating instruction-tuned models requires translating structured tasks into natural language prompts that match how you interact with the model. This translation step itself introduces variability: different prompt phrasings can yield different accuracy scores, highlighting the sensitivity of instruction-tuned models to precise wording.

HellaSwag tests commonsense reasoning through sentence completion, while TruthfulQA evaluates whether models generate truthful answers rather than plausible-sounding falsehoods. These benchmarks tell us about model capabilities but don't directly measure whether a model follows your arbitrary instructions. The distinction matters because a model might possess extensive knowledge (scoring well on MMLU) and strong reasoning abilities (performing well on HellaSwag) yet still struggle to understand what you actually want when you phrase requests in natural, conversational language.

Instruction-Specific BenchmarksLink Copied

The field has developed benchmarks specifically designed to evaluate instruction following. These focus on the quality of responses to diverse, open-ended instructions. We develop specialized benchmarks because instruction following requires more than just knowledge retrieval or logical reasoning. It requires understanding your intent, adapting tone and format appropriately, and maintaining coherence across complex, multi-faceted requests.

Alpaca Eval contains 805 instructions covering a range of tasks. The key innovation is using an automatic evaluator (typically GPT-4) to compare model outputs against a reference model's outputs. This approach addresses the scalability problem inherent in human evaluation while still producing preference-based rankings that correlate reasonably well with human judgments:

In[4]:

Code

# Sample of Alpaca Eval instruction categories
alpaca_categories = {
    "brainstorming": "Give me 5 creative names for a coffee shop",
    "classification": "Classify the sentiment of this review: 'The food was okay but the service was terrible'",
    "code": "Write a Python function to find the nth Fibonacci number",
    "creative_writing": "Write a short story about a robot learning to paint",
    "extraction": "Extract all dates mentioned in the following text",
    "general_qa": "What causes the northern lights?",
    "math": "If a train travels 120 km in 2 hours, what is its average speed?",
    "rewriting": "Rewrite this sentence in passive voice: 'The cat chased the mouse'",
    "summarization": "Summarize the main points of this article",
}

# Sample of Alpaca Eval instruction categories
alpaca_categories = {
    "brainstorming": "Give me 5 creative names for a coffee shop",
    "classification": "Classify the sentiment of this review: 'The food was okay but the service was terrible'",
    "code": "Write a Python function to find the nth Fibonacci number",
    "creative_writing": "Write a short story about a robot learning to paint",
    "extraction": "Extract all dates mentioned in the following text",
    "general_qa": "What causes the northern lights?",
    "math": "If a train travels 120 km in 2 hours, what is its average speed?",
    "rewriting": "Rewrite this sentence in passive voice: 'The cat chased the mouse'",
    "summarization": "Summarize the main points of this article",
}

Out[5]:

Console

Alpaca Eval covers diverse instruction types:

BRAINSTORMING
  Example: Give me 5 creative names for a coffee shop

CLASSIFICATION
  Example: Classify the sentiment of this review: 'The food was okay but the service was terrible'

CODE
  Example: Write a Python function to find the nth Fibonacci number

CREATIVE_WRITING
  Example: Write a short story about a robot learning to paint

EXTRACTION
  Example: Extract all dates mentioned in the following text

GENERAL_QA
  Example: What causes the northern lights?

MATH
  Example: If a train travels 120 km in 2 hours, what is its average speed?

REWRITING
  Example: Rewrite this sentence in passive voice: 'The cat chased the mouse'

SUMMARIZATION
  Example: Summarize the main points of this article

Out[6]:

Visualization

Distribution of instruction categories in Alpaca Eval, led by General QA and Creative Writing. The benchmark covers diverse task types to assess general instruction-following capability across different domains.

These categories illustrate the shift from narrow tasks to broad capabilities. The model must handle creative writing, logic, and extraction within a single interface. What makes this benchmark particularly valuable is its recognition that you do not restrict yourself to a single task type. A model deployed as a general assistant must seamlessly transition between generating creative content, performing analytical tasks, and answering factual questions, often within a single conversation session.

MT-Bench (Multi-Turn Benchmark) evaluates models on 80 multi-turn conversations across 8 categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. The multi-turn aspect is crucial because your conversations require maintaining context. This benchmark addresses a critical gap in single-turn evaluations: the ability to build coherently on previous exchanges, remember earlier constraints, and integrate new information without contradicting prior responses:

In[7]:

Code

# MT-Bench multi-turn example
mt_bench_example = {
    "category": "math",
    "turns": [
        {
            "turn": 1,
            "user": "What is the sum of all prime numbers less than 20?",
            "expected_capabilities": ["identify primes", "perform addition"],
        },
        {
            "turn": 2,
            "user": "Now exclude any primes that are also Fibonacci numbers.",
            "expected_capabilities": [
                "identify Fibonacci numbers",
                "set subtraction",
                "recall previous answer",
            ],
        },
    ],
}

# MT-Bench multi-turn example
mt_bench_example = {
    "category": "math",
    "turns": [
        {
            "turn": 1,
            "user": "What is the sum of all prime numbers less than 20?",
            "expected_capabilities": ["identify primes", "perform addition"],
        },
        {
            "turn": 2,
            "user": "Now exclude any primes that are also Fibonacci numbers.",
            "expected_capabilities": [
                "identify Fibonacci numbers",
                "set subtraction",
                "recall previous answer",
            ],
        },
    ],
}

Out[8]:

Console

MT-Bench Multi-Turn Example:
Category: math

Turn 1: What is the sum of all prime numbers less than 20?
  Requires: identify primes, perform addition

Turn 2: Now exclude any primes that are also Fibonacci numbers.
  Requires: identify Fibonacci numbers, set subtraction, recall previous answer

The second turn tests whether the model can build on its previous response while integrating new constraints. This design reflects how humans actually use conversational AI systems: they start with an initial request, then refine or extend it based on the model's response. A model that excels at isolated single-turn responses but fails to maintain coherence across multiple turns will frustrate you when you expect the kind of continuous understanding that characterizes human conversation.

LMSYS Chatbot Arena takes a different approach: you compare model outputs head-to-head without knowing which model produced which response. This crowdsourced evaluation provides authentic preference data but is slow and expensive to collect. The strength of this approach is that it reflects real-world usage. Rather than relying on predetermined instructions that may not reflect your actual needs, Arena captures genuine queries from you and preferences in a natural setting.

Benchmark LimitationsLink Copied

No benchmark perfectly captures instruction-following ability. Standard benchmarks like MMLU test knowledge but not the ability to follow arbitrary formatting requests. Instruction benchmarks like Alpaca Eval depend heavily on the automatic evaluator's biases. Arena-style benchmarks reflect your preferences but may favor verbosity or stylistic flourishes over correctness.

A robust evaluation strategy uses multiple benchmarks, combining automatic metrics for rapid iteration with periodic human evaluation for ground truth.

Human EvaluationLink Copied

Human evaluation remains the gold standard for assessing instruction following. When we want to know if a model's response is helpful, accurate, and appropriate, asking humans provides the most direct answer. If we want to build models that satisfy you, your judgment is the ultimate measure of success. However, human evaluation introduces its own complexities that you must understand and navigate carefully.

Pairwise ComparisonLink Copied

The most common human evaluation protocol presents evaluators with two responses to the same instruction and asks them to select the better one. This approach leverages a fundamental insight from psychology: humans are significantly better at making relative comparisons than absolute judgments. When asked to rate a response on a 1-10 scale, different evaluators may have wildly different internal calibrations. But when asked which of two responses is better, they tend to agree more consistently:

In[9]:

Code

# Pairwise comparison interface structure
pairwise_task = {
    "instruction": "Explain why the sky is blue in simple terms.",
    "response_a": """The sky appears blue because of a phenomenon called 
Rayleigh scattering. When sunlight enters Earth's atmosphere, it collides 
with gas molecules. Blue light has a shorter wavelength than other colors, 
so it scatters more in all directions. When you look up, you see this 
scattered blue light coming from all parts of the sky.""",
    "response_b": """The sky is blue because blue light bounces around more 
than other colors when sunlight hits the air. Think of it like throwing different sized balls at a bunch of tiny pins. Smaller balls (blue light) 
bounce around everywhere, while bigger balls (red light) go straight through. 
So when you look up, you see all that bouncing blue light!""",
    "options": ["A is better", "B is better", "Tie"],
}

# Pairwise comparison interface structure
pairwise_task = {
    "instruction": "Explain why the sky is blue in simple terms.",
    "response_a": """The sky appears blue because of a phenomenon called 
Rayleigh scattering. When sunlight enters Earth's atmosphere, it collides 
with gas molecules. Blue light has a shorter wavelength than other colors, 
so it scatters more in all directions. When you look up, you see this 
scattered blue light coming from all parts of the sky.""",
    "response_b": """The sky is blue because blue light bounces around more 
than other colors when sunlight hits the air. Think of it like throwing different sized balls at a bunch of tiny pins. Smaller balls (blue light) 
bounce around everywhere, while bigger balls (red light) go straight through. 
So when you look up, you see all that bouncing blue light!""",
    "options": ["A is better", "B is better", "Tie"],
}

Out[10]:

Console

INSTRUCTION: Explain why the sky is blue in simple terms.

============================================================
RESPONSE A:
The sky appears blue because of a phenomenon called 
Rayleigh scattering. When sunlight enters Earth's atmosphere, it collides 
with gas molecules. Blue light has a shorter wavelength than other colors, 
so it scatters more in all directions. When you look up, you see this 
scattered blue light coming from all parts of the sky.

============================================================
RESPONSE B:
The sky is blue because blue light bounces around more 
than other colors when sunlight hits the air. Think of it like throwing different sized balls at a bunch of tiny pins. Smaller balls (blue light) 
bounce around everywhere, while bigger balls (red light) go straight through. 
So when you look up, you see all that bouncing blue light!

============================================================
Options: ['A is better', 'B is better', 'Tie']

In this example, Response A provides a scientific explanation, while Response B uses an intuitive analogy. The "better" response depends on your intent, highlighting the subjectivity of the task. If the instruction had specified "explain to a child," most evaluators would prefer Response B. Without that specification, different evaluators may legitimately reach different conclusions based on their assumptions about the target audience.

Pairwise comparison has several advantages. It's cognitively easier than assigning absolute scores because humans are better at relative judgments. It also directly measures what we care about: which model produces more preferred outputs. Additionally, pairwise comparisons naturally aggregate into preference rankings that can inform training through methods like RLHF, creating a direct connection between evaluation methodology and model improvement.

Rating ScalesLink Copied

Alternative approaches use rating scales where evaluators score individual responses. This method offers different tradeoffs: while potentially less reliable at the individual judgment level, it provides richer information about specific dimensions of quality:

In[11]:

Code

# Likert scale evaluation criteria
evaluation_criteria = {
    "helpfulness": {
        "description": "Does the response address the user's request?",
        "scale": {
            1: "Completely unhelpful, ignores the instruction",
            2: "Mostly unhelpful, partially addresses instruction",
            3: "Somewhat helpful, addresses main points",
            4: "Helpful, addresses instruction well",
            5: "Very helpful, thoroughly addresses instruction",
        },
    },
    "accuracy": {
        "description": "Is the information in the response correct?",
        "scale": {
            1: "Completely inaccurate",
            2: "Mostly inaccurate",
            3: "Mixed accuracy",
            4: "Mostly accurate",
            5: "Completely accurate",
        },
    },
    "coherence": {
        "description": "Is the response well-organized and easy to follow?",
        "scale": {
            1: "Incoherent, disorganized",
            2: "Difficult to follow",
            3: "Somewhat organized",
            4: "Well-organized",
            5: "Excellently structured",
        },
    },
}

# Likert scale evaluation criteria
evaluation_criteria = {
    "helpfulness": {
        "description": "Does the response address the user's request?",
        "scale": {
            1: "Completely unhelpful, ignores the instruction",
            2: "Mostly unhelpful, partially addresses instruction",
            3: "Somewhat helpful, addresses main points",
            4: "Helpful, addresses instruction well",
            5: "Very helpful, thoroughly addresses instruction",
        },
    },
    "accuracy": {
        "description": "Is the information in the response correct?",
        "scale": {
            1: "Completely inaccurate",
            2: "Mostly inaccurate",
            3: "Mixed accuracy",
            4: "Mostly accurate",
            5: "Completely accurate",
        },
    },
    "coherence": {
        "description": "Is the response well-organized and easy to follow?",
        "scale": {
            1: "Incoherent, disorganized",
            2: "Difficult to follow",
            3: "Somewhat organized",
            4: "Well-organized",
            5: "Excellently structured",
        },
    },
}

Out[12]:

Console

Human Evaluation Rating Criteria:

HELPFULNESS: Does the response address the user's request?
  1: Completely unhelpful, ignores the instruction
  2: Mostly unhelpful, partially addresses instruction
  3: Somewhat helpful, addresses main points
  4: Helpful, addresses instruction well
  5: Very helpful, thoroughly addresses instruction

ACCURACY: Is the information in the response correct?
  1: Completely inaccurate
  2: Mostly inaccurate
  3: Mixed accuracy
  4: Mostly accurate
  5: Completely accurate

COHERENCE: Is the response well-organized and easy to follow?
  1: Incoherent, disorganized
  2: Difficult to follow
  3: Somewhat organized
  4: Well-organized
  5: Excellently structured

Rating scales enable more granular feedback and can identify specific weaknesses. For instance, a model might consistently score high on helpfulness but low on accuracy, revealing that it generates plausible-sounding but incorrect information. This diagnostic capability makes rating scales particularly valuable during model development, where understanding the nature of failures is as important as measuring overall quality. However, they suffer from calibration issues: different evaluators may interpret "4 out of 5" differently, and even individual evaluators may shift their standards over the course of a long evaluation session.

Inter-Annotator AgreementLink Copied

A critical concern in human evaluation is whether different evaluators agree. We measure this using inter-annotator agreement metrics. Agreement measurement is important because it clarifies whose judgment to trust when evaluators disagree. High agreement suggests that quality differences between responses are clear and objective. Low agreement indicates either that the responses are genuinely comparable in quality, that the evaluation criteria are ambiguous, or that the task itself admits multiple valid interpretations:

In[13]:

Code

import numpy as np


def cohens_kappa(rater1, rater2):
    """
    Calculate Cohen's Kappa for two raters.
    Measures agreement beyond chance.
    """
    # Convert to numpy arrays
    r1 = np.array(rater1)
    r2 = np.array(rater2)

    # Observed agreement
    observed_agreement = np.mean(r1 == r2)

    # Expected agreement by chance
    categories = np.unique(np.concatenate([r1, r2]))
    expected_agreement = 0
    for cat in categories:
        p1 = np.mean(r1 == cat)
        p2 = np.mean(r2 == cat)
        expected_agreement += p1 * p2

    # Kappa calculation
    if expected_agreement == 1:
        return 1.0
    kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement)
    return kappa


# Example: Two annotators rating 20 response pairs
# 1 = A is better, 2 = B is better, 3 = Tie
rater1_judgments = [1, 1, 2, 1, 3, 2, 1, 1, 2, 2, 1, 3, 1, 2, 1, 2, 1, 1, 3, 2]
rater2_judgments = [1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 3, 1, 2, 1, 3, 1, 2, 3, 2]

kappa = cohens_kappa(rater1_judgments, rater2_judgments)
raw_agreement = np.mean(
    np.array(rater1_judgments) == np.array(rater2_judgments)
)

import numpy as np


def cohens_kappa(rater1, rater2):
    """
    Calculate Cohen's Kappa for two raters.
    Measures agreement beyond chance.
    """
    # Convert to numpy arrays
    r1 = np.array(rater1)
    r2 = np.array(rater2)

    # Observed agreement
    observed_agreement = np.mean(r1 == r2)

    # Expected agreement by chance
    categories = np.unique(np.concatenate([r1, r2]))
    expected_agreement = 0
    for cat in categories:
        p1 = np.mean(r1 == cat)
        p2 = np.mean(r2 == cat)
        expected_agreement += p1 * p2

    # Kappa calculation
    if expected_agreement == 1:
        return 1.0
    kappa = (observed_agreement - expected_agreement) / (1 - expected_agreement)
    return kappa


# Example: Two annotators rating 20 response pairs
# 1 = A is better, 2 = B is better, 3 = Tie
rater1_judgments = [1, 1, 2, 1, 3, 2, 1, 1, 2, 2, 1, 3, 1, 2, 1, 2, 1, 1, 3, 2]
rater2_judgments = [1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 3, 1, 2, 1, 3, 1, 2, 3, 2]

kappa = cohens_kappa(rater1_judgments, rater2_judgments)
raw_agreement = np.mean(
    np.array(rater1_judgments) == np.array(rater2_judgments)
)

Out[14]:

Console

Raw agreement: 80.0%
Cohen's Kappa: 0.673

Kappa interpretation:
  < 0.20: Poor agreement
  0.21-0.40: Fair agreement
  0.41-0.60: Moderate agreement
  0.61-0.80: Substantial agreement
  0.81-1.00: Almost perfect agreement

Out[15]:

Visualization

Comparison of raw agreement versus Cohen's Kappa across different agreement scenarios. Kappa corrects for chance agreement, providing a more honest measure of evaluator consistency. Higher Kappa values indicate more reliable evaluation protocols.

Cohen's Kappa accounts for agreement that would occur by chance. This correction is essential because raw agreement can be misleading. If one response is clearly better 90% of the time and evaluators simply choose randomly, they would still agree 50% of the time by chance alone. Kappa provides a more honest assessment by asking: how much better is the observed agreement than what we would expect from random guessing? In the example above, two raters agreed on 80% of cases, and Kappa shows this represents substantial agreement after adjusting for chance.

Challenges in Human EvaluationLink Copied

Human evaluation faces several practical challenges:

Cost and scale. Evaluating thousands of examples across multiple models quickly becomes expensive. A single evaluation comparing two models on 1000 instructions with 3 annotators per comparison requires 3000 human judgments.

Evaluator expertise. For technical instructions (code, math, science), evaluators need domain expertise to judge correctness. A non-programmer might prefer a plausible-looking but buggy code response over a correct but terse one.

Evaluation bias. Humans tend to prefer longer, more detailed responses even when brevity is appropriate. They may also favor responses that match their pre-existing beliefs, regardless of factual accuracy.

Cognitive load. Comparing long, complex responses is mentally taxing. Evaluator fatigue leads to inconsistent judgments, especially in later evaluation sessions.

These challenges motivate the development of automatic evaluation methods that can scale while approximating human judgment.

Automatic EvaluationLink Copied

Automatic evaluation methods enable rapid iteration during model development. Automatic evaluation is useful because it removes human annotation bottlenecks and provides fast, reproducible measurements. The most successful approach treats evaluation itself as a language modeling task, using powerful LLMs to judge response quality. This reflects a trend where models become tools for training and evaluating other models.

LLM-as-a-JudgeLink Copied

The LLM-as-a-Judge paradigm uses a capable language model (typically GPT-4 or a similar model) to evaluate responses. This approach assumes that if a model can generate high-quality responses, it can also recognize quality in the responses of other models. This mirrors how human expertise works: skilled writers can identify good writing, experienced programmers can spot elegant code, and domain experts can assess the accuracy of technical explanations:

In[16]:

Code

def create_judge_prompt(instruction, response_a, response_b):
    """
    Create a prompt for LLM-as-judge pairwise comparison.
    """
    prompt = f"""You are an impartial judge evaluating the quality of two AI assistant responses.

[User Instruction]
{instruction}

[Response A]
{response_a}

[Response B]
{response_b}

[Task]
Compare the two responses based on:
1. Helpfulness: Does it address the user's request?
2. Accuracy: Is the information correct?
3. Clarity: Is it well-written and easy to understand?
4. Relevance: Does it stay on topic?

Provide your evaluation in the following format:
Analysis: <brief analysis of each response>
Winner: <A, B, or Tie>
Confidence: <high, medium, low>"""

    return prompt


# Example usage
instruction = "What are three benefits of regular exercise?"
response_a = """Regular exercise offers numerous benefits:
1. Improved cardiovascular health - strengthens heart and lungs
2. Better mental health - reduces anxiety and depression
3. Weight management - helps maintain healthy body weight"""

response_b = """Exercise is good for you. It helps your heart and makes you feel better. You should try to exercise every day if you can."""

judge_prompt = create_judge_prompt(instruction, response_a, response_b)

def create_judge_prompt(instruction, response_a, response_b):
    """
    Create a prompt for LLM-as-judge pairwise comparison.
    """
    prompt = f"""You are an impartial judge evaluating the quality of two AI assistant responses.

[User Instruction]
{instruction}

[Response A]
{response_a}

[Response B]
{response_b}

[Task]
Compare the two responses based on:
1. Helpfulness: Does it address the user's request?
2. Accuracy: Is the information correct?
3. Clarity: Is it well-written and easy to understand?
4. Relevance: Does it stay on topic?

Provide your evaluation in the following format:
Analysis: <brief analysis of each response>
Winner: <A, B, or Tie>
Confidence: <high, medium, low>"""

    return prompt


# Example usage
instruction = "What are three benefits of regular exercise?"
response_a = """Regular exercise offers numerous benefits:
1. Improved cardiovascular health - strengthens heart and lungs
2. Better mental health - reduces anxiety and depression
3. Weight management - helps maintain healthy body weight"""

response_b = """Exercise is good for you. It helps your heart and makes you feel better. You should try to exercise every day if you can."""

judge_prompt = create_judge_prompt(instruction, response_a, response_b)

Out[17]:

Console

Judge Prompt Structure:
============================================================
You are an impartial judge evaluating the quality of two AI assistant responses.

[User Instruction]
What are three benefits of regular exercise?

[Response A]
Regular exercise offers numerous benefits:
1. Improved cardiovascular health - strengthens heart and lungs
2. Better mental health - reduces anxiety and depression
3. Weight management - helps maintain healthy body weight

[Response B]
Exercise is good for you. It helps your heart and makes you feel better. You should try to exercise ever...

The judge model would analyze both responses and select a winner.

The structured prompt forces the judge to break down the evaluation into specific criteria before declaring a winner, which improves consistency compared to asking for a simple score. This structure serves multiple purposes: it guides the judge toward comprehensive evaluation rather than snap judgments, it provides interpretable reasoning that can be audited for systematic errors, and it aligns the evaluation process with the multi-faceted nature of response quality.

Research shows that GPT-4 as a judge achieves 80%+ agreement with human preferences on many instruction-following tasks. However, this approach has known biases that you must account for in your evaluation designs.

Position Bias and MitigationLink Copied

LLM judges exhibit position bias: they tend to favor the response presented first (or sometimes second, depending on the model). This bias likely emerges from patterns in the training data, where examples presented earlier in a list or conversation may have been systematically different from later examples. Regardless of its origin, position bias represents a significant confound that can distort evaluation results. We can mitigate this by evaluating each pair twice with swapped positions:

In[18]:

Code

def position_debiased_evaluation(
    instruction, response_a, response_b, judge_function
):
    """
    Evaluate with position swapping to reduce position bias.

    judge_function: callable that returns 'A', 'B', or 'Tie'
    """
    # First evaluation: A first, B second
    result_ab = judge_function(instruction, response_a, response_b)

    # Second evaluation: B first, A second
    result_ba = judge_function(instruction, response_b, response_a)
    # Flip the result to maintain consistent meaning
    result_ba_flipped = {"A": "B", "B": "A", "Tie": "Tie"}[result_ba]

    # Aggregate results
    if result_ab == result_ba_flipped:
        return result_ab, "consistent"
    else:
        return "Tie", "inconsistent"


# Simulating evaluation results
evaluations = [
    {
        "instruction": "Explain photosynthesis",
        "ab": "A",
        "ba": "A",
    },  # Consistent
    {"instruction": "Write a haiku", "ab": "A", "ba": "B"},  # Inconsistent
    {"instruction": "List prime numbers", "ab": "B", "ba": "B"},  # Consistent
    {"instruction": "Summarize WWII", "ab": "A", "ba": "A"},  # Consistent
]

# Process results
processed_evaluations = []
for eval_item in evaluations:
    ab_result = eval_item["ab"]
    ba_result = eval_item["ba"]
    ba_flipped = {"A": "B", "B": "A", "Tie": "Tie"}[ba_result]

    if ab_result == ba_flipped:
        final = ab_result
        status = "✓ Consistent"
    else:
        final = "Tie"
        status = "⚠ Position bias detected"

    processed_evaluations.append(
        {
            "instruction": eval_item["instruction"],
            "ab": ab_result,
            "ba": ba_result,
            "final": final,
            "status": status,
        }
    )

def position_debiased_evaluation(
    instruction, response_a, response_b, judge_function
):
    """
    Evaluate with position swapping to reduce position bias.

    judge_function: callable that returns 'A', 'B', or 'Tie'
    """
    # First evaluation: A first, B second
    result_ab = judge_function(instruction, response_a, response_b)

    # Second evaluation: B first, A second
    result_ba = judge_function(instruction, response_b, response_a)
    # Flip the result to maintain consistent meaning
    result_ba_flipped = {"A": "B", "B": "A", "Tie": "Tie"}[result_ba]

    # Aggregate results
    if result_ab == result_ba_flipped:
        return result_ab, "consistent"
    else:
        return "Tie", "inconsistent"


# Simulating evaluation results
evaluations = [
    {
        "instruction": "Explain photosynthesis",
        "ab": "A",
        "ba": "A",
    },  # Consistent
    {"instruction": "Write a haiku", "ab": "A", "ba": "B"},  # Inconsistent
    {"instruction": "List prime numbers", "ab": "B", "ba": "B"},  # Consistent
    {"instruction": "Summarize WWII", "ab": "A", "ba": "A"},  # Consistent
]

# Process results
processed_evaluations = []
for eval_item in evaluations:
    ab_result = eval_item["ab"]
    ba_result = eval_item["ba"]
    ba_flipped = {"A": "B", "B": "A", "Tie": "Tie"}[ba_result]

    if ab_result == ba_flipped:
        final = ab_result
        status = "✓ Consistent"
    else:
        final = "Tie"
        status = "⚠ Position bias detected"

    processed_evaluations.append(
        {
            "instruction": eval_item["instruction"],
            "ab": ab_result,
            "ba": ba_result,
            "final": final,
            "status": status,
        }
    )

Out[19]:

Console

Position-Debiased Evaluation Results:

Instruction: Explain photosynthesis
  A-first: A, B-first: A → Final: Tie (⚠ Position bias detected)

Instruction: Write a haiku
  A-first: A, B-first: B → Final: A (✓ Consistent)

Instruction: List prime numbers
  A-first: B, B-first: B → Final: Tie (⚠ Position bias detected)

Instruction: Summarize WWII
  A-first: A, B-first: A → Final: Tie (⚠ Position bias detected)

Out[20]:

Visualization

Position bias without mitigation. Judges favor the first position approximately 58% of the time, revealing a systematic error in evaluation.

Position bias with mitigation. Position swapping identifies inconsistent judgments, thereby improving the reliability of the evaluation protocol.

When results conflict between position orderings, declaring a tie is conservative but honest. The inconsistency itself reveals uncertainty in the evaluation. This approach doubles the computational cost of evaluation but provides a crucial quality check. The rate of inconsistent judgments also serves as a diagnostic: if position swapping frequently changes the outcome, it suggests either that the responses are genuinely similar in quality or that the judge model is unreliable for this type of instruction.

Verbosity BiasLink Copied

LLM judges also exhibit verbosity bias, preferring longer responses even when they contain unnecessary repetition or padding. This bias reflects a tendency to conflate quantity with quality, a pattern that likely exists in both human preferences and training data. Understanding this bias is crucial because it can systematically favor models that generate verbose outputs over those that provide concise, direct answers:

In[21]:

Code

# Demonstrating verbosity bias
concise_response = "The capital of France is Paris."

verbose_response = """That's a great question! The capital of France is Paris, 
which is located in the northern part of the country. Paris has been the capital 
for many centuries and is known for landmarks like the Eiffel Tower, the Louvre 
Museum, and Notre-Dame Cathedral. It's also called the "City of Light" and is 
one of the most visited cities in the world. So to directly answer your question, 
the capital of France is Paris."""

# Calculate response lengths
concise_words = len(concise_response.split())
verbose_words = len(verbose_response.split())

# Demonstrating verbosity bias
concise_response = "The capital of France is Paris."

verbose_response = """That's a great question! The capital of France is Paris, 
which is located in the northern part of the country. Paris has been the capital 
for many centuries and is known for landmarks like the Eiffel Tower, the Louvre 
Museum, and Notre-Dame Cathedral. It's also called the "City of Light" and is 
one of the most visited cities in the world. So to directly answer your question, 
the capital of France is Paris."""

# Calculate response lengths
concise_words = len(concise_response.split())
verbose_words = len(verbose_response.split())

Out[22]:

Console

Verbosity Bias Example

Instruction: What is the capital of France?

CONCISE (6 words):
  The capital of France is Paris.

VERBOSE (73 words):
  That's a great question! The capital of France is Paris, 
which is located in the northern part of the country. Paris has been the capital 
for many centuries and is known for landmarks like the Eiffel Tower, the Louvre 
Museum, and Notre-Dame Cathedral. It's also called the "City of Light" and is 
one of the most visited cities in the world. So to directly answer your question, 
the capital of France is Paris.

Out[23]:

Visualization

Response length comparison. The verbose response (78 words) is significantly longer than the concise version (7), despite conveying the same core information.

Preference by evaluator type. Both human and LLM judges favor the verbose response, illustrating a length bias that can distort model comparisons.

Both responses are factually correct, but LLM judges (and humans) often prefer the verbose version despite the concise one being more direct. This "length bias" often mistakes verbosity for quality. The verbose response includes tangentially related information that, while accurate, does not address your question more effectively. In fact, if you simply needed a quick factual answer, the verbose response wastes time and may obscure the key information you sought.

To mitigate verbosity bias, some evaluation protocols explicitly instruct the judge to prefer concise responses when both are equally correct. Others use length-controlled comparisons or normalize scores by response length. A more sophisticated approach involves asking judges to evaluate whether each piece of information in a response directly contributes to answering your question, penalizing padding and tangents explicitly.

Win Rate CalculationLink Copied

After collecting pairwise judgments, we calculate win rates to compare models. Win rates summarize performance by showing the fraction of comparisons a model wins against its opponents. This metric directly answers the question, "How often would you prefer this model's output over alternatives?":

In[24]:

Code

import numpy as np


def calculate_win_rates(results, model_names):
    """
    Calculate win rates from pairwise comparison results.

    results: list of dicts with 'model_a', 'model_b', 'winner'
    """
    # Initialize counters
    wins = {model: 0 for model in model_names}
    losses = {model: 0 for model in model_names}
    ties = {model: 0 for model in model_names}

    for result in results:
        model_a = result["model_a"]
        model_b = result["model_b"]
        winner = result["winner"]

        if winner == "A":
            wins[model_a] += 1
            losses[model_b] += 1
        elif winner == "B":
            wins[model_b] += 1
            losses[model_a] += 1
        else:  # Tie
            ties[model_a] += 1
            ties[model_b] += 1

    # Calculate win rates
    win_rates = {}
    for model in model_names:
        total = wins[model] + losses[model] + ties[model]
        if total > 0:
            # Win rate: wins / (wins + losses), excluding ties
            non_tie_total = wins[model] + losses[model]
            if non_tie_total > 0:
                win_rates[model] = wins[model] / non_tie_total
            else:
                win_rates[model] = 0.5  # All ties
        else:
            win_rates[model] = None

    return wins, losses, ties, win_rates


# Simulated comparison results
np.random.seed(42)
models = ["GPT-4", "Claude", "LLaMA-2-70B", "Mistral-7B"]
results = []

# Generate realistic-looking comparison data
comparison_probs = {
    ("GPT-4", "Claude"): (0.45, 0.40, 0.15),  # A wins, B wins, tie
    ("GPT-4", "LLaMA-2-70B"): (0.55, 0.30, 0.15),
    ("GPT-4", "Mistral-7B"): (0.60, 0.25, 0.15),
    ("Claude", "LLaMA-2-70B"): (0.50, 0.35, 0.15),
    ("Claude", "Mistral-7B"): (0.55, 0.30, 0.15),
    ("LLaMA-2-70B", "Mistral-7B"): (0.45, 0.40, 0.15),
}

for (model_a, model_b), (p_a, p_b, p_tie) in comparison_probs.items():
    for _ in range(100):  # 100 comparisons per pair
        r = np.random.random()
        if r < p_a:
            winner = "A"
        elif r < p_a + p_b:
            winner = "B"
        else:
            winner = "Tie"
        results.append(
            {"model_a": model_a, "model_b": model_b, "winner": winner}
        )

wins, losses, ties, win_rates = calculate_win_rates(results, models)

# Sort models by win rate for display
sorted_models = sorted(models, key=lambda m: win_rates.get(m, 0), reverse=True)

import numpy as np


def calculate_win_rates(results, model_names):
    """
    Calculate win rates from pairwise comparison results.

    results: list of dicts with 'model_a', 'model_b', 'winner'
    """
    # Initialize counters
    wins = {model: 0 for model in model_names}
    losses = {model: 0 for model in model_names}
    ties = {model: 0 for model in model_names}

    for result in results:
        model_a = result["model_a"]
        model_b = result["model_b"]
        winner = result["winner"]

        if winner == "A":
            wins[model_a] += 1
            losses[model_b] += 1
        elif winner == "B":
            wins[model_b] += 1
            losses[model_a] += 1
        else:  # Tie
            ties[model_a] += 1
            ties[model_b] += 1

    # Calculate win rates
    win_rates = {}
    for model in model_names:
        total = wins[model] + losses[model] + ties[model]
        if total > 0:
            # Win rate: wins / (wins + losses), excluding ties
            non_tie_total = wins[model] + losses[model]
            if non_tie_total > 0:
                win_rates[model] = wins[model] / non_tie_total
            else:
                win_rates[model] = 0.5  # All ties
        else:
            win_rates[model] = None

    return wins, losses, ties, win_rates


# Simulated comparison results
np.random.seed(42)
models = ["GPT-4", "Claude", "LLaMA-2-70B", "Mistral-7B"]
results = []

# Generate realistic-looking comparison data
comparison_probs = {
    ("GPT-4", "Claude"): (0.45, 0.40, 0.15),  # A wins, B wins, tie
    ("GPT-4", "LLaMA-2-70B"): (0.55, 0.30, 0.15),
    ("GPT-4", "Mistral-7B"): (0.60, 0.25, 0.15),
    ("Claude", "LLaMA-2-70B"): (0.50, 0.35, 0.15),
    ("Claude", "Mistral-7B"): (0.55, 0.30, 0.15),
    ("LLaMA-2-70B", "Mistral-7B"): (0.45, 0.40, 0.15),
}

for (model_a, model_b), (p_a, p_b, p_tie) in comparison_probs.items():
    for _ in range(100):  # 100 comparisons per pair
        r = np.random.random()
        if r < p_a:
            winner = "A"
        elif r < p_a + p_b:
            winner = "B"
        else:
            winner = "Tie"
        results.append(
            {"model_a": model_a, "model_b": model_b, "winner": winner}
        )

wins, losses, ties, win_rates = calculate_win_rates(results, models)

# Sort models by win rate for display
sorted_models = sorted(models, key=lambda m: win_rates.get(m, 0), reverse=True)

Out[25]:

Console

Model Comparison Results (simulated)

Model                  Wins   Losses   Ties   Win Rate
----------------------------------------------------
GPT-4                   161       91     48      63.9%
Claude                  139      113     48      55.2%
LLaMA-2-70B             104      146     50      41.6%
Mistral-7B               94      148     58      38.8%

Out[26]:

Visualization

The results show a clear ranking based on head-to-head performance. Win rates provide a simple summary, but they don't account for which opponents each model faced. A model that only competed against weak opponents would have an inflated win rate compared to one that faced stronger competition. More sophisticated ranking systems like Elo or Bradley-Terry (which we'll explore in the upcoming chapter on preference modeling) provide better rankings when not all pairs are equally compared. These systems model each comparison as evidence about underlying model strength, accounting for the difficulty of each opponent faced.

Reference-Based MetricsLink Copied

For some instruction types, reference-based metrics remain useful. Code generation tasks can be evaluated by running tests. This represents a fundamentally different evaluation paradigm: rather than asking whether a response seems good, we verify whether it actually works. This functional evaluation provides an objective ground truth that neither human evaluators nor LLM judges can achieve for tasks with verifiable outputs:

In[27]:

Code

def evaluate_code_response(generated_code, test_cases):
    """
    Evaluate generated code by running test cases.
    Returns pass rate and detailed results.
    """
    results = []

    for test in test_cases:
        try:
            # Create isolated namespace for execution
            namespace = {}
            exec(generated_code, namespace)

            # Run the test
            function_name = test["function"]
            inputs = test["inputs"]
            expected = test["expected"]

            if function_name in namespace:
                actual = namespace[function_name](*inputs)
                passed = actual == expected
            else:
                passed = False
                actual = "Function not found"

        except Exception as e:
            passed = False
            actual = str(e)

        results.append(
            {
                "inputs": inputs,
                "expected": expected,
                "actual": actual,
                "passed": passed,
            }
        )

    pass_rate = sum(r["passed"] for r in results) / len(results)
    return pass_rate, results


# Example: Testing a Fibonacci implementation
generated_code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

test_cases = [
    {"function": "fibonacci", "inputs": (0,), "expected": 0},
    {"function": "fibonacci", "inputs": (1,), "expected": 1},
    {"function": "fibonacci", "inputs": (5,), "expected": 5},
    {"function": "fibonacci", "inputs": (10,), "expected": 55},
]

pass_rate, test_results = evaluate_code_response(generated_code, test_cases)

def evaluate_code_response(generated_code, test_cases):
    """
    Evaluate generated code by running test cases.
    Returns pass rate and detailed results.
    """
    results = []

    for test in test_cases:
        try:
            # Create isolated namespace for execution
            namespace = {}
            exec(generated_code, namespace)

            # Run the test
            function_name = test["function"]
            inputs = test["inputs"]
            expected = test["expected"]

            if function_name in namespace:
                actual = namespace[function_name](*inputs)
                passed = actual == expected
            else:
                passed = False
                actual = "Function not found"

        except Exception as e:
            passed = False
            actual = str(e)

        results.append(
            {
                "inputs": inputs,
                "expected": expected,
                "actual": actual,
                "passed": passed,
            }
        )

    pass_rate = sum(r["passed"] for r in results) / len(results)
    return pass_rate, results


# Example: Testing a Fibonacci implementation
generated_code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

test_cases = [
    {"function": "fibonacci", "inputs": (0,), "expected": 0},
    {"function": "fibonacci", "inputs": (1,), "expected": 1},
    {"function": "fibonacci", "inputs": (5,), "expected": 5},
    {"function": "fibonacci", "inputs": (10,), "expected": 55},
]

pass_rate, test_results = evaluate_code_response(generated_code, test_cases)

Out[28]:

Console

Code Evaluation Results:

Generated code:

def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)


Test Results:
  Test 1: fibonacci(0,) = 0 (expected 0) ✓ PASS
  Test 2: fibonacci(1,) = 1 (expected 1) ✓ PASS
  Test 3: fibonacci(5,) = 5 (expected 5) ✓ PASS
  Test 4: fibonacci(10,) = 55 (expected 55) ✓ PASS

Pass Rate: 100%

With a 100% pass rate, we can confirm the model's solution is functionally correct. For tasks with verifiable outputs (math problems, factual questions, code), combining functional tests with qualitative LLM-as-judge evaluation provides comprehensive assessment. The functional tests verify correctness while qualitative evaluation assesses aspects like code readability, efficiency, and adherence to best practices that are not captured by pass/fail test results alone.

Instruction DifficultyLink Copied

Not all instructions are equally challenging. Understanding what makes instructions difficult helps us build better training sets and evaluate models more thoroughly. A comprehensive evaluation should include instructions spanning the full range of difficulty to identify where models excel and where they struggle. This understanding also informs curriculum design during training, as exposing models to appropriately challenging examples improves learning efficiency.

Dimensions of DifficultyLink Copied

Instruction difficulty emerges from several independent dimensions. These dimensions are largely orthogonal, meaning an instruction can be easy along one dimension while being extremely challenging along another. A simple factual question might require specialized domain knowledge, while a complex multi-step task might involve only common knowledge. Understanding these dimensions helps us construct balanced evaluation sets that probe different aspects of model capability:

In[29]:

Code

difficulty_dimensions = {
    "knowledge_required": {
        "description": "Domain expertise needed to answer correctly",
        "examples": {
            "easy": "What color is the sky?",
            "medium": "Explain the difference between TCP and UDP.",
            "hard": "Derive the Euler-Lagrange equation from Hamilton's principle.",
        },
    },
    "reasoning_depth": {
        "description": "Number of logical steps required",
        "examples": {
            "easy": "Is 15 greater than 12?",
            "medium": "If all A are B, and all B are C, are all A also C?",
            "hard": "Given these 5 clues, determine who owns the fish.",
        },
    },
    "constraint_complexity": {
        "description": "Number of constraints the response must satisfy",
        "examples": {
            "easy": "Write a sentence about dogs.",
            "medium": "Write a sentence about dogs using exactly 10 words.",
            "hard": "Write a haiku about dogs where each line starts with 'D'.",
        },
    },
    "ambiguity": {
        "description": "How underspecified is the instruction",
        "examples": {
            "easy": "Calculate 25 × 4.",
            "medium": "Explain recursion.",
            "hard": "Help me with my project.",
        },
    },
    "context_length": {
        "description": "Amount of input context to process",
        "examples": {
            "easy": "Summarize this tweet.",
            "medium": "Summarize this article (500 words).",
            "hard": "Summarize this legal document (50 pages).",
        },
    },
}

difficulty_dimensions = {
    "knowledge_required": {
        "description": "Domain expertise needed to answer correctly",
        "examples": {
            "easy": "What color is the sky?",
            "medium": "Explain the difference between TCP and UDP.",
            "hard": "Derive the Euler-Lagrange equation from Hamilton's principle.",
        },
    },
    "reasoning_depth": {
        "description": "Number of logical steps required",
        "examples": {
            "easy": "Is 15 greater than 12?",
            "medium": "If all A are B, and all B are C, are all A also C?",
            "hard": "Given these 5 clues, determine who owns the fish.",
        },
    },
    "constraint_complexity": {
        "description": "Number of constraints the response must satisfy",
        "examples": {
            "easy": "Write a sentence about dogs.",
            "medium": "Write a sentence about dogs using exactly 10 words.",
            "hard": "Write a haiku about dogs where each line starts with 'D'.",
        },
    },
    "ambiguity": {
        "description": "How underspecified is the instruction",
        "examples": {
            "easy": "Calculate 25 × 4.",
            "medium": "Explain recursion.",
            "hard": "Help me with my project.",
        },
    },
    "context_length": {
        "description": "Amount of input context to process",
        "examples": {
            "easy": "Summarize this tweet.",
            "medium": "Summarize this article (500 words).",
            "hard": "Summarize this legal document (50 pages).",
        },
    },
}

Out[30]:

Console

Dimensions of Instruction Difficulty

━━━ KNOWLEDGE REQUIRED ━━━
Domain expertise needed to answer correctly

  Easy: "What color is the sky?"
  Medium: "Explain the difference between TCP and UDP."
  Hard: "Derive the Euler-Lagrange equation from Hamilton's principle."

━━━ REASONING DEPTH ━━━
Number of logical steps required

  Easy: "Is 15 greater than 12?"
  Medium: "If all A are B, and all B are C, are all A also C?"
  Hard: "Given these 5 clues, determine who owns the fish."

━━━ CONSTRAINT COMPLEXITY ━━━
Number of constraints the response must satisfy

  Easy: "Write a sentence about dogs."
  Medium: "Write a sentence about dogs using exactly 10 words."
  Hard: "Write a haiku about dogs where each line starts with 'D'."

━━━ AMBIGUITY ━━━
How underspecified is the instruction

  Easy: "Calculate 25 × 4."
  Medium: "Explain recursion."
  Hard: "Help me with my project."

━━━ CONTEXT LENGTH ━━━
Amount of input context to process

  Easy: "Summarize this tweet."
  Medium: "Summarize this article (500 words)."
  Hard: "Summarize this legal document (50 pages)."

Out[31]:

Visualization

Radar chart showing how instruction difficulty varies across five dimensions. Code generation tasks require high knowledge and reasoning, while creative writing demands handling complex constraints and ambiguity. A comprehensive evaluation should sample across all dimensions to identify model weaknesses.

A comprehensive evaluation should sample across all difficulty dimensions, not just one. A model might handle high-knowledge-requirement questions well because it memorized relevant facts during pretraining, but fail on multi-step reasoning despite the individual steps being simple. Conversely, a model with strong reasoning capabilities might produce excellent responses to complex logical puzzles while making basic factual errors on domain-specific questions.

The IFEval BenchmarkLink Copied

IFEval (Instruction Following Evaluation) specifically measures whether models follow explicit constraints. Unlike open-ended benchmarks, IFEval instructions contain verifiable requirements. This design philosophy reflects an important insight: while overall response quality is subjective and difficult to measure, constraint compliance is objective and automatically verifiable. A response either contains exactly 100 words or it does not; it either includes the required keyword or it does not. This objectivity enables large-scale automatic evaluation without the biases inherent in LLM-as-judge approaches:

In[32]:

Code

# IFEval constraint types
ifeval_constraints = {
    "length_constraints": [
        "Write a response with exactly 100 words.",
        "Your response should be at least 3 paragraphs.",
        "Answer in no more than 2 sentences.",
    ],
    "format_constraints": [
        "Respond entirely in lowercase.",
        "Use bullet points for your answer.",
        "Write your response in JSON format.",
    ],
    "keyword_constraints": [
        "Include the word 'therefore' in your response.",
        "Do not use the word 'the'.",
        "Mention 'artificial intelligence' at least twice.",
    ],
    "structural_constraints": [
        "Start your response with 'Certainly!'",
        "End your response with a question.",
        "Include exactly 3 numbered items.",
    ],
}

# Example IFEval instruction with multiple constraints
complex_instruction = """Write a product description for a smartphone. 
Your response must:
1. Be exactly 50-75 words
2. Include the word "innovative" at least once
3. Use no more than 2 sentences
4. End with an exclamation mark"""

# IFEval constraint types
ifeval_constraints = {
    "length_constraints": [
        "Write a response with exactly 100 words.",
        "Your response should be at least 3 paragraphs.",
        "Answer in no more than 2 sentences.",
    ],
    "format_constraints": [
        "Respond entirely in lowercase.",
        "Use bullet points for your answer.",
        "Write your response in JSON format.",
    ],
    "keyword_constraints": [
        "Include the word 'therefore' in your response.",
        "Do not use the word 'the'.",
        "Mention 'artificial intelligence' at least twice.",
    ],
    "structural_constraints": [
        "Start your response with 'Certainly!'",
        "End your response with a question.",
        "Include exactly 3 numbered items.",
    ],
}

# Example IFEval instruction with multiple constraints
complex_instruction = """Write a product description for a smartphone. 
Your response must:
1. Be exactly 50-75 words
2. Include the word "innovative" at least once
3. Use no more than 2 sentences
4. End with an exclamation mark"""

Out[33]:

Console

IFEval Constraint Categories:

LENGTH CONSTRAINTS:
  • Write a response with exactly 100 words.
  • Your response should be at least 3 paragraphs.
  • Answer in no more than 2 sentences.

FORMAT CONSTRAINTS:
  • Respond entirely in lowercase.
  • Use bullet points for your answer.
  • Write your response in JSON format.

KEYWORD CONSTRAINTS:
  • Include the word 'therefore' in your response.
  • Do not use the word 'the'.
  • Mention 'artificial intelligence' at least twice.

STRUCTURAL CONSTRAINTS:
  • Start your response with 'Certainly!'
  • End your response with a question.
  • Include exactly 3 numbered items.

------------------------------------------------------------

Example Multi-Constraint Instruction:
Write a product description for a smartphone. 
Your response must:
1. Be exactly 50-75 words
2. Include the word "innovative" at least once
3. Use no more than 2 sentences
4. End with an exclamation mark

IFEval's constraints are automatically verifiable, enabling fully automatic evaluation. The verification process requires no subjective judgment: we simply check whether each constraint is satisfied according to its precise definition:

In[34]:

Code

def verify_ifeval_constraints(response, constraints):
    """
    Verify whether a response satisfies IFEval-style constraints.
    Returns dict of constraint -> (passed, details)
    """
    results = {}

    for constraint_type, constraint_value in constraints.items():
        if constraint_type == "min_words":
            word_count = len(response.split())
            passed = word_count >= constraint_value
            results[f"min_{constraint_value}_words"] = (
                passed,
                f"Has {word_count} words (need ≥{constraint_value})",
            )

        elif constraint_type == "max_words":
            word_count = len(response.split())
            passed = word_count <= constraint_value
            results[f"max_{constraint_value}_words"] = (
                passed,
                f"Has {word_count} words (need ≤{constraint_value})",
            )

        elif constraint_type == "must_include":
            word = constraint_value.lower()
            passed = word in response.lower()
            results[f"must_include_{word}"] = (
                passed,
                f"'{word}' {'found' if passed else 'not found'}",
            )

        elif constraint_type == "must_exclude":
            word = constraint_value.lower()
            passed = word not in response.lower()
            results[f"must_exclude_{word}"] = (
                passed,
                f"'{word}' {'not found' if passed else 'found'}",
            )

        elif constraint_type == "ends_with":
            passed = response.strip().endswith(constraint_value)
            results[f"ends_with_{constraint_value}"] = (
                passed,
                f"Ends with '{response.strip()[-5:]}'",
            )

    return results


# Test a response against constraints
test_response = """The new XPhone Pro represents innovative smartphone technology 
at its finest. With groundbreaking camera capabilities and all-day battery life, 
this device transforms how you capture and share life's moments!"""

constraints = {
    "min_words": 20,
    "max_words": 50,
    "must_include": "innovative",
    "ends_with": "!",
}

verification_results = verify_ifeval_constraints(test_response, constraints)

# Calculate overall pass status
overall_passed = all(passed for passed, _ in verification_results.values())
overall_status_text = (
    "All constraints satisfied" if overall_passed else "Some constraints failed"
)

def verify_ifeval_constraints(response, constraints):
    """
    Verify whether a response satisfies IFEval-style constraints.
    Returns dict of constraint -> (passed, details)
    """
    results = {}

    for constraint_type, constraint_value in constraints.items():
        if constraint_type == "min_words":
            word_count = len(response.split())
            passed = word_count >= constraint_value
            results[f"min_{constraint_value}_words"] = (
                passed,
                f"Has {word_count} words (need ≥{constraint_value})",
            )

        elif constraint_type == "max_words":
            word_count = len(response.split())
            passed = word_count <= constraint_value
            results[f"max_{constraint_value}_words"] = (
                passed,
                f"Has {word_count} words (need ≤{constraint_value})",
            )

        elif constraint_type == "must_include":
            word = constraint_value.lower()
            passed = word in response.lower()
            results[f"must_include_{word}"] = (
                passed,
                f"'{word}' {'found' if passed else 'not found'}",
            )

        elif constraint_type == "must_exclude":
            word = constraint_value.lower()
            passed = word not in response.lower()
            results[f"must_exclude_{word}"] = (
                passed,
                f"'{word}' {'not found' if passed else 'found'}",
            )

        elif constraint_type == "ends_with":
            passed = response.strip().endswith(constraint_value)
            results[f"ends_with_{constraint_value}"] = (
                passed,
                f"Ends with '{response.strip()[-5:]}'",
            )

    return results


# Test a response against constraints
test_response = """The new XPhone Pro represents innovative smartphone technology 
at its finest. With groundbreaking camera capabilities and all-day battery life, 
this device transforms how you capture and share life's moments!"""

constraints = {
    "min_words": 20,
    "max_words": 50,
    "must_include": "innovative",
    "ends_with": "!",
}

verification_results = verify_ifeval_constraints(test_response, constraints)

# Calculate overall pass status
overall_passed = all(passed for passed, _ in verification_results.values())
overall_status_text = (
    "All constraints satisfied" if overall_passed else "Some constraints failed"
)

Out[35]:

Console

Response to verify:
"The new XPhone Pro represents innovative smartphone technology 
at its finest. With groundbreaking camera capabilities and all-day battery life, 
this device transforms how you capture and share life's moments!"

Constraint Verification:
--------------------------------------------------
  min_20_words: ✓ PASS
    Has 29 words (need ≥20)
  max_50_words: ✓ PASS
    Has 29 words (need ≤50)
  must_include_innovative: ✓ PASS
    'innovative' found
  ends_with_!: ✓ PASS
    Ends with 'ents!'
--------------------------------------------------
Overall: All constraints satisfied

Out[36]:

Visualization

Model performance across different constraint types in IFEval-style benchmarks. Models show varying compliance rates across constraint categories, with format and structural constraints often proving more challenging than simple keyword requirements.

IFEval enables measuring instruction-following ability independent of response quality. A model might generate excellent prose but fail basic formatting requirements, revealing a gap in instruction compliance. This separation is valuable diagnostically: it distinguishes between models that understand what to do but generate poor content versus models that generate good content but fail to follow explicit directions. Both failure modes exist in practice, and they require different interventions to address.

Difficulty ScoringLink Copied

We can estimate instruction difficulty before evaluation by analyzing instruction characteristics. This heuristic approach enables automatic categorization of instructions, which is useful for ensuring evaluation sets include appropriate coverage across difficulty levels. While no heuristic can perfectly predict how challenging an instruction will be for a given model, analyzing structural features provides a reasonable approximation:

In[37]:

Code

import re


def estimate_instruction_difficulty(instruction):
    """
    Heuristic difficulty scoring based on instruction features.
    Returns score 1-10 and contributing factors.
    """
    factors = {}
    score = 0

    instruction_lower = instruction.lower()

    # Word count (longer instructions often more complex)
    word_count = len(instruction.split())
    if word_count > 50:
        factors["length"] = "very long instruction"
        score += 2
    elif word_count > 25:
        factors["length"] = "long instruction"
        score += 1

    # Count explicit constraints
    constraint_patterns = [
        r"exactly \d+",
        r"at least \d+",
        r"no more than \d+",
        r"must include",
        r"must not",
        r"do not use",
        r"without using",
    ]
    constraint_count = sum(
        len(re.findall(pattern, instruction_lower))
        for pattern in constraint_patterns
    )
    if constraint_count >= 3:
        factors["constraints"] = f"{constraint_count} explicit constraints"
        score += 3
    elif constraint_count >= 1:
        factors["constraints"] = f"{constraint_count} explicit constraint(s)"
        score += constraint_count

    # Multi-step indicators
    step_patterns = [r"first.*then", r"step \d", r"\d\.", r"after that"]
    if any(re.search(p, instruction_lower) for p in step_patterns):
        factors["multi_step"] = "requires multiple steps"
        score += 2

    # Reasoning indicators
    reasoning_words = [
        "analyze",
        "compare",
        "contrast",
        "evaluate",
        "synthesize",
        "critique",
        "derive",
        "prove",
    ]
    found_reasoning = [w for w in reasoning_words if w in instruction_lower]
    if found_reasoning:
        factors["reasoning"] = (
            f"reasoning required ({', '.join(found_reasoning)})"
        )
        score += 2

    # Domain-specific terms (simple heuristic)
    technical_patterns = [
        r"\b(algorithm|equation|theorem|hypothesis|coefficient)\b",
        r"\b(quantum|molecular|neural|genetic|statistical)\b",
    ]
    if any(re.search(p, instruction_lower) for p in technical_patterns):
        factors["technical"] = "technical domain knowledge required"
        score += 2

    return min(score, 10), factors


# Test on various instructions
test_instructions = [
    "What is 2 + 2?",
    "Explain the concept of machine learning to a beginner.",
    "Write a 500-word essay analyzing the economic impacts of climate change, including at least 3 specific examples and a counterargument.",
    "Given the following code, identify all bugs, fix them, then optimize for performance without changing the output.",
]

# Calculate difficulty for all instructions
difficulty_results = []
for instruction in test_instructions:
    score, factors = estimate_instruction_difficulty(instruction)
    difficulty_results.append(
        {"instruction": instruction, "score": score, "factors": factors}
    )

import re


def estimate_instruction_difficulty(instruction):
    """
    Heuristic difficulty scoring based on instruction features.
    Returns score 1-10 and contributing factors.
    """
    factors = {}
    score = 0

    instruction_lower = instruction.lower()

    # Word count (longer instructions often more complex)
    word_count = len(instruction.split())
    if word_count > 50:
        factors["length"] = "very long instruction"
        score += 2
    elif word_count > 25:
        factors["length"] = "long instruction"
        score += 1

    # Count explicit constraints
    constraint_patterns = [
        r"exactly \d+",
        r"at least \d+",
        r"no more than \d+",
        r"must include",
        r"must not",
        r"do not use",
        r"without using",
    ]
    constraint_count = sum(
        len(re.findall(pattern, instruction_lower))
        for pattern in constraint_patterns
    )
    if constraint_count >= 3:
        factors["constraints"] = f"{constraint_count} explicit constraints"
        score += 3
    elif constraint_count >= 1:
        factors["constraints"] = f"{constraint_count} explicit constraint(s)"
        score += constraint_count

    # Multi-step indicators
    step_patterns = [r"first.*then", r"step \d", r"\d\.", r"after that"]
    if any(re.search(p, instruction_lower) for p in step_patterns):
        factors["multi_step"] = "requires multiple steps"
        score += 2

    # Reasoning indicators
    reasoning_words = [
        "analyze",
        "compare",
        "contrast",
        "evaluate",
        "synthesize",
        "critique",
        "derive",
        "prove",
    ]
    found_reasoning = [w for w in reasoning_words if w in instruction_lower]
    if found_reasoning:
        factors["reasoning"] = (
            f"reasoning required ({', '.join(found_reasoning)})"
        )
        score += 2

    # Domain-specific terms (simple heuristic)
    technical_patterns = [
        r"\b(algorithm|equation|theorem|hypothesis|coefficient)\b",
        r"\b(quantum|molecular|neural|genetic|statistical)\b",
    ]
    if any(re.search(p, instruction_lower) for p in technical_patterns):
        factors["technical"] = "technical domain knowledge required"
        score += 2

    return min(score, 10), factors


# Test on various instructions
test_instructions = [
    "What is 2 + 2?",
    "Explain the concept of machine learning to a beginner.",
    "Write a 500-word essay analyzing the economic impacts of climate change, including at least 3 specific examples and a counterargument.",
    "Given the following code, identify all bugs, fix them, then optimize for performance without changing the output.",
]

# Calculate difficulty for all instructions
difficulty_results = []
for instruction in test_instructions:
    score, factors = estimate_instruction_difficulty(instruction)
    difficulty_results.append(
        {"instruction": instruction, "score": score, "factors": factors}
    )

Out[38]:

Console

Instruction Difficulty Analysis

INSTRUCTION: "What is 2 + 2?"
  Difficulty Score: 0/10
  No difficulty factors detected (simple instruction)

INSTRUCTION: "Explain the concept of machine learning to a beginner."
  Difficulty Score: 0/10
  No difficulty factors detected (simple instruction)

INSTRUCTION: "Write a 500-word essay analyzing the economic impacts of climate chang..."
  Difficulty Score: 1/10
  Contributing factors:
    • 1 explicit constraint(s)

INSTRUCTION: "Given the following code, identify all bugs, fix them, then optimize f..."
  Difficulty Score: 0/10
  No difficulty factors detected (simple instruction)

Out[39]:

Visualization

Heuristic difficulty scores for sample instructions. The scoring considers multiple factors including instruction length, explicit constraints, reasoning requirements, and domain-specific terminology. Higher scores indicate more challenging instructions.

While heuristic, difficulty estimation helps balance evaluation sets and identify where models struggle. By tracking performance across difficulty levels, we can characterize model capabilities more precisely. A model that excels on easy instructions but fails on difficult ones differs meaningfully from a model with consistent performance across difficulty levels, even if their average scores are similar.

Limitations and Practical ConsiderationsLink Copied

Evaluating instruction following remains an open problem with no perfect solution. Understanding the limitations of each approach helps you make better evaluation decisions.

Human evaluation, despite being the gold standard, faces fundamental challenges beyond cost. Evaluators disagree on what constitutes a "good" response, and this disagreement isn't noise; it reflects genuine differences in preferences. Some of you prefer concise answers while others want comprehensive explanations. Some prioritize strict factual accuracy while others value engaging presentation. A model that scores highly with one evaluator population may score poorly with another. This makes it difficult to declare any single model universally "best" at instruction following.

Automatic evaluation using LLM-as-judge introduces its own systematic biases. Beyond position and verbosity bias, these systems tend to favor responses that match their own training distribution. A GPT-4 judge may prefer GPT-4-style responses, creating evaluation circularity when developing models trained to match GPT-4 outputs. Additionally, LLM judges struggle with certain evaluations: they often cannot reliably verify factual claims, execute code mentally, or assess whether a creative writing piece is genuinely original versus derivative. For high-stakes evaluations, automatic metrics should complement, not replace, targeted human review.

Benchmark saturation presents an emerging challenge. As models improve and benchmarks become well-known, performance gains may reflect benchmark-specific optimization rather than genuine capability improvements. Models trained with MMLU-style questions in their data will naturally score higher on MMLU, even if they don't have stronger reasoning abilities overall. This motivates continuous development of new evaluation paradigms and held-out test sets that cannot be gamed through data contamination.

The gap between benchmark performance and real-world usefulness is perhaps the most significant limitation. A model might achieve high scores on instruction-following benchmarks while still frustrating you in deployment. Benchmarks test specific, curated instructions, but you issue ambiguous, poorly-formed, or contextually-dependent requests. Evaluation should ultimately connect to your satisfaction metrics when possible, treating benchmarks as proxies rather than ground truth.

SummaryLink Copied

Evaluating instruction-following models requires multiple complementary approaches because no single metric captures all aspects of quality.

Benchmarks like Alpaca Eval, MT-Bench, and IFEval provide standardized comparisons across models. Standard NLP benchmarks test underlying capabilities, while instruction-specific benchmarks measure actual response quality and constraint compliance.

Human evaluation remains the gold standard but faces challenges of cost, scale, and inter-annotator disagreement. Pairwise comparison tends to be more reliable than absolute rating scales. Measuring agreement using metrics like Cohen's Kappa helps assess evaluation reliability.

Automatic evaluation using LLM-as-judge scales effectively but introduces systematic biases including position bias and verbosity bias. Position swapping and explicit instruction to judges can partially mitigate these issues. For verifiable tasks like code generation, functional testing provides ground truth that complements qualitative evaluation.

Instruction difficulty varies across multiple dimensions: knowledge requirements, reasoning depth, constraint complexity, ambiguity, and context length. IFEval specifically measures constraint compliance through automatically verifiable requirements, separating instruction-following ability from response quality.

The evaluation methods covered here directly inform the next part on RLHF, where preference data collected through these evaluation approaches becomes training signal for aligning models with human values. Understanding both the power and limitations of instruction-following evaluation helps you make informed decisions about model selection and deployment.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about instruction following evaluation.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Instruction Tuning Training

Next Chapter

Alignment Problem

Reference

BIBTEXAcademic

@misc{instructionfollowingevaluationbenchmarksllmjudges, author = {Michael Brenndoerfer}, title = {Instruction Following Evaluation: Benchmarks & LLM Judges}, year = {2025}, url = {https://mbrenndoerfer.com/writing/instruction-following-evaluation-benchmarks-llm-judge}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Instruction Following Evaluation: Benchmarks & LLM Judges. Retrieved from https://mbrenndoerfer.com/writing/instruction-following-evaluation-benchmarks-llm-judge

MLAAcademic

Michael Brenndoerfer. "Instruction Following Evaluation: Benchmarks & LLM Judges." 2026. Web. today. <https://mbrenndoerfer.com/writing/instruction-following-evaluation-benchmarks-llm-judge>.

CHICAGOAcademic

Michael Brenndoerfer. "Instruction Following Evaluation: Benchmarks & LLM Judges." Accessed today. https://mbrenndoerfer.com/writing/instruction-following-evaluation-benchmarks-llm-judge.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Instruction Following Evaluation: Benchmarks & LLM Judges'. Available at: https://mbrenndoerfer.com/writing/instruction-following-evaluation-benchmarks-llm-judge (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Instruction Following Evaluation: Benchmarks & LLM Judges. https://mbrenndoerfer.com/writing/instruction-following-evaluation-benchmarks-llm-judge

Direct link:

https://mbrenndoerfer.com/writing/instruction-following-evaluation-benchmarks-llm-judge

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Instruction Following Evaluation: Benchmarks & LLM Judges

Instruction Following EvaluationLink Copied

Benchmarks for Instruction FollowingLink Copied

Standard NLP BenchmarksLink Copied

Instruction-Specific BenchmarksLink Copied

Benchmark LimitationsLink Copied

Human EvaluationLink Copied

Pairwise ComparisonLink Copied

Rating ScalesLink Copied

Inter-Annotator AgreementLink Copied

Challenges in Human EvaluationLink Copied

Automatic EvaluationLink Copied

LLM-as-a-JudgeLink Copied

Position Bias and MitigationLink Copied

Verbosity BiasLink Copied

Win Rate CalculationLink Copied

Reference-Based MetricsLink Copied

Instruction DifficultyLink Copied

Dimensions of DifficultyLink Copied

The IFEval BenchmarkLink Copied

Difficulty ScoringLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Instruction Tuning Training: Data Mixing & Loss Masking

Instruction Format: Chat Templates & Role Definitions for LLMs

Self-Instruct: Bootstrap Instruction-Tuning Datasets

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Instruction Tuning Training: Data Mixing & Loss Masking

Instruction Format: Chat Templates & Role Definitions for LLMs

Self-Instruct: Bootstrap Instruction-Tuning Datasets

Stay updated