Instruction Following: Teaching LLMs to Execute Your Requests

Michael BrenndoerferDecember 15, 202537 min read

Learn how instruction tuning transforms base language models into helpful assistants. Explore format design, data diversity, and quality principles.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Instruction Following

Imagine you've just finished pre-training a large language model on trillions of tokens from the internet. The model can complete sentences, generate coherent text, and even demonstrate surprising capabilities like answering questions when prompted correctly. But when you ask it to "Summarize this article in three bullet points," it continues writing more of the article instead. When you request "Translate the following sentence to French," it sometimes translates and sometimes just keeps generating English. The model has learned language, but it hasn't learned to follow instructions.

This gap between language modeling capability and practical utility motivated one of the most important developments in modern LLMs: instruction tuning. As we explored in Part XVIII with GPT-3's in-context learning, pre-trained models can perform tasks when given the right examples in their context. But this requires careful prompt engineering, wastes context space on demonstrations, and fails unpredictably on novel phrasings. Instruction tuning offers a different approach: teaching models to understand and execute your requests directly, without requiring examples at inference time.

This chapter introduces the motivation behind instruction tuning, explores how to design effective instruction formats, examines why diversity in training data matters for generalization, and discusses what distinguishes high-quality instruction data from noise. Understanding these foundations prepares you for the subsequent chapters on data creation, training procedures, and evaluation methods.

The Pre-training Gap

Pre-trained language models like those we studied in Parts XVI through XX excel at one specific task: predicting the next token given a context. During causal language modeling, the model learns to assign high probability to continuations that match the statistical patterns in its training data. This training paradigm produces models with remarkable capabilities, but it creates a fundamental mismatch with how humans want to use these systems. The core issue is that predicting likely continuations and following your instructions are fundamentally different objectives, even though they both involve generating text.

Consider what happens when you prompt a base GPT-style model with "Write a poem about autumn leaves." The model doesn't interpret this as a request requiring action. Instead, it treats the text as a prefix to be continued in the most likely way given its training distribution. The model asks itself, in effect, what text would most plausibly follow these words based on everything it learned during pre-training. Depending on what similar text appeared in pre-training, it might:

  • Complete the sentence: "Write a poem about autumn leaves falling gently..."
  • Generate a meta-discussion: "Write a poem about autumn leaves. This is a common creative writing prompt..."
  • Produce the poem directly if similar instructional content appeared frequently in training

This unpredictability stems from the fact that the pre-training objective optimizes for continuation probability, not instruction compliance. The model has no explicit training signal that says "when text looks like a request, generate a response that fulfills that request." From the model's perspective, there is no meaningful distinction between your instruction and any other piece of text. Everything is simply context to be continued. This fundamental design characteristic explains why base models, despite their impressive language understanding, often fail to behave as useful assistants.

Out[2]:
Visualization
Response probability distributions for base and instruction-tuned models given the prompt 'Write a poem about autumn leaves'. Base models distribute probability across continuation and meta-discussion categories, reflecting their pre-training on document completion, whereas instruction-tuned models concentrate probability mass on executing the instruction.
Response probability distributions for base and instruction-tuned models given the prompt 'Write a poem about autumn leaves'. Base models distribute probability across continuation and meta-discussion categories, reflecting their pre-training on document completion, whereas instruction-tuned models concentrate probability mass on executing the instruction.
Notebook output
Base Model vs. Instruction-Tuned Model

A base model (or foundation model) is trained only on the language modeling objective, learning to predict the next token. An instruction-tuned model receives additional training specifically on instruction-response pairs, teaching it to interpret and follow your requests. The base model knows how to generate plausible text; the instruction-tuned model knows how to generate text that specifically addresses what you asked for.

The Format Problem

Beyond the objective mismatch, base models don't understand conversational conventions. When you interact with an assistant, there's an implicit structure: you provide a request, and the assistant responds. This turn-taking pattern requires the model to know when to stop generating (it has produced a complete response) and what role it should adopt (helpful assistant rather than document author). These conventions, which humans learn through years of social interaction, must be explicitly taught to language models.

Base models trained on web text have seen conversations, but they've also seen novels, code, advertisements, forum threads, and countless other formats. Without additional training, the model cannot reliably distinguish "this is an instruction I should execute" from "this is text I should continue in its style." A base model might generate a response to your question, or it might generate more questions, or it might continue as if writing a FAQ document, or it might shift into a completely different register. The model simply doesn't have a consistent understanding of its role in the interaction.

The Instruction Tuning Insight

The key insight behind instruction tuning is deceptively simple: if you want models to follow instructions, train them on examples of instructions being followed. By fine-tuning a pre-trained model on a dataset of (instruction, response) pairs, you explicitly teach it the behavior you want. Rather than hoping the model infers the correct behavior from ambiguous pre-training data, you directly demonstrate what good instruction-following looks like.

This approach was pioneered by several research efforts in 2021-2022, including Google's FLAN (Fine-tuned LAnguage Net), OpenAI's InstructGPT, and various academic projects. Despite using relatively small amounts of instruction data compared to pre-training corpora (thousands to millions of examples versus trillions of tokens), instruction tuning produces dramatic improvements in model usability. Models go from being powerful but unpredictable text generators to being cooperative assistants that understand and respond to your needs.

The effectiveness of instruction tuning reveals something important about large pre-trained models: they already possess the underlying capabilities needed to follow instructions. Pre-training gives models broad knowledge about language, facts, and reasoning patterns. Instruction tuning doesn't teach new capabilities so much as it teaches models to activate existing capabilities in response to your requests. Think of it like this: a pre-trained model is like a highly skilled professional who doesn't know they're supposed to be helping you. Instruction tuning teaches them to recognize when they're being asked to do something and to respond accordingly.

Why Small Datasets Work

Given the scale of pre-training (hundreds of billions to trillions of tokens), it might seem surprising that instruction tuning works with datasets that are orders of magnitude smaller. Several factors explain this efficiency, and understanding them illuminates why instruction tuning is so effective:

Capability already exists. The model has already learned to summarize, translate, answer questions, and perform other tasks during pre-training. It encountered examples of summaries, translations, and question-answer pairs throughout the web data it was trained on. What it lacks is the understanding that when you present an instruction, it should deploy these existing capabilities. Instruction tuning provides this missing link, teaching the model to recognize and respond to requests.

Format is learnable. The instruction-following format (receive request, generate response, stop) is a relatively simple pattern compared to the full complexity of language. Unlike learning grammar, world knowledge, or reasoning, which requires exposure to vast amounts of text, learning to recognize and respond to instructions is a much more constrained problem. Models can learn this convention quickly because it's fundamentally a formatting task rather than a capability-building task.

Transfer across instructions. Training on "summarize this article" helps the model understand "condense this text" and "give me the main points" even without explicit examples. The model generalizes from the specific examples it sees to the broader concept of summarization requests. This transfer effect multiplies the value of each training example, as one example can inform the model's behavior across many related phrasings and variations.

Alignment with generation. Following instructions aligns with the model's core capability: generating text. The model simply learns to generate the kind of text that represents a helpful response to the given instruction. Unlike training objectives that require the model to perform fundamentally new operations, instruction tuning asks the model to do what it already does (generate text) but with a different intent (fulfill your requests rather than continue documents).

Out[3]:
Visualization
Data scale contrast between pre-training corpora and instruction tuning datasets. Pre-training leverages approximately 1 trillion tokens of web text, while instruction tuning uses a dataset five orders of magnitude smaller (10 million tokens) to activate existing capabilities rather than teach new ones.
Data scale contrast between pre-training corpora and instruction tuning datasets. Pre-training leverages approximately 1 trillion tokens of web text, while instruction tuning uses a dataset five orders of magnitude smaller (10 million tokens) to activate existing capabilities rather than teach new ones.

Instruction Format Design

Creating effective instruction-tuning data requires careful attention to format. The structure of training examples teaches models what inputs to expect and what outputs to produce. A well-designed format makes the distinction between instruction, input, and response crystal clear, helping the model learn the appropriate boundaries and behaviors. Several components define an instruction format, and each plays a crucial role in the training signal.

Instruction Component

The instruction component tells the model what to do. This is the core directive that specifies the task the model should perform. Effective instructions are clear, specific, and actionable:

  • Clear: "Translate the following English sentence to Spanish" leaves no ambiguity about the task. The model knows exactly what language pair to work with and what direction the translation should go.
  • Specific: "Summarize in exactly three sentences" provides concrete constraints that guide the model's output. Without such specificity, the model must guess at the appropriate length and format.
  • Actionable: The instruction describes something the model can actually generate as text output. Instructions like "feel happy about this text" aren't actionable because they don't translate into specific textual output.

Instructions can range from single-word commands ("Translate:") to detailed multi-sentence specifications explaining exactly what output is expected. The level of detail often depends on the task complexity and the desired output format. Simple tasks may need only brief instructions, while complex tasks benefit from more elaborate specifications that reduce ambiguity and set clear expectations.

Input Component

Many instructions operate on provided content. The input component contains the material the model should process, serving as the raw data that the instruction acts upon:

Instruction: Summarize the following article.
Input: [article text here]

Some instructions don't require separate input (e.g., "Write a haiku about spring") while others are meaningless without it (e.g., "Translate the following sentence"). The format must clearly distinguish the instruction from the input to prevent confusion. Without clear separation, the model might incorporate parts of the instruction into its understanding of the input, or vice versa, leading to incorrect responses.

Output Component

The output (or response) component contains the desired model behavior. During training, this is the target text the model learns to generate. The output should directly address the instruction and, where applicable, properly process the input. This component serves as the ground truth that the model optimizes toward during training.

Quality outputs exhibit several properties:

  • Responsiveness: The output addresses exactly what the instruction asks. A summarization instruction should produce a summary, not additional commentary or tangential information.
  • Completeness: All parts of the instruction are fulfilled. If the instruction asks for three points, the output should contain three points, not two or four.
  • Conciseness: No unnecessary elaboration beyond what's requested. Verbose responses that pad the output with irrelevant information teach the model to be similarly unfocused.
  • Correctness: Factually accurate and logically sound. Training on incorrect outputs teaches the model to produce errors confidently.

Format Templates

Instruction-tuning datasets typically use consistent templates to structure examples. A common template might look like:

### Instruction:
{instruction_text}

### Input:
{input_text}

### Response:
{output_text}

The specific delimiters and formatting vary across datasets, but consistency within a dataset helps models learn the pattern reliably. When the model sees consistent markers like "### Instruction:" and "### Response:", it learns to associate these patterns with the instruction-following behavior. This consistency reduces ambiguity and accelerates learning. We'll explore format templates in depth in the upcoming chapter on instruction format.

Instruction Diversity

The diversity of instructions in training data critically affects how well the tuned model generalizes. A model trained only on translation examples won't learn to summarize, because it has never seen the pattern of summarization requests and responses. A model trained only on formal phrasings won't understand casual requests, because it has learned to expect a specific register of language. Diversity operates along multiple dimensions, and each dimension contributes to the model's overall flexibility and robustness.

Task Diversity

Task diversity means covering many different types of instructions. Rather than specializing in one particular capability, a truly instruction-following model should be able to handle the full range of tasks you might request. Comprehensive instruction datasets include examples across categories like:

  • Text generation: Writing stories, poems, emails, code, essays. These tasks require the model to create content from scratch based on specifications.
  • Text transformation: Summarization, paraphrasing, style transfer, translation. These tasks require the model to take existing text and modify it in specific ways.
  • Information extraction: Named entity recognition, relation extraction, key point identification. These tasks require the model to identify and isolate specific information from provided text.
  • Question answering: Factual questions, reasoning questions, reading comprehension. These tasks require the model to provide information in response to queries.
  • Analysis: Sentiment classification, topic identification, text comparison. These tasks require the model to evaluate and categorize text according to various criteria.
  • Conversation: Dialogue responses, follow-up handling, context maintenance. These tasks require the model to participate in multi-turn exchanges while tracking context.

Training across diverse tasks teaches models the general skill of instruction following rather than specific task patterns. A model that has learned to follow summarization instructions and classification instructions and generation instructions develops a meta-understanding of what it means to receive and execute an instruction. It learns the abstract pattern: "receive a directive, understand what's being asked, generate appropriate output." This meta-learning is far more valuable than mastering any single task.

Out[4]:
Visualization
Task type distribution in a balanced instruction dataset. Text generation, answering, and transformation tasks comprise the majority of examples, ensuring the model learns to handle diverse instruction categories beyond simple patterns.
Task type distribution in a balanced instruction dataset. Text generation, answering, and transformation tasks comprise the majority of examples, ensuring the model learns to handle diverse instruction categories beyond simple patterns.

Phrasing Diversity

Even within a single task, instructions can be phrased countless ways. Human language is remarkably flexible, and you will express the same intent using vastly different words and structures. Consider these equivalent summarization requests:

  • "Summarize this text"
  • "What are the main points?"
  • "Give me a brief overview"
  • "TL;DR"
  • "Can you condense this?"
  • "I need a quick summary"
  • "Break this down for me"

If training data only contains one phrasing, the model may fail on others because it has learned to pattern-match on specific words rather than understanding the underlying intent. Phrasing diversity ensures the model learns the underlying intent rather than pattern-matching on specific words. The model should recognize that all these phrasings point to the same underlying request for summarization, even though they use completely different vocabulary.

Difficulty Diversity

Instructions vary in complexity from simple single-step requests to intricate multi-part specifications. A robust instruction-following model must handle this entire spectrum, from straightforward commands to nuanced, multi-faceted directives. Training data should span this spectrum:

Simple instructions:

  • "Translate 'hello' to German"
  • "What is 2 + 2?"

Moderate instructions:

  • "Summarize this article in 3 bullet points, focusing on the economic implications"
  • "Rewrite this paragraph in a more formal tone while preserving the key information"

Complex instructions:

  • "Compare and contrast these two research papers, identifying their methodological differences and evaluating the strength of their conclusions"
  • "Write a response to this customer complaint that acknowledges their frustration, explains our policy, offers a reasonable solution, and maintains a professional tone"

Exposure to varying difficulty levels helps models handle the full range of your requests they'll encounter in deployment. If training data contains only simple instructions, the model may struggle to decompose and address complex multi-part requests. Conversely, training only on complex instructions may not help the model learn the fundamental patterns of instruction following that simpler examples demonstrate clearly.

Domain Diversity

Instructions span many knowledge domains: science, history, technology, arts, law, medicine, and more. Domain diversity ensures the model doesn't overfit to particular subject areas and can handle requests across the full breadth of human knowledge. A model trained predominantly on technology-related instructions might struggle with history questions, not because it lacks historical knowledge (that comes from pre-training) but because it hasn't learned how historical inquiries are typically phrased.

This doesn't mean the model needs domain expertise (that comes from pre-training knowledge), but it needs to understand how instructions are phrased in different domains. A medical question might use terminology differently than a legal question, even if both are asking for clarification. Scientific instructions might favor precise, technical language, while creative writing instructions might use more evocative, open-ended phrasing. Domain diversity teaches the model to navigate these variations in communication style.

Instruction Quality

Not all instruction examples are equally valuable. The quality of training data directly impacts model behavior, often in ways that may not be immediately apparent but compound over time. High-quality instruction data exhibits several characteristics that distinguish it from noise or low-value examples.

Response Accuracy

The response must correctly fulfill the instruction. An instruction asking for the capital of France paired with a response saying "London" teaches the wrong behavior. The model learns to produce confident but incorrect answers because that's what it sees in training. Quality control processes must verify that responses accurately address their instructions, especially for factual or deterministic tasks where correctness can be objectively assessed.

For subjective tasks like creative writing, accuracy means appropriateness: does the response reasonably satisfy what a human would expect from the instruction? A poem about autumn should actually be about autumn, not about spring. A formal letter should sound formal, not casual. These judgments require human evaluation to ensure the training data teaches appropriate behavior.

Response Completeness

Partial or truncated responses teach models to stop prematurely. If an instruction asks for "five examples" but the response only provides three, the model learns incomplete behavior. It learns that three examples constitute an acceptable response to a request for five. Over many training examples, such patterns accumulate, teaching the model that incomplete responses are acceptable. Quality instruction data includes complete responses that fully address all aspects of the instruction.

Instruction Clarity

Ambiguous or confusing instructions produce training noise. If an instruction could reasonably be interpreted multiple ways, the response may align with one interpretation while you have another in mind. When the model encounters similar ambiguous instructions during deployment, it may produce responses that don't match your expectations because it learned a different interpretation during training. High-quality instructions minimize ambiguity.

Consider the instruction "Make this better." Better in what way? More formal? More concise? More engaging? Such vague instructions should either be clarified or paired with responses that interpret the intent reasonably and explicitly state their interpretation. A high-quality response might begin, "I'll improve this text by making it more concise and formal:" before providing the revised version, thereby demonstrating how to handle ambiguity constructively.

Naturalness

Instructions should reflect how you actually make requests. Overly formal or artificial instructions create a distribution mismatch between training and deployment. If all training instructions follow a rigid template like "Please perform the following action: [task description]" but you write "hey can you help me with something," the model may struggle with natural language inputs. It has learned to expect a particular format that you don't employ.

Effective instruction datasets include a mix of formal and informal phrasings, complete sentences and fragments, explicit requests and implied needs. This variety prepares the model to handle the messy, inconsistent ways humans actually communicate.

Safety and Appropriateness

Training data shapes model behavior, including potentially harmful behaviors. If the training data includes examples of generating harmful content, the model learns to produce such content when asked. Quality instruction data excludes examples that would teach models to:

  • Generate harmful or dangerous content
  • Reveal private information
  • Exhibit biases or discriminatory patterns
  • Bypass safety considerations

This concern becomes particularly important in the alignment work we'll explore in Part XXVII, where human preferences guide models toward helpful, harmless, and honest behavior.

Worked Example: Creating Instruction Examples

Let's create a few instruction examples to illustrate format and quality considerations. We'll develop examples for a summarization task, showing how different format choices affect the training signal. By examining these examples closely, you can see how small variations in instruction phrasing lead to meaningfully different expected outputs.

Basic format:

Instruction: Summarize the following paragraph.

Input: The James Webb Space Telescope, launched in December 2021, represents 
the largest and most powerful space telescope ever built. Its primary mirror, 
composed of 18 hexagonal gold-plated beryllium segments, spans 6.5 meters in 
diameter. Operating primarily in the infrared spectrum, JWST can observe 
objects too old, distant, or faint for its predecessor, the Hubble Space 
Telescope. The telescope orbits the Sun at the second Lagrange point, 
approximately 1.5 million kilometers from Earth.

Response: The James Webb Space Telescope is the largest space telescope ever 
built, featuring a 6.5-meter segmented mirror. Launched in 2021, it observes 
in infrared from a position 1.5 million kilometers from Earth, enabling 
detection of objects beyond Hubble's capabilities.

More constrained format:

Instruction: Summarize the following paragraph in exactly two sentences. 
Focus on the telescope's key capabilities.

Input: [same paragraph]

Response: The James Webb Space Telescope features a 6.5-meter gold-plated 
mirror that enables infrared observations of objects too distant or faint 
for previous telescopes. Operating from the L2 Lagrange point, it represents 
humanity's most powerful tool for deep space observation.

Informal phrasing:

Instruction: What's the gist of this?

Input: [same paragraph]

Response: It's about the James Webb Space Telescope, which is basically the 
biggest and most powerful space telescope we've ever made. It launched in 
late 2021 and can see much farther into space than Hubble could, using 
infrared detection from its position way out past Earth's orbit.

Notice how the response style matches the instruction style. Formal instructions receive formal responses; casual questions get conversational answers. This stylistic alignment teaches models to adapt their output to your preferences. The model learns not just what to say but how to say it, matching the register and tone of your request.

Code Implementation

Let's implement a basic instruction data processor that validates and formats instruction examples. This prepares raw data for instruction tuning.

In[5]:
Code
from dataclasses import dataclass
from typing import Optional


@dataclass
class InstructionExample:
    """Represents a single instruction-tuning example."""

    instruction: str
    input_text: Optional[str]
    output: str

    def format_prompt(self, template: str = "default") -> str:
        """Format the example according to a template."""
        if template == "default":
            if self.input_text:
                return f"### Instruction:\n{self.instruction}\n\n### Input:\n{self.input_text}\n\n### Response:\n"
            else:
                return (
                    f"### Instruction:\n{self.instruction}\n\n### Response:\n"
                )
        elif template == "alpaca":
            if self.input_text:
                return f"Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{self.instruction}\n\n### Input:\n{self.input_text}\n\n### Response:\n"
            else:
                return f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{self.instruction}\n\n### Response:\n"
        else:
            raise ValueError(f"Unknown template: {template}")

    def format_full(self, template: str = "default") -> str:
        """Format the complete example including the response."""
        return self.format_prompt(template) + self.output

Now let's create a collection of sample instruction examples spanning different task types.

In[6]:
Code
# Create diverse instruction examples
examples = [
    # Translation task
    InstructionExample(
        instruction="Translate the following English text to French.",
        input_text="The weather is beautiful today.",
        output="Le temps est magnifique aujourd'hui.",
    ),
    # Summarization task
    InstructionExample(
        instruction="Summarize this paragraph in one sentence.",
        input_text="Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. It focuses on developing computer programs that can access data and use it to learn for themselves. The process begins with observations or data, such as examples, direct experience, or instruction, to look for patterns in data and make better decisions in the future.",
        output="Machine learning enables computer systems to automatically learn and improve from data and experience rather than requiring explicit programming.",
    ),
    # Generation task (no input)
    InstructionExample(
        instruction="Write a haiku about programming.",
        input_text=None,
        output="Code flows like water\nBugs emerge from hidden depths\nDebug, compile, run",
    ),
    # Classification task
    InstructionExample(
        instruction="Classify the sentiment of the following review as positive, negative, or neutral.",
        input_text="The product arrived on time but the quality was disappointing. I expected better for the price.",
        output="Negative",
    ),
    # Question answering task
    InstructionExample(
        instruction="Answer the following question based on the given context.",
        input_text="Context: The Great Wall of China was built over many centuries, with construction beginning in the 7th century BC. The most well-known sections were built during the Ming Dynasty (1368-1644 AD).\n\nQuestion: When were the most famous parts of the Great Wall built?",
        output="The most famous sections of the Great Wall of China were built during the Ming Dynasty, which lasted from 1368 to 1644 AD.",
    ),
    # Code generation task
    InstructionExample(
        instruction="Write a Python function that calculates the factorial of a number.",
        input_text=None,
        output="def factorial(n):\n    if n < 0:\n        raise ValueError('Factorial not defined for negative numbers')\n    if n == 0 or n == 1:\n        return 1\n    return n * factorial(n - 1)",
    ),
]

Let's examine how these examples look when formatted with different templates.

In[7]:
Code
# Generate formatted outputs for the first example
example = examples[0]
default_format = example.format_full("default")
alpaca_format = example.format_full("alpaca")
Out[8]:
Console
============================================================
TEMPLATE: default
============================================================
### Instruction:
Translate the following English text to French.

### Input:
The weather is beautiful today.

### Response:
Le temps est magnifique aujourd'hui.

============================================================
TEMPLATE: alpaca
============================================================
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Translate the following English text to French.

### Input:
The weather is beautiful today.

### Response:
Le temps est magnifique aujourd'hui.

The default template provides a minimal structure, while the Alpaca template adds a context-setting preamble that helps the model understand its role. Both formats clearly separate the instruction from the input content using distinct headers.

Now let's implement quality checks for instruction data.

In[9]:
Code
class InstructionValidator:
    """Validates instruction examples for quality issues."""

    def __init__(
        self,
        min_instruction_length: int = 5,
        min_output_length: int = 1,
        max_instruction_length: int = 1000,
    ):
        self.min_instruction_length = min_instruction_length
        self.min_output_length = min_output_length
        self.max_instruction_length = max_instruction_length

    def validate(self, example: InstructionExample) -> dict:
        """
        Validate an instruction example.
        Returns dict with 'valid' bool and 'issues' list.
        """
        issues = []

        # Check instruction length
        if len(example.instruction) < self.min_instruction_length:
            issues.append(
                f"Instruction too short ({len(example.instruction)} chars)"
            )
        if len(example.instruction) > self.max_instruction_length:
            issues.append(
                f"Instruction too long ({len(example.instruction)} chars)"
            )

        # Check output exists and has content
        if (
            not example.output
            or len(example.output.strip()) < self.min_output_length
        ):
            issues.append("Output is empty or too short")

        # Check for placeholder text
        placeholder_patterns = [
            "[your response here]",
            "[insert",
            "TODO",
            "...",
        ]
        for pattern in placeholder_patterns:
            if pattern.lower() in example.output.lower():
                issues.append(f"Output contains placeholder text: {pattern}")

        # Check instruction ends appropriately (not mid-sentence)
        if example.instruction and example.instruction[-1] not in ".?!:":
            # This is a soft warning, not necessarily an error
            issues.append("Instruction doesn't end with punctuation (minor)")

        return {
            "valid": len([i for i in issues if "minor" not in i]) == 0,
            "issues": issues,
        }
In[10]:
Code
# Test the validator
validator = InstructionValidator()

# Create some examples with quality issues
test_examples = [
    InstructionExample(
        instruction="Hi",  # Too short
        input_text=None,
        output="Hello!",
    ),
    InstructionExample(
        instruction="Complete this task",  # Missing punctuation
        input_text="Some input",
        output="",  # Empty output
    ),
    InstructionExample(
        instruction="Summarize this article.",
        input_text="Article content here.",
        output="[Your response here]",  # Placeholder
    ),
    InstructionExample(
        instruction="Translate to Spanish.",
        input_text="Good morning",
        output="Buenos días",  # Valid example
    ),
]

validation_results = [validator.validate(ex) for ex in test_examples]
Out[11]:
Console
Example 1: INVALID
  - Instruction too short (2 chars)
  - Instruction doesn't end with punctuation (minor)

Example 2: INVALID
  - Output is empty or too short
  - Instruction doesn't end with punctuation (minor)

Example 3: INVALID
  - Output contains placeholder text: [your response here]

Example 4: VALID

The validator successfully identifies common quality issues. Example 1 fails due to brevity, Example 2 lacks a complete instruction and output, and Example 3 contains placeholder text, while Example 4 passes all checks.

Out[12]:
Visualization
Validation results for four sample instruction examples. The validator flags critical issues such as insufficient length (Example 1) and missing content (Examples 2 and 3), while correctly passing valid examples (Example 4), demonstrating the importance of automated quality checks.
Validation results for four sample instruction examples. The validator flags critical issues such as insufficient length (Example 1) and missing content (Examples 2 and 3), while correctly passing valid examples (Example 4), demonstrating the importance of automated quality checks.

Let's also implement a function to analyze diversity in an instruction dataset.

In[13]:
Code
import re
from collections import Counter


def analyze_diversity(examples: list[InstructionExample]) -> dict:
    """
    Analyze the diversity of an instruction dataset.
    Returns statistics about task types, instruction patterns, and lengths.
    """

    # Extract first verb from instructions as a rough task indicator
    def extract_task_verb(instruction: str) -> str:
        words = instruction.lower().split()
        task_verbs = [
            "write",
            "translate",
            "summarize",
            "classify",
            "answer",
            "explain",
            "list",
            "describe",
            "compare",
            "generate",
            "create",
            "analyze",
            "extract",
            "rewrite",
            "convert",
        ]
        for word in words[:5]:  # Check first 5 words
            clean_word = re.sub(r"[^a-z]", "", word)
            if clean_word in task_verbs:
                return clean_word
        return "other"

    task_verbs = [extract_task_verb(ex.instruction) for ex in examples]

    # Analyze instruction lengths
    instruction_lengths = [len(ex.instruction) for ex in examples]
    output_lengths = [len(ex.output) for ex in examples]

    # Check for input presence
    has_input = sum(1 for ex in examples if ex.input_text)

    return {
        "total_examples": len(examples),
        "task_distribution": dict(Counter(task_verbs)),
        "instruction_length": {
            "min": min(instruction_lengths),
            "max": max(instruction_lengths),
            "mean": sum(instruction_lengths) / len(instruction_lengths),
        },
        "output_length": {
            "min": min(output_lengths),
            "max": max(output_lengths),
            "mean": sum(output_lengths) / len(output_lengths),
        },
        "examples_with_input": has_input,
        "examples_without_input": len(examples) - has_input,
    }
In[14]:
Code
diversity_stats = analyze_diversity(examples)
Out[15]:
Console
Dataset Diversity Analysis
========================================
Total examples: 6

Task distribution:
  translate: 1
  summarize: 1
  write: 2
  classify: 1
  answer: 1

Instruction length (chars):
  Min: 32
  Max: 81
  Mean: 54.0

Output length (chars):
  Min: 8
  Max: 176
  Mean: 93.2

Input presence:
  With input: 4
  Without input: 2

The analysis reveals the structural properties of our dataset. We see a distribution of different task verbs and a mix of examples with and without input text. Monitoring these statistics helps ensure the training data covers the necessary variety of instruction types and lengths required for robust generalization.

Out[16]:
Visualization
Task type distribution in the sample dataset categorized by leading verbs. The frequency analysis shows that generation-focused requests occur most often, demonstrating how leading verbs serve as effective proxies for identifying instruction intent.
Task type distribution in the sample dataset categorized by leading verbs. The frequency analysis shows that generation-focused requests occur most often, demonstrating how leading verbs serve as effective proxies for identifying instruction intent.
Out[17]:
Visualization
Proportion of examples containing distinct input fields. Two-thirds of the sample instructions require an associated input field, reflecting the dominance of text-processing tasks like summarization and translation over standalone generation.
Proportion of examples containing distinct input fields. Two-thirds of the sample instructions require an associated input field, reflecting the dominance of text-processing tasks like summarization and translation over standalone generation.
Out[18]:
Visualization
Instruction length versus output length for different task types. The scatter plot reveals distinct clustering by task, with generation tasks typically yielding longer outputs compared to classification or extraction tasks.
Instruction length versus output length for different task types. The scatter plot reveals distinct clustering by task, with generation tasks typically yielding longer outputs compared to classification or extraction tasks.

Finally, let's implement a function to prepare instruction data for training by tokenizing and creating the appropriate format for a language model.

In[19]:
Code
def prepare_for_training(
    examples: list[InstructionExample],
    template: str = "default",
    max_length: int = 512,
) -> list[dict]:
    """
    Prepare instruction examples for training.
    Returns list of dicts with 'prompt' and 'completion' keys.
    """
    prepared = []

    for example in examples:
        prompt = example.format_prompt(template)
        completion = example.output
        full_text = prompt + completion

        # Check length (rough estimate using character count)
        # In practice, you'd use actual tokenization
        if len(full_text) > max_length * 4:  # Rough char-to-token ratio
            continue  # Skip examples that are too long

        prepared.append(
            {"prompt": prompt, "completion": completion, "full_text": full_text}
        )

    return prepared
In[20]:
Code
training_data = prepare_for_training(examples, template="alpaca")
Out[21]:
Console
Prepared 6 examples for training

Sample prepared example:
----------------------------------------
PROMPT:
Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Write a haiku about programming.

### Response:
...

COMPLETION:
Code flows like water
Bugs emerge from hidden depths
Debug, compile, run

Key Parameters

The key parameters for the instruction processing pipeline are:

  • template: The formatting style for instructions (e.g., "default", "alpaca"). Different templates provide different context cues to the model.
  • max_length: The maximum sequence length allowed for training examples. Filtering long sequences ensures data fits within the model's context window.
  • min_instruction_length: The minimum character count for instructions. This threshold helps filter out noise or trivial inputs.
  • min_output_length: The minimum character count for responses to ensure meaningful training signals.

Limitations and Practical Considerations

Instruction tuning represents a significant advance in making language models practically useful, but it comes with important limitations that you must understand.

The most fundamental limitation is that instruction tuning cannot teach capabilities the model doesn't already possess. If a base model lacks knowledge about a particular domain or cannot perform certain reasoning, no amount of instruction tuning will create that capability from nothing. Instruction tuning is more about eliciting and formatting existing capabilities than creating new ones. This means the quality of the pre-trained base model fundamentally limits what instruction tuning can achieve.

Data quality presents persistent challenges. Creating high-quality instruction data requires significant human effort, and scaling this effort is expensive. Crowdsourced data may contain errors, inconsistencies, or biases. Model-generated instruction data (which we'll explore in the chapter on Self-Instruct) can propagate and amplify the generating model's mistakes. The old principle of "garbage in, garbage out" applies with full force: models trained on low-quality instruction data learn low-quality behaviors.

Instruction tuning also introduces the risk of capability degradation. As discussed in Part XXIV on catastrophic forgetting, fine-tuning on instruction data can cause the model to forget some pre-training knowledge. This is particularly concerning when instruction datasets don't cover the full breadth of capabilities the base model possessed. Careful dataset design and training procedures can mitigate but not eliminate this risk.

The generalization boundaries of instruction tuning remain unclear. While instruction-tuned models often generalize impressively to instruction phrasings they've never seen, they can also fail unexpectedly on seemingly simple variations. The exact factors determining when generalization succeeds versus fails are active areas of research. You should not assume that instruction tuning creates robust, failure-free instruction following.

Finally, instruction tuning alone doesn't solve the alignment problem. Teaching a model to follow instructions efficiently also means it will follow harmful instructions efficiently. The subsequent work on preference learning and RLHF, which we'll cover in Part XXVII, addresses this limitation by teaching models to follow instructions while also avoiding harmful outputs.

Summary

Instruction tuning bridges the gap between language models that can generate text and assistants that respond helpfully to your requests. Pre-trained models optimize for next-token prediction, not instruction compliance, creating a fundamental mismatch with your expectations. Instruction tuning addresses this by fine-tuning on datasets of instruction-response pairs, teaching models to recognize and execute requests.

Effective instruction data requires careful attention to format, with clear structure distinguishing instructions, inputs, and expected outputs. Diversity across task types, phrasings, difficulty levels, and domains enables models to generalize beyond their training examples. Quality considerations including response accuracy, completeness, and naturalness determine whether training produces reliable or erratic behavior.

The surprising efficiency of instruction tuning, achieving dramatic behavioral changes with relatively small datasets, reveals that pre-trained models already possess the necessary capabilities. Instruction tuning teaches them when and how to deploy these capabilities in response to your needs.

The next chapter examines how instruction data is created at scale, exploring both human annotation approaches and the increasingly important role of model-generated synthetic data. Understanding data creation methods will help you evaluate and create instruction datasets for your own applications.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about instruction tuning and instruction-following in large language models.

Loading component...

Reference

BIBTEXAcademic
@misc{instructionfollowingteachingllmstoexecuteyourrequests, author = {Michael Brenndoerfer}, title = {Instruction Following: Teaching LLMs to Execute Your Requests}, year = {2025}, url = {https://mbrenndoerfer.com/writing/instruction-following-llm-tuning-fundamentals}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Instruction Following: Teaching LLMs to Execute Your Requests. Retrieved from https://mbrenndoerfer.com/writing/instruction-following-llm-tuning-fundamentals
MLAAcademic
Michael Brenndoerfer. "Instruction Following: Teaching LLMs to Execute Your Requests." 2026. Web. today. <https://mbrenndoerfer.com/writing/instruction-following-llm-tuning-fundamentals>.
CHICAGOAcademic
Michael Brenndoerfer. "Instruction Following: Teaching LLMs to Execute Your Requests." Accessed today. https://mbrenndoerfer.com/writing/instruction-following-llm-tuning-fundamentals.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Instruction Following: Teaching LLMs to Execute Your Requests'. Available at: https://mbrenndoerfer.com/writing/instruction-following-llm-tuning-fundamentals (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Instruction Following: Teaching LLMs to Execute Your Requests. https://mbrenndoerfer.com/writing/instruction-following-llm-tuning-fundamentals