Instruction Data Creation: Building Quality Training Datasets

Michael BrenndoerferDecember 16, 202541 min read

Learn practical techniques for creating instruction-tuning datasets. Covers human annotation, template-based generation, seed expansion, and quality filtering.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Instruction Data Creation

As we discussed in the previous chapter on instruction following, the key innovation that transforms a language model into an instruction-following assistant is not architectural. It's the training data. A model learns to follow instructions by seeing thousands of examples that demonstrate the desired behavior: clear tasks paired with helpful, accurate responses. This chapter explores the practical challenge of creating such data at scale, examining the techniques that have enabled researchers to build the datasets that power modern AI assistants.

Creating high-quality instruction data is both an art and an engineering challenge. You need sufficient volume to cover diverse tasks, enough quality to teach proper behavior, and adequate diversity to generalize beyond memorized patterns. The field has developed several complementary approaches to meet these requirements: human annotation for quality, template-based generation for scale, seed task expansion for diversity, and quality filtering to maintain standards across all sources. Each approach has its own strengths and limitations, and understanding these tradeoffs is essential for you.

Human Annotation

Human annotation remains the gold standard for instruction data quality. When a skilled annotator writes an instruction and its response, they bring world knowledge, common sense, and an intuitive understanding of what makes a response helpful. The earliest instruction-tuned models, including InstructGPT from OpenAI, relied heavily on human-written demonstrations. This reliance on human expertise highlights that instruction following depends on human communication patterns and preferences that automated methods struggle to capture.

Annotation Guidelines

Creating effective annotation guidelines requires balancing specificity with flexibility. Guidelines that are too rigid produce formulaic responses that lack naturalness. Guidelines that are too loose produce inconsistent data where annotators interpret tasks differently. Finding the right balance is itself a design challenge that requires iteration and feedback from annotators who work with the guidelines in practice.

Effective guidelines typically specify:

  • Response format expectations: Should responses include explanations, or just direct answers? Should they acknowledge uncertainty? These choices shape the personality and communication style of the resulting model.
  • Tone and style: Formal or conversational? Verbose or concise? The tone established in training data carries through to the model's behavior with you.
  • Handling edge cases: What should annotators do with ambiguous requests, harmful queries, or questions outside their expertise? Clear guidance prevents annotators from improvising inconsistently.
  • Quality thresholds: What makes a response "good enough" versus requiring revision? Concrete examples of acceptable and unacceptable responses help calibrate annotator judgment.

The Dolly dataset from Databricks provides a useful example. Their guidelines instructed annotators to write creative, brainstorming, question answering, classification, summarization, and information extraction tasks, covering a spectrum of instruction types while giving annotators freedom within each category. This approach recognized that creativity flourishes within constraints, and that annotators need enough structure to be consistent while retaining enough freedom to produce natural, varied examples.

Quality Control Mechanisms

Human annotation introduces human variability. Different annotators bring different writing styles, knowledge levels, and interpretations of guidelines. Quality control mechanisms help maintain consistency across a diverse pool of contributors and ensure that the resulting dataset reflects the intended quality standards rather than the idiosyncrasies of individual annotators.

  • Inter-annotator agreement: Having multiple annotators label the same examples reveals disagreements that indicate unclear guidelines or genuinely difficult cases. When two annotators produce very different responses to the same instruction, it signals either an ambiguous task or a need for guideline refinement.
  • Expert review: Senior annotators review a sample of outputs, providing feedback and identifying systematic issues. This creates a feedback loop where common problems are caught early and addressed through guideline updates or additional training.
  • Iterative refinement: Guidelines evolve based on observed problems, with regular calibration sessions to realign annotator understanding. This ongoing process acknowledges that annotation guidelines are living documents that improve through use.

Cost and Scale Tradeoffs

Human annotation is expensive. Professional annotators cost $15-50+ per hour depending on the task complexity and required expertise. A single high-quality instruction-response pair might take 5-15 minutes to create, putting the cost per example between $1-15. These costs accumulate quickly when building datasets of any meaningful size.

This cost creates a fundamental tradeoff: you can have either a small dataset of exceptional quality or a larger dataset of variable quality. The InstructGPT paper used only about 13,000 human demonstrations, a tiny dataset by pre-training standards, but each example was carefully crafted by selected contractors who underwent extensive training. This choice reflected a bet that quality would matter more than quantity for teaching instruction-following behavior.

In[2]:
Code
# Simulating annotation cost calculations
def estimate_annotation_cost(
    target_examples: int,
    minutes_per_example: float,
    hourly_rate: float,
    review_overhead: float = 0.2,  # 20% overhead for quality review
) -> dict:
    """Estimate the cost and time for human annotation."""

    # Direct annotation time
    annotation_hours = (target_examples * minutes_per_example) / 60

    # Add review overhead
    total_hours = annotation_hours * (1 + review_overhead)

    # Calculate cost
    total_cost = total_hours * hourly_rate
    cost_per_example = total_cost / target_examples

    return {
        "total_hours": total_hours,
        "total_cost": total_cost,
        "cost_per_example": cost_per_example,
        "annotator_days": total_hours / 8,  # Assuming 8-hour workdays
    }


# Example: Creating a dataset like Dolly (~15k examples)
dolly_estimate = estimate_annotation_cost(
    target_examples=15000, minutes_per_example=10, hourly_rate=25
)

# Example: Creating a dataset like InstructGPT demonstrations (~13k examples)
instruct_gpt_estimate = estimate_annotation_cost(
    target_examples=13000,
    minutes_per_example=15,  # Higher quality, more time
    hourly_rate=40,  # More specialized annotators
)
Out[3]:
Console
Dolly-scale dataset (15k examples, $25/hr, 10 min/example):
  Total cost: $75,000
  Cost per example: $5.00
  Annotator-days required: 375

InstructGPT-scale dataset (13k examples, $40/hr, 15 min/example):
  Total cost: $156,000
  Cost per example: $12.00
  Annotator-days required: 488
Out[4]:
Visualization
Using Python 3.11.13 environment at: /Users/michaelbrenndoerfer/tinker/mb/.venv
Audited 2 packages in 6ms
Total annotation costs for Dolly-scale versus InstructGPT-scale projects. The significantly higher total cost for the InstructGPT-scale dataset ($156,000) compared to the Dolly-scale dataset ($75,000) demonstrates how stricter quality requirements amplify expenses.
Per-example annotation costs for Dolly-scale versus InstructGPT-scale projects. The InstructGPT approach costs $12.00 per example compared to $5.00 for Dolly, illustrating the fundamental tradeoff between dataset scale and individual example quality.
Per-example annotation costs for Dolly-scale versus InstructGPT-scale projects. The InstructGPT approach costs $12.00 per example compared to $5.00 for Dolly, illustrating the fundamental tradeoff between dataset scale and individual example quality.
Notebook output

These costs explain why purely human-annotated instruction datasets remain relatively small. Even well-funded projects typically cap at tens of thousands of examples, not the millions common in pre-training. This scale limitation motivates the alternative approaches described in the following sections, each of which trades some aspect of human quality for the ability to generate data more efficiently.

Template-Based Generation

Template-based generation addresses the scale limitation of human annotation by programmatically converting existing NLP datasets into instruction format. The insight is that tasks like sentiment classification, question answering, and summarization already exist in large annotated datasets. They just need to be reframed as natural language instructions. This reframing transforms supervised learning labels into the kind of instruction-response pairs that teach a model how to follow your requests.

Converting Existing Datasets

Consider the Stanford Sentiment Treebank, a dataset with sentences labeled as positive or negative. In its original form, an example looks like:

text: "This movie is a triumph of style over substance." label: negative

With templates, we can transform this into an instruction-following example:

instruction: "Classify the sentiment of the following movie review as 'positive' or 'negative'." input: "This movie is a triumph of style over substance." output: "negative"

This transformation preserves the supervised signal while presenting it in a format that teaches instruction following. The model learns not just to classify sentiment, but to respond appropriately when you ask it to perform classification. The distinction matters: instruction following requires understanding what is being asked, not just producing the correct label.

In[5]:
Code
import random

# Template-based conversion for sentiment classification
sentiment_templates = [
    {
        "instruction": "Classify the sentiment of the following text as 'positive' or 'negative'.",
        "output_format": lambda label: label,
    },
    {
        "instruction": "What is the sentiment expressed in this review? Answer with 'positive' or 'negative'.",
        "output_format": lambda label: label,
    },
    {
        "instruction": "Read the following text and determine if the author's sentiment is positive or negative.",
        "output_format": lambda label: f"The sentiment is {label}.",
    },
    {
        "instruction": "Is this review expressing a positive or negative opinion?",
        "output_format": lambda label: f"This review expresses a {label} opinion.",
    },
    {
        "instruction": "Analyze the sentiment in the text below.",
        "output_format": lambda label: f"The text conveys a {label} sentiment.",
    },
]


def convert_sentiment_example(text: str, label: str) -> dict:
    """Convert a sentiment classification example to instruction format."""
    template = random.choice(sentiment_templates)
    return {
        "instruction": template["instruction"],
        "input": text,
        "output": template["output_format"](label),
    }


# Example conversions
example_text = "The acting was superb and the plot kept me engaged throughout."
example_label = "positive"

converted_examples = [
    convert_sentiment_example(example_text, example_label) for _ in range(3)
]
Out[6]:
Console
Original example:
  text: "The acting was superb and the plot kept me engaged throughout."
  label: positive

Converted to instruction format (3 random templates):

  Example 1:
    instruction: "Analyze the sentiment in the text below."
    input: "The acting was superb and the plot kept me engaged..."
    output: "The text conveys a positive sentiment."

  Example 2:
    instruction: "Analyze the sentiment in the text below."
    input: "The acting was superb and the plot kept me engaged..."
    output: "The text conveys a positive sentiment."

  Example 3:
    instruction: "Classify the sentiment of the following text as 'positive' or 'negative'."
    input: "The acting was superb and the plot kept me engaged..."
    output: "positive"

The output shows how a single input "The acting was superb..." generates diverse training examples. By seeing the same content formatted differently, the model learns that the task of sentiment analysis is independent of the specific prompt wording. This robustness to phrasing variations is essential for real-world deployment, where you express the same intent in countless different ways.

Template Design Principles

Effective templates share several characteristics that make them valuable for instruction tuning. Understanding these principles helps you create better templates and recognize quality issues in existing template collections.

Templates should vary in phrasing to prevent the model from overfitting to specific wordings. If every sentiment classification instruction starts with "Classify the sentiment," the model may fail when encountering "What's the tone of this text?" Variation teaches the model to recognize the underlying task regardless of surface-level differences in how it is expressed.

Templates should sound natural, as if you wrote them. Stilted or overly formal phrasing creates a mismatch between training data and your queries. The goal is to expose the model to the kinds of requests it will receive in practice, not to construct perfectly grammatical but unnaturally precise instructions.

Templates should include diverse output formats: sometimes a single word, sometimes a full sentence, to teach flexible response styles. You have different preferences for response verbosity, and a well-trained model should be able to adapt. You might want a quick answer, or you might appreciate context and explanation.

The FLAN collection demonstrates these principles at scale. It includes templates for over 60 datasets spanning question answering, sentiment analysis, natural language inference, coreference resolution, and many other tasks. Each dataset has multiple templates, creating millions of instruction examples from existing supervised data. This massive scale would be impossible to achieve through human annotation alone.

In[7]:
Code
# Example: Templates for natural language inference (NLI)
# NLI datasets contain premise-hypothesis pairs with labels: entailment, neutral, contradiction

nli_templates = [
    {
        "instruction": "Given the premise, determine if the hypothesis is true (entailment), false (contradiction), or unknown (neutral).",
        "format_input": lambda p, h: f"Premise: {p}\nHypothesis: {h}",
        "format_output": lambda l: {
            "entailment": "true",
            "neutral": "unknown",
            "contradiction": "false",
        }[l],
    },
    {
        "instruction": "Does the premise support, contradict, or neither support nor contradict the hypothesis?",
        "format_input": lambda p, h: f'Premise: "{p}"\nHypothesis: "{h}"',
        "format_output": lambda l: {
            "entailment": "support",
            "neutral": "neither",
            "contradiction": "contradict",
        }[l],
    },
    {
        "instruction": "Based on the first sentence, is the second sentence definitely true, definitely false, or possibly true?",
        "format_input": lambda p, h: f"Sentence 1: {p}\nSentence 2: {h}",
        "format_output": lambda l: {
            "entailment": "definitely true",
            "neutral": "possibly true",
            "contradiction": "definitely false",
        }[l],
    },
    {
        "instruction": "Can we conclude the hypothesis from the premise? Answer 'yes', 'no', or 'maybe'.",
        "format_input": lambda p, h: f"Premise: {p}\nHypothesis: {h}",
        "format_output": lambda l: {
            "entailment": "yes",
            "neutral": "maybe",
            "contradiction": "no",
        }[l],
    },
]


def convert_nli_example(premise: str, hypothesis: str, label: str) -> dict:
    """Convert an NLI example to instruction format."""
    template = random.choice(nli_templates)
    return {
        "instruction": template["instruction"],
        "input": template["format_input"](premise, hypothesis),
        "output": template["format_output"](label),
    }


# Example NLI data
nli_examples = [
    {
        "premise": "A man is playing a guitar on stage.",
        "hypothesis": "A person is performing music.",
        "label": "entailment",
    },
    {
        "premise": "The children are playing in the park.",
        "hypothesis": "The children are at school.",
        "label": "contradiction",
    },
    {
        "premise": "A woman is reading a book.",
        "hypothesis": "The woman is enjoying the book.",
        "label": "neutral",
    },
]

converted_nli = [convert_nli_example(**ex) for ex in nli_examples]
Out[8]:
Console
NLI examples converted to instruction format:

Example 1 (original label: entailment):
  Instruction: "Given the premise, determine if the hypothesis is true (entailment), false (contradiction), or unknown (neutral)."
  Input: "Premise: A man is playing a guitar on stage.
Hypothesis: A p..."
  Output: "true"

Example 2 (original label: contradiction):
  Instruction: "Based on the first sentence, is the second sentence definitely true, definitely false, or possibly true?"
  Input: "Sentence 1: The children are playing in the park.
Sentence 2..."
  Output: "definitely false"

Example 3 (original label: neutral):
  Instruction: "Can we conclude the hypothesis from the premise? Answer 'yes', 'no', or 'maybe'."
  Input: "Premise: A woman is reading a book.
Hypothesis: The woman is..."
  Output: "maybe"

These examples illustrate how the same underlying logical relationship can be cast as different instructional tasks, ranging from simple classification to more complex reasoning queries. By varying the prompt structure, the model learns to identify the core task regardless of how it is phrased. The logical relationship between premise and hypothesis remains constant, but the way we ask about that relationship changes. This teaches the model that natural language inference is a concept, not a specific prompt format.

Limitations of Template-Based Generation

While template-based generation provides scale, it has significant limitations that you must understand and address. Recognizing these limitations helps you make informed decisions about when template-based data is appropriate and when other approaches are needed.

The resulting instructions tend to be formulaic, lacking the variety of your real queries. Templates follow patterns, and no matter how many templates you create, they cannot capture the full messiness and creativity of how humans actually phrase requests. You make typos, use slang, provide unclear context, and phrase things in unexpected ways. Template-based data, by contrast, is clean and predictable.

Templates also inherit the task distribution of existing datasets, which skew heavily toward classification and extraction tasks rather than open-ended generation. The NLP community has historically focused on tasks with clear right answers that enable straightforward evaluation. This means template-converted data overrepresents tasks like sentiment classification, named entity recognition, and question answering, while underrepresenting creative writing, advice-giving, and open-ended exploration.

Perhaps most importantly, template-based data doesn't teach creative or conversational abilities. A model trained only on converted classification datasets won't learn to write poetry, explain concepts in simple terms, or engage in multi-turn dialogue. These capabilities require training data that demonstrates them explicitly. Template-based generation works best as a complement to other data sources, not a replacement for them.

Seed Task Expansion

Seed task expansion bridges the gap between expensive human annotation and mechanical template conversion. The approach starts with a small set of carefully crafted seed examples, then uses various techniques to expand this seed into a larger, more diverse dataset. This method preserves the quality and naturalness of human-written examples while achieving scale through systematic augmentation.

The Seed Set Philosophy

A good seed set is small but carefully curated. The philosophy behind seed expansion recognizes that a few excellent examples, chosen to cover the space of possible instructions, can serve as anchors for generating many more examples. The key is ensuring that the seeds represent the diversity you want in the final dataset.

A well-designed seed set should demonstrate:

  • Task diversity: Including examples from many task categories (creative writing, question answering, coding, analysis, etc.) ensures that expansion can generate examples across the full range of desired capabilities.
  • Format diversity: Showing different input/output structures (single-turn, multi-turn, with/without context) teaches the model that instructions can take many forms.
  • Complexity range: Spanning from simple lookups to multi-step reasoning ensures the model learns to handle both easy and difficult requests.
  • Style variety: Demonstrating different appropriate tones and verbosity levels teaches the model to adapt its communication style to different contexts.

The Self-Instruct paper, which the next chapter explores in detail, started with just 175 seed tasks. Each seed was written to be distinct, covering a different aspect of instruction-following capability. This small but diverse seed set enabled the generation of tens of thousands of additional examples while maintaining quality and coverage.

In[9]:
Code
# Example seed tasks demonstrating diversity
seed_tasks = [
    # Creative writing
    {
        "category": "creative_writing",
        "instruction": "Write a haiku about artificial intelligence.",
        "input": "",
        "output": "Silicon neurons\nDreaming in binary code\nLearning to be real",
    },
    # Question answering with context
    {
        "category": "reading_comprehension",
        "instruction": "Answer the question based on the given passage.",
        "input": "Passage: The Great Wall of China was built over many centuries, with the most well-known sections constructed during the Ming Dynasty (1368-1644). It stretches approximately 13,171 miles.\n\nQuestion: During which dynasty were the most famous parts of the Great Wall built?",
        "output": "The most well-known sections of the Great Wall were built during the Ming Dynasty (1368-1644).",
    },
    # Code generation
    {
        "category": "coding",
        "instruction": "Write a Python function that checks if a number is prime.",
        "input": "",
        "output": """def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n**0.5) + 1):
        if n % i == 0:
            return False
    return True""",
    },
    # Analysis
    {
        "category": "analysis",
        "instruction": "Identify the logical fallacy in the following argument.",
        "input": "Everyone I know loves this restaurant, so it must be the best restaurant in the city.",
        "output": "This argument commits the hasty generalization fallacy. The speaker assumes that because everyone they personally know loves the restaurant, it must be the best in the entire city. However, their personal acquaintances represent a small, potentially unrepresentative sample of all restaurant-goers in the city.",
    },
    # Classification
    {
        "category": "classification",
        "instruction": "Determine whether the following email is spam or legitimate.",
        "input": "Subject: Meeting Tomorrow\n\nHi John,\n\nJust a reminder about our 3 PM meeting tomorrow to discuss the Q3 budget. Please bring the latest projections.\n\nBest,\nSarah",
        "output": "Legitimate. This email appears to be a genuine business communication about an internal meeting, with specific details about timing, purpose, and attendees.",
    },
    # Summarization
    {
        "category": "summarization",
        "instruction": "Summarize the following paragraph in one sentence.",
        "input": "Machine learning models require large amounts of data to learn patterns effectively. This data must be properly cleaned and preprocessed before training. Additionally, the choice of model architecture depends heavily on the specific task and available computational resources. Hyperparameter tuning is often necessary to achieve optimal performance.",
        "output": "Effective machine learning requires substantial clean data, appropriate architecture selection based on task requirements and resources, and careful hyperparameter optimization.",
    },
]
Out[10]:
Console
Seed set contains 6 diverse examples across categories:

  creative_writing: 1 example(s)
  reading_comprehension: 1 example(s)
  coding: 1 example(s)
  analysis: 1 example(s)
  classification: 1 example(s)
  summarization: 1 example(s)

Example seed task (coding):
  Instruction: "Write a Python function that checks if a number is prime."
  Output preview: def is_prime(n):
    if n < 2:
        return Fals...
Out[11]:
Visualization
Distribution of seed tasks across categories. The balanced count across six different task types ensures that the expanded dataset will maintain broad functional coverage. This even representation helps prevent the model from specializing too narrowly in any single domain.
Distribution of seed tasks across categories. The balanced count across six different task types ensures that the expanded dataset will maintain broad functional coverage. This even representation helps prevent the model from specializing too narrowly in any single domain.

Augmentation Techniques

Once you have a seed set, several techniques can expand it without requiring additional human annotation. These techniques leverage the structure and content of existing examples to generate new variations that preserve the essential characteristics while adding diversity.

Paraphrasing generates alternative wordings for existing instructions. The goal is to teach the model that the same task can be expressed in many different ways. You might say "Write a poem" or "Compose a poem" or "Create a poem," and the model should recognize these as equivalent requests.

In[12]:
Code
# Simple rule-based paraphrasing for demonstration
# In practice, you might use a language model for more natural paraphrases


def paraphrase_instruction(instruction: str) -> list:
    """Generate paraphrased versions of an instruction."""
    paraphrases = []

    # Pattern: "Write X" -> "Compose X" / "Create X" / "Draft X"
    if instruction.startswith("Write"):
        for verb in ["Compose", "Create", "Draft", "Produce"]:
            paraphrases.append(instruction.replace("Write", verb, 1))

    # Pattern: "Determine X" -> "Figure out X" / "Identify X"
    if instruction.startswith("Determine"):
        for verb in ["Figure out", "Identify", "Assess"]:
            paraphrases.append(instruction.replace("Determine", verb, 1))

    # Pattern: "Answer X" -> "Respond to X" / "Address X"
    if instruction.lower().startswith("answer"):
        for verb in ["Respond to", "Address"]:
            paraphrases.append(
                instruction.replace("Answer", verb, 1).replace(
                    "answer", verb, 1
                )
            )

    # Add question form variations
    if not instruction.endswith("?"):
        question_form = (
            f"Can you {instruction[0].lower()}{instruction[1:].rstrip('.')}?"
        )
        paraphrases.append(question_form)

    return paraphrases


# Example paraphrasing
original = "Write a haiku about artificial intelligence."
paraphrased = paraphrase_instruction(original)
Out[13]:
Console
Original: "Write a haiku about artificial intelligence."

Paraphrased versions:
  - "Compose a haiku about artificial intelligence."
  - "Create a haiku about artificial intelligence."
  - "Draft a haiku about artificial intelligence."
  - "Produce a haiku about artificial intelligence."
  - "Can you write a haiku about artificial intelligence?"

The rule-based approach generates several valid syntactic variations of the original instruction, expanding the dataset without changing the semantic intent. This teaches the model that different queries you write can map to the same underlying task. While rule-based paraphrasing is limited in its flexibility, it provides a foundation that can be enhanced with more sophisticated language model-based paraphrasing techniques.

Input variation creates new examples by substituting different inputs into the same instruction template. This technique recognizes that many instructions follow reusable patterns where the specific content can change while the task structure remains constant. A request to "write a Python function" can apply to countless different programming problems.

In[14]:
Code
# Input variation for the prime number coding task
coding_variations = [
    {
        "instruction": "Write a Python function that checks if a string is a palindrome.",
        "input": "",
        "base_concept": "string validation",
    },
    {
        "instruction": "Write a Python function that calculates the factorial of a number.",
        "input": "",
        "base_concept": "mathematical computation",
    },
    {
        "instruction": "Write a Python function that finds the largest element in a list.",
        "input": "",
        "base_concept": "list processing",
    },
    {
        "instruction": "Write a Python function that reverses a string.",
        "input": "",
        "base_concept": "string manipulation",
    },
]

# Topic variation for creative writing
creative_variations = [
    "Write a haiku about the ocean.",
    "Write a haiku about autumn leaves.",
    "Write a haiku about morning coffee.",
    "Write a haiku about city life.",
    "Write a haiku about solitude.",
]
Out[15]:
Console
Input variations for coding task pattern:
  - Write a Python function that checks if a string is a palindrome. (string validation)
  - Write a Python function that calculates the factorial of a number. (mathematical computation)
  - Write a Python function that finds the largest element in a list. (list processing)
  - Write a Python function that reverses a string. (string manipulation)

Topic variations for creative writing pattern:
  - Write a haiku about the ocean.
  - Write a haiku about autumn leaves.
  - Write a haiku about morning coffee.
  - Write a haiku about city life.
  - Write a haiku about solitude.

Diversity Sampling

When expanding from seeds, maintaining diversity is crucial. Without careful sampling, expansion tends to produce many similar examples clustered around popular seed patterns while neglecting rarer but important task types. This clustering reduces the effective size of the dataset and can lead to models that perform well on common tasks but poorly on less frequent ones.

Diversity sampling techniques help ensure balanced coverage:

  • Category balancing: Ensuring expanded data maintains representation across all task categories prevents any single category from dominating the training signal.
  • Embedding-based filtering: Using text embeddings to detect and remove examples too similar to existing ones keeps the dataset diverse at a semantic level.
  • Verb/topic tracking: Monitoring the distribution of instruction verbs and topics to prevent overconcentration helps identify when expansion is gravitating toward particular patterns.
In[16]:
Code
import numpy as np
from collections import Counter


def calculate_instruction_diversity(instructions: list) -> dict:
    """Analyze the diversity of a set of instructions."""

    # Extract first verbs (simplified)
    first_words = [inst.split()[0].lower() for inst in instructions]
    verb_counts = Counter(first_words)

    # Calculate entropy as a diversity measure
    total = len(first_words)
    probabilities = [count / total for count in verb_counts.values()]
    entropy = -sum(p * np.log2(p) for p in probabilities if p > 0)
    max_entropy = np.log2(len(verb_counts))  # Maximum possible entropy

    # Normalized diversity score (0-1)
    diversity_score = entropy / max_entropy if max_entropy > 0 else 0

    return {
        "num_instructions": total,
        "unique_first_words": len(verb_counts),
        "most_common": verb_counts.most_common(5),
        "diversity_score": diversity_score,
    }


# Compare diversity of two hypothetical expanded datasets
diverse_instructions = [
    "Write a poem about nature.",
    "Explain quantum computing.",
    "Summarize the main points.",
    "Calculate the area of a circle.",
    "Translate this to Spanish.",
    "Debug the following code.",
    "Classify this sentiment.",
    "Generate a story beginning.",
    "Analyze the argument's logic.",
    "Compare these two approaches.",
]

repetitive_instructions = [
    "Write a poem about love.",
    "Write a story about adventure.",
    "Write an essay about climate.",
    "Write a haiku about seasons.",
    "Write a limerick about cats.",
    "Explain machine learning.",
    "Explain neural networks.",
    "Explain deep learning.",
    "Write a sonnet about time.",
    "Write a paragraph about history.",
]

diverse_stats = calculate_instruction_diversity(diverse_instructions)
repetitive_stats = calculate_instruction_diversity(repetitive_instructions)
Out[17]:
Console
Diverse instruction set:
  Unique starting words: 10
  Diversity score: 1.000
  Most common: [('write', 1), ('explain', 1), ('summarize', 1), ('calculate', 1), ('translate', 1)]

Repetitive instruction set:
  Unique starting words: 2
  Diversity score: 0.881
  Most common: [('write', 7), ('explain', 3)]
Out[18]:
Visualization
Diversity score comparison between diverse and repetitive datasets. The diverse dataset achieves a score near 1.0, indicating high entropy and balanced task representation, while the repetitive dataset scores significantly lower due to pattern reuse.
Diversity score comparison between diverse and repetitive datasets. The diverse dataset achieves a score near 1.0, indicating high entropy and balanced task representation, while the repetitive dataset scores significantly lower due to pattern reuse.
Starting verb distribution for diverse versus repetitive datasets. Repetitive datasets heavily overuse specific verbs like 'Write', whereas diverse datasets demonstrate a broader lexical spread and more varied instruction structures.
Starting verb distribution for diverse versus repetitive datasets. Repetitive datasets heavily overuse specific verbs like 'Write', whereas diverse datasets demonstrate a broader lexical spread and more varied instruction structures.

The diversity score captures how evenly distributed the instruction patterns are. A dataset dominated by "Write..." instructions scores lower than one with balanced representation across different task types. This metric provides a quantitative way to track diversity during expansion and flag potential problems before they affect model training.

Quality Filtering

Whether data comes from human annotation, template conversion, or seed expansion, quality filtering is essential. Raw data inevitably contains errors, duplicates, and low-quality examples that can harm model training. A systematic filtering pipeline removes these problems while preserving valuable examples. The goal is to maximize the signal-to-noise ratio in the training data, ensuring that every example the model sees contributes positively to its learning.

Length and Format Filters

Basic heuristics catch many obvious problems. These filters are fast to compute and can be applied to millions of examples without significant computational cost. While simple, they remove a substantial portion of problematic data before more expensive filtering stages.

In[19]:
Code
def apply_basic_filters(example: dict) -> tuple:
    """Apply basic quality filters to an instruction example.

    Returns (passed: bool, rejection_reason: str or None)
    """
    instruction = example.get("instruction", "")
    output = example.get("output", "")

    # Filter 1: Instruction too short
    if len(instruction.split()) < 3:
        return False, "instruction_too_short"

    # Filter 2: Output too short (might indicate incomplete response)
    if len(output.split()) < 2:
        return False, "output_too_short"

    # Filter 3: Output too long (might indicate runaway generation)
    if len(output.split()) > 2000:
        return False, "output_too_long"

    # Filter 4: Instruction is just a single word repeated
    words = instruction.lower().split()
    if len(set(words)) == 1:
        return False, "repetitive_instruction"

    # Filter 5: Output contains error markers
    error_markers = [
        "I cannot",
        "I'm not able to",
        "Error:",
        "undefined",
        "NaN",
    ]
    if any(marker.lower() in output.lower() for marker in error_markers):
        return False, "error_in_output"

    # Filter 6: Instruction-output mismatch (output repeats instruction verbatim)
    if instruction.lower().strip() == output.lower().strip():
        return False, "output_copies_instruction"

    return True, None


# Test examples
test_examples = [
    {
        "instruction": "Hi",
        "output": "Hello! How can I help you today?",
    },  # Too short instruction
    {"instruction": "Write a poem", "output": ""},  # Empty output
    {
        "instruction": "Explain quantum computing",
        "output": "Quantum computing uses quantum mechanical phenomena...",
    },  # Good
    {
        "instruction": "What is 2+2?",
        "output": "I cannot answer mathematical questions.",
    },  # Error marker
    {
        "instruction": "Say hello",
        "output": "Say hello",
    },  # Output copies instruction
]

filter_results = [(ex, *apply_basic_filters(ex)) for ex in test_examples]
Out[20]:
Console
Basic filter results:

  Instruction: "Hi..."
  Status: ✗ FAIL (instruction_too_short)

  Instruction: "Write a poem..."
  Status: ✗ FAIL (output_too_short)

  Instruction: "Explain quantum computing..."
  Status: ✓ PASS

  Instruction: "What is 2+2?..."
  Status: ✗ FAIL (error_in_output)

  Instruction: "Say hello..."
  Status: ✗ FAIL (instruction_too_short)

Deduplication

Duplicate or near-duplicate examples waste training compute and can cause the model to memorize specific examples rather than learn general patterns. Deduplication operates at multiple levels, from exact string matching to semantic similarity detection. Each level catches different types of redundancy.

Exact deduplication removes identical instruction-output pairs. This is the simplest form of deduplication but catches surprisingly many duplicates, especially in data generated through automated expansion or scraped from multiple sources.

In[21]:
Code
def exact_deduplicate(examples: list) -> list:
    """Remove exact duplicate examples."""
    seen = set()
    unique = []

    for ex in examples:
        # Create a hashable key from instruction and output
        key = (
            ex.get("instruction", "").strip().lower(),
            ex.get("output", "").strip().lower(),
        )

        if key not in seen:
            seen.add(key)
            unique.append(ex)

    return unique


# Example with duplicates
examples_with_dupes = [
    {
        "instruction": "What is Python?",
        "output": "Python is a programming language.",
    },
    {
        "instruction": "Explain machine learning",
        "output": "Machine learning is a subset of AI.",
    },
    {
        "instruction": "What is Python?",
        "output": "Python is a programming language.",
    },  # Exact dupe
    {
        "instruction": "WHAT IS PYTHON?",
        "output": "PYTHON IS A PROGRAMMING LANGUAGE.",
    },  # Case variant
    {
        "instruction": "Explain machine learning",
        "output": "ML is a field of artificial intelligence.",
    },  # Same instruction, different output
]

deduped = exact_deduplicate(examples_with_dupes)
Out[22]:
Console
Before deduplication: 5 examples
After deduplication: 3 examples

Remaining examples:
  - "What is Python?" -> "Python is a programming langua..."
  - "Explain machine learning" -> "Machine learning is a subset o..."
  - "Explain machine learning" -> "ML is a field of artificial in..."

Removing exact duplicates reduces the dataset size but ensures that the model encounters unique examples, preventing overfitting to repeated tokens. In this case, identical and case-variant duplicates were successfully consolidated. Note that examples with the same instruction but different outputs are preserved, as they represent genuinely different training signals.

Near-duplicate detection catches examples that are semantically identical but superficially different. Two instructions might ask the same thing in slightly different words, and training on both provides little additional value. This typically uses embedding similarity to identify examples that are close in semantic space.

In[23]:
Code
!uv pip install scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def find_near_duplicates(instructions: list, threshold: float = 0.85) -> list:
    """Find pairs of instructions that are near-duplicates using TF-IDF similarity."""
    
    # Vectorize instructions
    vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=1000)
    tfidf_matrix = vectorizer.fit_transform(instructions)
    
    # Compute pairwise similarities
    similarities = cosine_similarity(tfidf_matrix)
    
    # Find pairs above threshold (excluding self-similarity)
    near_dupes = []
    n = len(instructions)
    for i in range(n):
        for j in range(i + 1, n):
            if similarities[i, j] >= threshold:
                near_dupes.append((i, j, similarities[i, j]))
    
    return near_dupes

# Test with similar instructions
instructions = [
    "Write a Python function to check if a number is prime.",
    "Create a Python function that determines if a number is prime.",
    "Explain the concept of machine learning in simple terms.",
    "Explain machine learning to a beginner.",
    "What is the capital of France?",
    "Translate 'Hello' to Spanish."
]

near_dupes = find_near_duplicates(instructions, threshold=0.5)
Out[24]:
Console
Near-duplicate detection results:

Out[25]:
Visualization
Pairwise cosine similarity matrix of instruction embeddings. Darker cells indicate higher similarity, identifying potential near-duplicates like 'Prime check' variations. The block-diagonal structure (excluding the main diagonal) highlights clusters of semantically redundant examples that exact string matching would miss.
Pairwise cosine similarity matrix of instruction embeddings. Darker cells indicate higher similarity, identifying potential near-duplicates like 'Prime check' variations. The block-diagonal structure (excluding the main diagonal) highlights clusters of semantically redundant examples that exact string matching would miss.

The high similarity scores identify pairs that differ only slightly in wording, allowing us to flag potential redundancies that exact matching misses. Filtering these near-duplicates ensures the training compute is spent on genuinely distinct examples. The threshold for near-duplicate detection requires careful tuning: too low and you remove legitimate variations; too high and redundant examples slip through.

Model-Based Quality Scoring

Human-written heuristics can only catch obvious problems. For subtler quality issues, including unclear instructions, factually incorrect outputs, or stylistically poor responses, model-based filtering provides more nuanced assessment. These approaches leverage the judgment capabilities of language models to evaluate aspects of quality that are difficult to capture in simple rules.

The approach uses a language model to score or classify the quality of each example. The model can be prompted to evaluate various dimensions of quality: clarity of the instruction, helpfulness of the response, correctness of any factual claims, and appropriateness of the tone.

In[26]:
Code
# Simulating model-based quality scoring
# In practice, this would use an actual language model


def simulate_quality_scores(examples: list, seed: int = 42) -> list:
    """Simulate quality scores that a model might assign.

    Real implementations would use prompts like:
    "Rate the quality of this instruction-response pair from 1-5..."
    """
    np.random.seed(seed)

    scored_examples = []
    for ex in examples:
        # Simulate scores based on heuristic proxies
        instruction_length = len(ex.get("instruction", "").split())
        output_length = len(ex.get("output", "").split())

        # Base score with some randomness
        base_score = 3.0

        # Longer, more detailed outputs tend to score higher
        if output_length > 20:
            base_score += 0.5
        if output_length > 50:
            base_score += 0.3

        # Very short outputs score lower
        if output_length < 5:
            base_score -= 1.0

        # Add noise to simulate model uncertainty
        score = base_score + np.random.normal(0, 0.3)
        score = max(1.0, min(5.0, score))  # Clamp to 1-5

        scored_examples.append({**ex, "quality_score": round(score, 2)})

    return scored_examples


# Example scoring
examples_to_score = [
    {"instruction": "What is 2+2?", "output": "4"},
    {
        "instruction": "Explain the theory of relativity.",
        "output": "Einstein's theory of relativity actually consists of two interrelated theories: special relativity and general relativity. Special relativity, published in 1905, deals with objects moving at constant speeds, particularly those approaching the speed of light. It introduced the famous equation E=mc², showing that mass and energy are equivalent.",
    },
    {
        "instruction": "Write a story.",
        "output": "Once upon a time, there was a kingdom.",
    },
]

scored = simulate_quality_scores(examples_to_score)
Out[27]:
Console
Model-based quality scoring results:

Score: 2.15/5.0
  Instruction: "What is 2+2?"
  Output: "4"

Score: 3.46/5.0
  Instruction: "Explain the theory of relativity."
  Output: "Einstein's theory of relativity actually consists ..."

Score: 3.19/5.0
  Instruction: "Write a story."
  Output: "Once upon a time, there was a kingdom."

The simulated scores reflect the heuristic that detailed, well-structured responses (like the relativity explanation) are more valuable for instruction tuning than brief or generic outputs. By scoring examples, we can filter out low-value data or curriculum-train the model on high-quality examples first. This curriculum approach exposes the model to the best examples early in training, establishing good patterns before introducing noisier data.

In production systems, quality filtering often uses a tiered approach: fast heuristic filters remove obvious problems, then more expensive model-based scoring handles the remaining examples. This balances thoroughness with computational cost. Running a language model over every example is expensive, so reserving model-based filtering for examples that pass initial heuristic checks makes the pipeline practical at scale.

Building a Complete Filtering Pipeline

Combining all filtering stages into a unified pipeline ensures consistent quality across the dataset. A well-designed pipeline applies filters in order of computational cost, with cheap heuristic filters first and expensive model-based scoring last. This design minimizes the total compute required while still achieving thorough filtering.

In[28]:
Code
class InstructionDataFilter:
    """Pipeline for filtering instruction-tuning data."""

    def __init__(
        self, min_quality_score: float = 3.0, similarity_threshold: float = 0.9
    ):
        self.min_quality_score = min_quality_score
        self.similarity_threshold = similarity_threshold
        self.filter_stats = {
            "total_input": 0,
            "failed_basic": 0,
            "failed_duplicate": 0,
            "failed_quality": 0,
            "passed": 0,
        }

    def _apply_basic_filters(self, example: dict) -> bool:
        """Apply length and format filters."""
        instruction = example.get("instruction", "")
        output = example.get("output", "")

        if len(instruction.split()) < 3:
            return False
        if len(output.split()) < 2:
            return False
        if len(output.split()) > 2000:
            return False

        return True

    def filter_dataset(self, examples: list) -> list:
        """Run the complete filtering pipeline."""
        self.filter_stats["total_input"] = len(examples)

        # Stage 1: Basic filters
        stage1_passed = []
        for ex in examples:
            if self._apply_basic_filters(ex):
                stage1_passed.append(ex)
            else:
                self.filter_stats["failed_basic"] += 1

        # Stage 2: Exact deduplication
        seen = set()
        stage2_passed = []
        for ex in stage1_passed:
            key = (
                ex["instruction"].strip().lower(),
                ex["output"].strip().lower(),
            )
            if key not in seen:
                seen.add(key)
                stage2_passed.append(ex)
            else:
                self.filter_stats["failed_duplicate"] += 1

        # Stage 3: Quality scoring (simulated)
        stage3_passed = []
        scored = simulate_quality_scores(stage2_passed)
        for ex in scored:
            if ex["quality_score"] >= self.min_quality_score:
                stage3_passed.append(ex)
            else:
                self.filter_stats["failed_quality"] += 1

        self.filter_stats["passed"] = len(stage3_passed)
        return stage3_passed

    def get_stats(self) -> dict:
        """Return filtering statistics."""
        stats = self.filter_stats.copy()
        if stats["total_input"] > 0:
            stats["pass_rate"] = stats["passed"] / stats["total_input"]
        return stats


# Create a test dataset with various quality issues
test_dataset = [
    {
        "instruction": "Explain photosynthesis",
        "output": "Photosynthesis is the process by which plants convert sunlight into energy. It occurs in the chloroplasts of plant cells.",
    },
    {"instruction": "Hi", "output": "Hello!"},  # Too short
    {
        "instruction": "Write a poem",
        "output": "Roses are red, violets are blue, this is a poem, written for you.",
    },
    {
        "instruction": "What is AI?",
        "output": "AI stands for artificial intelligence, which refers to computer systems that can perform tasks typically requiring human intelligence.",
    },
    {
        "instruction": "Explain photosynthesis",
        "output": "Photosynthesis is the process by which plants convert sunlight into energy. It occurs in the chloroplasts of plant cells.",
    },  # Duplicate
    {"instruction": "Help", "output": "Sure!"},  # Low quality
]

# Run the pipeline
filter_pipeline = InstructionDataFilter(min_quality_score=2.5)
filtered_dataset = filter_pipeline.filter_dataset(test_dataset)
stats = filter_pipeline.get_stats()
Out[29]:
Console
Filtering Pipeline Results:
  Total input: 6
  Failed basic filters: 4
  Failed deduplication: 0
  Failed quality check: 0
  Passed all stages: 2
  Pass rate: 33.3%

Filtered examples (2):
  - "Write a poem" (score: 3.15)
  - "What is AI?" (score: 2.96)
Out[30]:
Visualization
Data volume retention through the quality filtering pipeline. The sequence shows a progressive reduction in dataset size, with the largest drops occurring during basic filtering and deduplication. This funnel ensures that only high-quality, unique examples reach the final training set.
Data volume retention through the quality filtering pipeline. The sequence shows a progressive reduction in dataset size, with the largest drops occurring during basic filtering and deduplication. This funnel ensures that only high-quality, unique examples reach the final training set.

Combining Data Sources

Real instruction-tuning datasets rarely use a single creation method. Instead, they combine multiple sources to balance their respective strengths and weaknesses. This combination recognizes that no single approach excels at everything: human annotation provides quality but not scale, templates provide scale but not naturalness, and seed expansion provides a middle ground that still requires careful curation.

Comparison of instruction data creation methods and their tradeoffs.
SourceStrengthsWeaknesses
Human annotationHigh quality, natural phrasingExpensive, limited scale
Template conversionLarge scale, task diversityFormulaic, limited creativity
Seed expansionBalance of quality and scaleRequires good seeds, can drift

The FLAN-T5 model, for instance, combines template-converted versions of 62 existing datasets with chain-of-thought examples and dialogue data. This mixture teaches both task-specific skills and more general conversational abilities. The template-converted data provides broad coverage of NLP tasks, while the chain-of-thought and dialogue data teach more sophisticated reasoning and interaction patterns.

Out[31]:
Visualization
Composition of a hybrid instruction-tuning dataset. Template-converted data forms the majority (65%) to provide scale and breadth, while human-written and seed-expanded examples (27%) are prioritized for their high quality and creativity, illustrating the strategic balance between volume and excellence.
Composition of a hybrid instruction-tuning dataset. Template-converted data forms the majority (65%) to provide scale and breadth, while human-written and seed-expanded examples (27%) are prioritized for their high quality and creativity, illustrating the strategic balance between volume and excellence.

This visualization highlights the strategic balance between volume (template-converted) and quality (human-written/seed-expanded) required for effective instruction tuning. While automated methods provide the bulk of the data, the smaller high-quality components are essential for steering the model's style and capabilities. The human-written demonstrations, though small in number, have outsized influence on how the model behaves, establishing patterns that the larger automated data then reinforces.

Limitations and Practical Considerations

Instruction data creation faces several fundamental challenges that you must navigate.

Quality-diversity tradeoffs are unavoidable. Human annotation produces high-quality data but struggles to cover the full space of possible instructions. Template-based methods achieve coverage but produce data that lacks the naturalness of your queries. Seed expansion can balance these concerns but requires careful monitoring to prevent quality degradation as the dataset grows. No single approach solves all problems, which is why production systems invariably combine multiple methods.

Dataset biases propagate to model behavior. If your instruction data overrepresents certain topics, phrasings, or response styles, the fine-tuned model will reflect those biases. This is particularly insidious because biases in training data are often invisible until they manifest as unexpected model behaviors in deployment. For instance, a dataset heavy on formal business writing might produce a model that struggles with casual conversation, while one focused on American English idioms might confuse you if you live in other English-speaking regions.

Scaling data creation remains expensive. Despite advances in template-based and model-assisted generation, creating genuinely diverse, high-quality instruction data still requires significant human effort. The cost estimates we calculated earlier for human annotation represent a floor, not a ceiling. Real projects often spend considerably more on iteration, quality control, and fixing problems discovered during training. This expense motivates ongoing research into more efficient data creation methods, including the Self-Instruct approach described in the next chapter.

Evaluation lags behind creation. While many techniques exist for creating instruction data, evaluating its quality at scale remains challenging. Automated metrics capture surface-level properties but miss deeper quality issues. Human evaluation is expensive and slow. This asymmetry means data problems often remain undiscovered until after a model has been trained and its behavior observed creating costly iteration cycles.

Summary

Instruction data creation bridges the gap between pre-trained language models and instruction-following assistants. This chapter covered four complementary approaches that form the foundation of modern instruction-tuning pipelines:

Human annotation provides the highest quality data but is constrained by cost. Effective annotation requires clear guidelines, quality control mechanisms, and acceptance that scale will be limited. Human-written demonstrations remain valuable for establishing quality standards and covering tasks that require genuine expertise.

Template-based generation converts existing supervised datasets into instruction format, providing scale and task diversity. Well-designed templates vary in phrasing and output format to teach flexible instruction following. However, templates produce formulaic data that lacks the naturalness of your real interactions.

Seed task expansion starts with a small set of carefully curated examples and uses various techniques to grow the dataset while maintaining diversity. Good seed sets demonstrate variety across task categories, formats, complexity levels, and styles. Expansion techniques include paraphrasing, input variation, and diversity sampling to prevent clustering around popular patterns.

Quality filtering removes problematic examples regardless of their source. A complete pipeline combines fast heuristic filters for obvious problems, deduplication to prevent memorization, and model-based scoring for subtler quality issues. Filtering statistics help identify systematic problems in data creation processes.

The next chapter explores Self-Instruct, a specific seed expansion technique that uses a language model to generate new instruction examples from a small seed set. This approach has enabled the creation of instruction datasets at scale without massive human annotation efforts, democratizing access to instruction-tuning beyond well-funded research labs.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about instruction data creation methods.

Loading component...

Reference

BIBTEXAcademic
@misc{instructiondatacreationbuildingqualitytrainingdatasets, author = {Michael Brenndoerfer}, title = {Instruction Data Creation: Building Quality Training Datasets}, year = {2025}, url = {https://mbrenndoerfer.com/writing/instruction-data-creation-building-training-datasets}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Instruction Data Creation: Building Quality Training Datasets. Retrieved from https://mbrenndoerfer.com/writing/instruction-data-creation-building-training-datasets
MLAAcademic
Michael Brenndoerfer. "Instruction Data Creation: Building Quality Training Datasets." 2026. Web. today. <https://mbrenndoerfer.com/writing/instruction-data-creation-building-training-datasets>.
CHICAGOAcademic
Michael Brenndoerfer. "Instruction Data Creation: Building Quality Training Datasets." Accessed today. https://mbrenndoerfer.com/writing/instruction-data-creation-building-training-datasets.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Instruction Data Creation: Building Quality Training Datasets'. Available at: https://mbrenndoerfer.com/writing/instruction-data-creation-building-training-datasets (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Instruction Data Creation: Building Quality Training Datasets. https://mbrenndoerfer.com/writing/instruction-data-creation-building-training-datasets