Self-Instruct: Bootstrap Instruction-Tuning Datasets

Michael BrenndoerferDecember 17, 202540 min read

Learn how Self-Instruct enables language models to generate their own training data through iterative bootstrapping from minimal human-written seed tasks.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Self-Instruct

In the previous chapter, we explored methods for creating instruction-tuning datasets, from expensive human annotation to various synthetic approaches. Each method involved trade-offs between quality, cost, and scale. What if a language model could generate its own training data? This seemingly circular idea turns out to be remarkably effective, opening up possibilities that were previously inaccessible to you without substantial annotation budgets.

Self-Instruct is a framework that enables large language models to bootstrap their instruction-following capabilities by generating their own training examples. Starting with just a handful of human-written seed tasks, the model iteratively produces thousands of diverse instructions along with their inputs and outputs. The approach addresses a fundamental bottleneck in instruction tuning: the difficulty of creating large-scale, diverse instruction datasets without massive human effort. By leveraging the model's own capabilities, Self-Instruct transforms data creation from a labor-intensive process into an automated pipeline.

The key insight behind Self-Instruct is that language models, even before instruction tuning, already possess substantial knowledge from pre-training. They understand language structure, know facts about the world, and can follow patterns demonstrated in context. This latent capability exists because pre-training on vast text corpora exposes models to countless examples of instructions and their completions, embedded naturally within documents, tutorials, and conversations. Self-Instruct leverages these capabilities, using in-context learning (which we covered in Part XVIII) to guide the model toward generating useful instruction-following examples. Rather than teaching the model something entirely new, the framework activates and structures knowledge the model already possesses.

The Self-Instruct Pipeline

Self-Instruct operates as an iterative bootstrapping process that gradually builds a diverse instruction dataset from minimal human input. The pipeline begins with a small set of human-written seed tasks and progressively expands this collection by having the model generate new tasks, validate them, and add the high-quality ones back to the pool. This cyclical approach means that each iteration enriches the context available for subsequent generations.

The core loop consists of four stages that repeat until the desired dataset size is reached. Each stage serves a specific purpose in ensuring both the quality and diversity of the final dataset.

  1. Instruction Generation: Sample existing tasks from the pool and prompt the model to generate new, diverse instructions. This stage leverages in-context learning to demonstrate the desired format and style.
  2. Classification Identification: Determine whether each generated instruction is a classification task (requiring a fixed set of output labels) or an open-ended generation task. This distinction is crucial because it affects how we generate training instances.
  3. Instance Generation: For each instruction, generate appropriate inputs and outputs using either input-first or output-first approaches. The choice of approach depends on the task type identified in the previous stage.
  4. Filtering: Remove low-quality, duplicate, or problematic examples before adding them to the task pool. This quality control step prevents degradation of the dataset over iterations.
In[2]:
Code
# Conceptual representation of the Self-Instruct pipeline
class SelfInstructPipeline:
    def __init__(self, seed_tasks, model, target_size=50000):
        self.task_pool = list(seed_tasks)
        self.model = model
        self.target_size = target_size

    def run(self):
        while len(self.task_pool) < self.target_size:
            # Step 1: Generate new instructions
            new_instructions = self.generate_instructions()

            # Step 2: Classify each instruction
            classified = self.classify_instructions(new_instructions)

            # Step 3: Generate instances (inputs and outputs)
            instances = self.generate_instances(classified)

            # Step 4: Filter and add to pool
            filtered = self.filter_instances(instances)
            self.task_pool.extend(filtered)

        return self.task_pool

This iterative approach creates a flywheel effect: as the task pool grows with diverse examples, the model has richer context for generating even more varied instructions. The expanding diversity of demonstrations enables the model to explore new directions it might not have discovered with only the original seed tasks. The process is self-reinforcing, though it requires careful filtering to prevent degradation. Without proper quality controls, errors and biases could accumulate across iterations, leading to progressively worse outputs.

Out[4]:
Visualization
The Self-Instruct pipeline iteration process. Starting with seed tasks, the model iteratively generates instructions, classifies them, creates instances, and filters results to expand the training pool. The annotated stages show the rapid initial growth followed by diminishing returns as the pool saturates.
The Self-Instruct pipeline iteration process. Starting with seed tasks, the model iteratively generates instructions, classifies them, creates instances, and filters results to expand the training pool. The annotated stages show the rapid initial growth followed by diminishing returns as the pool saturates.

Seed Task Design

The quality of Self-Instruct outputs depends critically on the initial seed tasks. These human-written examples establish the format, diversity, and quality standards that the model will follow when generating new tasks. Think of seed tasks as a template that defines what "good" looks like: the model learns to mimic their structure, variety, and level of detail. Poor seed design leads to limited, repetitive outputs, while thoughtful seed curation enables the generation of rich, diverse datasets.

The original Self-Instruct paper used 175 seed tasks, each containing an instruction, optional input, and expected output. These seeds were designed to cover a broad range of task types, ensuring the model had exposure to many different patterns from the very beginning:

In[5]:
Code
# Example seed tasks spanning different categories
seed_tasks = [
    {
        "instruction": "Classify the sentiment of the given movie review.",
        "input": "This film was absolutely wonderful. The acting was superb!",
        "output": "Positive",
    },
    {
        "instruction": "Write a haiku about autumn.",
        "input": "",
        "output": "Leaves drift slowly down\nCrisp air fills the empty woods\nNature's final bow",
    },
    {
        "instruction": "Convert the following temperature from Celsius to Fahrenheit.",
        "input": "25 degrees Celsius",
        "output": "77 degrees Fahrenheit",
    },
    {
        "instruction": "List three potential causes of the given historical event.",
        "input": "The French Revolution",
        "output": "1. Economic crisis and heavy taxation of the poor\n2. Social inequality between estates\n3. Enlightenment ideas spreading democratic principles",
    },
    {
        "instruction": "Rewrite the sentence to make it more formal.",
        "input": "Hey, can you help me out with this thing?",
        "output": "Would you be able to assist me with this matter?",
    },
]

The seed tasks should exhibit several key properties that together establish a strong foundation for generation.

  • Format diversity: Mix of tasks with and without inputs, varying output lengths and structures. Some tasks require only an instruction, while others need additional context provided as input.
  • Topic coverage: Span different domains like science, arts, daily life, and abstract reasoning. This breadth ensures the model generates instructions across many subject areas rather than clustering around a single theme.
  • Task type variety: Include classification, generation, transformation, and reasoning tasks. Each task type exercises different capabilities and produces different output patterns.
  • Clear patterns: Demonstrate the expected instruction-input-output format unambiguously. The model should have no confusion about what each field contains or how they relate to one another.
Out[6]:
Visualization
Task category distribution. Generation and transformation tasks make up the majority, ensuring the model learns to produce and manipulate content.
Task category distribution. Generation and transformation tasks make up the majority, ensuring the model learns to produce and manipulate content.
Output type distribution. Short text outputs are most common, balanced with structured and long-form content to ensure versatility.
Output type distribution. Short text outputs are most common, balanced with structured and long-form content to ensure versatility.
Input requirement distribution. The majority of tasks include input context, teaching the model to condition its generation on provided information.
Input requirement distribution. The majority of tasks include input context, teaching the model to condition its generation on provided information.

Instruction Generation

With seed tasks established, the model generates new instructions through prompted generation. The approach samples several existing tasks from the pool and asks the model to create novel instructions inspired by, but different from, the examples. This in-context learning approach leverages the model's ability to identify patterns and generate variations that maintain the same structure while introducing new content.

The prompt structure is carefully designed to encourage diversity while maintaining quality. By showing the model a numbered list of existing instructions and asking it to continue the list, we tap into the model's natural tendency to generate coherent continuations:

In[7]:
Code
def create_instruction_generation_prompt(sampled_tasks, num_to_generate=8):
    """
    Create a prompt for generating new instructions.

    Samples existing tasks and formats them as few-shot examples,
    then asks for new, diverse instructions.
    """
    prompt = """Come up with a series of tasks:

"""
    # Add sampled existing tasks as demonstrations
    for i, task in enumerate(sampled_tasks, 1):
        prompt += f"{i}. {task['instruction']}\n"

    # Request new instructions
    next_num = len(sampled_tasks) + 1
    prompt += f"{next_num}."

    return prompt
In[8]:
Code
import random

# Simulate instruction generation with sampled context
sampled = random.sample(seed_tasks, 3)
generation_prompt = create_instruction_generation_prompt(sampled)
Out[9]:
Console
=== Instruction Generation Prompt ===
Come up with a series of tasks:

1. Classify the sentiment of the given movie review.
2. List three potential causes of the given historical event.
3. Convert the following temperature from Celsius to Fahrenheit.
4.

The sampling strategy for selecting demonstration tasks is crucial for diversity. Rather than random uniform sampling, Self-Instruct uses a balanced approach that considers multiple factors.

  • Recency bias: Slightly favor recently added tasks to explore new directions. This helps the generation process branch out from recently discovered instruction patterns rather than always returning to the same seed tasks.
  • Diversity sampling: Ensure sampled tasks span different categories. If all demonstrations are sentiment classification tasks, the model will likely generate more sentiment classification tasks.
  • Human seed inclusion: Always include some original seed tasks to maintain quality anchoring. The human-written examples serve as a quality floor, preventing drift toward lower-quality patterns.

When the model generates instructions, it often produces multiple candidates at once. The prompt asks for several new instructions, and the model continues generating until it hits a stop condition or maximum length. This batch generation approach is more efficient than requesting one instruction at a time:

In[10]:
Code
def parse_generated_instructions(model_output):
    """
    Extract individual instructions from model output.

    The model typically generates numbered lists like:
    4. Instruction one
    5. Instruction two
    """
    instructions = []
    lines = model_output.strip().split("\n")

    for line in lines:
        line = line.strip()
        # Match numbered items like "4. Instruction text"
        if line and line[0].isdigit():
            # Remove the number and period
            parts = line.split(".", 1)
            if len(parts) > 1:
                instruction = parts[1].strip()
                if instruction:
                    instructions.append(instruction)

    return instructions
In[11]:
Code
# Example model output
mock_model_output = """4. Summarize the main argument of the given paragraph.
5. Write a product description for a smartwatch.
6. Identify the logical fallacy in the given statement.
7. Translate the English sentence to Spanish.
8. Suggest three alternative titles for the given book."""

parsed_instructions = parse_generated_instructions(mock_model_output)
Out[12]:
Console
Parsed instructions:
  1. Summarize the main argument of the given paragraph.
  2. Write a product description for a smartwatch.
  3. Identify the logical fallacy in the given statement.
  4. Translate the English sentence to Spanish.
  5. Suggest three alternative titles for the given book.

The parsing function successfully extracts the core instruction text from the raw model output, removing numbering and whitespace. This step normalizes the data into a structured list ready for filtering and instance generation. The simplicity of this parsing reflects a key design principle of Self-Instruct: use straightforward, predictable formats that are easy to both generate and parse.

Classification Task Identification

After generating instructions, the pipeline determines whether each represents a classification task or an open-ended generation task. This distinction matters because the two task types require different instance generation strategies. Classification and generation tasks have fundamentally different output structures, and treating them identically would lead to suboptimal training data.

Classification tasks have a fixed, limited set of valid outputs (like sentiment labels or categories). The outputs for these tasks are constrained to a small vocabulary of options, and the model's job is essentially to select the correct label. Open-ended tasks can have diverse, creative outputs (like writing stories or answering questions). For these tasks, there is no single "correct" answer, and the space of valid responses is vast. The model makes this determination through another prompted generation:

In[13]:
Code
def create_classification_prompt(instruction):
    """
    Prompt to determine if an instruction represents a classification task.
    """
    prompt = f"""Determine whether the following task output is a 
fixed set of categories or labels (classification) or free-form text (generation).

Task: {instruction}

If this is a classification task, respond with "Classification".
If this is an open-ended generation task, respond with "Generation".

Answer:"""
    return prompt
In[14]:
Code
# Example classification decisions
test_instructions = [
    "Classify the sentiment of the given review as positive, negative, or neutral.",
    "Write a short story about a robot learning to paint.",
    "Determine if the given statement is a fact or an opinion.",
    "Explain the concept of photosynthesis to a 10-year-old.",
]

# In practice, these would be model outputs
classification_labels = [
    "Classification",
    "Generation",
    "Classification",
    "Generation",
]
Out[15]:
Console
Task type classification:
  [Classification] Classify the sentiment of the given review as positive, nega...
  [Generation    ] Write a short story about a robot learning to paint....
  [Classification] Determine if the given statement is a fact or an opinion....
  [Generation    ] Explain the concept of photosynthesis to a 10-year-old....

The model correctly identifies the nature of each task, distinguishing between those requiring constrained labels (Classification) and those inviting open-ended text (Generation). Notice how the presence of explicit label options in the instruction (like "positive, negative, or neutral") provides a strong signal for classification tasks, while verbs like "write" or "explain" suggest generation tasks.

The classification distinction guides the next stage of the pipeline. Classification tasks benefit from output-first generation (where you first generate possible labels, then create inputs that would warrant each label). This approach ensures balanced representation across all label categories. Generation tasks work better with input-first approaches (where you create a plausible input, then generate an appropriate output). This order is more natural for open-ended tasks where the input context shapes what constitutes a good response.

Instance Generation

Once instructions are classified, the pipeline generates complete task instances consisting of inputs and outputs. This stage uses one of two strategies depending on the task type. The choice of strategy directly impacts the quality and balance of the resulting training data.

Input-First Approach

For open-ended generation tasks, the model first creates a plausible input, then generates the corresponding output. This approach works well when the input constrains what kind of output makes sense: given a specific article to summarize, there are many valid summaries, but they should all reflect the article's content. By generating the input first, we establish a concrete context that grounds the output generation.

In[16]:
Code
def create_input_first_prompt(instruction, demonstrations):
    """
    Generate input first, then output.
    Used for open-ended generation tasks.
    """
    prompt = f"""For the following task, first generate an appropriate input, 
then provide the expected output.

Task: {instruction}

"""
    # Add demonstrations
    for demo in demonstrations:
        prompt += f"Input: {demo['input']}\n"
        prompt += f"Output: {demo['output']}\n\n"

    prompt += "Input:"
    return prompt
In[17]:
Code
# Example: generating an instance for a summarization task
summarization_instruction = "Summarize the main points of the given article."

# Mock demonstrations
summarization_demos = [
    {
        "input": "Scientists discovered a new species of deep-sea fish that produces its own light...",
        "output": "Researchers found a bioluminescent fish species in deep ocean waters.",
    }
]

input_first_prompt = create_input_first_prompt(
    summarization_instruction, summarization_demos
)
Out[18]:
Console
=== Input-First Prompt ===
For the following task, first generate an appropriate input, 
then provide the expected output.

Task: Summarize the main points of the given article.

Input: Scientists discovered a new species of deep-sea fish that produces its own light...
Output: Researchers found a bioluminescent fish species in deep ocean waters.

Input:

The prompt explicitly instructs the model to generate the input context first, establishing a scenario that makes the subsequent output generation more natural and coherent. The demonstrations show the model what kind of inputs are appropriate for this task type and how outputs should relate to those inputs. This sequential structure mimics how humans would approach such tasks: first understand the context, then produce a response.

Output-First Approach

For classification tasks, generating labels first and then creating matching inputs produces more balanced datasets. This ordering might seem counterintuitive at first, but it addresses an important practical problem. If you generate inputs first, the model might repeatedly create inputs that warrant the same label, leading to imbalanced class distributions. For example, when generating sentiment examples, a model might produce mostly positive reviews simply because they are easier to write or more common in training data.

In[19]:
Code
def create_output_first_prompt(instruction, possible_labels, demonstrations):
    """
    Generate output (label) first, then create matching input.
    Used for classification tasks to ensure balanced labels.
    """
    prompt = f"""For the following classification task, generate an example 
for each possible output label.

Task: {instruction}
Possible labels: {", ".join(possible_labels)}

"""
    for demo in demonstrations:
        prompt += f"Label: {demo['output']}\n"
        prompt += f"Input: {demo['input']}\n\n"

    # Request a new example
    prompt += "Label:"
    return prompt
In[20]:
Code
# Example: generating instances for sentiment classification
sentiment_instruction = "Classify the sentiment of the given text."
sentiment_labels = ["Positive", "Negative", "Neutral"]

sentiment_demos = [
    {
        "output": "Positive",
        "input": "This product exceeded all my expectations!",
    },
    {"output": "Negative", "input": "Terrible service, would not recommend."},
]

output_first_prompt = create_output_first_prompt(
    sentiment_instruction, sentiment_labels, sentiment_demos
)
Out[21]:
Console
=== Output-First Prompt ===
For the following classification task, generate an example 
for each possible output label.

Task: Classify the sentiment of the given text.
Possible labels: Positive, Negative, Neutral

Label: Positive
Input: This product exceeded all my expectations!

Label: Negative
Input: Terrible service, would not recommend.

Label:

The output-first approach cycles through labels, requesting inputs for each one in turn. This ensures roughly equal representation of each class in the final dataset. By explicitly specifying which label to generate an example for, we force the model to create inputs that genuinely warrant that label rather than defaulting to the most common or easiest case. The result is a more balanced dataset that trains models to distinguish between all categories effectively.

Out[22]:
Visualization
Input-first generation results. Generating inputs before outputs mirrors natural language production but leads to class imbalance by favoring common outputs.
Input-first generation results. Generating inputs before outputs mirrors natural language production but leads to class imbalance by favoring common outputs.
Output-first generation results. Conditioning on the label first enforces a balanced distribution across all target classes.
Output-first generation results. Conditioning on the label first enforces a balanced distribution across all target classes.

Filtering Strategies

Raw generated data contains significant noise: duplicate instructions, low-quality outputs, formatting errors, and examples that are too similar to existing ones. Filtering removes these problematic cases before they contaminate the task pool. Without rigorous filtering, the iterative nature of Self-Instruct would cause errors to compound over time, degrading the quality of each successive generation.

ROUGE-Based Similarity Filtering

The most important filter prevents near-duplicate instructions from entering the pool. New instructions that are too similar to existing ones add no diversity and waste annotation budget. If two instructions are nearly identical, including both provides little additional signal during training. Self-Instruct uses ROUGE-L similarity to measure instruction overlap, comparing the longest common subsequence between instruction texts.

In[23]:
Code
def compute_rouge_l(reference, candidate):
    """
    Compute ROUGE-L score between two texts.
    ROUGE-L uses longest common subsequence.
    """
    ref_tokens = reference.lower().split()
    cand_tokens = candidate.lower().split()

    # Compute LCS length using dynamic programming
    m, n = len(ref_tokens), len(cand_tokens)
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if ref_tokens[i - 1] == cand_tokens[j - 1]:
                dp[i][j] = dp[i - 1][j - 1] + 1
            else:
                dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

    lcs_length = dp[m][n]

    if lcs_length == 0:
        return 0.0

    # Compute precision, recall, and F1
    precision = lcs_length / n if n > 0 else 0
    recall = lcs_length / m if m > 0 else 0

    if precision + recall == 0:
        return 0.0

    f1 = 2 * precision * recall / (precision + recall)
    return f1
In[24]:
Code
def filter_by_similarity(new_instruction, existing_instructions, threshold=0.7):
    """
    Check if a new instruction is too similar to existing ones.
    Returns True if the instruction should be kept (is sufficiently novel).
    """
    for existing in existing_instructions:
        similarity = compute_rouge_l(existing, new_instruction)
        if similarity > threshold:
            return False
    return True
In[25]:
Code
# Test similarity filtering
existing = [
    "Summarize the given article in three sentences.",
    "Translate the text from English to French.",
    "Write a poem about nature.",
]

candidates = [
    "Summarize the given paragraph in three sentences.",  # Too similar
    "Write a haiku about the ocean.",  # Different enough
    "Classify the sentiment of the review.",  # Novel
]

filter_results = []
for cand in candidates:
    is_novel = filter_by_similarity(cand, existing)
    # Find the most similar existing instruction
    max_sim = max(compute_rouge_l(e, cand) for e in existing)
    filter_results.append((cand, is_novel, max_sim))
Out[26]:
Console
Similarity filtering results (threshold=0.7):
----------------------------------------------------------------------
[REJECT] (sim=0.86) Summarize the given paragraph in three sentences.
[KEEP  ] (sim=0.55) Write a haiku about the ocean.
[KEEP  ] (sim=0.15) Classify the sentiment of the review.

The threshold of 0.7 ROUGE-L similarity balances diversity against strictness. Lower thresholds enforce more diversity but risk rejecting valid variations that might provide useful training signal. Higher thresholds allow more similar instructions, potentially reducing dataset diversity. The choice of 0.7 reflects empirical testing showing it effectively removes near-duplicates while accepting meaningfully different instructions.

Out[27]:
Visualization
Pairwise ROUGE-L similarity heatmap for seed task instructions. The predominantly light yellow off-diagonal regions indicate low similarity scores, confirming that the seed tasks are distinct from one another and provide a diverse foundation for generation.
Pairwise ROUGE-L similarity heatmap for seed task instructions. The predominantly light yellow off-diagonal regions indicate low similarity scores, confirming that the seed tasks are distinct from one another and provide a diverse foundation for generation.

Length and Format Filtering

Simple heuristics catch many problematic generations without requiring complex analysis. These filters check basic structural properties that valid instructions should satisfy:

In[28]:
Code
def apply_length_filters(instruction, input_text, output_text):
    """
    Apply length-based quality filters.
    Returns (is_valid, reason) tuple.
    """
    # Instruction should be meaningful but not excessively long
    inst_words = len(instruction.split())
    if inst_words < 3:
        return False, "Instruction too short"
    if inst_words > 150:
        return False, "Instruction too long"

    # Output should exist and be reasonable length
    if not output_text.strip():
        return False, "Empty output"

    out_words = len(output_text.split())
    if out_words > 1000:
        return False, "Output too long"

    # Input can be empty (for tasks like "Write a poem about...")
    # but shouldn't be excessively long
    if input_text:
        inp_words = len(input_text.split())
        if inp_words > 500:
            return False, "Input too long"

    return True, "Passed"
In[29]:
Code
def apply_format_filters(instruction, input_text, output_text):
    """
    Filter based on formatting issues.
    """
    # Check for incomplete generations (common failure mode)
    if output_text.rstrip().endswith("..."):
        return False, "Incomplete output"

    # Filter obvious repetitions
    words = output_text.lower().split()
    if len(words) > 10:
        unique_ratio = len(set(words)) / len(words)
        if unique_ratio < 0.3:
            return False, "Too much repetition"

    # Check instruction doesn't start with prohibited phrases
    prohibited_starts = [
        "write a program",  # Often leads to code that can't be verified
        "create a code",
        "as an ai",
        "i cannot",
    ]
    inst_lower = instruction.lower()
    for phrase in prohibited_starts:
        if inst_lower.startswith(phrase):
            return False, f"Prohibited start: {phrase}"

    return True, "Passed"
In[30]:
Code
# Test filtering pipeline
test_instances = [
    {
        "instruction": "Summarize the article.",
        "input": "Long article text here...",
        "output": "This article discusses...",
    },
    {
        "instruction": "Hi",  # Too short
        "input": "",
        "output": "Hello",
    },
    {
        "instruction": "Write a poem about rain.",
        "input": "",
        "output": "the the the the the the the the the the the",  # Repetitive
    },
    {
        "instruction": "Explain quantum computing.",
        "input": "",
        "output": "",  # Empty output
    },
]

filter_outcomes = []
for inst in test_instances:
    length_ok, length_reason = apply_length_filters(
        inst["instruction"], inst["input"], inst["output"]
    )
    format_ok, format_reason = apply_format_filters(
        inst["instruction"], inst["input"], inst["output"]
    )

    if not length_ok:
        filter_outcomes.append(
            (inst["instruction"][:40], "REJECT", length_reason)
        )
    elif not format_ok:
        filter_outcomes.append(
            (inst["instruction"][:40], "REJECT", format_reason)
        )
    else:
        filter_outcomes.append(
            (inst["instruction"][:40], "KEEP", "All filters passed")
        )
Out[31]:
Console
Filter results:
----------------------------------------------------------------------
[REJECT] Summarize the article.                   | Incomplete output
[REJECT] Hi                                       | Instruction too short
[REJECT] Write a poem about rain.                 | Too much repetition
[REJECT] Explain quantum computing.               | Empty output

The filters successfully accept valid instructions while rejecting those that are too short, repetitive, or empty, preventing low-quality data from polluting the pool. These simple checks are computationally cheap but catch a surprising number of problematic cases, making them an efficient first line of defense.

Keyword-Based Filtering

Certain keywords indicate problematic instructions that should be excluded. These include self-references, meta-commentary, and requests for capabilities the model lacks. Keyword filtering acts as a semantic safety net, catching issues that structural filters would miss:

In[32]:
Code
BLOCKED_KEYWORDS = [
    "image",
    "images",
    "picture",
    "graph",
    "figure",  # Visual content
    "video",
    "audio",
    "voice",  # Non-text modalities
    "http",
    "www",
    "link",
    "url",  # External resources
    "gpt",
    "chatgpt",
    "openai",
    "anthropic",  # Self-reference
    "previous conversation",
    "our earlier",  # Context dependency
]


def filter_by_keywords(instruction):
    """
    Filter instructions containing problematic keywords.
    """
    inst_lower = instruction.lower()
    for keyword in BLOCKED_KEYWORDS:
        if keyword in inst_lower:
            return False, f"Contains blocked keyword: {keyword}"
    return True, "Passed"
In[33]:
Code
keyword_test_cases = [
    "Describe what you see in the image.",
    "Summarize the main points of the article.",
    "Based on our previous conversation, continue the story.",
    "Explain the theory of relativity.",
]

keyword_results = [filter_by_keywords(inst) for inst in keyword_test_cases]
Out[34]:
Console
Keyword filter results:
[REJECT] Describe what you see in the image.... | Contains blocked keyword: image
[KEEP  ] Summarize the main points of the article.... | Passed
[REJECT] Based on our previous conversation, continue the s... | Contains blocked keyword: previous conversation
[KEEP  ] Explain the theory of relativity.... | Passed

These results demonstrate that the keyword filter effectively catches blocked terms like "image" or context-dependent phrases that would be inappropriate for a standalone instruction dataset. Instructions referencing images cannot be completed by text-only models, and those depending on previous conversation lack the necessary context. By filtering these early, we avoid generating instances that would be useless or misleading during training.

Out[35]:
Visualization
Data retention through the filtering pipeline. The similarity filter rejects the largest number of candidates, preventing redundancy, while length and format filters act as initial quality gates to remove malformed generations.
Data retention through the filtering pipeline. The similarity filter rejects the largest number of candidates, preventing redundancy, while length and format filters act as initial quality gates to remove malformed generations.

Complete Self-Instruct Implementation

Let's bring together all components into a working implementation that demonstrates the full pipeline. For demonstration purposes, we'll use a mock language model, but the structure mirrors how you would integrate with real APIs. The modular design makes it straightforward to swap in actual model calls when deploying this approach in practice:

In[36]:
Code
import random


class MockLanguageModel:
    """
    Simulates LLM responses for demonstration.
    In practice, replace with API calls to GPT-3, LLaMA, etc.
    """

    def __init__(self):
        self.instruction_templates = [
            "Rewrite the given sentence in passive voice.",
            "Identify the main theme of the passage.",
            "List the pros and cons of the given topic.",
            "Explain the concept to a five-year-old.",
            "Write a professional email about the topic.",
            "Compare and contrast the two given items.",
            "Provide three examples of the given concept.",
            "Summarize the key takeaways from the text.",
            "Correct the grammatical errors in the sentence.",
            "Generate a creative title for the story.",
            "Classify the text as formal or informal.",
            "Extract the named entities from the sentence.",
            "Paraphrase the following paragraph.",
            "Write a brief description of the product.",
            "Determine the tone of the message.",
        ]

    def generate(self, prompt: str, max_tokens: int = 200) -> str:
        """Generate response based on prompt type."""
        if "Come up with a series of tasks" in prompt:
            return self._generate_instructions()
        elif "Input:" in prompt:
            return self._generate_instance()
        elif "determine whether" in prompt.lower():
            return random.choice(["Classification", "Generation"])
        else:
            return "Sample output for the given input."

    def _generate_instructions(self) -> str:
        selected = random.sample(self.instruction_templates, 5)
        return "\n".join(f"{i + 4}. {inst}" for i, inst in enumerate(selected))

    def _generate_instance(self) -> str:
        inputs = [
            "The cat sat on the mat.",
            "Climate change affects biodiversity.",
            "Machine learning transforms industries.",
        ]
        outputs = [
            "The mat was sat on by the cat.",
            "Climate change has significant impacts on biodiversity.",
            "Industries are being transformed by machine learning.",
        ]
        idx = random.randint(0, len(inputs) - 1)
        # Include Input: prefix so parser can find it
        return f"Input: {inputs[idx]}\nOutput: {outputs[idx]}"
In[37]:
Code
import random
from typing import List, Dict


class SelfInstructGenerator:
    """
    Complete Self-Instruct implementation.

    Generates new instruction-tuning examples by leveraging
    an existing language model to create diverse tasks.
    """

    def __init__(
        self, seed_tasks: List[Dict], model, similarity_threshold: float = 0.7
    ):
        self.task_pool = list(seed_tasks)
        self.model = model
        self.similarity_threshold = similarity_threshold
        self.generated_instructions = set()

        # Track existing instructions for similarity checking
        for task in seed_tasks:
            self.generated_instructions.add(task["instruction"].lower())

    def generate_batch(self, batch_size: int = 10) -> List[Dict]:
        """Generate a batch of new task instances."""
        new_tasks = []
        attempts = 0
        max_attempts = batch_size * 3  # Allow some failures

        while len(new_tasks) < batch_size and attempts < max_attempts:
            attempts += 1

            # Step 1: Generate new instructions
            sampled = random.sample(self.task_pool, min(3, len(self.task_pool)))
            prompt = create_instruction_generation_prompt(sampled)
            raw_output = self.model.generate(prompt)
            instructions = parse_generated_instructions(raw_output)

            for instruction in instructions:
                # Step 2: Apply filters
                if not self._passes_filters(instruction):
                    continue

                # Step 3: Classify task type
                is_classification = self._classify_task(instruction)

                # Step 4: Generate instance
                instance = self._generate_instance(
                    instruction, is_classification
                )

                if instance and self._validate_instance(instance):
                    new_tasks.append(instance)
                    self.generated_instructions.add(instruction.lower())

                    if len(new_tasks) >= batch_size:
                        break

        return new_tasks

    def _passes_filters(self, instruction: str) -> bool:
        """Check if instruction passes all filters."""
        # Similarity filter
        if not filter_by_similarity(
            instruction,
            list(self.generated_instructions),
            self.similarity_threshold,
        ):
            return False

        # Keyword filter
        passed, _ = filter_by_keywords(instruction)
        if not passed:
            return False

        # Length check
        if len(instruction.split()) < 3 or len(instruction.split()) > 150:
            return False

        return True

    def _classify_task(self, instruction: str) -> bool:
        """Determine if task is classification."""
        prompt = create_classification_prompt(instruction)
        response = self.model.generate(prompt, max_tokens=20)
        return "classification" in response.lower()

    def _generate_instance(
        self, instruction: str, is_classification: bool
    ) -> Dict:
        """Generate input-output pair for instruction."""
        # Use demonstrations from pool
        demos = random.sample(self.task_pool, min(2, len(self.task_pool)))

        if is_classification:
            # Output-first for classification
            prompt = create_output_first_prompt(
                instruction,
                ["Yes", "No"],  # Simplified for demo
                demos,
            )
        else:
            # Input-first for generation
            prompt = create_input_first_prompt(instruction, demos)

        response = self.model.generate(prompt)

        # Parse response to extract input and output
        return self._parse_instance(instruction, response)

    def _parse_instance(self, instruction: str, response: str) -> Dict:
        """Parse model response into structured instance."""
        # Simple parsing - production code would be more robust
        input_text = ""
        output_text = ""

        if "Input:" in response and "Output:" in response:
            parts = response.split("Output:")
            input_part = parts[0]
            output_text = parts[1].strip() if len(parts) > 1 else ""

            if "Input:" in input_part:
                input_text = input_part.split("Input:")[1].strip()
        else:
            output_text = response.strip()

        return {
            "instruction": instruction,
            "input": input_text,
            "output": output_text,
        }

    def _validate_instance(self, instance: Dict) -> bool:
        """Final validation of complete instance."""
        length_ok, _ = apply_length_filters(
            instance["instruction"], instance["input"], instance["output"]
        )
        format_ok, _ = apply_format_filters(
            instance["instruction"], instance["input"], instance["output"]
        )
        return length_ok and format_ok
In[38]:
Code
# Run Self-Instruct generation
model = MockLanguageModel()
generator = SelfInstructGenerator(seed_tasks, model, similarity_threshold=0.7)

# Generate a small batch
generated_batch = generator.generate_batch(batch_size=5)
Out[39]:
Console
Generated 5 new task instances:
============================================================

--- Task 1 ---
Instruction: Write a brief description of the product.
Input: Machine learning transforms industries.
Output: Industries are being transformed by machine learning.

--- Task 2 ---
Instruction: Extract the named entities from the sentence.
Input: The cat sat on the mat.
Output: The mat was sat on by the cat.

--- Task 3 ---
Instruction: Explain the concept to a five-year-old.
Input: The cat sat on the mat.
Output: The mat was sat on by the cat.

--- Task 4 ---
Instruction: Classify the text as formal or informal.
Input: Machine learning transforms industries.
Output: Industries are being transformed by machine learning.

--- Task 5 ---
Instruction: Write a professional email about the topic.
Input: The cat sat on the mat.
Output: The mat was sat on by the cat.

Quality and Diversity Metrics

Evaluating Self-Instruct outputs requires measuring both quality and diversity. A large dataset of repetitive instructions is no better than a small, diverse one. In fact, redundant data wastes computational resources during training and may cause the model to overfit to particular patterns. Comprehensive metrics help us understand whether our generation pipeline is producing genuinely useful training data.

Diversity Metrics

Diversity can be measured at multiple levels, from individual word choices to overall structural patterns. Each metric captures a different aspect of what makes a dataset varied and comprehensive:

In[40]:
Code
def compute_diversity_metrics(tasks: List[Dict]) -> Dict:
    """
    Compute diversity metrics for a task collection.
    """
    instructions = [t["instruction"] for t in tasks]

    # Vocabulary diversity
    all_words = []
    for inst in instructions:
        all_words.extend(inst.lower().split())

    vocab_size = len(set(all_words))
    total_words = len(all_words)
    type_token_ratio = vocab_size / total_words if total_words > 0 else 0

    # Root word diversity (approximation via first word)
    verb_starts = [inst.split()[0].lower() for inst in instructions if inst]
    unique_verbs = len(set(verb_starts))
    verb_diversity = unique_verbs / len(verb_starts) if verb_starts else 0

    # Length diversity
    lengths = [len(inst.split()) for inst in instructions]
    length_std = (
        sum((l - sum(lengths) / len(lengths)) ** 2 for l in lengths)
        / len(lengths)
    ) ** 0.5

    return {
        "vocabulary_size": vocab_size,
        "type_token_ratio": type_token_ratio,
        "unique_verb_starts": unique_verbs,
        "verb_diversity": verb_diversity,
        "avg_instruction_length": sum(lengths) / len(lengths),
        "length_std": length_std,
    }
In[41]:
Code
# Combine seed tasks with generated ones for analysis
all_tasks = seed_tasks + generated_batch
diversity_metrics = compute_diversity_metrics(all_tasks)
Out[42]:
Console
Diversity Metrics:
----------------------------------------
  vocabulary_size: 49
  type_token_ratio: 0.681
  unique_verb_starts: 7
  verb_diversity: 0.700
  avg_instruction_length: 7.200
  length_std: 1.077

These metrics provide a quantitative view of the dataset's richness. A type-token ratio above 0.5 and high verb diversity suggest the instructions cover a wide range of actions and topics, rather than repeating the same few patterns. The length standard deviation indicates whether the dataset includes both short, focused instructions and longer, more detailed ones. Together, these measurements give a holistic picture of diversity that helps guide further data generation or filtering decisions.

Out[43]:
Visualization
Frequency of instruction-starting verbs. While common directives like 'Write' and 'Explain' dominate, the long tail of unique verbs indicates a wide variety of task types, from extraction to reasoning, in the generated pool.
Frequency of instruction-starting verbs. While common directives like 'Write' and 'Explain' dominate, the long tail of unique verbs indicates a wide variety of task types, from extraction to reasoning, in the generated pool.

Pairwise Similarity Distribution

The distribution of pairwise similarities reveals whether the dataset has clusters of similar instructions or maintains broad diversity. This analysis goes beyond simple metrics to show the actual structure of relationships within the dataset:

In[44]:
Code
def compute_pairwise_similarities(tasks):
    """
    Compute all pairwise ROUGE-L similarities between instructions.
    """
    instructions = [t["instruction"] for t in tasks]
    similarities = []

    for i in range(len(instructions)):
        for j in range(i + 1, len(instructions)):
            sim = compute_rouge_l(instructions[i], instructions[j])
            similarities.append(sim)

    return similarities
In[45]:
Code
pairwise_sims = compute_pairwise_similarities(all_tasks)
Out[46]:
Visualization
Distribution of pairwise ROUGE-L similarity scores between generated instructions. The concentration of scores near zero demonstrates that the pipeline successfully generates distinctive tasks, with only a negligible fraction exceeding the similarity threshold.
Distribution of pairwise ROUGE-L similarity scores between generated instructions. The concentration of scores near zero demonstrates that the pipeline successfully generates distinctive tasks, with only a negligible fraction exceeding the similarity threshold.

A well-functioning Self-Instruct pipeline produces a left-skewed distribution where most instruction pairs have low similarity. This pattern indicates that the dataset contains many distinct instructions rather than variations on a few themes. The similarity threshold (typically 0.7) prevents the right tail from growing with near-duplicate instructions, maintaining the overall diversity of the pool.

Scaling Self-Instruct

The original Self-Instruct paper generated over 52,000 instructions using GPT-3, starting from just 175 seed tasks. This dramatic expansion demonstrates the power of the bootstrapping approach: minimal human effort yields a large-scale dataset. However, the iterative process shows diminishing returns: early iterations add many novel instructions, while later iterations increasingly generate duplicates or near-duplicates that get filtered out.

In[47]:
Code
from typing import List, Tuple


def simulate_scaling_curve(
    initial_pool_size: int = 175,
    target_size: int = 5000,
    acceptance_rate_decay: float = 0.95,
) -> List[Tuple[int, int, float]]:
    """
    Simulate how acceptance rate changes as pool grows.

    As the pool grows, new instructions are more likely to be
    similar to existing ones, reducing acceptance rate.
    """
    pool_size = initial_pool_size
    history = [(0, pool_size, 1.0)]

    iteration = 0
    acceptance_rate = 1.0

    while pool_size < target_size:
        iteration += 1

        # Acceptance rate decays as pool grows
        acceptance_rate = acceptance_rate_decay ** (
            pool_size / initial_pool_size
        )

        # Generate batch of candidates
        candidates = 100
        accepted = int(candidates * acceptance_rate)

        pool_size += accepted
        history.append((iteration, pool_size, acceptance_rate))

    return history
In[48]:
Code
scaling_history = simulate_scaling_curve()
Out[49]:
Visualization
Instruction pool growth. The total number of accepted tasks grows rapidly at first but slows as the pool becomes saturated.
Instruction pool growth. The total number of accepted tasks grows rapidly at first but slows as the pool becomes saturated.
Acceptance rate decay. The probability of generating a novel instruction drops exponentially as the pool expands.
Acceptance rate decay. The probability of generating a novel instruction drops exponentially as the pool expands.

The acceptance rate decay creates a natural ceiling on dataset size. As more instructions enter the pool, the probability that any new generation is sufficiently different from all existing instructions decreases. This phenomenon reflects a fundamental constraint: the space of "useful, distinct instructions" is large but finite for any given domain and prompt structure.

To push beyond this ceiling, you can employ several strategies that help explore new regions of the instruction space.

  • Expanding seed diversity: Adding more diverse seed tasks opens new instruction spaces. If seeds cover a new domain or task type, the model can generate variations in that direction.
  • Topic-constrained generation: Prompting for instructions in specific underrepresented domains. By explicitly asking for instructions about particular subjects, you can fill gaps in coverage.
  • Relaxing similarity thresholds: Trading some diversity for volume (with quality trade-offs). This allows more similar instructions through but requires careful evaluation of the resulting dataset.
  • Multiple model sources: Using different models to generate instructions, each with its own biases. Different models may explore different parts of the instruction space, leading to greater overall diversity.
Out[50]:
Visualization
Impact of ROUGE-L similarity threshold on dataset properties. Higher thresholds allow more instructions into the pool (blue) but reduce overall diversity (orange), making 0.7 an optimal balance point between scale and quality.
Impact of ROUGE-L similarity threshold on dataset properties. Higher thresholds allow more instructions into the pool (blue) but reduce overall diversity (orange), making 0.7 an optimal balance point between scale and quality.

Limitations and Impact

Self-Instruct democratized instruction tuning by eliminating the need for large human annotation budgets. Before Self-Instruct, creating instruction-following models required either access to proprietary datasets (like InstructGPT's human demonstrations) or significant manual effort. After Self-Instruct, you could bootstrap instruction data from any sufficiently capable language model. This shift opened up instruction tuning research to a much broader community.

The approach demonstrated that language models contain latent instruction-following knowledge from pre-training that can be "unlocked" through carefully structured prompting and filtering. This insight influenced subsequent work on prompting strategies and synthetic data generation, showing that creative use of existing capabilities can substitute for expensive data collection.

However, Self-Instruct has important limitations that you must understand. The generated data inherits biases and errors from the source model. If the model has misconceptions about certain topics, those misconceptions propagate into the training data. The iterative nature can amplify these issues: errors in early generations influence later generations, creating feedback loops that entrench problematic patterns.

Quality control remains challenging. While heuristic filters catch obvious problems (length issues, duplicates, formatting errors), they cannot verify factual accuracy or catch subtle logical errors. A Self-Instruct dataset might contain confident but incorrect explanations that a human annotator would catch. This limitation means that Self-Instruct data often requires additional human review for high-stakes applications.

The approach also struggles with task types that require genuine creativity or specialized knowledge. Generated instructions tend to cluster around patterns the model has seen frequently during pre-training. Truly novel task formulations, or tasks requiring deep domain expertise, rarely emerge from the self-instruct process. The model essentially remixes what it knows rather than inventing fundamentally new concepts.

Finally, there are concerns about model collapse, where models trained on synthetic data from other models progressively lose capability or diversity. Training exclusively on Self-Instruct data without human-quality checks can lead to models that generate plausible-sounding but degraded outputs. You can mix Self-Instruct data with human-annotated examples to maintain quality anchoring and prevent this degradation over successive training cycles.

Despite these limitations, Self-Instruct remains influential. It established synthetic data generation as a viable paradigm for instruction tuning and paved the way for more sophisticated approaches like Evol-Instruct and WizardLM, which we'll encounter when discussing instruction format in the next chapter.

Summary

Self-Instruct enables language models to generate their own instruction-tuning data through an iterative bootstrapping process. Starting with a small set of human-written seed tasks, the pipeline generates new instructions, classifies them as classification or generation tasks, creates input-output instances, and filters for quality before adding them to the growing task pool.

The key components of the pipeline include the following.

  • Instruction generation using in-context learning with sampled demonstrations
  • Task classification to determine whether output-first or input-first instance generation is appropriate
  • Instance generation strategies tailored to task type
  • Multi-stage filtering using ROUGE similarity, length constraints, and keyword blocking

Diversity emerges from careful prompt design and the iterative nature of the process, though acceptance rates decay as the pool grows and novel instructions become harder to generate. The approach trades some quality for scale, making it suitable for bootstrapping but often requiring human curation for production use.

Self-Instruct demonstrated that large language models contain substantial instruction-following capability that can be activated through the right training data, even when that data is generated by the model itself.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Self-Instruct and bootstrapping instruction-tuning datasets.

Loading component...

Reference

BIBTEXAcademic
@misc{selfinstructbootstrapinstructiontuningdatasets, author = {Michael Brenndoerfer}, title = {Self-Instruct: Bootstrap Instruction-Tuning Datasets}, year = {2025}, url = {https://mbrenndoerfer.com/writing/self-instruct-bootstrap-instruction-tuning-datasets}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Self-Instruct: Bootstrap Instruction-Tuning Datasets. Retrieved from https://mbrenndoerfer.com/writing/self-instruct-bootstrap-instruction-tuning-datasets
MLAAcademic
Michael Brenndoerfer. "Self-Instruct: Bootstrap Instruction-Tuning Datasets." 2026. Web. today. <https://mbrenndoerfer.com/writing/self-instruct-bootstrap-instruction-tuning-datasets>.
CHICAGOAcademic
Michael Brenndoerfer. "Self-Instruct: Bootstrap Instruction-Tuning Datasets." Accessed today. https://mbrenndoerfer.com/writing/self-instruct-bootstrap-instruction-tuning-datasets.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Self-Instruct: Bootstrap Instruction-Tuning Datasets'. Available at: https://mbrenndoerfer.com/writing/self-instruct-bootstrap-instruction-tuning-datasets (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Self-Instruct: Bootstrap Instruction-Tuning Datasets. https://mbrenndoerfer.com/writing/self-instruct-bootstrap-instruction-tuning-datasets