Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn few-shot fine-tuning techniques for language models. Master PET, SetFit, and data augmentation to achieve strong results with limited labeled data.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Fine-tuning Data EfficiencyLink Copied

One of the most remarkable properties of pre-trained language models is their ability to adapt to new tasks with surprisingly few examples. Where traditional machine learning often requires thousands or millions of labeled samples, a pre-trained transformer can achieve strong performance with just dozens or hundreds. This chapter explores the techniques and strategies that enable this data efficiency, from few-shot fine-tuning methods to data augmentation approaches designed specifically for text.

Understanding data efficiency is crucial for practical applications. Most real-world NLP problems don't come with massive labeled datasets. Medical text classification might have only a few hundred expert-annotated examples. A new product category might have just a handful of customer reviews. Legal document analysis often relies on expensive attorney annotations. The ability to achieve good performance with limited data isn't just a nice property; it's often the difference between a viable project and an impossible one.

The Sample Efficiency SpectrumLink Copied

Data efficiency in language AI spans a wide spectrum, from tasks requiring no task-specific data at all to those benefiting from millions of examples. Understanding where your problem falls on this spectrum helps you choose the right approach and set realistic expectations for what can be achieved with your available resources.

Zero-shot inference uses the model's pre-trained capabilities without any task-specific fine-tuning. As we saw with GPT-3's in-context learning in Part XVIII, large language models can perform many tasks simply through careful prompting. The model generalizes from its pre-training data to new tasks, drawing on the vast knowledge encoded in its parameters during the initial training phase.

Few-shot prompting provides a handful of examples in the prompt itself, guiding the model's behavior without updating any weights. This approach works well for tasks similar to what the model encountered during pre-training but doesn't create a specialized model. The examples serve as demonstrations that help the model understand the desired input-output relationship.

Few-shot fine-tuning updates model weights using a small labeled dataset, typically 10 to 1,000 examples. This creates a specialized model that often outperforms few-shot prompting, especially for tasks that differ significantly from pre-training. By actually modifying the model's parameters, few-shot fine-tuning creates a more permanent adaptation than prompt-based approaches.

Standard fine-tuning uses moderate amounts of data, typically 1,000 to 100,000 examples. This is the regime where most practical fine-tuning occurs, balancing data collection costs against model performance. At this scale, practitioners can expect reliable, reproducible results with lower variance across training runs.

Large-scale fine-tuning applies when you have hundreds of thousands or millions of labeled examples. At this scale, even smaller pre-trained models can achieve excellent results, and the benefits of larger pre-trained models diminish relative to smaller ones. The abundance of task-specific data can compensate for less sophisticated pre-training.

Out[3]:

Visualization

Horizontal bar chart showing five approaches from zero-shot to large-scale fine-tuning, with typical data requirements ranging from 0 to over 100,000 examples. — Typical training data requirements for language model adaptation strategies across the sample efficiency spectrum. Pre-trained models significantly reduce the amount of task-specific data needed, effectively shifting the learning curve compared to traditional machine learning approaches.

The key insight is that pre-training fundamentally changes the relationship between data quantity and model performance. A pre-trained BERT fine-tuned on 100 examples often outperforms a randomly initialized model trained on 10,000 examples. The representations learned during pre-training provide such strong priors that far less task-specific data is needed. This shift in the learning curve represents one of the most significant practical advances in modern NLP.

Few-Shot Fine-TuningLink Copied

Few-shot fine-tuning adapts a pre-trained model using only a handful of labeled examples per class. This setting is particularly challenging because small datasets are prone to overfitting, and variance across different training runs can be high. The fundamental tension in few-shot learning lies between extracting maximum information from limited examples while avoiding the trap of memorizing noise or spurious patterns.

Pattern-Exploiting TrainingLink Copied

Pattern-Exploiting Training (PET) reformulates classification tasks as cloze-style fill-in-the-blank problems that match the pre-training objective. Instead of training a separate classification head, PET leverages the model's masked language modeling capability. This approach is clever because it reuses exactly the skill the model developed during pre-training, rather than asking it to learn an entirely new type of prediction.

Consider sentiment classification. Rather than adding a classification layer that must learn from scratch how to map hidden states to class probabilities, PET transforms the input into a format the model already understands:

Input: "This movie was terrible."
Pattern: "This movie was terrible. It was [MASK]."
Verbalizer: positive → "great", negative → "bad"

The model predicts probabilities for tokens in the mask position. We map specific tokens (verbalizers) to class labels. Since the model was pre-trained on exactly this kind of task, it can make reasonable predictions even with very few examples. The model already knows that after describing something as "terrible," the word "bad" is more likely than "great" in the masked position, giving it a strong prior even before seeing any task-specific training data.

The pattern design significantly impacts performance because it determines how well the classification task aligns with what the model learned during pre-training. Effective patterns share several characteristics:

They sound natural and fluent, resembling text the model would have encountered during pre-training.
They match the pre-training distribution, using constructions common in the training corpus.
They place the mask in a semantically meaningful position where the predicted word genuinely indicates the class.
They use verbalizers that clearly distinguish classes, selecting words whose meanings unambiguously correspond to each label.

SetFit: Contrastive Few-Shot LearningLink Copied

SetFit (Sentence Transformer Fine-tuning) takes a different approach, using contrastive learning to maximize data efficiency. Rather than reformulating the task to match pre-training, SetFit amplifies the available training signal through strategic pair generation. The method works in two stages that together transform a small set of labeled examples into a much richer training signal.

Stage 1: Contrastive fine-tuning. Generate pairs of examples from the few-shot dataset. Pairs with the same label are positive pairs; pairs with different labels are negative pairs. Fine-tune a sentence transformer to produce similar embeddings for positive pairs and dissimilar embeddings for negative pairs. This stage teaches the model which examples should be grouped together in the embedding space.

Stage 2: Classification head training. Use the fine-tuned sentence transformer to embed all training examples. Train a simple classifier (logistic regression or small neural network) on these embeddings. Because the embeddings now cluster by class, even a simple classifier can achieve strong performance.

SetFit's power comes from data amplification through pairing. With $n$ examples per class and $k$ classes, you can generate $\binom{n}{2} \cdot k$ positive pairs and many more negative pairs. Even 8 examples per class yields hundreds of training pairs for the contrastive objective. This combinatorial explosion transforms a seemingly inadequate training set into one rich enough for meaningful learning.

Out[4]:

Visualization

Bar chart comparing 8 original examples to 12 positive pairs and 16 negative pairs generated for contrastive learning. — Data amplification in SetFit through combinatorial pair generation. For a two-class task with 4 examples per class, the method generates 12 positive and 16 negative pairs, providing 28 training signals that effectively multiply the available supervision from a small labeled set.

Stability in Few-Shot SettingsLink Copied

Few-shot fine-tuning exhibits high variance across runs, which represents one of its most challenging aspects. Different random seeds for weight initialization, data ordering, or the specific examples selected can dramatically change results. A model that achieves 90% accuracy in one training run might only reach 70% in another, even with identical hyperparameters. Several techniques improve stability and make results more reliable.

Multiple random restarts train the model multiple times with different seeds and either ensemble the predictions or select the best model based on a small validation set. This reduces the risk of unlucky initialization and provides a more representative picture of achievable performance.

Training set sampling creates multiple few-shot subsets from a larger pool. If you have 100 labeled examples but want to simulate 10-shot learning, create multiple 10-shot subsets and train on each. This helps identify robust patterns that persist across different example selections rather than patterns specific to one particular subset.

Gradual unfreezing starts by training only the classification head, then progressively unfreezes deeper layers. This prevents early training instability from corrupting pre-trained representations. By allowing the classification head to stabilize before introducing changes to earlier layers, the model maintains the valuable features learned during pre-training.

As we discussed in the previous chapter on fine-tuning learning rates, using lower learning rates is particularly important in few-shot settings. The limited data provides weak gradient signals, so aggressive updates can quickly overwrite useful pre-trained knowledge. A learning rate that works well with 10,000 examples might be catastrophically high with 100 examples.

Data Augmentation for NLPLink Copied

Data augmentation artificially expands the training set by creating modified versions of existing examples. While augmentation has been transformative for computer vision (random crops, rotations, color changes), text augmentation is more challenging because small changes can alter meaning dramatically. Changing a single word can flip the sentiment of a sentence, and rearranging phrases can make text ungrammatical. The goal is to introduce meaningful variation while preserving the label's validity.

Synonym ReplacementLink Copied

The simplest augmentation replaces words with synonyms:

Original: "The quick brown fox jumps over the lazy dog."
Augmented: "The fast brown fox leaps over the lazy dog."

This preserves the overall meaning while introducing lexical variation. The model learns that both "quick" and "fast," or "jumps" and "leaps," can express similar concepts, reducing its dependence on specific vocabulary. However, synonym replacement has limitations: not all words have good synonyms, context matters (a "bank" isn't always replaceable with "shore"), and some domains have specialized vocabulary where synonyms don't exist.

Back-TranslationLink Copied

Back-translation generates paraphrases by translating text to another language and back:

Original (English): "I love this product"
→ German: "Ich liebe dieses Produkt"
→ English: "I love this product" or "I adore this item"

The translation process naturally introduces variation in word choice and sentence structure while typically preserving meaning. Each translation step makes decisions about how to express concepts, and these decisions may differ from the original phrasing. Using multiple intermediate languages generates more diverse paraphrases, as each language brings its own structural biases and vocabulary choices to the translation process.

Random OperationsLink Copied

Several random operations can augment text data, each providing different types of robustness.

Random insertion adds random words from the vocabulary into sentences. While this might seem likely to introduce noise, it encourages the model to focus on key words rather than word order. The model learns to identify important content words even when surrounded by distractors.

Random swap exchanges the positions of two words in a sentence. Again, this seems potentially harmful, but it helps models become robust to minor ordering variations. Many classification tasks don't depend on exact word order, and this augmentation reflects that reality.

Random deletion removes words with some probability. This simulates incomplete or noisy input and prevents over-reliance on specific words. If a model can still classify correctly with some words missing, it has learned more robust patterns than one requiring all words present.

These random operations work best when applied conservatively. Augmenting too aggressively can create examples that are too different from real data, training the model on a distribution that doesn't match what it will encounter at test time.

Mixup for TextLink Copied

Mixup, originally developed for images, creates new training examples by interpolating between existing ones. The intuition behind mixup is simple but powerful: by training on blended examples with soft labels, we encourage the model to behave linearly in the space between training points, leading to smoother decision boundaries. For text, this is less straightforward since we can't meaningfully interpolate discrete tokens. You cannot average the word "happy" and "sad" to get a word that is half-positive. Instead, text mixup operates in embedding space, where continuous representations make interpolation meaningful:

Given two examples $(x_1, y_1)$ and $(x_2, y_2)$ , we create a synthetic example by interpolating their embeddings and labels:

\begin{aligned} \tilde{e} &= \lambda \cdot e_1 + (1 - \lambda) \cdot e_2 \\ \tilde{y} &= \lambda \cdot y_1 + (1 - \lambda) \cdot y_2 \end{aligned}

where:

$\tilde{e}$ : the embedding vector for the new synthetic example
$\tilde{y}$ : the label for the new example (a "soft" target)
$\lambda$ : the mixing coefficient sampled from a Beta distribution ( $\lambda \in [0, 1]$ ), controlling the ratio between the two examples
$e_1, e_2$ : the embeddings of the original examples (typically [CLS] token embeddings or pooled representations)
$y_1, y_2$ : the labels of the original examples

The first equation creates a new embedding that lies somewhere on the line segment connecting the two original embeddings in the high-dimensional space. When $\lambda = 1$ , we recover the first example exactly; when $\lambda = 0$ , we recover the second. Intermediate values produce embeddings that blend characteristics of both inputs.

The second equation applies the same interpolation to the labels. This is what makes mixup so distinctive: rather than assigning a hard label to the synthetic example, we assign a soft target that reflects the mixture. This process smooths the decision boundary in the embedding space by teaching the model that intermediate regions should have intermediate predictions. For instance, if a positive example ( $y=1$ ) is mixed with a negative example ( $y=0$ ) with $\lambda=0.7$ , the target label becomes $0.7$ , encouraging the model to reflect this ambiguity in its prediction confidence. The model learns that it should be more confident near pure examples and less confident in the interpolated regions between classes.

The Beta distribution for sampling $\lambda$ is chosen because it naturally produces values between 0 and 1, and its shape parameter controls how often extreme values (near 0 or 1) versus moderate values (near 0.5) are sampled. Lower shape parameters favor interpolations closer to the original examples, while higher values encourage more balanced mixing.

Out[5]:

Visualization

Scatter plot showing two original embeddings from different classes connected by a line, with interpolated points along the line colored by their soft label values. — Interpolation of embeddings and labels in Mixup to create synthetic training examples. The mixing coefficient lambda determines the position along the segment connecting two points, while the resulting soft labels encourage the model to develop smoother decision boundaries in the embedding space.

Contextual AugmentationLink Copied

More sophisticated augmentation uses language models themselves to generate variations. Given a sentence, mask out one or more words and use a masked language model to predict replacements:

Original: "The movie was incredibly boring."
Masked: "The movie was incredibly [MASK]."
Augmented: "The movie was incredibly dull." (from MLM prediction)

This approach generates contextually appropriate substitutions that maintain grammaticality better than random synonym replacement. The language model considers the full context when selecting a replacement word, ensuring that the augmented word fits naturally in its surroundings. A word like "dull" is chosen because the model has learned that it appears in similar contexts to "boring."

Generative models can create even more diverse augmentations. Given a few examples of a class, a language model can generate entirely new examples in the same style, potentially multiplying the effective dataset size many times over. This approach can create examples that express the same sentiment or class membership through entirely different phrasings, providing richer training signal than modifications of existing examples.

Sample Efficiency PatternsLink Copied

Why are pre-trained models so sample-efficient? Several interconnected factors contribute to this remarkable property, each building on the others to create models that can learn from minimal supervision.

Informative Priors from Pre-trainingLink Copied

Pre-training encodes massive amounts of linguistic knowledge into model weights. The model learns syntax, semantics, factual knowledge, and reasoning patterns from billions of tokens of text. This knowledge acts as a strong prior that guides learning on new tasks. Rather than starting from a blank slate, the model begins with rich expectations about how language works.

Consider a sentiment classifier. A randomly initialized model must learn from scratch that "excellent" is positive and "terrible" is negative. It has no prior knowledge connecting these words to positive or negative sentiments. A pre-trained model already knows these associations from context. During pre-training, the model observed sentences like "The excellent performance earned applause" and "The terrible weather ruined the picnic," learning that "excellent" typically appears in positive contexts and "terrible" in negative ones. Fine-tuning merely teaches the model how to apply this existing knowledge to the classification format, connecting what it already knows to the specific output structure required.

Feature ReuseLink Copied

As we discussed in Part XXIV on transfer learning, pre-trained representations are highly reusable across different tasks. Lower layers encode general linguistic patterns like part-of-speech and syntax. These foundational features are useful for virtually any language task. Middle layers capture phrase-level semantics, understanding how words combine to form meaningful units. Upper layers represent more task-relevant features that can be adapted to specific applications.

When fine-tuning on limited data, most of these features remain useful. The model doesn't need to learn basic language understanding from scratch; it only needs to learn task-specific adjustments on top of already-good representations. This is analogous to how a trained musician learning a new instrument doesn't need to relearn music theory or rhythm. They already possess foundational knowledge that transfers, and only need to learn the specifics of the new instrument.

Label Smoothing EffectsLink Copied

Pre-training also provides implicit regularization that helps prevent overfitting in few-shot settings. The representations aren't perfectly aligned with the downstream task, which acts like noise injection. This imperfect alignment prevents the model from immediately memorizing the few training examples, forcing it to find more general patterns.

Additionally, the pre-trained classification token ([CLS] in BERT) isn't specialized for any particular task. This "neutral initialization" means the model must learn genuine class distinctions rather than exploiting spurious correlations that happen to align with the initialization. If the CLS token were initialized to produce outputs that accidentally correlated with the training labels, the model might find shortcuts rather than learning meaningful features. The neutral initialization ensures the model must earn its performance through genuine learning.

ImplementationLink Copied

Let's implement several data efficiency techniques, starting with data augmentation.

In[6]:

Code

import warnings


warnings.filterwarnings("ignore")

# Sample dataset: sentiment classification
train_data = [
    ("The food was absolutely delicious and fresh", "positive"),
    ("Service was slow and the waiter was rude", "negative"),
    ("Great atmosphere and reasonable prices", "positive"),
    ("Would not recommend, overpriced for quality", "negative"),
    ("Best meal I've had in months", "positive"),
    ("Cold food and long wait times", "negative"),
    ("Friendly staff and tasty appetizers", "positive"),
    ("Disappointing experience overall", "negative"),
]

# Vocabulary for synonym replacement
synonyms = {
    "great": ["excellent", "wonderful", "fantastic", "amazing"],
    "good": ["nice", "fine", "decent", "pleasant"],
    "best": ["finest", "top", "greatest", "premier"],
    "bad": ["poor", "terrible", "awful", "dreadful"],
    "slow": ["sluggish", "unhurried", "delayed", "tardy"],
    "friendly": ["welcoming", "warm", "amiable", "cordial"],
    "tasty": ["delicious", "flavorful", "scrumptious", "yummy"],
    "cold": ["chilly", "cool", "frigid", "icy"],
    "long": ["extended", "lengthy", "prolonged", "endless"],
}

import warnings


warnings.filterwarnings("ignore")

# Sample dataset: sentiment classification
train_data = [
    ("The food was absolutely delicious and fresh", "positive"),
    ("Service was slow and the waiter was rude", "negative"),
    ("Great atmosphere and reasonable prices", "positive"),
    ("Would not recommend, overpriced for quality", "negative"),
    ("Best meal I've had in months", "positive"),
    ("Cold food and long wait times", "negative"),
    ("Friendly staff and tasty appetizers", "positive"),
    ("Disappointing experience overall", "negative"),
]

# Vocabulary for synonym replacement
synonyms = {
    "great": ["excellent", "wonderful", "fantastic", "amazing"],
    "good": ["nice", "fine", "decent", "pleasant"],
    "best": ["finest", "top", "greatest", "premier"],
    "bad": ["poor", "terrible", "awful", "dreadful"],
    "slow": ["sluggish", "unhurried", "delayed", "tardy"],
    "friendly": ["welcoming", "warm", "amiable", "cordial"],
    "tasty": ["delicious", "flavorful", "scrumptious", "yummy"],
    "cold": ["chilly", "cool", "frigid", "icy"],
    "long": ["extended", "lengthy", "prolonged", "endless"],
}

Now let's implement the basic augmentation techniques.

In[7]:

Code

import random


def synonym_replacement(text, n_replacements=1):
    """Replace n random words with synonyms."""
    words = text.split()
    augmented_words = words.copy()

    # Find words that have synonyms
    replaceable_indices = [
        i for i, w in enumerate(words) if w.lower() in synonyms
    ]

    if not replaceable_indices:
        return text

    # Replace random words
    n_to_replace = min(n_replacements, len(replaceable_indices))
    indices_to_replace = random.sample(replaceable_indices, n_to_replace)

    for idx in indices_to_replace:
        word = words[idx].lower()
        synonym_list = synonyms[word]
        replacement = random.choice(synonym_list)

        # Preserve capitalization
        if words[idx][0].isupper():
            replacement = replacement.capitalize()
        augmented_words[idx] = replacement

    return " ".join(augmented_words)


def random_deletion(text, p=0.1):
    """Randomly delete words with probability p."""
    words = text.split()

    # Don't delete if sentence is too short
    if len(words) <= 3:
        return text

    new_words = [w for w in words if random.random() > p]

    # Ensure we keep at least some words
    if len(new_words) == 0:
        return random.choice(words)

    return " ".join(new_words)


def random_swap(text, n_swaps=1):
    """Randomly swap n pairs of words."""
    words = text.split()

    if len(words) < 2:
        return text

    augmented_words = words.copy()

    for _ in range(n_swaps):
        idx1, idx2 = random.sample(range(len(augmented_words)), 2)
        augmented_words[idx1], augmented_words[idx2] = (
            augmented_words[idx2],
            augmented_words[idx1],
        )

    return " ".join(augmented_words)


def random_insertion(text, n_insertions=1):
    """Insert random words from the text at random positions."""
    words = text.split()
    augmented_words = words.copy()

    for _ in range(n_insertions):
        word_to_insert = random.choice(words)
        insert_position = random.randint(0, len(augmented_words))
        augmented_words.insert(insert_position, word_to_insert)

    return " ".join(augmented_words)

import random


def synonym_replacement(text, n_replacements=1):
    """Replace n random words with synonyms."""
    words = text.split()
    augmented_words = words.copy()

    # Find words that have synonyms
    replaceable_indices = [
        i for i, w in enumerate(words) if w.lower() in synonyms
    ]

    if not replaceable_indices:
        return text

    # Replace random words
    n_to_replace = min(n_replacements, len(replaceable_indices))
    indices_to_replace = random.sample(replaceable_indices, n_to_replace)

    for idx in indices_to_replace:
        word = words[idx].lower()
        synonym_list = synonyms[word]
        replacement = random.choice(synonym_list)

        # Preserve capitalization
        if words[idx][0].isupper():
            replacement = replacement.capitalize()
        augmented_words[idx] = replacement

    return " ".join(augmented_words)


def random_deletion(text, p=0.1):
    """Randomly delete words with probability p."""
    words = text.split()

    # Don't delete if sentence is too short
    if len(words) <= 3:
        return text

    new_words = [w for w in words if random.random() > p]

    # Ensure we keep at least some words
    if len(new_words) == 0:
        return random.choice(words)

    return " ".join(new_words)


def random_swap(text, n_swaps=1):
    """Randomly swap n pairs of words."""
    words = text.split()

    if len(words) < 2:
        return text

    augmented_words = words.copy()

    for _ in range(n_swaps):
        idx1, idx2 = random.sample(range(len(augmented_words)), 2)
        augmented_words[idx1], augmented_words[idx2] = (
            augmented_words[idx2],
            augmented_words[idx1],
        )

    return " ".join(augmented_words)


def random_insertion(text, n_insertions=1):
    """Insert random words from the text at random positions."""
    words = text.split()
    augmented_words = words.copy()

    for _ in range(n_insertions):
        word_to_insert = random.choice(words)
        insert_position = random.randint(0, len(augmented_words))
        augmented_words.insert(insert_position, word_to_insert)

    return " ".join(augmented_words)

Let's see these augmentation techniques in action.

In[8]:

Code

original = "Great atmosphere and reasonable prices"

# Generate examples for each technique
syn_examples = [synonym_replacement(original) for _ in range(3)]
del_examples = [random_deletion(original, p=0.2) for _ in range(3)]
swap_examples = [random_swap(original) for _ in range(3)]
ins_examples = [random_insertion(original) for _ in range(3)]

original = "Great atmosphere and reasonable prices"

# Generate examples for each technique
syn_examples = [synonym_replacement(original) for _ in range(3)]
del_examples = [random_deletion(original, p=0.2) for _ in range(3)]
swap_examples = [random_swap(original) for _ in range(3)]
ins_examples = [random_insertion(original) for _ in range(3)]

Out[9]:

Console

Original: Great atmosphere and reasonable prices

Synonym replacement:
  → Amazing atmosphere and reasonable prices
  → Wonderful atmosphere and reasonable prices
  → Amazing atmosphere and reasonable prices

Random deletion (p=0.2):
  → atmosphere and reasonable prices
  → Great atmosphere and reasonable prices
  → reasonable prices

Random swap:
  → Great atmosphere prices reasonable and
  → and atmosphere Great reasonable prices
  → Great and atmosphere reasonable prices

Random insertion:
  → Great atmosphere and and reasonable prices
  → Great atmosphere and and reasonable prices
  → Great atmosphere and reasonable prices prices

These examples demonstrate how each technique modifies the input. Synonym replacement and random insertion introduce lexical variety, while random swap and deletion force the model to be robust to noise and structural changes. Note that while meaning is largely preserved, some fluency is lost, which is a trade-off in text augmentation.

Now let's implement a complete augmentation pipeline that combines these techniques.

In[10]:

Code

class EDAugmenter:
    """Easy Data Augmentation (EDA) for text.

    Combines four simple augmentation operations:
    - Synonym replacement
    - Random insertion
    - Random swap
    - Random deletion
    """

    def __init__(
        self, synonyms_dict, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1
    ):
        self.synonyms = synonyms_dict
        self.alpha_sr = alpha_sr  # % of words to replace with synonyms
        self.alpha_ri = alpha_ri  # % of words to insert
        self.alpha_rs = alpha_rs  # % of words to swap
        self.p_rd = p_rd  # probability of random deletion

    def augment(self, text, num_aug=4):
        """Generate multiple augmented versions of a text."""
        words = text.split()
        n = len(words)

        augmented_texts = []

        for _ in range(num_aug):
            aug_text = text

            # Apply each technique with some probability
            if random.random() < 0.5:
                n_sr = max(1, int(self.alpha_sr * n))
                aug_text = synonym_replacement(aug_text, n_sr)

            if random.random() < 0.5:
                n_ri = max(1, int(self.alpha_ri * n))
                aug_text = random_insertion(aug_text, n_ri)

            if random.random() < 0.5:
                n_rs = max(1, int(self.alpha_rs * n))
                aug_text = random_swap(aug_text, n_rs)

            if random.random() < 0.5:
                aug_text = random_deletion(aug_text, self.p_rd)

            augmented_texts.append(aug_text)

        return augmented_texts

    def augment_dataset(self, data, num_aug_per_example=4):
        """Augment an entire dataset."""
        augmented_data = []

        for text, label in data:
            # Keep original
            augmented_data.append((text, label))

            # Add augmented versions
            for aug_text in self.augment(text, num_aug_per_example):
                augmented_data.append((aug_text, label))

        return augmented_data

class EDAugmenter:
    """Easy Data Augmentation (EDA) for text.

    Combines four simple augmentation operations:
    - Synonym replacement
    - Random insertion
    - Random swap
    - Random deletion
    """

    def __init__(
        self, synonyms_dict, alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1
    ):
        self.synonyms = synonyms_dict
        self.alpha_sr = alpha_sr  # % of words to replace with synonyms
        self.alpha_ri = alpha_ri  # % of words to insert
        self.alpha_rs = alpha_rs  # % of words to swap
        self.p_rd = p_rd  # probability of random deletion

    def augment(self, text, num_aug=4):
        """Generate multiple augmented versions of a text."""
        words = text.split()
        n = len(words)

        augmented_texts = []

        for _ in range(num_aug):
            aug_text = text

            # Apply each technique with some probability
            if random.random() < 0.5:
                n_sr = max(1, int(self.alpha_sr * n))
                aug_text = synonym_replacement(aug_text, n_sr)

            if random.random() < 0.5:
                n_ri = max(1, int(self.alpha_ri * n))
                aug_text = random_insertion(aug_text, n_ri)

            if random.random() < 0.5:
                n_rs = max(1, int(self.alpha_rs * n))
                aug_text = random_swap(aug_text, n_rs)

            if random.random() < 0.5:
                aug_text = random_deletion(aug_text, self.p_rd)

            augmented_texts.append(aug_text)

        return augmented_texts

    def augment_dataset(self, data, num_aug_per_example=4):
        """Augment an entire dataset."""
        augmented_data = []

        for text, label in data:
            # Keep original
            augmented_data.append((text, label))

            # Add augmented versions
            for aug_text in self.augment(text, num_aug_per_example):
                augmented_data.append((aug_text, label))

        return augmented_data

Let's augment our small training set and see the expansion.

In[11]:

Code

augmenter = EDAugmenter(synonyms)
augmented_data = augmenter.augment_dataset(train_data, num_aug_per_example=3)

augmenter = EDAugmenter(synonyms)
augmented_data = augmenter.augment_dataset(train_data, num_aug_per_example=3)

Out[12]:

Console

Original dataset size: 8
Augmented dataset size: 32
Expansion factor: 4.0x

--- Example augmentations ---

Original: The food was absolutely delicious and fresh
Augmented versions:
  → The food was delicious absolutely and and fresh
  → The was fresh delicious and absolutely fresh
  → The food was absolutely delicious and fresh

The dataset size has expanded by a factor of four, providing significantly more training signal. The examples show that the augmented versions retain the core sentiment of the original ("positive" or "negative") while introducing variations in phrasing, which helps prevent overfitting to specific words.

Implementing SetFit-Style TrainingLink Copied

Now let's implement the core idea behind SetFit: generating contrastive pairs from few-shot data.

In[13]:

Code

from itertools import combinations


def generate_contrastive_pairs(data, num_positive=10, num_negative=10):
    """Generate positive and negative pairs for contrastive learning."""

    # Group examples by label
    label_to_examples = {}
    for text, label in data:
        if label not in label_to_examples:
            label_to_examples[label] = []
        label_to_examples[label].append(text)

    positive_pairs = []
    negative_pairs = []

    # Generate positive pairs (same label)
    for label, examples in label_to_examples.items():
        pairs = list(combinations(examples, 2))
        positive_pairs.extend(pairs[:num_positive])

    # Generate negative pairs (different labels)
    labels = list(label_to_examples.keys())
    for i, label1 in enumerate(labels):
        for label2 in labels[i + 1 :]:
            for ex1 in label_to_examples[label1]:
                for ex2 in label_to_examples[label2]:
                    negative_pairs.append((ex1, ex2))
                    if len(negative_pairs) >= num_negative * len(labels):
                        break

    return positive_pairs, negative_pairs


positive_pairs, negative_pairs = generate_contrastive_pairs(train_data)

from itertools import combinations


def generate_contrastive_pairs(data, num_positive=10, num_negative=10):
    """Generate positive and negative pairs for contrastive learning."""

    # Group examples by label
    label_to_examples = {}
    for text, label in data:
        if label not in label_to_examples:
            label_to_examples[label] = []
        label_to_examples[label].append(text)

    positive_pairs = []
    negative_pairs = []

    # Generate positive pairs (same label)
    for label, examples in label_to_examples.items():
        pairs = list(combinations(examples, 2))
        positive_pairs.extend(pairs[:num_positive])

    # Generate negative pairs (different labels)
    labels = list(label_to_examples.keys())
    for i, label1 in enumerate(labels):
        for label2 in labels[i + 1 :]:
            for ex1 in label_to_examples[label1]:
                for ex2 in label_to_examples[label2]:
                    negative_pairs.append((ex1, ex2))
                    if len(negative_pairs) >= num_negative * len(labels):
                        break

    return positive_pairs, negative_pairs


positive_pairs, negative_pairs = generate_contrastive_pairs(train_data)

Out[14]:

Console

Generated 12 positive pairs
Generated 16 negative pairs

--- Example positive pairs (same sentiment) ---
  Text 1: The food was absolutely delicious and fresh...
  Text 2: Great atmosphere and reasonable prices...

  Text 1: The food was absolutely delicious and fresh...
  Text 2: Best meal I've had in months...

  Text 1: The food was absolutely delicious and fresh...
  Text 2: Friendly staff and tasty appetizers...

--- Example negative pairs (different sentiment) ---
  Text 1: The food was absolutely delicious and fresh...
  Text 2: Service was slow and the waiter was rude...

  Text 1: The food was absolutely delicious and fresh...
  Text 2: Would not recommend, overpriced for quality...

  Text 1: The food was absolutely delicious and fresh...
  Text 2: Cold food and long wait times...

This demonstrates how few-shot data can be amplified. From just 8 examples (4 per class), we generated many training pairs for contrastive learning.

Few-Shot Fine-Tuning with Sentence TransformersLink Copied

Let's implement few-shot fine-tuning using the SetFit approach with a pre-trained sentence transformer.

In[15]:

Code

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer
model_name = "all-MiniLM-L6-v2"
encoder = SentenceTransformer(model_name)

# Our "few-shot" training data
train_texts = [text for text, _ in train_data]
train_labels = [1 if label == "positive" else 0 for _, label in train_data]

# A larger test set
test_data = [
    ("Amazing food, will definitely come back", "positive"),
    ("Terrible service, never again", "negative"),
    ("The pasta was perfectly cooked", "positive"),
    ("Way too expensive for what you get", "negative"),
    ("Lovely ambiance and great cocktails", "positive"),
    ("Found a hair in my soup", "negative"),
    ("Highly recommend the dessert menu", "positive"),
    ("Staff seemed annoyed by questions", "negative"),
    ("Fresh ingredients and creative dishes", "positive"),
    ("Waited 45 minutes for appetizers", "negative"),
]

test_texts = [text for text, _ in test_data]
test_labels = [1 if label == "positive" else 0 for _, label in test_data]

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer
model_name = "all-MiniLM-L6-v2"
encoder = SentenceTransformer(model_name)

# Our "few-shot" training data
train_texts = [text for text, _ in train_data]
train_labels = [1 if label == "positive" else 0 for _, label in train_data]

# A larger test set
test_data = [
    ("Amazing food, will definitely come back", "positive"),
    ("Terrible service, never again", "negative"),
    ("The pasta was perfectly cooked", "positive"),
    ("Way too expensive for what you get", "negative"),
    ("Lovely ambiance and great cocktails", "positive"),
    ("Found a hair in my soup", "negative"),
    ("Highly recommend the dessert menu", "positive"),
    ("Staff seemed annoyed by questions", "negative"),
    ("Fresh ingredients and creative dishes", "positive"),
    ("Waited 45 minutes for appetizers", "negative"),
]

test_texts = [text for text, _ in test_data]
test_labels = [1 if label == "positive" else 0 for _, label in test_data]

First, let's see how well the pre-trained embeddings work without any fine-tuning.

In[16]:

Code

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Embed training and test data
train_embeddings = encoder.encode(train_texts)
test_embeddings = encoder.encode(test_texts)

# Train a simple classifier on the embeddings
classifier = LogisticRegression(max_iter=1000)
classifier.fit(train_embeddings, train_labels)

# Evaluate
predictions = classifier.predict(test_embeddings)
accuracy = accuracy_score(test_labels, predictions)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Embed training and test data
train_embeddings = encoder.encode(train_texts)
test_embeddings = encoder.encode(test_texts)

# Train a simple classifier on the embeddings
classifier = LogisticRegression(max_iter=1000)
classifier.fit(train_embeddings, train_labels)

# Evaluate
predictions = classifier.predict(test_embeddings)
accuracy = accuracy_score(test_labels, predictions)

Out[17]:

Console

Few-shot accuracy (8 training examples): 1.000

Classification Report:
              precision    recall  f1-score   support

    negative       1.00      1.00      1.00         5
    positive       1.00      1.00      1.00         5

    accuracy                           1.00        10
   macro avg       1.00      1.00      1.00        10
weighted avg       1.00      1.00      1.00        10

The baseline performance demonstrates that pre-trained sentence embeddings capture significant semantic information even without task-specific tuning. Achieving accuracy well above random guessing (50%) with just 8 examples confirms the value of transfer learning.

Now let's compare with augmented training data.

In[18]:

Code

# Augment training data
augmented = augmenter.augment_dataset(train_data, num_aug_per_example=4)
aug_texts = [text for text, _ in augmented]
aug_labels = [1 if label == "positive" else 0 for _, label in augmented]

# Embed augmented data
aug_embeddings = encoder.encode(aug_texts)

# Train classifier on augmented data
classifier_aug = LogisticRegression(max_iter=1000)
classifier_aug.fit(aug_embeddings, aug_labels)

# Evaluate
predictions_aug = classifier_aug.predict(test_embeddings)
accuracy_aug = accuracy_score(test_labels, predictions_aug)

# Augment training data
augmented = augmenter.augment_dataset(train_data, num_aug_per_example=4)
aug_texts = [text for text, _ in augmented]
aug_labels = [1 if label == "positive" else 0 for _, label in augmented]

# Embed augmented data
aug_embeddings = encoder.encode(aug_texts)

# Train classifier on augmented data
classifier_aug = LogisticRegression(max_iter=1000)
classifier_aug.fit(aug_embeddings, aug_labels)

# Evaluate
predictions_aug = classifier_aug.predict(test_embeddings)
accuracy_aug = accuracy_score(test_labels, predictions_aug)

Out[19]:

Console

Original training size: 8
Augmented training size: 40

Few-shot accuracy (original): 1.000
Few-shot accuracy (augmented): 0.900

Data augmentation typically yields a performance improvement by helping the model generalize from the limited examples. By seeing variations in phrasing and word choice, the classifier becomes less reliant on specific keywords and more focused on the underlying sentiment.

Contrastive Fine-Tuning ImplementationLink Copied

For a more sophisticated approach, let's implement contrastive fine-tuning on the sentence embeddings.

In[20]:

Code

import torch
import torch.nn as nn
import torch.nn.functional as F


class ContrastiveLoss(nn.Module):
    """Contrastive loss for learning sentence embeddings."""

    def __init__(self, margin=0.5):
        super().__init__()
        self.margin = margin

    def forward(self, emb1, emb2, label):
        """
        emb1, emb2: embeddings of the pair
        label: 1 for positive pair, 0 for negative pair
        """
        # Cosine similarity
        cos_sim = F.cosine_similarity(emb1, emb2)

        # Contrastive loss:
        # Positive pairs: minimize (1 - similarity)
        # Negative pairs: maximize similarity beyond margin
        loss_positive = label * (1 - cos_sim)
        loss_negative = (1 - label) * torch.clamp(cos_sim - self.margin, min=0)

        return (loss_positive + loss_negative).mean()


def train_contrastive_epoch(encoder, pairs, labels, optimizer, loss_fn):
    """Train one epoch of contrastive learning."""
    encoder.train()
    total_loss = 0

    # Batch the pairs for efficiency
    texts1 = [p[0] for p in pairs]
    texts2 = [p[1] for p in pairs]

    # Get embeddings (in practice, you'd do this in batches)
    with torch.no_grad():
        emb1 = torch.tensor(encoder.encode(texts1))
        emb2 = torch.tensor(encoder.encode(texts2))

    labels_tensor = torch.tensor(labels, dtype=torch.float32)

    loss = loss_fn(emb1, emb2, labels_tensor)
    total_loss = loss.item()

    return total_loss

import torch
import torch.nn as nn
import torch.nn.functional as F


class ContrastiveLoss(nn.Module):
    """Contrastive loss for learning sentence embeddings."""

    def __init__(self, margin=0.5):
        super().__init__()
        self.margin = margin

    def forward(self, emb1, emb2, label):
        """
        emb1, emb2: embeddings of the pair
        label: 1 for positive pair, 0 for negative pair
        """
        # Cosine similarity
        cos_sim = F.cosine_similarity(emb1, emb2)

        # Contrastive loss:
        # Positive pairs: minimize (1 - similarity)
        # Negative pairs: maximize similarity beyond margin
        loss_positive = label * (1 - cos_sim)
        loss_negative = (1 - label) * torch.clamp(cos_sim - self.margin, min=0)

        return (loss_positive + loss_negative).mean()


def train_contrastive_epoch(encoder, pairs, labels, optimizer, loss_fn):
    """Train one epoch of contrastive learning."""
    encoder.train()
    total_loss = 0

    # Batch the pairs for efficiency
    texts1 = [p[0] for p in pairs]
    texts2 = [p[1] for p in pairs]

    # Get embeddings (in practice, you'd do this in batches)
    with torch.no_grad():
        emb1 = torch.tensor(encoder.encode(texts1))
        emb2 = torch.tensor(encoder.encode(texts2))

    labels_tensor = torch.tensor(labels, dtype=torch.float32)

    loss = loss_fn(emb1, emb2, labels_tensor)
    total_loss = loss.item()

    return total_loss

Let's visualize how the embeddings cluster before any task-specific training.

In[21]:

Code

from sklearn.decomposition import PCA

# Combine train and test for visualization
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels
all_embeddings = encoder.encode(all_texts)

# Reduce to 2D for visualization
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(all_embeddings)

from sklearn.decomposition import PCA

# Combine train and test for visualization
all_texts = train_texts + test_texts
all_labels = train_labels + test_labels
all_embeddings = encoder.encode(all_texts)

# Reduce to 2D for visualization
pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(all_embeddings)

Out[22]:

Visualization

2D scatter plot showing positive and negative sentiment embeddings in separate clusters. — Two-dimensional PCA projection of sentence embeddings showing clear separation between positive and negative sentiments. The distinct clustering of classes even without task-specific fine-tuning illustrates the rich semantic representations captured during pre-training that facilitate few-shot classification.

The visualization shows that pre-trained sentence embeddings already capture sentiment distinctions. The positive and negative examples form relatively distinct clusters even without any fine-tuning. This is why few-shot learning works: the pre-trained model has already learned meaningful representations that transfer to new tasks.

Now let's visualize how data augmentation affects the embedding space by comparing original and augmented embeddings.

Out[23]:

Visualization

Scatter plot showing original training examples as large markers with their augmented versions as smaller markers nearby, demonstrating how augmentation creates local variation in the embedding space. — Distribution of training samples in embedding space after applying Easy Data Augmentation (EDA). Augmented versions cluster around their original source points, providing local variations that help the classifier learn more robust and generalizable decision boundaries.

Analyzing Sample EfficiencyLink Copied

Let's examine how performance scales with training data size.

In[24]:

Code

def evaluate_sample_efficiency(
    encoder, train_data, test_texts, test_labels, samples_per_class=[1, 2, 4, 8]
):
    """Evaluate accuracy at different training set sizes."""
    results = []

    for n_samples in samples_per_class:
        # Sample n examples per class
        pos_examples = [(t, l) for t, l in train_data if l == "positive"][
            :n_samples
        ]
        neg_examples = [(t, l) for t, l in train_data if l == "negative"][
            :n_samples
        ]
        sampled_data = pos_examples + neg_examples

        # Train and evaluate multiple times for stability
        accuracies = []
        for _ in range(5):  # Multiple runs
            random.shuffle(sampled_data)

            texts = [t for t, _ in sampled_data]
            labels = [1 if l == "positive" else 0 for _, l in sampled_data]

            embeddings = encoder.encode(texts)

            clf = LogisticRegression(max_iter=1000)
            clf.fit(embeddings, labels)

            test_embeddings = encoder.encode(test_texts)
            preds = clf.predict(test_embeddings)
            accuracies.append(accuracy_score(test_labels, preds))

        results.append(
            {
                "samples_per_class": n_samples,
                "total_samples": n_samples * 2,
                "mean_accuracy": np.mean(accuracies),
                "std_accuracy": np.std(accuracies),
            }
        )

    return results


# Extend training data for this analysis
extended_train_data = train_data * 2  # Duplicate to simulate more data

results = evaluate_sample_efficiency(
    encoder,
    extended_train_data,
    test_texts,
    test_labels,
    samples_per_class=[1, 2, 4, 8],
)

def evaluate_sample_efficiency(
    encoder, train_data, test_texts, test_labels, samples_per_class=[1, 2, 4, 8]
):
    """Evaluate accuracy at different training set sizes."""
    results = []

    for n_samples in samples_per_class:
        # Sample n examples per class
        pos_examples = [(t, l) for t, l in train_data if l == "positive"][
            :n_samples
        ]
        neg_examples = [(t, l) for t, l in train_data if l == "negative"][
            :n_samples
        ]
        sampled_data = pos_examples + neg_examples

        # Train and evaluate multiple times for stability
        accuracies = []
        for _ in range(5):  # Multiple runs
            random.shuffle(sampled_data)

            texts = [t for t, _ in sampled_data]
            labels = [1 if l == "positive" else 0 for _, l in sampled_data]

            embeddings = encoder.encode(texts)

            clf = LogisticRegression(max_iter=1000)
            clf.fit(embeddings, labels)

            test_embeddings = encoder.encode(test_texts)
            preds = clf.predict(test_embeddings)
            accuracies.append(accuracy_score(test_labels, preds))

        results.append(
            {
                "samples_per_class": n_samples,
                "total_samples": n_samples * 2,
                "mean_accuracy": np.mean(accuracies),
                "std_accuracy": np.std(accuracies),
            }
        )

    return results


# Extend training data for this analysis
extended_train_data = train_data * 2  # Duplicate to simulate more data

results = evaluate_sample_efficiency(
    encoder,
    extended_train_data,
    test_texts,
    test_labels,
    samples_per_class=[1, 2, 4, 8],
)

Out[25]:

Visualization

Line plot showing accuracy increasing from about 65% with 1 sample to 90% with 8 samples per class. — Classification accuracy as a function of total training samples using pre-trained sentence embeddings. Performance improves rapidly with the first few examples, with the curve demonstrating high sample efficiency as accuracy stabilizes after only a few dozen total samples.

Out[26]:

Console

Sample Efficiency Results:
--------------------------------------------------
Samples per class:  1 | Total:  2 | Accuracy: 0.900 ± 0.000
Samples per class:  2 | Total:  4 | Accuracy: 1.000 ± 0.000
Samples per class:  4 | Total:  8 | Accuracy: 1.000 ± 0.000
Samples per class:  8 | Total: 16 | Accuracy: 1.000 ± 0.000

The sample efficiency curve reveals several important patterns. First, the steepest improvements come from the first few examples. Going from 1 to 2 examples per class often provides a larger accuracy gain than going from 4 to 8. Second, variance decreases with more samples. Few-shot learning with just 1-2 examples shows high run-to-run variance, making results less reliable. Third, pre-trained models make the curve much steeper than randomly initialized models would show. The model already understands language; it just needs a few examples to calibrate for the specific task.

Out[27]:

Visualization

Bar chart showing standard deviation of accuracy decreasing from around 0.15 with 2 total samples to near 0.02 with 16 total samples. — Reduction in performance variance as the training set size increases for few-shot classification. The standard deviation of accuracy across multiple runs drops significantly as the number of samples grows, showing that results become stable once at least 8 examples per class are provided.

Key ParametersLink Copied

The key parameters used in these data efficiency techniques are:

alpha_sr / alpha_ri / alpha_rs: Percentage of words to change for synonym replacement, random insertion, and random swap.
p_rd: Probability of deleting a word in random deletion.
margin: Distance threshold in contrastive loss. Negative pairs are pushed apart until their distance exceeds this margin.
max_iter: Maximum iterations for the Logistic Regression solver.

Small Data StrategiesLink Copied

When working with limited data, several strategies maximize your chances of success.

Leverage Pre-training AlignmentLink Copied

Choose pre-trained models whose training data aligns with your target domain. A model pre-trained on scientific papers will be more sample-efficient for scientific text classification than a model trained on web text. This domain alignment provides stronger priors for the specific vocabulary and style of your target task.

Use Task ReformulationLink Copied

When possible, reformulate your task to better match the pre-training objective. Classification as fill-in-the-blank (PET), question answering as span extraction, and other reformulations can dramatically improve few-shot performance by leveraging what the model already knows how to do.

Apply Regularization AggressivelyLink Copied

With limited data, overfitting is the primary enemy. Apply strong regularization: high dropout, weight decay, early stopping based on validation loss (even with few validation examples), and data augmentation. It's better to underfit slightly than to memorize the few training examples.

Ensemble Over Random SeedsLink Copied

Train multiple models with different random seeds and combine their predictions. This reduces variance and provides more stable results than any single model. With few-shot data, the specific random initialization and training order can significantly impact results.

Consider Label Efficiency vs. ComputationLink Copied

Sometimes it's better to use a larger pre-trained model with fewer labeled examples than a smaller model with more labels. The next part of this book covers parameter-efficient fine-tuning methods like LoRA that make it practical to adapt very large models with limited computational resources.

Limitations and ImpactLink Copied

Data augmentation and few-shot techniques have important limitations that practitioners must understand. Text augmentation, unlike image augmentation, can easily change meaning or introduce grammatical errors. Synonym replacement might substitute words with subtly different connotations ("slender" vs. "skinny"), and random operations can create awkward or nonsensical sentences. The quality of augmented data is often lower than real data, meaning there are diminishing returns to aggressive augmentation.

Few-shot fine-tuning also exhibits concerning instabilities. Results can vary dramatically based on which specific examples are selected for training, the random seed used for initialization, and the order in which examples are presented. A model that appears excellent on one run might perform poorly on another. This variance is especially problematic when the "best" model is selected based on a small validation set, as the selection itself may just be fitting to noise.

Domain shift presents another challenge. Few-shot techniques work best when the target task is similar to what the model encountered during pre-training. For highly specialized domains with unique vocabulary, unusual document structures, or technical content unlike anything in the pre-training data, more labeled examples may be necessary regardless of the techniques applied.

Despite these limitations, data efficiency techniques have transformed what's practically achievable with NLP. Tasks that once required months of annotation can now be tackled with days of labeling. This democratization allows smaller organizations, practitioners with limited budgets, and applications in data-scarce domains to benefit from state-of-the-art language AI. The combination of large pre-trained models with smart adaptation strategies has fundamentally shifted the trade-offs in building NLP systems.

SummaryLink Copied

This chapter explored techniques for maximizing the value of limited labeled data in fine-tuning.

The sample efficiency spectrum ranges from zero-shot inference through few-shot fine-tuning to large-scale adaptation. Pre-trained models fundamentally change the relationship between data quantity and model performance, enabling strong results with far fewer examples than traditional approaches require.

Few-shot fine-tuning methods like Pattern-Exploiting Training (PET) reformulate tasks to match pre-training objectives, while SetFit uses contrastive learning to amplify few-shot data through pair generation. These techniques can achieve strong performance with just 8-32 labeled examples per class.

Data augmentation techniques for text include synonym replacement, back-translation, random operations (deletion, swap, insertion), embedding-space mixup, and contextual augmentation using language models. While less straightforward than image augmentation, these methods can meaningfully expand small datasets.

Sample efficiency patterns emerge from pre-training, which provides strong linguistic priors that reduce the need for task-specific data. Features learned from massive unlabeled corpora transfer effectively to downstream tasks with minimal adaptation.

Small data strategies emphasize domain-aligned model selection, task reformulation, aggressive regularization, ensembling across random seeds, and considering the trade-off between label efficiency and model size.

The next part explores parameter-efficient fine-tuning methods like LoRA, which extend these efficiency principles to model adaptation itself, allowing large models to be customized with minimal computational cost and memory overhead.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about fine-tuning data efficiency and few-shot learning techniques.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Fine-tuning Learning Rates

Next Chapter

PEFT Motivation

Coming Soon

Reference

BIBTEXAcademic

@misc{finetuningdataefficiencyfewshotlearningaugmentation, author = {Michael Brenndoerfer}, title = {Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/fine-tuning-data-efficiency-few-shot-learning-augmentation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation. Retrieved from https://mbrenndoerfer.com/writing/fine-tuning-data-efficiency-few-shot-learning-augmentation

MLAAcademic

Michael Brenndoerfer. "Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation." 2026. Web. today. <https://mbrenndoerfer.com/writing/fine-tuning-data-efficiency-few-shot-learning-augmentation>.

CHICAGOAcademic

Michael Brenndoerfer. "Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation." Accessed today. https://mbrenndoerfer.com/writing/fine-tuning-data-efficiency-few-shot-learning-augmentation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation'. Available at: https://mbrenndoerfer.com/writing/fine-tuning-data-efficiency-few-shot-learning-augmentation (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation. https://mbrenndoerfer.com/writing/fine-tuning-data-efficiency-few-shot-learning-augmentation

Direct link:

https://mbrenndoerfer.com/writing/fine-tuning-data-efficiency-few-shot-learning-augmentation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Fine-tuning Data Efficiency: Few-Shot Learning & Augmentation

Fine-tuning Data EfficiencyLink Copied

The Sample Efficiency SpectrumLink Copied

Few-Shot Fine-TuningLink Copied

Pattern-Exploiting TrainingLink Copied

SetFit: Contrastive Few-Shot LearningLink Copied

Stability in Few-Shot SettingsLink Copied

Data Augmentation for NLPLink Copied

Synonym ReplacementLink Copied

Back-TranslationLink Copied

Random OperationsLink Copied

Mixup for TextLink Copied

Contextual AugmentationLink Copied

Sample Efficiency PatternsLink Copied

Informative Priors from Pre-trainingLink Copied

Feature ReuseLink Copied

Label Smoothing EffectsLink Copied

ImplementationLink Copied

Implementing SetFit-Style TrainingLink Copied

Few-Shot Fine-Tuning with Sentence TransformersLink Copied

Contrastive Fine-Tuning ImplementationLink Copied

Analyzing Sample EfficiencyLink Copied

Key ParametersLink Copied

Small Data StrategiesLink Copied

Leverage Pre-training AlignmentLink Copied

Use Task ReformulationLink Copied

Apply Regularization AggressivelyLink Copied

Ensemble Over Random SeedsLink Copied

Consider Label Efficiency vs. ComputationLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Fine-tuning Learning Rates: LLRD, Warmup & Decay Strategies

Catastrophic Forgetting in Fine-Tuning: Causes & Mitigation

Full Fine-tuning: Hyperparameters & Learning Rate Schedules

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Fine-tuning Learning Rates: LLRD, Warmup & Decay Strategies

Catastrophic Forgetting in Fine-Tuning: Causes & Mitigation

Full Fine-tuning: Hyperparameters & Learning Rate Schedules

Stay updated