Transfer Learning: Pre-training and Fine-tuning for NLP

Michael BrenndoerferNovember 24, 202534 min read

Learn how transfer learning enables pre-trained models to adapt to new NLP tasks. Covers pre-training, fine-tuning, layer representations, and sample efficiency.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Transfer Learning

Throughout this book, you've learned how language models acquire knowledge through pre-training: BERT learns bidirectional representations through masked language modeling, GPT models capture sequential patterns through causal language modeling, and T5 learns flexible text-to-text mappings through span corruption. But why do we train these massive models in the first place? The answer is transfer learning, which changed how we build NLP systems.

Transfer learning is the practice of applying knowledge gained from one task to improve performance on a different, related task. Rather than training a model from scratch for each new problem, you start with a model that has already learned useful representations from a large corpus, then adapt it to your specific needs. This approach works well: a model pre-trained on general text can learn to classify sentiment, extract entities, answer questions, or summarize documents with remarkably little task-specific data.

This had a large impact. Before transfer learning became standard practice, building an effective NLP system required massive labeled datasets for each task. Sentiment analysis needed tens of thousands of labeled reviews. Named entity recognition required expensive expert annotation. Question answering demanded carefully curated question-answer pairs. Today, these same tasks can be tackled effectively with just hundreds or thousands of examples by leveraging pre-trained models. Transfer learning made advanced NLP accessible to more people by reducing the need for large labeled datasets.

The Pre-training/Fine-tuning Paradigm

Modern transfer learning separates general language understanding from task-specific adaptation. This reflects a key insight: language understanding is general, while task-specific knowledge is specialized. By decoupling these two phases, we can invest enormous computational resources in learning once and then reap the benefits across unlimited applications.

Stage 1: Pre-training

During pre-training, a model learns from vast amounts of unlabeled text using self-supervised objectives. As we covered in Part XVI, these objectives include causal language modeling (predicting the next token), masked language modeling (recovering masked tokens), and span corruption (reconstructing corrupted spans). The key insight is that predicting words in context forces the model to develop rich representations of language at multiple levels: syntax, semantics, pragmatics, and world knowledge.

To understand why this works, consider what it takes to predict a masked word accurately. Given the sentence "The capital of France is [MASK]," a model must know that capitals are cities, that France is a country, and that Paris is its capital. Given "The attorney argued that her [MASK] was innocent," the model must understand legal terminology, recognize the coreference between "her" and "attorney," and know that attorneys represent clients. Each prediction requires integrating multiple types of knowledge, and the cumulative effect of millions of such predictions builds comprehensive linguistic understanding.

Pre-training is computationally expensive. Training GPT-3 required approximately 3.64×10233.64 \times 10^{23} floating-point operations, costing millions of dollars in compute. But this cost is paid once. The resulting model encodes general-purpose knowledge that benefits countless downstream applications. Think of pre-training as constructing a massive library of linguistic knowledge: the construction cost is high, but once built, anyone can use the library to accomplish their specific goals.

Stage 2: Fine-tuning

Fine-tuning adapts a pre-trained model to a specific task using labeled examples. You start with the pre-trained weights, add task-specific layers if needed, and train on your target dataset with a much smaller learning rate than pre-training. The model adjusts its representations to optimize for your task while retaining the general knowledge from pre-training.

Fine-tuning requires a balance. The learning rate must be small enough to preserve pre-trained knowledge, but large enough for the model to learn the new task. Typically, fine-tuning learning rates are 10 to 100 times smaller than pre-training learning rates. The number of training epochs is also much smaller: while pre-training might involve multiple passes over billions of tokens, fine-tuning often converges within 3 to 5 epochs over thousands of examples.

This difference is important: pre-training uses billions of unlabeled tokens, while fine-tuning uses thousands of labeled examples. You get the best of both worlds: the broad knowledge of massive unsupervised learning combined with the precision of supervised task-specific training.

Why This Works

The pre-training/fine-tuning split works because natural language tasks share common structure. To classify the sentiment of "The movie was absolutely breathtaking," a model needs to understand that "breathtaking" is intensely positive, that "absolutely" amplifies this, and that these words apply to "movie." These linguistic skills, learned during pre-training, transfer directly to sentiment analysis even though the model was never explicitly trained on sentiment labels.

Consider the alternative: training a sentiment classifier from scratch. The model would need to learn from labeled examples that "breathtaking" is positive, that "absolutely" intensifies, and how adjectives modify nouns. With only a few thousand labeled examples, learning all these patterns would be impossible. The model would memorize surface patterns from the training data without understanding the underlying linguistic structure, leading to poor generalization.

More formally, pre-training learns a function that maps text to a rich representation space where semantically similar inputs cluster together. Fine-tuning then learns a relatively simple function from this representation space to task-specific outputs. Because the hard work of representation learning is already done, the task-specific function can be simple and learned from few examples.

To visualize this conceptually, imagine pre-training as organizing a vast library of books by topic, genre, and theme. When a new task arrives, such as finding books about Renaissance art, you do not need to re-read every book. Instead, you navigate the already-organized structure to the relevant section. Fine-tuning is like learning to navigate to a specific section; it is much easier than organizing the entire library from scratch.

What Transfers: A Layer-by-Layer Analysis

Not all knowledge transfers equally. Research into what pre-trained models learn reveals a hierarchical organization where different layers capture different linguistic phenomena. Understanding this hierarchy helps explain why transfer learning works and guides decisions about how to fine-tune models effectively. The progression from lower to upper layers mirrors the progression from surface form to deep meaning, a pattern that emerges naturally from the pre-training objective.

Lower Layers: Surface Patterns and Morphology

The early layers of transformer models capture surface-level patterns: character sequences, morphological structure, and local syntactic relationships. These layers learn representations that are highly transferable because they encode fundamental aspects of language that appear across virtually all text.

Why do lower layers specialize in surface patterns? The answer lies in how information flows through the network. The first layer receives token embeddings that encode only local information about each token's identity. Through self-attention, this layer can detect patterns in how tokens co-occur within local contexts, learning that certain character sequences form words and that certain words frequently appear together. These patterns are the building blocks upon which higher-level understanding is constructed.

Probing experiments reveal that lower layers can accurately predict:

  • Part-of-speech tags
  • Morphological features (tense, number, case)
  • Character-level patterns
  • Basic phrase boundaries

These representations are language-specific but task-agnostic. Whether you're doing sentiment analysis, named entity recognition, or question answering, you need to understand that "running" is a verb form and "quickly" is an adverb. The universality of these requirements explains why lower-layer representations transfer so effectively across diverse tasks.

Middle Layers: Syntactic Structure

The middle layers of pre-trained models encode syntactic structure. These layers learn implicit parse trees, dependency relationships, and long-range grammatical agreements. Remarkably, models trained only to predict words develop representations that correlate strongly with traditional linguistic formalisms, even though they were never explicitly taught these concepts.

This shows that the pre-training objective helps models learn syntax. Consider why syntactic understanding helps predict masked words. In the sentence "The dogs that live next door [MASK] loudly every morning," predicting the masked word requires knowing that "dogs" is the subject, not "door." This requires tracking the relative clause structure and maintaining agreement across intervening material. Models that learn to make such predictions accurately must develop internal representations of syntactic structure.

Research using attention probing has found that specific attention heads specialize in tracking syntactic relationships:

  • Subject-verb agreement across intervening clauses
  • Coreference chains linking pronouns to antecedents
  • Constituency boundaries marking phrase structure

This syntactic knowledge transfers because syntax constrains meaning. Understanding that "the cat that chased the mouse ate the cheese" means the cat ate the cheese (not the mouse) requires syntactic parsing, regardless of what downstream task you're performing. A sentiment classifier, a question answering system, and a summarization model all benefit from accurate syntactic analysis, even though their ultimate outputs differ dramatically.

Upper Layers: Semantics and Task Adaptation

The upper layers capture more abstract semantic relationships and are most influenced by fine-tuning. These layers encode meaning compositions, reasoning patterns, and increasingly task-specific representations as you move toward the output.

The semantic representations in upper layers integrate information gathered from lower layers into coherent interpretations. At this level, the model represents not just what words mean in isolation but what they mean in context: the same word "bank" receives different representations depending on whether the surrounding context involves rivers or finance. These contextualized semantic representations are the primary currency of transfer learning, encoding the rich understanding that enables downstream task performance.

During fine-tuning, upper layers change more than lower layers. This makes intuitive sense: the surface-level linguistic knowledge encoded in lower layers remains useful regardless of task, while the higher-level representations need reshaping to produce task-specific outputs. A sentiment classifier and a named entity recognizer both benefit from the same morphological and syntactic analysis, but they need different semantic representations to produce their respective outputs. Fine-tuning specializes the upper layers for each task while largely preserving the shared lower-layer representations.

Visualizing Layer Representations

Let's examine how representations differ across layers in a pre-trained model:

In[2]:
Code
!uv pip install transformers torch numpy matplotlib scikit-learn
from transformers import AutoModel, AutoTokenizer
import torch
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from sklearn.decomposition import PCA

## Load a pre-trained BERT model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_hidden_states=True)
model.eval()
In[3]:
Code
## Example sentences with different semantic content
## First 3: Financial bank, Last 3: River bank
sentences = [
    "The bank approved my loan application.",
    "She deposited money at the bank.",
    "I need to go to the bank to withdraw cash.",
    "The river bank was covered in wildflowers.",
    "We sat on the bank and watched the fish.",
    "The boat was tied to the bank of the river.",
]


## Get hidden states for all layers
def get_layer_representations(sentences, word="bank"):
    representations = {i: [] for i in range(13)}  # 12 layers + embeddings

    for sent in sentences:
        inputs = tokenizer(sent, return_tensors="pt", padding=True)
        with torch.no_grad():
            outputs = model(**inputs)

        # Find the position of our target word
        tokens = tokenizer.tokenize(sent)
        word_idx = None
        for idx, token in enumerate(tokens):
            if word in token:
                word_idx = idx + 1  # +1 for [CLS] token
                break

        if word_idx is not None:
            for layer_idx, hidden_state in enumerate(outputs.hidden_states):
                rep = hidden_state[0, word_idx, :].numpy()
                representations[layer_idx].append(rep)

    return representations


reps = get_layer_representations(sentences)

We extract representations of the word "bank" from sentences where it has different meanings: the financial institution versus the river bank. This experiment shows how pre-trained models disambiguate word senses based on context. Let's see how these representations separate across layers:

In[4]:
Code
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import PCA
from matplotlib.lines import Line2D

plt.rcParams.update(
    {
        "figure.figsize": (2.0, 1.8),  # Adjust for layout
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

layers_to_plot = [0, 2, 4, 7, 9, 11]
## Financial (orange), River (blue)
colors = ["#ff7f0e"] * 3 + ["#1f77b4"] * 3

for layer_idx in layers_to_plot:
    plt.figure()
    if len(reps[layer_idx]) >= 2:
        X = np.array(reps[layer_idx])
        pca = PCA(n_components=2)
        X_pca = pca.fit_transform(X)

        for i, (x, y) in enumerate(X_pca):
            plt.scatter(x, y, c=colors[i], s=50, alpha=0.7)

        plt.title(f"Layer {layer_idx}")
        plt.xlabel("PC1")
        plt.ylabel("PC2")

        # Add legend to the first plot as a key
        if layer_idx == 0:
            legend_elements = [
                Line2D(
                    [0],
                    [0],
                    marker="o",
                    color="w",
                    markerfacecolor="#ff7f0e",
                    label="Financial",
                ),
                Line2D(
                    [0],
                    [0],
                    marker="o",
                    color="w",
                    markerfacecolor="#1f77b4",
                    label="River",
                ),
            ]
            plt.legend(handles=legend_elements, loc="upper right")

    plt.show()

We can quantify this separation by computing the ratio of between-class to within-class distances at each layer:

In[5]:
Code
## Quantify the separation of "bank" meanings across layers
separation_ratios = []

for layer_idx in range(13):
    layer_reps = reps[layer_idx]
    if len(layer_reps) >= 6:
        X = np.array(layer_reps)
        # Financial bank: sentences 0-2, River bank: sentences 3-5
        financial = X[:3]
        river = X[3:]

        # Compute mean distance between classes
        between_dists = []
        for f in financial:
            for r in river:
                between_dists.append(np.linalg.norm(f - r))
        between_mean = np.mean(between_dists)

        # Compute mean distance within each class
        within_financial = []
        for i, f1 in enumerate(financial):
            for j, f2 in enumerate(financial):
                if i < j:
                    within_financial.append(np.linalg.norm(f1 - f2))

        within_river = []
        for i, r1 in enumerate(river):
            for j, r2 in enumerate(river):
                if i < j:
                    within_river.append(np.linalg.norm(r1 - r2))

        within_mean = (
            np.mean(within_financial + within_river)
            if (within_financial + within_river)
            else 1e-8
        )

        separation_ratios.append(between_mean / (within_mean + 1e-8))
    else:
        separation_ratios.append(np.nan)
In[6]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),  # Adjust for layout
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.plot(range(13), separation_ratios, "o-", linewidth=2, markersize=8)
plt.axhline(y=1.0, color="gray", linestyle="--", alpha=0.5)
plt.xlabel("Layer")
plt.ylabel("Between/Within Class Distance Ratio")
plt.title("Word Sense Separation Across BERT Layers")
plt.xticks(range(13), ["Emb"] + [str(i) for i in range(1, 13)])
plt.show()
Out[6]:
Visualization
Ratio of between-class to within-class Euclidean distances for 'bank' representations across BERT layers. A ratio exceeding 1.0 indicates that context-specific clusters (financial vs. river) are more distinct than the internal variation within each sense. The steep rise in middle layers quantifies the network's increasing ability to disambiguate word senses based on syntactic and semantic context.
Ratio of between-class to within-class Euclidean distances for 'bank' representations across BERT layers. A ratio exceeding 1.0 indicates that context-specific clusters (financial vs. river) are more distinct than the internal variation within each sense. The steep rise in middle layers quantifies the network's increasing ability to disambiguate word senses based on syntactic and semantic context.

The visualization reveals how word sense disambiguation emerges across layers. In lower layers, representations of "bank" are relatively similar regardless of context, reflecting the fact that these layers primarily encode the token's identity and local patterns rather than its contextual meaning. As we progress through the middle layers, the representations begin to diverge, reflecting the integration of syntactic context. By upper layers, the financial and geographical senses have separated in representation space, forming distinct clusters that reflect their different meanings.

This contextual disambiguation, which the model learned during pre-training without any explicit word sense labels, directly benefits any downstream task involving ambiguous words. A sentiment classifier analyzing "The bank's customer service was terrible" benefits from knowing that "bank" refers to a financial institution, because financial institutions can have customer service while river banks cannot. This disambiguation happens automatically, as a natural consequence of the rich contextual representations learned during pre-training.

Key Parameters

The key parameters for the visualization code are:

  • output_hidden_states (AutoModel): Set to True to retrieve hidden states from all layers rather than just the final layer.
  • n_components (PCA): The number of principal components to keep (2) for reducing the high-dimensional representations to a plotable 2D space.

Types of Knowledge That Transfer

Transfer learning succeeds because pre-trained models acquire multiple types of knowledge, each useful for different downstream applications. Understanding these different knowledge types helps explain why transfer learning is so broadly effective and guides decisions about which pre-trained model to select for different tasks. The diversity of knowledge encoded in pre-trained models reflects the diversity of information required to predict words accurately in natural text.

Linguistic Knowledge

The most obvious type of transfer involves core linguistic competencies. These competencies form the foundation upon which all language understanding is built, and they transfer because every language task, regardless of its specific objective, requires parsing and interpreting natural language:

  • Syntax: Understanding grammatical structure, agreement patterns, and phrase boundaries. This includes knowing which words can modify which other words, how clauses nest within sentences, and how word order conveys meaning.
  • Morphology: Recognizing word forms, inflections, and derivational patterns. This encompasses understanding that "running," "runs," and "ran" are forms of the same verb, and that "unhappiness" is derived from "happy" through regular morphological processes.
  • Semantics: Encoding word meanings, compositional semantics, and lexical relationships. This involves knowing that "dog" and "canine" are related, that "buy" and "sell" describe the same transaction from different perspectives, and that "not unhappy" has a different meaning than "happy."
  • Pragmatics: Capturing discourse structure, coherence, and communicative intent. This includes understanding that questions expect answers, that pronouns refer to previously mentioned entities, and that certain phrases signal speaker attitude or certainty.

This linguistic knowledge enables models to parse novel sentences, understand complex constructions, and handle the infinite variety of natural language. Every new sentence a model encounters differs from every sentence it saw during training, yet the model can process it because it has learned the underlying rules and patterns of the language.

World Knowledge

Pre-trained models also acquire factual knowledge about the world. Training on internet text exposes models to encyclopedic information: that Paris is the capital of France, that water freezes at 0C0^\circ\text{C}, and that Einstein developed the theory of relativity. This knowledge transfers to tasks requiring factual understanding, such as question answering or fact verification.

The acquisition of world knowledge through language modeling is remarkable because the model is never explicitly told these facts. Instead, it learns them by observing patterns in how concepts co-occur. A model that sees thousands of sentences mentioning Paris in contexts involving France, government, and capitals learns to associate these concepts. This implicit knowledge acquisition means that pre-trained models function as compressed databases of the information present in their training corpora.

Research has shown that larger models store more factual knowledge, explaining part of why scale improves downstream task performance. However, this knowledge can become outdated, as the model's knowledge reflects its training data cutoff date. A model trained on text from 2022 will not know about events that occurred in 2023, regardless of its size.

Reasoning Patterns

Pre-trained models also appear to learn reasoning patterns that transfer across tasks. These patterns emerge from the regularities in how humans express logical relationships in text:

  • Analogical reasoning: Understanding relationships between concepts, such as knowing that Paris is to France as Berlin is to Germany
  • Causal reasoning: Recognizing cause-effect relationships in text, such as understanding that "because the bridge collapsed, traffic was rerouted" indicates the collapse caused the rerouting
  • Commonsense inference: Drawing everyday conclusions from context, such as inferring that someone who "grabbed an umbrella before leaving" expects rain
  • Numerical reasoning: Basic arithmetic and quantitative comparisons, such as understanding that "more than half" means a majority

These abilities emerge from patterns in training text that implicitly demonstrate reasoning. A model that has seen thousands of examples explaining that "because XX happened, YY resulted" learns to recognize causal structure even in novel contexts. This learned reasoning transfers to downstream tasks that require similar inferences, even when those tasks involve different domains or surface forms.

Domain-Specific Knowledge

When pre-training data includes domain-specific text, models acquire specialized knowledge. This is why domain-adapted models like BioBERT (trained on biomedical literature) or FinBERT (trained on financial text) often outperform general-purpose models on domain-specific tasks. The pre-training stage can be thought of as installing a prior over useful representations, and domain-specific pre-training installs a better prior for domain-specific tasks.

The effectiveness of domain-specific pre-training reflects the fact that different domains have different vocabularies, different patterns of expression, and different background knowledge requirements. Medical text uses technical terminology, abbreviations, and writing conventions that differ from everyday language. A model pre-trained on medical literature has already learned these domain-specific patterns, making it better positioned to understand new medical texts than a model trained only on general web text.

Transfer Learning Efficiency

Transfer learning is efficient in several ways. These gains explain why transfer learning has become the default approach for virtually all NLP applications: it is not merely convenient but fundamentally changes what is possible with limited resources.

Sample Efficiency

The primary benefit is sample efficiency: the ability to achieve good performance with far fewer labeled examples than training from scratch would require. This efficiency arises because the pre-trained model already understands language; it needs only to learn how to apply that understanding to the specific task at hand.

In[7]:
Code
## Let's simulate the sample efficiency of transfer learning
## by comparing a simple baseline to a pre-trained model approach


## Sample sentiment data
texts = [
    "This movie was absolutely wonderful and amazing!",
    "Terrible film, complete waste of time.",
    "I loved every minute of this masterpiece.",
    "Boring and predictable, not recommended.",
    "Outstanding performances by the entire cast!",
    "Dull, uninspired, and forgettable.",
    "A beautiful story that touched my heart.",
    "Awful acting and terrible dialogue.",
    "Brilliant direction and stunning visuals!",
    "One of the worst movies I've ever seen.",
    "Captivating from start to finish.",
    "Disappointing and poorly executed.",
    "An incredible cinematic experience!",
    "Mediocre at best, skip this one.",
    "Truly exceptional and thought-provoking.",
    "Painfully slow and utterly boring.",
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
In[8]:
Code
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from transformers import pipeline


## Approach 1: Traditional TF-IDF + Logistic Regression (from scratch)
def train_tfidf_classifier(train_texts, train_labels, test_texts):
    vectorizer = TfidfVectorizer(max_features=1000)
    X_train = vectorizer.fit_transform(train_texts)
    X_test = vectorizer.transform(test_texts)

    clf = LogisticRegression(max_iter=1000)
    clf.fit(X_train, train_labels)

    return clf.predict(X_test)


## Approach 2: Pre-trained sentiment classifier
sentiment_classifier = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
)


def pretrained_predictions(texts):
    results = sentiment_classifier(texts)
    return [1 if r["label"] == "POSITIVE" else 0 for r in results]
In[9]:
Code
## Test with varying amounts of training data
training_sizes = [4, 8, 12]
test_texts = texts[-4:]  # Hold out last 4 for testing
test_labels = labels[-4:]

from_scratch_acc = []
pretrained_acc = []

for n in training_sizes:
    # Train from scratch with n examples
    train_texts = texts[:n]
    train_labels = labels[:n]

    preds_scratch = train_tfidf_classifier(
        train_texts, train_labels, test_texts
    )
    acc_scratch = sum(p == t for p, t in zip(preds_scratch, test_labels)) / len(
        test_labels
    )
    from_scratch_acc.append(acc_scratch)

    # Pre-trained model (doesn't need our training data - already fine-tuned)
    preds_pretrained = pretrained_predictions(test_texts)
    acc_pretrained = sum(
        p == t for p, t in zip(preds_pretrained, test_labels)
    ) / len(test_labels)
    pretrained_acc.append(acc_pretrained)
In[10]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),  # Adjust for layout
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.plot(
    training_sizes,
    from_scratch_acc,
    "o-",
    label="TF-IDF + LogReg (from scratch)",
)
plt.plot(training_sizes, pretrained_acc, "s-", label="Pre-trained DistilBERT")
plt.xlabel("Number of Training Examples")
plt.ylabel("Test Accuracy")
plt.title("Sample Efficiency: Transfer Learning vs. From Scratch")
plt.legend()
plt.ylim(0, 1.1)
plt.xticks(training_sizes)
plt.show()
Out[10]:
Visualization
Comparison of test accuracy between a pre-trained DistilBERT model and a TF-IDF logistic regression baseline across varying training set sizes. The pre-trained model achieves perfect accuracy with as few as 4 examples, leveraging prior knowledge, whereas the model trained from scratch requires significantly more data to learn task-relevant patterns.
Comparison of test accuracy between a pre-trained DistilBERT model and a TF-IDF logistic regression baseline across varying training set sizes. The pre-trained model achieves perfect accuracy with as few as 4 examples, leveraging prior knowledge, whereas the model trained from scratch requires significantly more data to learn task-relevant patterns.

The pre-trained model achieves strong performance immediately because it already understands sentiment from its original fine-tuning. The from-scratch approach must learn everything from the few examples provided: what words indicate positive or negative sentiment, how modifiers work, and how to compose these signals into an overall judgment. With only a handful of examples, this comprehensive learning is impossible.

Key Parameters

The key parameters used in this comparison are:

  • max_features (TfidfVectorizer): Limits the vocabulary to the top most frequent words to prevent overfitting on small data.
  • max_iter (LogisticRegression): The maximum number of iterations for the solver to converge.

Compute Efficiency

Transfer learning also provides compute efficiency. Fine-tuning a pre-trained model typically requires:

  • Fewer training iterations (the model is already close to a good solution)
  • Smaller batch sizes (updates are refinements, not wholesale learning)
  • Less total compute (often 1001000×100\text{--}1000\times less than pre-training)

This means fine-tuning can often be done on a single GPU in hours, while pre-training the same model would require clusters of GPUs running for weeks.

The compute efficiency of fine-tuning stems from the optimization landscape. A pre-trained model has already found a region of parameter space that produces good language representations. Fine-tuning needs only to navigate from this good starting point to a nearby point that is optimal for the specific task. In contrast, training from scratch must navigate from a random initialization through a vast, complex loss landscape to find good representations. The distance to travel is far shorter when starting from pre-trained weights.

The Economics of Transfer Learning

Consider the economic implications. Pre-training BERT-base cost roughly \$5,000--10,000 in cloud compute (at 2018 prices). But once trained, this model has been fine-tuned for thousands of different tasks by researchers and practitioners worldwide. Each fine-tuning run costs perhaps \$5--50. The pre-training cost is amortized across countless applications, making sophisticated NLP accessible to organizations that could never afford to train from scratch.

This economic structure has shaped the field. Large organizations with substantial compute budgets pre-train foundation models, while the broader community fine-tunes these models for specific applications. It's a form of specialization that has accelerated progress across NLP. Small startups can build state-of-the-art NLP features by fine-tuning publicly available pre-trained models, competing effectively with much larger organizations. Academic researchers can explore new tasks and domains without requiring the compute budgets of industry labs.

Historical Perspective

Looking at the history of transfer learning helps explain how it works today.

Computer Vision: The ImageNet Moment

Transfer learning first demonstrated its power in computer vision. In 2012, AlexNet won the ImageNet challenge and researchers discovered that its learned features transferred remarkably well to other vision tasks. Features learned to detect edges, textures, and shapes in ImageNet could be reused for medical imaging, satellite analysis, or facial recognition.

This "ImageNet moment" created a template: pre-train on a large general dataset, fine-tune for specific applications. NLP researchers sought an analogous approach but faced a challenge: there was no natural analog to ImageNet's supervised image classification dataset.

Word Embeddings: First Steps

Word2Vec and GloVe, which we covered in Part IV, represented early transfer learning in NLP. Pre-trained word embeddings captured semantic relationships that could initialize neural network models for downstream tasks. However, these embeddings were static: each word had a single representation regardless of context.

Contextualized Embeddings: ELMo

ELMo (Embeddings from Language Models), introduced in 2018, changed the game. By pre-training a bidirectional LSTM language model, ELMo produced context-dependent representations. The word "bank" would have different representations in financial and geographical contexts. These contextualized embeddings dramatically improved performance across NLP tasks.

ELMo used a feature-based approach: the pre-trained representations were fixed features fed into task-specific models. This was effective but limited, as it couldn't benefit from joint optimization of representations and task objectives.

The BERT Revolution

BERT, as we discussed in Part XVII, combined the benefits of contextualized representations with end-to-end fine-tuning. Pre-trained using masked language modeling, BERT's parameters could be adapted during fine-tuning, allowing representations to specialize for each task while retaining general linguistic knowledge.

BERT's success established the modern transfer learning paradigm. Subsequent models, including RoBERTa, ALBERT, ELECTRA, and DeBERTa that you've already studied, refined the approach with improved pre-training objectives, more efficient architectures, and better fine-tuning strategies.

GPT and Generative Transfer

While BERT demonstrated transfer learning for discriminative tasks (classification, tagging, extraction), the GPT series showed that autoregressive language modeling could enable transfer to generative tasks. As we covered in Part XVIII, GPT-2 and GPT-3 demonstrated impressive transfer via prompting and in-context learning, expanding the scope of what pre-trained models could accomplish.

Conditions for Successful Transfer

Not all transfer is beneficial. Understanding when transfer works helps you design effective systems. Transfer works when knowledge from the source task is relevant to the target task. High relevance accelerates learning, while low relevance can hurt performance.

Domain Similarity

Transfer works best when source and target domains share structure. A model pre-trained on news text will transfer well to other formal written English but may struggle with informal social media language or highly technical scientific prose. This is why domain-adapted models often outperform general-purpose ones.

The relevant notion of similarity encompasses multiple dimensions. Vocabulary overlap matters: a model that has never seen medical terminology will struggle with medical text. Syntactic conventions matter: scientific writing uses passive voice and complex nominalizations more than casual conversation. Discourse structure matters: legal documents follow different organizational principles than narrative fiction. The more these dimensions align between source and target, the more effective transfer will be.

Task Relatedness

Related tasks share representations. Language modeling helps sentiment analysis because both require understanding word meanings and compositions. But language modeling may help less for tasks requiring specialized knowledge not present in pre-training data.

Task relatedness can be understood through the lens of representation requirements. Two tasks are related if they benefit from similar internal representations. Sentiment analysis and emotion detection are highly related because both require understanding affective language. Sentiment analysis and parsing are somewhat related because sentiment often depends on syntactic structure. Sentiment analysis and mathematical reasoning are less related because they require different types of knowledge and different representational properties.

Avoiding Negative Transfer

When source and target domains are too dissimilar, transfer can actually hurt performance. This negative transfer occurs when pre-trained representations encode biases or patterns inappropriate for the target task. We'll explore how to handle this through careful fine-tuning strategies in the upcoming chapters.

Negative transfer is particularly insidious because it is not always obvious. A model pre-trained on formal English might learn that certain grammatical constructions indicate high-quality text. When applied to informal text like social media posts, this bias could lead the model to misclassify informal but substantive content. Detecting and mitigating negative transfer requires careful evaluation on held-out data from the target domain.

Probing What Models Learn

Probing Tasks

Probing tasks are simple classification tasks designed to test whether specific linguistic properties are encoded in model representations. By training a lightweight classifier on top of frozen model representations, you can assess what information is accessible in different layers.

Probing provides a window into the internal representations of pre-trained models. To probe a model, extract representations from a layer and train a simple classifier to predict a linguistic property. Success means the information is present; failure means it is absent or inaccessible.

Let's implement a simple probing experiment to verify that syntactic information is encoded in pre-trained representations:

In[11]:
Code
## Probing for part-of-speech information in BERT representations
from transformers import AutoModel, AutoTokenizer

## Prepare probing data: words with their POS tags
probing_data = [
    # Nouns
    ("The cat sat on the mat.", [("cat", "NOUN"), ("mat", "NOUN")]),
    ("She read a book yesterday.", [("book", "NOUN")]),
    ("The dog chased the ball.", [("dog", "NOUN"), ("ball", "NOUN")]),
    # Verbs
    ("He runs every morning.", [("runs", "VERB")]),
    ("They decided to leave early.", [("decided", "VERB"), ("leave", "VERB")]),
    ("The bird sings beautifully.", [("sings", "VERB")]),
    # Adjectives
    ("The red car is fast.", [("red", "ADJ"), ("fast", "ADJ")]),
    ("She wore a beautiful dress.", [("beautiful", "ADJ")]),
    ("The old house was empty.", [("old", "ADJ"), ("empty", "ADJ")]),
]

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained(
    "bert-base-uncased", output_hidden_states=True
)
model.eval()
In[12]:
Code
def extract_word_representations(sentence, target_word, layer=6):
    """Extract representation for a specific word from a specific layer."""
    inputs = tokenizer(sentence, return_tensors="pt")
    tokens = tokenizer.tokenize(sentence)

    # Find target word position
    target_lower = target_word.lower()
    word_idx = None
    for idx, token in enumerate(tokens):
        if token.startswith(target_lower) or target_lower.startswith(
            token.replace("##", "")
        ):
            word_idx = idx + 1  # +1 for [CLS]
            break

    if word_idx is None:
        return None

    with torch.no_grad():
        outputs = model(**inputs)
        hidden_state = outputs.hidden_states[layer]
        return hidden_state[0, word_idx, :].numpy()


## Build probing dataset
X, y = [], []
pos_to_idx = {"NOUN": 0, "VERB": 1, "ADJ": 2}

for sentence, word_tags in probing_data:
    for word, pos in word_tags:
        rep = extract_word_representations(sentence, word, layer=6)
        if rep is not None:
            X.append(rep)
            y.append(pos_to_idx[pos])

X = np.array(X)
y = np.array(y)
In[13]:
Code
## Train a simple probe and evaluate
from sklearn.model_selection import cross_val_score

probe = LogisticRegression(max_iter=1000, random_state=42)
scores = cross_val_score(
    probe, X, y, cv=min(3, len(X) // 3), scoring="accuracy"
)
Out[14]:
Console
POS Probing Accuracy: 0.93 (+/- 0.09)
Number of probing examples: 14
Representation dimensionality: 768

Even with this tiny probing dataset, the classifier achieves reasonable accuracy at distinguishing parts of speech, demonstrating that BERT's layer 6 representations encode syntactic information. The success of this simple experiment reflects the rich linguistic knowledge that BERT acquired during pre-training. Large-scale probing studies use thousands of examples and show that different layers specialize in different linguistic properties, with lower layers encoding morphology, middle layers encoding syntax, and upper layers encoding semantics.

To see how syntactic information is distributed across layers, we can extend our probing experiment to test each layer:

In[15]:
Code
## Probe POS information across all layers
layer_probing_accuracies = []

for layer in range(13):
    X_layer, y_layer = [], []

    for sentence, word_tags in probing_data:
        for word, pos in word_tags:
            rep = extract_word_representations(sentence, word, layer=layer)
            if rep is not None:
                X_layer.append(rep)
                y_layer.append(pos_to_idx[pos])

    if len(X_layer) >= 6:
        X_layer = np.array(X_layer)
        y_layer = np.array(y_layer)
        probe_layer = LogisticRegression(max_iter=1000, random_state=42)
        n_splits = min(3, len(X_layer) // max(len(set(y_layer)), 1))
        if n_splits >= 2:
            layer_scores = cross_val_score(
                probe_layer, X_layer, y_layer, cv=n_splits
            )
            layer_probing_accuracies.append(layer_scores.mean())
        else:
            probe_layer.fit(X_layer, y_layer)
            layer_probing_accuracies.append(probe_layer.score(X_layer, y_layer))
    else:
        layer_probing_accuracies.append(np.nan)
In[16]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),  # Adjust for layout
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.plot(range(13), layer_probing_accuracies, "o-", linewidth=2, markersize=8)
plt.xlabel("Layer")
plt.ylabel("POS Probing Accuracy")
plt.title("Syntactic Information Across BERT Layers")
plt.xticks(range(13), ["Emb"] + [str(i) for i in range(1, 13)])
plt.ylim(0, 1.05)
plt.show()
Out[16]:
Visualization
Part-of-Speech (POS) probing accuracy across 13 BERT layers (embedding + 12 transformer layers). Accuracy peaks in the middle layers (layers 3-6), indicating that syntactic information is most accessible in this region of the network, before becoming more abstract in upper layers.
Part-of-Speech (POS) probing accuracy across 13 BERT layers (embedding + 12 transformer layers). Accuracy peaks in the middle layers (layers 3-6), indicating that syntactic information is most accessible in this region of the network, before becoming more abstract in upper layers.

This layer-wise analysis confirms the hierarchical organization of linguistic knowledge in pre-trained models. Syntactic information like part-of-speech tags becomes increasingly accessible as we move from the embedding layer through the early transformer layers, typically reaching peak accessibility in middle layers. The slight decline in upper layers reflects their specialization toward more abstract semantic representations that may not require explicit syntactic encoding.

Key Parameters

The key parameters for the probing experiment are:

  • output_hidden_states (AutoModel): Enables access to internal layer representations required for probing.
  • cv (cross_val_score): The number of folds for cross-validation, ensuring robust performance estimation.
  • max_iter (LogisticRegression): Ensures the probe classifier converges given the high-dimensional input features.

Implications for Practice

Transfer learning has practical implications for how you approach NLP projects.

Start with Pre-trained Models

Unless you have a compelling reason not to, always start with a pre-trained model. The burden of proof should be on training from scratch, not on using transfer learning. Even if your domain is specialized, pre-trained models provide a strong initialization.

Choose the Right Base Model

Different pre-trained models suit different tasks:

  • BERT-style models: Best for classification, token labeling, and extraction tasks
  • GPT-style models: Best for generation and tasks that can be framed as text completion
  • T5-style models: Flexible for tasks that can be framed as text-to-text

As we'll explore in upcoming chapters, the choice of base model interacts with fine-tuning strategy.

Consider Domain Adaptation

If your domain differs substantially from general web text, consider domain-adaptive pre-training. Continue pre-training on domain-specific text before task-specific fine-tuning. This two-stage transfer often outperforms direct fine-tuning.

Monitor for Distribution Shift

Pre-trained models reflect their training data. If your target distribution differs significantly (different time period, different demographics, different register), be aware that transfer may be imperfect. Evaluate carefully and consider strategies to address distribution shift.

Limitations and Challenges

While transfer learning has transformed NLP, it has important limitations.

The most significant challenge is catastrophic forgetting, where fine-tuning causes the model to lose capabilities it had after pre-training. Optimizing for a specific task can overwrite the general knowledge that made transfer learning valuable in the first place. This is particularly problematic when you want a single model to handle multiple tasks. We'll address this in detail in an upcoming chapter.

Transfer learning also inherits biases from pre-training data. Models trained on internet text encode societal biases present in that text. These biases transfer to downstream tasks, sometimes amplifying stereotypes or producing unfair predictions. Addressing these biases requires careful evaluation and mitigation strategies.

Another limitation is the fixed knowledge cutoff. Pre-trained models know about events and facts present in their training data but nothing about what happened afterward. This temporal limitation means models can provide outdated information and cannot reason about recent events without additional mechanisms.

Finally, transfer learning works best for tasks that resemble aspects of the pre-training objective. Tasks requiring specialized reasoning, precise numerical computation, or knowledge not present in pre-training data may see limited benefit from transfer. In such cases, task-specific approaches or specialized pre-training may be necessary.

Summary

Transfer learning revolutionized NLP by enabling powerful models trained on vast unlabeled text to be adapted for specific tasks with minimal labeled data. The pre-training/fine-tuning paradigm separates general language understanding from task-specific adaptation, allowing the expensive work of representation learning to be amortized across countless applications.

Pre-trained models acquire multiple types of knowledge that transfer: linguistic competencies encoded hierarchically across layers, world knowledge absorbed from training text, and reasoning patterns implicit in language use. This rich prior makes fine-tuning extraordinarily sample-efficient, allowing strong performance from hundreds rather than hundreds of thousands of examples.

The history of transfer learning traces from static word embeddings through contextualized representations to the modern transformer-based paradigm. Each step expanded what could transfer and how effectively. Today, transfer learning is the default approach for nearly all NLP tasks.

However, transfer learning introduces challenges: catastrophic forgetting during fine-tuning, inherited biases from pre-training, and limitations from knowledge cutoffs. In the following chapters, we'll explore full fine-tuning techniques, strategies to prevent forgetting, and efficient alternatives like parameter-efficient fine-tuning that address some of these challenges while preserving the benefits that make transfer learning so powerful.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about transfer learning in NLP.

Loading component...

Reference

BIBTEXAcademic
@misc{transferlearningpretrainingandfinetuningfornlp, author = {Michael Brenndoerfer}, title = {Transfer Learning: Pre-training and Fine-tuning for NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/transfer-learning-nlp-pre-training-fine-tuning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Transfer Learning: Pre-training and Fine-tuning for NLP. Retrieved from https://mbrenndoerfer.com/writing/transfer-learning-nlp-pre-training-fine-tuning
MLAAcademic
Michael Brenndoerfer. "Transfer Learning: Pre-training and Fine-tuning for NLP." 2026. Web. today. <https://mbrenndoerfer.com/writing/transfer-learning-nlp-pre-training-fine-tuning>.
CHICAGOAcademic
Michael Brenndoerfer. "Transfer Learning: Pre-training and Fine-tuning for NLP." Accessed today. https://mbrenndoerfer.com/writing/transfer-learning-nlp-pre-training-fine-tuning.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Transfer Learning: Pre-training and Fine-tuning for NLP'. Available at: https://mbrenndoerfer.com/writing/transfer-learning-nlp-pre-training-fine-tuning (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Transfer Learning: Pre-training and Fine-tuning for NLP. https://mbrenndoerfer.com/writing/transfer-learning-nlp-pre-training-fine-tuning