BERT Fine-tuning: Classification, NER & Question Answering

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Master BERT fine-tuning for downstream NLP tasks. Learn task-specific heads, hyperparameter tuning, and strategies to prevent catastrophic forgetting.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

BERT Fine-tuningLink Copied

You have a pre-trained BERT model with 110 million parameters trained on billions of words. Now you need it to classify movie reviews, identify named entities, or answer questions about documents. How do you adapt this general language understanding to your specific task?

Fine-tuning is the answer. Rather than training from scratch, you take BERT's pre-trained weights and continue training on your task-specific data. The model already understands language; fine-tuning teaches it your particular task. This chapter covers the complete fine-tuning process: how to add task-specific heads for classification, sequence labeling, and question answering; how to set hyperparameters that balance learning speed against stability; and how to avoid catastrophic forgetting, where the model loses its pre-trained knowledge.

The Fine-tuning ParadigmLink Copied

Pre-training teaches BERT general language understanding. Fine-tuning specializes that understanding for downstream tasks. The key insight is that most of BERT's learned representations transfer well across tasks, so only minor adjustments are needed.

Fine-tuning

The process of taking a pre-trained model and continuing training on task-specific data with a task-specific objective. Fine-tuning updates all or most of the model's parameters, adapting general representations to the target task while preserving useful pre-trained knowledge.

The fine-tuning workflow follows a consistent pattern:

Load a pre-trained BERT model
Add a task-specific head (classifier, token labeler, or span predictor)
Train on labeled task data with a much lower learning rate than pre-training
The entire model updates: both the new head and BERT's existing weights

This differs from feature extraction, where BERT's weights are frozen and only the task head trains. Fine-tuning typically achieves better performance because BERT can adapt its internal representations to the task, not just learn to use fixed features.

Out[3]:

Visualization

Line plot comparing validation accuracy over epochs for fine-tuning and feature extraction approaches. — Fine-tuning versus feature extraction on a classification task. Fine-tuning (updating all weights) typically outperforms feature extraction (frozen BERT, only training the classifier) because BERT can adapt its internal representations.

Out[4]:

Visualization

Diagram showing BERT layers with a classification head on top, arrows indicating gradient flow through all layers during fine-tuning. — The fine-tuning paradigm: pre-trained BERT weights are loaded, a task-specific head is added, and the entire model is trained end-to-end on task data. Gradients flow through both the new head and the pre-trained layers.

Classification Fine-tuningLink Copied

Sentiment analysis, spam detection, topic classification: these tasks require mapping an entire text to a single label. BERT handles classification through its [CLS] token.

The Classification ArchitectureLink Copied

To classify a sentence, we need to reduce BERT's variable-length sequence of token representations into a single fixed-size vector. But which tokens should we use? We could average all token representations, but that treats every word equally, giving "the" as much weight as "terrible" in a movie review. We could use the final token, but that's arbitrary.

BERT's solution is elegant: it reserves a special [CLS] token at the beginning of every input specifically for sequence-level tasks. During pre-training, this token learned to aggregate information from the entire sequence for the Next Sentence Prediction objective. By the time pre-training finishes, the [CLS] representation has become a 768-dimensional summary of the sequence's meaning.

For classification, we add a single linear layer on top of this representation. The layer transforms the 768-dimensional [CLS] vector into a vector of class scores:

\mathbf{z} = W \cdot \mathbf{h}_{\text{[CLS]}} + \mathbf{b}

where:

$\mathbf{z}$ : the output logits vector with dimension $K$ (number of classes)
$W$ : the weight matrix of shape $(K, 768)$ that projects the hidden state to class space
$\mathbf{h}_{\text{[CLS]}}$ : the 768-dimensional hidden state of the [CLS] token from BERT's final layer
$\mathbf{b}$ : the bias vector of dimension $K$

Each element of $\mathbf{z}$ represents the model's confidence that the input belongs to that class, called a logit or unnormalized score. Higher values mean higher confidence. To convert these logits into proper probabilities that sum to 1, we apply the softmax function. During training, we minimize the cross-entropy loss between these predicted probabilities and the true labels, which pushes the model to assign high probability to the correct class and low probability to incorrect ones.

Out[5]:

Visualization

Bar chart showing balanced logits and their resulting nearly uniform probability distribution. — Uncertain prediction: similar logits produce nearly equal probabilities.

Bar chart showing skewed logits and their resulting confident probability distribution. — Confident prediction: large logit differences produce skewed probabilities.

In[6]:

Code

from transformers import BertModel


class BertForClassification(nn.Module):
    """BERT with a classification head for sequence classification."""

    def __init__(
        self,
        num_classes: int,
        model_name: str = "bert-base-uncased",
        dropout: float = 0.1,
    ):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        # Get BERT outputs
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        # Use [CLS] token representation (first token)
        cls_output = outputs.last_hidden_state[:, 0, :]

        # Apply dropout and classify
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)

        return logits

from transformers import BertModel


class BertForClassification(nn.Module):
    """BERT with a classification head for sequence classification."""

    def __init__(
        self,
        num_classes: int,
        model_name: str = "bert-base-uncased",
        dropout: float = 0.1,
    ):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_classes)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        # Get BERT outputs
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        # Use [CLS] token representation (first token)
        cls_output = outputs.last_hidden_state[:, 0, :]

        # Apply dropout and classify
        cls_output = self.dropout(cls_output)
        logits = self.classifier(cls_output)

        return logits

In[7]:

Code

from transformers import BertTokenizer

# Create a sentiment classifier
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForClassification(num_classes=2)  # Binary: positive/negative

# Example inference
text = "This movie was absolutely fantastic!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

from transformers import BertTokenizer

# Create a sentiment classifier
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForClassification(num_classes=2)  # Binary: positive/negative

# Example inference
text = "This movie was absolutely fantastic!"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

In[8]:

Code

model.eval()
with torch.no_grad():
    logits = model(**inputs)
    probs = F.softmax(logits, dim=-1)

model.eval()
with torch.no_grad():
    logits = model(**inputs)
    probs = F.softmax(logits, dim=-1)

Out[9]:

Console

Input: 'This movie was absolutely fantastic!'
Logits shape: torch.Size([1, 2])
Probabilities: Negative=0.5580, Positive=0.4420

The probabilities are nearly equal, which is expected: the classifier produces random predictions because its weights are randomly initialized. BERT's layers contain useful pre-trained representations, but the classifier hasn't learned to use them yet. After fine-tuning on labeled sentiment data, these probabilities would reflect meaningful predictions.

Training Loop for ClassificationLink Copied

Fine-tuning requires careful attention to learning rates, batch sizes, and training duration. Here's a complete training loop:

In[10]:

Code

from torch.utils.data import TensorDataset
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup


def create_classification_dataset(texts, labels, tokenizer, max_length=128):
    """Tokenize texts and create a dataset."""
    encodings = tokenizer(
        texts,
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt",
    )
    dataset = TensorDataset(
        encodings["input_ids"],
        encodings["attention_mask"],
        torch.tensor(labels),
    )
    return dataset


def train_classifier(
    model,
    train_dataloader,
    val_dataloader,
    num_epochs=3,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    device="cpu",
):
    """Fine-tune BERT for classification."""
    model = model.to(device)
    model.train()

    # Optimizer with different learning rates is optional but common
    optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)

    # Learning rate scheduler with warmup
    total_steps = len(train_dataloader) * num_epochs
    warmup_steps = int(total_steps * warmup_ratio)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps
    )

    loss_fn = nn.CrossEntropyLoss()
    history = {"train_loss": [], "val_loss": [], "val_accuracy": []}

    for epoch in range(num_epochs):
        # Training
        model.train()
        epoch_loss = 0
        for batch in train_dataloader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]

            optimizer.zero_grad()
            logits = model(input_ids, attention_mask)
            loss = loss_fn(logits, labels)
            loss.backward()

            # Gradient clipping prevents exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()
            scheduler.step()

            epoch_loss += loss.item()

        avg_train_loss = epoch_loss / len(train_dataloader)
        history["train_loss"].append(avg_train_loss)

        # Validation
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for batch in val_dataloader:
                input_ids, attention_mask, labels = [
                    b.to(device) for b in batch
                ]
                logits = model(input_ids, attention_mask)
                loss = loss_fn(logits, labels)
                val_loss += loss.item()

                preds = torch.argmax(logits, dim=-1)
                correct += (preds == labels).sum().item()
                total += labels.size(0)

        avg_val_loss = val_loss / len(val_dataloader)
        val_accuracy = correct / total
        history["val_loss"].append(avg_val_loss)
        history["val_accuracy"].append(val_accuracy)

    return history

from torch.utils.data import TensorDataset
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup


def create_classification_dataset(texts, labels, tokenizer, max_length=128):
    """Tokenize texts and create a dataset."""
    encodings = tokenizer(
        texts,
        truncation=True,
        padding="max_length",
        max_length=max_length,
        return_tensors="pt",
    )
    dataset = TensorDataset(
        encodings["input_ids"],
        encodings["attention_mask"],
        torch.tensor(labels),
    )
    return dataset


def train_classifier(
    model,
    train_dataloader,
    val_dataloader,
    num_epochs=3,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    device="cpu",
):
    """Fine-tune BERT for classification."""
    model = model.to(device)
    model.train()

    # Optimizer with different learning rates is optional but common
    optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=0.01)

    # Learning rate scheduler with warmup
    total_steps = len(train_dataloader) * num_epochs
    warmup_steps = int(total_steps * warmup_ratio)
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps
    )

    loss_fn = nn.CrossEntropyLoss()
    history = {"train_loss": [], "val_loss": [], "val_accuracy": []}

    for epoch in range(num_epochs):
        # Training
        model.train()
        epoch_loss = 0
        for batch in train_dataloader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]

            optimizer.zero_grad()
            logits = model(input_ids, attention_mask)
            loss = loss_fn(logits, labels)
            loss.backward()

            # Gradient clipping prevents exploding gradients
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()
            scheduler.step()

            epoch_loss += loss.item()

        avg_train_loss = epoch_loss / len(train_dataloader)
        history["train_loss"].append(avg_train_loss)

        # Validation
        model.eval()
        val_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for batch in val_dataloader:
                input_ids, attention_mask, labels = [
                    b.to(device) for b in batch
                ]
                logits = model(input_ids, attention_mask)
                loss = loss_fn(logits, labels)
                val_loss += loss.item()

                preds = torch.argmax(logits, dim=-1)
                correct += (preds == labels).sum().item()
                total += labels.size(0)

        avg_val_loss = val_loss / len(val_dataloader)
        val_accuracy = correct / total
        history["val_loss"].append(avg_val_loss)
        history["val_accuracy"].append(val_accuracy)

    return history

In[11]:

Code

from torch.utils.data import DataLoader

# Create a small demo dataset
train_texts = [
    "I loved this movie, it was incredible!",
    "Best film I've seen all year.",
    "Absolutely wonderful storytelling.",
    "A masterpiece of cinema.",
    "Brilliantly acted and directed.",
    "Terrible waste of time.",
    "I hated every minute of it.",
    "Worst movie I've ever seen.",
    "Complete garbage, avoid at all costs.",
    "Boring and predictable plot.",
]
train_labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]  # 1=positive, 0=negative

val_texts = [
    "Really enjoyed watching this.",
    "Not worth the ticket price.",
]
val_labels = [1, 0]

# Create datasets
train_dataset = create_classification_dataset(
    train_texts, train_labels, tokenizer
)
val_dataset = create_classification_dataset(val_texts, val_labels, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=2)

# Train (in practice, you'd use more data and epochs)
model_fresh = BertForClassification(num_classes=2)
history = train_classifier(model_fresh, train_loader, val_loader, num_epochs=2)

from torch.utils.data import DataLoader

# Create a small demo dataset
train_texts = [
    "I loved this movie, it was incredible!",
    "Best film I've seen all year.",
    "Absolutely wonderful storytelling.",
    "A masterpiece of cinema.",
    "Brilliantly acted and directed.",
    "Terrible waste of time.",
    "I hated every minute of it.",
    "Worst movie I've ever seen.",
    "Complete garbage, avoid at all costs.",
    "Boring and predictable plot.",
]
train_labels = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]  # 1=positive, 0=negative

val_texts = [
    "Really enjoyed watching this.",
    "Not worth the ticket price.",
]
val_labels = [1, 0]

# Create datasets
train_dataset = create_classification_dataset(
    train_texts, train_labels, tokenizer
)
val_dataset = create_classification_dataset(val_texts, val_labels, tokenizer)

train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=2)

# Train (in practice, you'd use more data and epochs)
model_fresh = BertForClassification(num_classes=2)
history = train_classifier(model_fresh, train_loader, val_loader, num_epochs=2)

Out[12]:

Console

Training Complete
----------------------------------------
Epoch 1: Train Loss=0.7341, Val Loss=0.6740, Val Acc=50.00%
Epoch 2: Train Loss=0.5325, Val Loss=0.6292, Val Acc=50.00%

The decreasing training loss indicates the model is learning from the examples. Even with only 10 training examples and 2 epochs, the model begins adapting to the sentiment classification task. Real fine-tuning uses thousands of examples and typically runs for 3-4 epochs, achieving much stronger validation accuracy.

Multi-class ClassificationLink Copied

The same architecture handles multi-class classification by changing the number of output classes:

In[13]:

Code

# Topic classification with 4 categories
topic_model = BertForClassification(num_classes=4)

# Example categories: Sports, Politics, Technology, Entertainment
category_names = ["Sports", "Politics", "Technology", "Entertainment"]

test_texts = [
    "The team won the championship game.",
    "New legislation passed in Congress.",
    "Apple releases new smartphone model.",
    "The movie grossed millions at box office.",
]

# Topic classification with 4 categories
topic_model = BertForClassification(num_classes=4)

# Example categories: Sports, Politics, Technology, Entertainment
category_names = ["Sports", "Politics", "Technology", "Entertainment"]

test_texts = [
    "The team won the championship game.",
    "New legislation passed in Congress.",
    "Apple releases new smartphone model.",
    "The movie grossed millions at box office.",
]

In[14]:

Code

topic_model.eval()
predictions = []
for text in test_texts:
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        logits = topic_model(**inputs)
        probs = F.softmax(logits, dim=-1)
        pred = torch.argmax(probs, dim=-1).item()
    predictions.append((text, category_names[pred]))

topic_model.eval()
predictions = []
for text in test_texts:
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        logits = topic_model(**inputs)
        probs = F.softmax(logits, dim=-1)
        pred = torch.argmax(probs, dim=-1).item()
    predictions.append((text, category_names[pred]))

Out[15]:

Console

Multi-class Classification (Untrained)
--------------------------------------------------
'The team won the championship game....'
  Predicted: Sports

'New legislation passed in Congress....'
  Predicted: Sports

'Apple releases new smartphone model....'
  Predicted: Politics

'The movie grossed millions at box office...'
  Predicted: Sports

The predictions are random because the model is untrained. The architecture remains identical to binary classification; only the output dimension changes from 2 to 4 classes. The softmax over 4 classes produces a probability distribution over topics. After fine-tuning on topic-labeled data, the model would correctly classify each text into its appropriate category.

Sequence Labeling Fine-tuningLink Copied

Named Entity Recognition (NER), part-of-speech tagging, and similar tasks require predictions for each token, not just the sequence. Instead of using only the [CLS] token, we classify every token position.

Token Classification ArchitectureLink Copied

Classification uses a single representation, the [CLS] token, to make one prediction for the entire sequence. But what if we need a prediction for every token? In Named Entity Recognition, each word gets a label: "John" is a person, "Google" is an organization, "works" is outside any entity. The model must make as many predictions as there are tokens.

The solution extends naturally from classification. Instead of applying our linear layer only to [CLS], we apply the same linear transformation to every token's representation. For each position $t$ in the sequence:

\mathbf{z}_t = W \cdot \mathbf{h}_t + \mathbf{b}

where:

$\mathbf{z}_t$ : the logits for position $t$ with dimension $L$ (number of labels)
$\mathbf{h}_t$ : the hidden state at position $t$ from BERT's final layer
$W$ : the shared weight matrix of shape $(L, 768)$ applied to all positions
$\mathbf{b}$ : the shared bias vector

An important design choice: we use the same $W$ and $\mathbf{b}$ for every position. This weight sharing makes sense because the meaning of labels doesn't change across positions: a "person" label means the same thing whether we're classifying the first token or the tenth. Sharing weights also keeps the parameter count manageable; without sharing, we'd need separate parameters for each possible position.

The output is a matrix of logits with shape (sequence_length, num_labels). We apply softmax independently to each row to get per-token probability distributions, then take the argmax to get the predicted label for each position:

In[16]:

Code

class BertForTokenClassification(nn.Module):
    """BERT with a token classification head for sequence labeling."""

    def __init__(
        self,
        num_labels: int,
        model_name: str = "bert-base-uncased",
        dropout: float = 0.1,
    ):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        self.num_labels = num_labels

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        # Get all token representations
        sequence_output = outputs.last_hidden_state  # (batch, seq_len, hidden)

        # Apply dropout and classify each token
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(
            sequence_output
        )  # (batch, seq_len, num_labels)

        return logits

class BertForTokenClassification(nn.Module):
    """BERT with a token classification head for sequence labeling."""

    def __init__(
        self,
        num_labels: int,
        model_name: str = "bert-base-uncased",
        dropout: float = 0.1,
    ):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)
        self.num_labels = num_labels

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        # Get all token representations
        sequence_output = outputs.last_hidden_state  # (batch, seq_len, hidden)

        # Apply dropout and classify each token
        sequence_output = self.dropout(sequence_output)
        logits = self.classifier(
            sequence_output
        )  # (batch, seq_len, num_labels)

        return logits

In[17]:

Code

# NER with IOB tagging scheme
# Labels: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC
ner_labels = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
ner_model = BertForTokenClassification(num_labels=len(ner_labels))

# Example sentence
text = "John Smith works at Google in California."
inputs = tokenizer(text, return_tensors="pt")

# NER with IOB tagging scheme
# Labels: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC
ner_labels = ["O", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC"]
ner_model = BertForTokenClassification(num_labels=len(ner_labels))

# Example sentence
text = "John Smith works at Google in California."
inputs = tokenizer(text, return_tensors="pt")

In[18]:

Code

ner_model.eval()
with torch.no_grad():
    logits = ner_model(**inputs)
    preds = torch.argmax(logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
token_predictions = [
    (token, ner_labels[pred.item()]) for token, pred in zip(tokens, preds[0])
]

ner_model.eval()
with torch.no_grad():
    logits = ner_model(**inputs)
    preds = torch.argmax(logits, dim=-1)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
token_predictions = [
    (token, ner_labels[pred.item()]) for token, pred in zip(tokens, preds[0])
]

Out[19]:

Console

Token Classification (NER)
----------------------------------------
Token           Predicted Label
----------------------------------------
[CLS]           B-ORG          
john            I-ORG          
smith           I-PER          
works           B-ORG          
at              B-PER          
google          B-PER          
in              B-PER          
california      B-PER          
.               B-ORG          
[SEP]           B-PER

The predicted labels are random because the classifier weights haven't been trained. In a trained NER model, "John" and "Smith" would receive B-PER and I-PER tags, "Google" would receive B-ORG, and "California" would receive B-LOC. The O (Outside) tag would correctly mark non-entity tokens like "works" and "at".

Handling WordPiece TokenizationLink Copied

A subtle challenge arises from WordPiece tokenization. When a word splits into multiple subwords, which subword's prediction should we use?

In[20]:

Code

def align_predictions_to_words(tokens, predictions, tokenizer):
    """
    Align subword predictions back to original words.
    Uses the first subword's prediction for each word.
    """
    word_predictions = []
    current_word = ""
    current_pred = None

    for token, pred in zip(tokens, predictions):
        # Skip special tokens
        if token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue

        if token.startswith("##"):
            # Continuation of previous word
            current_word += token[2:]
        else:
            # New word: save previous word's prediction
            if current_word:
                word_predictions.append((current_word, current_pred))
            current_word = token
            current_pred = pred

    # Don't forget the last word
    if current_word:
        word_predictions.append((current_word, current_pred))

    return word_predictions

def align_predictions_to_words(tokens, predictions, tokenizer):
    """
    Align subword predictions back to original words.
    Uses the first subword's prediction for each word.
    """
    word_predictions = []
    current_word = ""
    current_pred = None

    for token, pred in zip(tokens, predictions):
        # Skip special tokens
        if token in ["[CLS]", "[SEP]", "[PAD]"]:
            continue

        if token.startswith("##"):
            # Continuation of previous word
            current_word += token[2:]
        else:
            # New word: save previous word's prediction
            if current_word:
                word_predictions.append((current_word, current_pred))
            current_word = token
            current_pred = pred

    # Don't forget the last word
    if current_word:
        word_predictions.append((current_word, current_pred))

    return word_predictions

In[21]:

Code

# Example with multi-token words
text_subwords = "The unbelievable transformation surprised everyone."
inputs_sw = tokenizer(text_subwords, return_tensors="pt")
tokens_sw = tokenizer.convert_ids_to_tokens(inputs_sw["input_ids"][0])

# Example with multi-token words
text_subwords = "The unbelievable transformation surprised everyone."
inputs_sw = tokenizer(text_subwords, return_tensors="pt")
tokens_sw = tokenizer.convert_ids_to_tokens(inputs_sw["input_ids"][0])

In[22]:

Code

ner_model.eval()
with torch.no_grad():
    logits_sw = ner_model(**inputs_sw)
    preds_sw = torch.argmax(logits_sw, dim=-1)[0]

word_preds = align_predictions_to_words(
    tokens_sw, [ner_labels[p.item()] for p in preds_sw], tokenizer
)

ner_model.eval()
with torch.no_grad():
    logits_sw = ner_model(**inputs_sw)
    preds_sw = torch.argmax(logits_sw, dim=-1)[0]

word_preds = align_predictions_to_words(
    tokens_sw, [ner_labels[p.item()] for p in preds_sw], tokenizer
)

Out[23]:

Console

WordPiece Tokenization:
Original: 'The unbelievable transformation surprised everyone.'
Tokens: ['[CLS]', 'the', 'unbelievable', 'transformation', 'surprised', 'everyone', '.', '[SEP]']

Aligned Predictions:
  the: B-PER
  unbelievable: B-PER
  transformation: B-ORG
  surprised: B-PER
  everyone: O
  .: B-PER

The word "unbelievable" tokenizes into multiple subwords (likely "un", "##believ", "##able"). The alignment function takes the first subword's prediction as the label for the entire word. This is necessary because NER labels apply to words, not subwords. Alternative strategies include averaging logits across subwords or using the last subword's prediction.

The IOB Tagging SchemeLink Copied

NER commonly uses Inside-Outside-Beginning (IOB) tagging. "B-" marks the beginning of an entity, "I-" continues it, and "O" marks non-entity tokens.

Out[24]:

Visualization

Diagram showing a sentence with tokens labeled using IOB tags for person, organization, and location entities. — IOB tagging scheme for Named Entity Recognition. B-tags mark entity beginnings, I-tags mark continuations, and O marks tokens outside any entity.

The IOB scheme allows multi-word entities: "John Smith" spans two tokens with B-PER and I-PER, while "New York" spans B-LOC and I-LOC. This is why sequence labeling is more complex than simple classification: the model must learn to produce coherent tag sequences.

Question Answering Fine-tuningLink Copied

Extractive question answering finds answer spans within a context passage. Given a question and context, the model predicts which tokens constitute the answer.

Span Prediction ArchitectureLink Copied

Question answering presents a fundamentally different challenge from classification or sequence labeling. The answer to a question isn't a single label but a contiguous span of text within the context. For "Where does John work?", the answer "Google" is a substring of the passage. We need to predict both where this substring starts and where it ends.

Consider the alternatives. We could treat this as token classification, labeling each token as "in answer" or "not in answer." But this approach has a flaw: it doesn't guarantee a contiguous span. The model might label disconnected tokens as part of the answer. We could add constraints, but there's a simpler approach.

Instead of classifying each token, we compute two scores: how likely each position is to be the answer's start, and how likely each position is to be the answer's end. For each token position $t$ :

s_t = \mathbf{w}_s^T \cdot \mathbf{h}_t \quad \text{and} \quad e_t = \mathbf{w}_e^T \cdot \mathbf{h}_t

where:

$s_t$ : the start score for position $t$ (higher means more likely to be the answer start)
$e_t$ : the end score for position $t$ (higher means more likely to be the answer end)
$\mathbf{w}_s$ : the learned weight vector for start prediction (768 dimensions)
$\mathbf{w}_e$ : the learned weight vector for end prediction (768 dimensions)
$\mathbf{h}_t$ : the hidden state at position $t$ from BERT's final layer

Notice that these are dot products, not full linear layers. Each score is a simple weighted sum of the hidden state dimensions. We use separate weight vectors $\mathbf{w}_s$ and $\mathbf{w}_e$ because the features that indicate "this is where an answer starts" differ from those indicating "this is where an answer ends."

The predicted answer span is the substring from position $\text{argmax}_t(s_t)$ to position $\text{argmax}_t(e_t)$ . We find the token with the highest start score and the token with the highest end score, then extract everything between them (inclusive). During inference, we add a constraint: the end position must not precede the start position.

In[25]:

Code

class BertForQuestionAnswering(nn.Module):
    """BERT with span prediction heads for extractive QA."""

    def __init__(self, model_name: str = "bert-base-uncased"):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)

        # Separate heads for start and end positions
        self.start_classifier = nn.Linear(self.bert.config.hidden_size, 1)
        self.end_classifier = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        sequence_output = outputs.last_hidden_state

        # Compute start and end logits for each position
        start_logits = self.start_classifier(sequence_output).squeeze(-1)
        end_logits = self.end_classifier(sequence_output).squeeze(-1)

        return start_logits, end_logits

    def predict_answer(
        self, input_ids, attention_mask, token_type_ids, tokenizer
    ):
        """Extract the predicted answer span."""
        self.eval()
        with torch.no_grad():
            start_logits, end_logits = self.forward(
                input_ids, attention_mask, token_type_ids
            )

            # Get best start and end positions
            start_idx = torch.argmax(start_logits, dim=-1).item()
            end_idx = torch.argmax(end_logits, dim=-1).item()

            # Ensure end >= start
            if end_idx < start_idx:
                end_idx = start_idx

            # Extract answer tokens
            answer_tokens = input_ids[0, start_idx : end_idx + 1]
            answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

        return answer, start_idx, end_idx

class BertForQuestionAnswering(nn.Module):
    """BERT with span prediction heads for extractive QA."""

    def __init__(self, model_name: str = "bert-base-uncased"):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)

        # Separate heads for start and end positions
        self.start_classifier = nn.Linear(self.bert.config.hidden_size, 1)
        self.end_classifier = nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask=None, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
        )

        sequence_output = outputs.last_hidden_state

        # Compute start and end logits for each position
        start_logits = self.start_classifier(sequence_output).squeeze(-1)
        end_logits = self.end_classifier(sequence_output).squeeze(-1)

        return start_logits, end_logits

    def predict_answer(
        self, input_ids, attention_mask, token_type_ids, tokenizer
    ):
        """Extract the predicted answer span."""
        self.eval()
        with torch.no_grad():
            start_logits, end_logits = self.forward(
                input_ids, attention_mask, token_type_ids
            )

            # Get best start and end positions
            start_idx = torch.argmax(start_logits, dim=-1).item()
            end_idx = torch.argmax(end_logits, dim=-1).item()

            # Ensure end >= start
            if end_idx < start_idx:
                end_idx = start_idx

            # Extract answer tokens
            answer_tokens = input_ids[0, start_idx : end_idx + 1]
            answer = tokenizer.decode(answer_tokens, skip_special_tokens=True)

        return answer, start_idx, end_idx

In[26]:

Code

qa_model = BertForQuestionAnswering()

# Format: [CLS] question [SEP] context [SEP]
question = "Where does John work?"
context = "John Smith is a software engineer at Google in California."

# Tokenize with both segments
encoding = tokenizer(
    question,
    context,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=256,
)

qa_model = BertForQuestionAnswering()

# Format: [CLS] question [SEP] context [SEP]
question = "Where does John work?"
context = "John Smith is a software engineer at Google in California."

# Tokenize with both segments
encoding = tokenizer(
    question,
    context,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=256,
)

In[27]:

Code

qa_model.eval()
with torch.no_grad():
    start_logits, end_logits = qa_model(**encoding)

tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])

# Find predicted span
start_probs = F.softmax(start_logits, dim=-1)
end_probs = F.softmax(end_logits, dim=-1)

start_idx = torch.argmax(start_probs, dim=-1).item()
end_idx = torch.argmax(end_probs, dim=-1).item()

qa_model.eval()
with torch.no_grad():
    start_logits, end_logits = qa_model(**encoding)

tokens = tokenizer.convert_ids_to_tokens(encoding["input_ids"][0])

# Find predicted span
start_probs = F.softmax(start_logits, dim=-1)
end_probs = F.softmax(end_logits, dim=-1)

start_idx = torch.argmax(start_probs, dim=-1).item()
end_idx = torch.argmax(end_probs, dim=-1).item()

Out[28]:

Console

Question Answering
--------------------------------------------------
Question: Where does John work?
Context: John Smith is a software engineer at Google in California.

Predicted start position: 17 (token: '.')
Predicted end position: 7 (token: 'john')

The predicted positions are random because the span prediction heads haven't been trained. In a fine-tuned model, the start position would point to "Google" and the end position would also point to "Google" (or to "California" if the question asked about location). The model learns to identify answer boundaries by training on question-context-answer triplets from datasets like SQuAD.

The SQuAD FormatLink Copied

The Stanford Question Answering Dataset (SQuAD) is the standard benchmark for extractive QA. Each example contains a context paragraph, a question, and the answer's start position in the context.

In[29]:

Code

def prepare_qa_features(
    question, context, answer_text, answer_start, tokenizer, max_length=384
):
    """
    Prepare features for QA training.
    Returns input tensors and the token positions of the answer.
    """
    # Tokenize question and context together
    encoding = tokenizer(
        question,
        context,
        max_length=max_length,
        truncation="only_second",  # Only truncate context, not question
        padding="max_length",
        return_offsets_mapping=True,  # Map tokens to character positions
        return_tensors="pt",
    )

    # Find answer token positions
    offset_mapping = encoding.pop("offset_mapping")[0]

    # Character positions of answer
    answer_end = answer_start + len(answer_text)

    # Find token positions
    start_token = None
    end_token = None

    for idx, (start, end) in enumerate(offset_mapping):
        # Skip special tokens and question tokens (segment 0)
        if encoding["token_type_ids"][0, idx] == 0:
            continue

        if start <= answer_start < end:
            start_token = idx
        if start < answer_end <= end:
            end_token = idx
            break

    return {
        "input_ids": encoding["input_ids"],
        "attention_mask": encoding["attention_mask"],
        "token_type_ids": encoding["token_type_ids"],
        "start_position": start_token,
        "end_position": end_token,
    }

def prepare_qa_features(
    question, context, answer_text, answer_start, tokenizer, max_length=384
):
    """
    Prepare features for QA training.
    Returns input tensors and the token positions of the answer.
    """
    # Tokenize question and context together
    encoding = tokenizer(
        question,
        context,
        max_length=max_length,
        truncation="only_second",  # Only truncate context, not question
        padding="max_length",
        return_offsets_mapping=True,  # Map tokens to character positions
        return_tensors="pt",
    )

    # Find answer token positions
    offset_mapping = encoding.pop("offset_mapping")[0]

    # Character positions of answer
    answer_end = answer_start + len(answer_text)

    # Find token positions
    start_token = None
    end_token = None

    for idx, (start, end) in enumerate(offset_mapping):
        # Skip special tokens and question tokens (segment 0)
        if encoding["token_type_ids"][0, idx] == 0:
            continue

        if start <= answer_start < end:
            start_token = idx
        if start < answer_end <= end:
            end_token = idx
            break

    return {
        "input_ids": encoding["input_ids"],
        "attention_mask": encoding["attention_mask"],
        "token_type_ids": encoding["token_type_ids"],
        "start_position": start_token,
        "end_position": end_token,
    }

In[30]:

Code

from transformers import BertTokenizerFast

# Use fast tokenizer for offset mapping support
qa_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Example from SQuAD-style data
qa_question = "What color is the sky?"
qa_context = (
    "The sky is blue during the day. At sunset, it turns orange and red."
)
answer_text = "blue"
answer_start = qa_context.index(answer_text)

features = prepare_qa_features(
    qa_question, qa_context, answer_text, answer_start, qa_tokenizer
)

# Verify the answer extraction
qa_tokens = qa_tokenizer.convert_ids_to_tokens(features["input_ids"][0])
if features["start_position"] and features["end_position"]:
    answer_tokens = qa_tokens[
        features["start_position"] : features["end_position"] + 1
    ]

from transformers import BertTokenizerFast

# Use fast tokenizer for offset mapping support
qa_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

# Example from SQuAD-style data
qa_question = "What color is the sky?"
qa_context = (
    "The sky is blue during the day. At sunset, it turns orange and red."
)
answer_text = "blue"
answer_start = qa_context.index(answer_text)

features = prepare_qa_features(
    qa_question, qa_context, answer_text, answer_start, qa_tokenizer
)

# Verify the answer extraction
qa_tokens = qa_tokenizer.convert_ids_to_tokens(features["input_ids"][0])
if features["start_position"] and features["end_position"]:
    answer_tokens = qa_tokens[
        features["start_position"] : features["end_position"] + 1
    ]

Out[31]:

Console

QA Feature Preparation
----------------------------------------
Question: What color is the sky?
Context: The sky is blue during the day. At sunset, it turns orange and red.
Answer: 'blue' (starts at char 11)

Answer token positions: start=11, end=11
Extracted tokens: ['blue']

The answer "blue" was correctly located at token positions within the context. During training, these start and end positions serve as labels: the model learns to maximize the probability of the correct positions. The key challenge is aligning character-level answer positions to token-level positions. The offset_mapping from the tokenizer provides this alignment, mapping each token to its character span in the original text.

Unanswerable QuestionsLink Copied

SQuAD 2.0 introduced unanswerable questions: the answer might not exist in the context. To handle this, the model can predict the [CLS] position as both start and end, indicating "no answer."

Out[32]:

Visualization

Diagram showing input tokens with start and end probability bars, highlighting the predicted answer span. — Question answering span prediction. The model outputs start and end logits for each token position. The answer span is extracted between the highest-scoring start and end positions.

Fine-tuning HyperparametersLink Copied

Fine-tuning is sensitive to hyperparameter choices. Unlike pre-training, where you have billions of tokens, fine-tuning datasets are often small (thousands of examples), making overfitting a constant concern.

Learning RateLink Copied

The learning rate is the most critical hyperparameter. BERT's original paper recommends values in the range 2e-5 to 5e-5, much smaller than typical neural network training.

Why Small Learning Rates?

Pre-trained weights encode valuable knowledge. Large learning rates would rapidly overwrite this knowledge with task-specific but potentially less general patterns. Small learning rates allow gradual adaptation while preserving useful representations.

Out[33]:

Visualization

Line plot showing validation accuracy over epochs for three different learning rates, with 2e-5 performing best. — Effect of learning rate on fine-tuning. Very small rates (1e-6) learn too slowly. Very large rates (1e-3) cause instability. The sweet spot (2e-5 to 5e-5) balances learning speed with stability.

Batch Size and Training StepsLink Copied

BERT fine-tuning typically uses batch sizes of 16 or 32. Larger batches can work with learning rate scaling, but smaller datasets may not benefit.

Training duration is measured in epochs. Most tasks converge within 2-4 epochs. More epochs risk overfitting, especially on small datasets.

In[34]:

Code

def compute_training_steps(dataset_size, batch_size, num_epochs):
    """Calculate total training steps and recommended warmup."""
    steps_per_epoch = dataset_size // batch_size
    total_steps = steps_per_epoch * num_epochs
    warmup_steps = int(total_steps * 0.1)  # 10% warmup is common

    return {
        "steps_per_epoch": steps_per_epoch,
        "total_steps": total_steps,
        "warmup_steps": warmup_steps,
    }

def compute_training_steps(dataset_size, batch_size, num_epochs):
    """Calculate total training steps and recommended warmup."""
    steps_per_epoch = dataset_size // batch_size
    total_steps = steps_per_epoch * num_epochs
    warmup_steps = int(total_steps * 0.1)  # 10% warmup is common

    return {
        "steps_per_epoch": steps_per_epoch,
        "total_steps": total_steps,
        "warmup_steps": warmup_steps,
    }

In[35]:

Code

# Common dataset sizes
datasets = [
    ("Small (1K examples)", 1000),
    ("Medium (10K examples)", 10000),
    ("Large (100K examples)", 100000),
]

configs = []
for name, size in datasets:
    batch_size = 16 if size <= 10000 else 32
    epochs = 4 if size <= 10000 else 3
    stats = compute_training_steps(size, batch_size, epochs)
    configs.append((name, batch_size, epochs, stats["total_steps"]))

# Common dataset sizes
datasets = [
    ("Small (1K examples)", 1000),
    ("Medium (10K examples)", 10000),
    ("Large (100K examples)", 100000),
]

configs = []
for name, size in datasets:
    batch_size = 16 if size <= 10000 else 32
    epochs = 4 if size <= 10000 else 3
    stats = compute_training_steps(size, batch_size, epochs)
    configs.append((name, batch_size, epochs, stats["total_steps"]))

Out[36]:

Console

Training Configuration Recommendations
============================================================
Dataset                   Batch    Epochs   Total Steps    
------------------------------------------------------------
Small (1K examples)       16       4        248            
Medium (10K examples)     16       4        2500           
Large (100K examples)     32       3        9375

Smaller datasets benefit from smaller batch sizes (16) to see more gradient updates per epoch, and more epochs (4) to learn from limited data. Larger datasets can use bigger batches (32) for stability and fewer epochs (3) since each epoch provides more learning signal. The total steps column shows how training duration scales with dataset size.

Warmup and Learning Rate SchedulingLink Copied

Why not just use a constant learning rate? Two problems arise. First, in early training, the classifier head's weights are random, producing random gradients that might destabilize BERT's carefully tuned representations. Second, in late training, the model is close to convergence, and large updates can overshoot the optimum.

Learning rate scheduling addresses both problems by varying the learning rate throughout training. The standard BERT schedule has two phases:

Warmup: The learning rate starts at zero and increases linearly to its peak value. This gives the classifier time to produce meaningful gradients before larger updates hit BERT's layers.
Linear decay: The learning rate decreases linearly from its peak to zero. This allows fine-grained adjustments as the model converges.

The complete schedule can be expressed mathematically as:

\eta_t = \begin{cases} \eta_{\text{peak}} \cdot \frac{t}{t_{\text{warmup}}} & \text{if } t < t_{\text{warmup}} \\ \eta_{\text{peak}} \cdot \frac{T - t}{T - t_{\text{warmup}}} & \text{otherwise} \end{cases}

where:

$\eta_t$ : the learning rate at step $t$
$\eta_{\text{peak}}$ : the maximum learning rate (typically 2e-5 to 5e-5 for BERT)
$t_{\text{warmup}}$ : the number of warmup steps (typically 10% of total steps)
$T$ : the total number of training steps

Let's trace through the schedule. At step 0, $\eta_0 = 0$ : the model doesn't update at all. By step $t_{\text{warmup}}$ , the learning rate reaches its peak: $\eta_{t_{\text{warmup}}} = \eta_{\text{peak}}$ . Then it begins declining, reaching zero exactly at step $T$ . The denominator $T - t_{\text{warmup}}$ in the decay phase ensures the linear decrease is calibrated to hit zero at the final step.

Out[37]:

Visualization

Line plot showing learning rate starting at 0, rising to peak during warmup, then decaying linearly to 0. — Linear warmup followed by linear decay. The learning rate ramps up during the first 10% of training steps, then decreases linearly to zero.

Layer-wise Learning Rate DecayLink Copied

So far, we've treated all of BERT's parameters equally: every layer receives the same learning rate. But should they? Research on transfer learning suggests that different layers encode different kinds of information:

Lower layers (near the input) encode general linguistic features: part-of-speech patterns, syntactic structures, word relationships. These are useful across many tasks.
Upper layers (near the output) encode more abstract, task-specific features. In pre-training, these became tuned to masked language modeling and next sentence prediction.

For fine-tuning, this layered structure suggests a strategy: preserve the general features in lower layers by updating them slowly, while allowing upper layers to adapt more aggressively to the new task. We can implement this by giving each layer its own learning rate, with lower layers receiving smaller rates.

In[38]:

Code

def get_layerwise_lr_params(model, base_lr, decay_rate=0.9):
    """
    Create parameter groups with layer-wise learning rate decay.
    Lower layers get smaller learning rates.
    """
    param_groups = []
    num_layers = len(model.bert.encoder.layer)

    # Embeddings: lowest learning rate
    param_groups.append(
        {
            "params": model.bert.embeddings.parameters(),
            "lr": base_lr * (decay_rate**num_layers),
            "name": "embeddings",
        }
    )

    # Encoder layers: decreasing LR from bottom to top
    for i, layer in enumerate(model.bert.encoder.layer):
        layer_lr = base_lr * (decay_rate ** (num_layers - i - 1))
        param_groups.append(
            {
                "params": layer.parameters(),
                "lr": layer_lr,
                "name": f"layer_{i}",
            }
        )

    # Classifier: full learning rate
    param_groups.append(
        {
            "params": model.classifier.parameters(),
            "lr": base_lr,
            "name": "classifier",
        }
    )

    return param_groups

def get_layerwise_lr_params(model, base_lr, decay_rate=0.9):
    """
    Create parameter groups with layer-wise learning rate decay.
    Lower layers get smaller learning rates.
    """
    param_groups = []
    num_layers = len(model.bert.encoder.layer)

    # Embeddings: lowest learning rate
    param_groups.append(
        {
            "params": model.bert.embeddings.parameters(),
            "lr": base_lr * (decay_rate**num_layers),
            "name": "embeddings",
        }
    )

    # Encoder layers: decreasing LR from bottom to top
    for i, layer in enumerate(model.bert.encoder.layer):
        layer_lr = base_lr * (decay_rate ** (num_layers - i - 1))
        param_groups.append(
            {
                "params": layer.parameters(),
                "lr": layer_lr,
                "name": f"layer_{i}",
            }
        )

    # Classifier: full learning rate
    param_groups.append(
        {
            "params": model.classifier.parameters(),
            "lr": base_lr,
            "name": "classifier",
        }
    )

    return param_groups

In[39]:

Code

# Compute learning rates for each layer group
base_lr = 2e-5
decay_rate = 0.9
num_layers = 12

layer_lrs = []
layer_lrs.append(("Embeddings", base_lr * (decay_rate**num_layers)))
for i in range(num_layers):
    lr = base_lr * (decay_rate ** (num_layers - i - 1))
    layer_lrs.append((f"Layer {i}", lr))
layer_lrs.append(("Classifier", base_lr))

# Compute learning rates for each layer group
base_lr = 2e-5
decay_rate = 0.9
num_layers = 12

layer_lrs = []
layer_lrs.append(("Embeddings", base_lr * (decay_rate**num_layers)))
for i in range(num_layers):
    lr = base_lr * (decay_rate ** (num_layers - i - 1))
    layer_lrs.append((f"Layer {i}", lr))
layer_lrs.append(("Classifier", base_lr))

Out[40]:

Visualization

Horizontal bar chart showing learning rates for each BERT layer, with embeddings having the smallest rate and classifier having the largest. — Layer-wise learning rate decay with decay factor 0.9. Lower layers receive smaller learning rates to preserve general linguistic features, while upper layers and the classifier adapt more aggressively.

The pattern is clear: the embeddings, which encode the most fundamental token representations, receive the smallest learning rate (about 28% of the base). Each successive layer receives a slightly higher rate, culminating in the classifier head, which gets the full base learning rate since it must learn entirely from scratch.

The layer-wise learning rate for layer $i$ (counting from 0 at the bottom) is computed as:

\eta_i = \eta_{\text{base}} \cdot \xi^{(L - i - 1)}

where:

$\eta_i$ : the learning rate for layer $i$
$\eta_{\text{base}}$ : the base learning rate (used for the classifier)
$\xi$ : the decay rate (typically 0.9 or 0.95)
$L$ : the total number of transformer layers (12 for BERT-Base)

Let's unpack this formula. The exponent $(L - i - 1)$ counts how many layers are above layer $i$ . For the top layer (layer 11 in BERT-Base), $L - i - 1 = 0$ , so $\eta_{11} = \eta_{\text{base}} \cdot \xi^0 = \eta_{\text{base}}$ : it gets the full learning rate. For layer 10, $\eta_{10} = \eta_{\text{base}} \cdot \xi^1$ : it gets 90% of the full rate. For layer 0 (the bottom transformer layer), $\eta_0 = \eta_{\text{base}} \cdot \xi^{11}$ : it gets about 31% of the full rate.

The embeddings sit below all transformer layers. By convention, they receive $\eta_{\text{base}} \cdot \xi^L = \eta_{\text{base}} \cdot 0.9^{12} \approx 0.28 \cdot \eta_{\text{base}}$ . This 28% rate reflects how important it is to preserve the token embeddings that encode fundamental word meanings.

Catastrophic ForgettingLink Copied

Catastrophic forgetting occurs when fine-tuning overwrites the general knowledge BERT learned during pre-training. The model becomes highly specialized for the fine-tuning task but loses its ability to generalize.

Catastrophic Forgetting

The phenomenon where a neural network, when trained on a new task, rapidly forgets previously learned information. In the context of BERT fine-tuning, this means losing pre-trained language understanding while adapting to a specific downstream task.

Signs of Catastrophic ForgettingLink Copied

Several symptoms indicate catastrophic forgetting:

Validation loss increases after initial decrease
Out-of-domain performance drops on examples unlike the training data
Model becomes overconfident on training-like examples but fails on variations
Pre-training task performance degrades (masked language modeling accuracy drops)

Out[41]:

Visualization

Line plot showing training loss decreasing while validation loss increases after initial improvement, indicating overfitting. — Catastrophic forgetting manifests as divergent training and validation curves. While training loss continues to decrease, validation loss increases as the model overfits to training patterns and forgets general knowledge.

Prevention StrategiesLink Copied

Several techniques mitigate catastrophic forgetting:

1. Use small learning rates: The most important factor. Rates of 2e-5 to 5e-5 allow gradual adaptation.

2. Train for few epochs: 2-4 epochs is typically sufficient. More epochs increase forgetting risk.

3. Early stopping: Monitor validation loss and stop when it starts increasing.

In[42]:

Code

class EarlyStopping:
    """Stop training when validation loss stops improving."""

    def __init__(self, patience: int = 3, min_delta: float = 0.0):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float("inf")
        self.counter = 0
        self.best_weights = None

    def __call__(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            # Save best weights
            self.best_weights = {
                k: v.clone() for k, v in model.state_dict().items()
            }
            return False
        else:
            self.counter += 1
            if self.counter >= self.patience:
                # Restore best weights
                model.load_state_dict(self.best_weights)
                return True
            return False

class EarlyStopping:
    """Stop training when validation loss stops improving."""

    def __init__(self, patience: int = 3, min_delta: float = 0.0):
        self.patience = patience
        self.min_delta = min_delta
        self.best_loss = float("inf")
        self.counter = 0
        self.best_weights = None

    def __call__(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            # Save best weights
            self.best_weights = {
                k: v.clone() for k, v in model.state_dict().items()
            }
            return False
        else:
            self.counter += 1
            if self.counter >= self.patience:
                # Restore best weights
                model.load_state_dict(self.best_weights)
                return True
            return False

4. Regularization: Weight decay penalizes large weights, keeping them from growing unbounded. The AdamW optimizer applies weight decay directly to the weights:

\theta_{t+1} = \theta_t - \eta \cdot \left( \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon) + \lambda \cdot \theta_t \right)

where:

$\theta_t$ : the model parameters at step $t$
$\eta$ : the learning rate
$\hat{m}_t$ , $\hat{v}_t$ : bias-corrected first and second moment estimates from Adam
$\lambda$ : the weight decay coefficient (0.01 is standard for BERT)
$\epsilon$ : small constant for numerical stability

The update has two parts. The first term $\hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$ is the standard Adam update, moving parameters in the direction that reduces loss. The second term $\lambda \cdot \theta_t$ shrinks every weight toward zero by a fraction proportional to its current magnitude. This shrinkage keeps weights close to their initial values, which for BERT means close to the pre-trained weights. The net effect: the model can adapt to the task, but large deviations from pre-training are penalized.

5. Layer freezing: Freeze lower BERT layers that encode general features, only fine-tuning upper layers.

In[43]:

Code

def freeze_bert_layers(model, num_layers_to_freeze: int):
    """
    Freeze the embedding layer and first N transformer layers.
    Later layers and the classifier remain trainable.
    """
    # Freeze embeddings
    for param in model.bert.embeddings.parameters():
        param.requires_grad = False

    # Freeze specified encoder layers
    for i, layer in enumerate(model.bert.encoder.layer):
        if i < num_layers_to_freeze:
            for param in layer.parameters():
                param.requires_grad = False

def freeze_bert_layers(model, num_layers_to_freeze: int):
    """
    Freeze the embedding layer and first N transformer layers.
    Later layers and the classifier remain trainable.
    """
    # Freeze embeddings
    for param in model.bert.embeddings.parameters():
        param.requires_grad = False

    # Freeze specified encoder layers
    for i, layer in enumerate(model.bert.encoder.layer):
        if i < num_layers_to_freeze:
            for param in layer.parameters():
                param.requires_grad = False

In[44]:

Code

# Demonstrate the effect of freezing
model_demo = BertForClassification(num_classes=2)

# Count parameters before freezing
total_before = sum(p.numel() for p in model_demo.parameters())
trainable_before = sum(
    p.numel() for p in model_demo.parameters() if p.requires_grad
)

# Freeze 6 layers
layers_frozen = 6
freeze_bert_layers(model_demo, num_layers_to_freeze=layers_frozen)

trainable_after = sum(
    p.numel() for p in model_demo.parameters() if p.requires_grad
)
reduction_pct = 100 * (1 - trainable_after / trainable_before)

# Demonstrate the effect of freezing
model_demo = BertForClassification(num_classes=2)

# Count parameters before freezing
total_before = sum(p.numel() for p in model_demo.parameters())
trainable_before = sum(
    p.numel() for p in model_demo.parameters() if p.requires_grad
)

# Freeze 6 layers
layers_frozen = 6
freeze_bert_layers(model_demo, num_layers_to_freeze=layers_frozen)

trainable_after = sum(
    p.numel() for p in model_demo.parameters() if p.requires_grad
)
reduction_pct = 100 * (1 - trainable_after / trainable_before)

Out[45]:

Console

Layer Freezing Effect
----------------------------------------
Total parameters: 109,483,778
Trainable before freezing: 109,483,778 (100.0%)
Trainable after freezing 6 layers: 43,119,362 (39.4%)
Reduction in trainable parameters: 60.6%

Freezing the bottom 6 layers reduces trainable parameters by roughly half. This significantly decreases memory requirements and training time while preserving the general language understanding encoded in lower layers. The remaining trainable parameters in layers 6-11 and the classifier can still adapt to the task.

6. Mixout regularization: Randomly replace model weights with pre-trained weights during training, maintaining proximity to the original model.

Gradual UnfreezingLink Copied

A sophisticated approach progressively unfreezes layers during training:

In[46]:

Code

def gradual_unfreeze(model, epoch, total_epochs, num_layers=12):
    """
    Progressively unfreeze layers as training progresses.
    Start with only the classifier trainable, end with all layers trainable.
    """
    # Calculate how many layers to unfreeze
    layers_to_unfreeze = int((epoch / total_epochs) * num_layers)

    # Freeze all BERT layers first
    for param in model.bert.parameters():
        param.requires_grad = False

    # Unfreeze from top (layer 11) to bottom
    for i in range(num_layers - 1, num_layers - 1 - layers_to_unfreeze, -1):
        if i >= 0:
            for param in model.bert.encoder.layer[i].parameters():
                param.requires_grad = True

    # Classifier is always trainable
    for param in model.classifier.parameters():
        param.requires_grad = True

    return layers_to_unfreeze

def gradual_unfreeze(model, epoch, total_epochs, num_layers=12):
    """
    Progressively unfreeze layers as training progresses.
    Start with only the classifier trainable, end with all layers trainable.
    """
    # Calculate how many layers to unfreeze
    layers_to_unfreeze = int((epoch / total_epochs) * num_layers)

    # Freeze all BERT layers first
    for param in model.bert.parameters():
        param.requires_grad = False

    # Unfreeze from top (layer 11) to bottom
    for i in range(num_layers - 1, num_layers - 1 - layers_to_unfreeze, -1):
        if i >= 0:
            for param in model.bert.encoder.layer[i].parameters():
                param.requires_grad = True

    # Classifier is always trainable
    for param in model.classifier.parameters():
        param.requires_grad = True

    return layers_to_unfreeze

In[47]:

Code

model_gradual = BertForClassification(num_classes=2)
total_params = sum(p.numel() for p in model_gradual.parameters())
total_epochs = 10

schedule = []
for epoch in range(1, total_epochs + 1):
    unfrozen = gradual_unfreeze(
        model_gradual, epoch, total_epochs=total_epochs, num_layers=12
    )
    trainable = sum(
        p.numel() for p in model_gradual.parameters() if p.requires_grad
    )
    layers_desc = f"Top {unfrozen}" if unfrozen > 0 else "None"
    schedule.append(
        (epoch, layers_desc, trainable, 100 * trainable / total_params)
    )

model_gradual = BertForClassification(num_classes=2)
total_params = sum(p.numel() for p in model_gradual.parameters())
total_epochs = 10

schedule = []
for epoch in range(1, total_epochs + 1):
    unfrozen = gradual_unfreeze(
        model_gradual, epoch, total_epochs=total_epochs, num_layers=12
    )
    trainable = sum(
        p.numel() for p in model_gradual.parameters() if p.requires_grad
    )
    layers_desc = f"Top {unfrozen}" if unfrozen > 0 else "None"
    schedule.append(
        (epoch, layers_desc, trainable, 100 * trainable / total_params)
    )

Out[48]:

Visualization

Area chart showing percentage of trainable parameters increasing over epochs as more layers are unfrozen. — Gradual unfreezing progressively increases trainable parameters over training. Early epochs train only the classifier, while later epochs unfreeze transformer layers from top to bottom.

The schedule starts by training only the classifier head in early epochs, letting it adapt to the task while BERT's weights remain frozen. As training progresses, more transformer layers are unfrozen from the top (layer 11) down. By epoch 10, most layers are trainable. This gradual approach reduces forgetting risk because the classifier establishes useful gradients before deeper layers begin updating.

Practical RecommendationsLink Copied

Successful fine-tuning requires balancing several factors. Here are concrete recommendations based on the original BERT paper and subsequent research.

Standard RecipeLink Copied

For most classification and sequence labeling tasks:

Standard fine-tuning hyperparameters for BERT. These values work well across a wide range of tasks and dataset sizes.

Parameter	Recommended Value
Learning rate	2e-5 to 5e-5
Batch size	16 or 32
Epochs	2-4
Warmup ratio	10% of total steps
Weight decay	0.01
Max sequence length	128-512 (task-dependent)
Dropout	0.1

Dataset Size GuidelinesLink Copied

The amount of training data affects hyperparameter choices:

Small datasets (< 1K examples): Use lower learning rates (1e-5), more epochs (4-5), and consider layer freezing
Medium datasets (1K-10K examples): Standard settings work well
Large datasets (> 10K examples): Can use larger learning rates (5e-5), fewer epochs (2-3)

Task-Specific ConsiderationsLink Copied

Different tasks benefit from different approaches:

Classification: Standard recipe works well. Focus on class imbalance if present.

Sequence labeling (NER, POS): Handle subword alignment carefully. Consider CRF layer on top for structured prediction.

Question answering: Use longer max sequence lengths (384-512). Handle impossible questions by allowing null predictions.

Sentence pair tasks: Leverage segment embeddings. Consider whether task requires symmetric (similarity) or asymmetric (entailment) modeling.

When Fine-tuning FailsLink Copied

If results are poor, try these debugging steps:

Check data quality: Are labels correct? Is the task well-defined?
Try different learning rates: Run a sweep from 1e-5 to 5e-5
Increase training data: Can you augment or synthesize examples?
Use a different model size: BERT-Large may help for complex tasks
Consider domain mismatch: Pre-train further on domain-specific data first

Limitations and Practical ConsiderationsLink Copied

Fine-tuning BERT has constraints that affect real-world deployment.

Compute requirements remain substantial. Fine-tuning BERT-Base requires a GPU with at least 16GB memory for reasonable batch sizes. Training takes hours even on modern hardware. For resource-constrained settings, consider DistilBERT or smaller variants.

Sequence length limits cap input at 512 tokens. Longer documents require chunking strategies, which may lose cross-chunk context. For document-level tasks, consider Longformer or hierarchical approaches.

Domain mismatch between pre-training data (Wikipedia, books) and target domain (medical, legal, code) may require continued pre-training before fine-tuning. Domain-specific BERT variants like BioBERT, LegalBERT, and CodeBERT address this for common domains.

Label imbalance in task data can skew the model toward majority classes. Use weighted loss functions, oversampling, or stratified batching to address imbalance.

Reproducibility is challenging due to the stochastic nature of fine-tuning. Small changes in random seeds, data order, or hyperparameters can cause significant performance variation. The original BERT paper reported high variance across runs on some tasks, recommending multiple runs with different seeds.

Out[49]:

Visualization

Box plot showing distribution of validation accuracy across 5 random seeds for three different learning rates. — Fine-tuning variance across random seeds. The same model and hyperparameters can produce different results depending on initialization and data ordering, highlighting the importance of running multiple experiments.

Despite these limitations, fine-tuning remains the most practical approach for adapting pre-trained models to specific tasks. The alternative, training from scratch, requires orders of magnitude more data and compute. Fine-tuning leverages the substantial investment already made in pre-training, allowing you to build effective models with limited task-specific resources.

Key ParametersLink Copied

The most important hyperparameters for BERT fine-tuning:

learning_rate (2e-5 to 5e-5): Controls how quickly the model adapts. Too high causes instability and forgetting; too low learns too slowly. Start with 2e-5 for sensitive tasks, 5e-5 for robust ones.
num_epochs (2-4): Number of passes through the training data. More epochs increase forgetting risk. Use early stopping to find the optimal point.
batch_size (16-32): Number of examples per gradient update. Larger batches are more stable but require more memory. Gradient accumulation can simulate larger batches.
warmup_ratio (0.1): Fraction of training spent warming up the learning rate. Prevents early instability from noisy gradients.
weight_decay (0.01): L2 regularization coefficient. Penalizes large weight changes, helping preserve pre-trained knowledge.
max_length (128-512): Maximum sequence length after tokenization. Longer sequences require more memory and compute. Use the shortest length that captures your data.
dropout (0.1): Applied in the classifier head. Increase for small datasets to reduce overfitting.
gradient_clip (1.0): Maximum gradient norm. Prevents exploding gradients during training.

SummaryLink Copied

Fine-tuning adapts BERT's pre-trained knowledge to specific tasks through continued training on labeled data. The key insights for effective fine-tuning:

Task heads map BERT outputs to task-specific predictions: classification uses [CLS], sequence labeling uses all tokens, and QA predicts answer spans
Hyperparameters matter: Small learning rates (2e-5 to 5e-5), few epochs (2-4), and warmup prevent catastrophic forgetting while enabling task adaptation
Catastrophic forgetting occurs when fine-tuning overwrites pre-trained knowledge. Prevent it with appropriate learning rates, early stopping, and layer freezing
Layer-wise strategies like differential learning rates and gradual unfreezing provide fine-grained control over the adaptation process
Practical constraints include compute requirements, sequence length limits, and domain mismatch, all of which have established solutions

Fine-tuning bridges the gap between general language understanding and specific applications. A single afternoon of fine-tuning can produce state-of-the-art results on tasks that would otherwise require months of data collection and model development. This efficiency is why BERT and its successors have become the default starting point for most NLP applications.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about BERT fine-tuning.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{bertfinetuningclassificationnerquestionanswering, author = {Michael Brenndoerfer}, title = {BERT Fine-tuning: Classification, NER & Question Answering}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bert-finetuning-classification-ner-qa}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). BERT Fine-tuning: Classification, NER & Question Answering. Retrieved from https://mbrenndoerfer.com/writing/bert-finetuning-classification-ner-qa

MLAAcademic

Michael Brenndoerfer. "BERT Fine-tuning: Classification, NER & Question Answering." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/bert-finetuning-classification-ner-qa>.

CHICAGOAcademic

Michael Brenndoerfer. "BERT Fine-tuning: Classification, NER & Question Answering." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/bert-finetuning-classification-ner-qa.

HARVARDAcademic

Michael Brenndoerfer (2025) 'BERT Fine-tuning: Classification, NER & Question Answering'. Available at: https://mbrenndoerfer.com/writing/bert-finetuning-classification-ner-qa (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). BERT Fine-tuning: Classification, NER & Question Answering. https://mbrenndoerfer.com/writing/bert-finetuning-classification-ner-qa

Direct link:

https://mbrenndoerfer.com/writing/bert-finetuning-classification-ner-qa

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

BERT Fine-tuning: Classification, NER & Question Answering

BERT Fine-tuningLink Copied

The Fine-tuning ParadigmLink Copied

Classification Fine-tuningLink Copied

The Classification ArchitectureLink Copied

Training Loop for ClassificationLink Copied

Multi-class ClassificationLink Copied

Sequence Labeling Fine-tuningLink Copied

Token Classification ArchitectureLink Copied

Handling WordPiece TokenizationLink Copied

The IOB Tagging SchemeLink Copied

Question Answering Fine-tuningLink Copied

Span Prediction ArchitectureLink Copied

The SQuAD FormatLink Copied

Unanswerable QuestionsLink Copied

Fine-tuning HyperparametersLink Copied

Learning RateLink Copied

Batch Size and Training StepsLink Copied

Warmup and Learning Rate SchedulingLink Copied

Layer-wise Learning Rate DecayLink Copied

Catastrophic ForgettingLink Copied

Signs of Catastrophic ForgettingLink Copied

Prevention StrategiesLink Copied

Gradual UnfreezingLink Copied

Practical RecommendationsLink Copied

Standard RecipeLink Copied

Dataset Size GuidelinesLink Copied

Task-Specific ConsiderationsLink Copied

When Fine-tuning FailsLink Copied

Limitations and Practical ConsiderationsLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

ELECTRA: Efficient Pre-training with Replaced Token Detection

DeBERTa: Disentangled Attention and Enhanced Mask Decoding

BERT Pre-training: MLM, NSP & Training Strategies Explained

Stay updated