RoBERTa: Robustly Optimized BERT Pretraining Approach

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Discover how RoBERTa surpassed BERT using the same architecture by removing Next Sentence Prediction, implementing dynamic masking, training with larger batches, and using 10x more data. Learn the complete RoBERTa training recipe and when to choose RoBERTa over BERT.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

RoBERTaLink Copied

What if BERT was undertrained? That was the central question Facebook AI posed when they introduced RoBERTa in 2019. The original BERT achieved remarkable results, but its training recipe was established through limited experimentation. By systematically investigating each design choice, the RoBERTa team discovered that BERT's architecture was capable of much more. The secret wasn't a new architecture or a novel objective. It was simply training more carefully.

RoBERTa (Robustly Optimized BERT Pretraining Approach) matches and often exceeds BERT's performance without any architectural changes. The improvements come entirely from training decisions: removing the Next Sentence Prediction task, using dynamic masking, training with larger batches, training longer, and using more data. These changes seem incremental, but together they produce a model that outperforms BERT on virtually every benchmark.

In this chapter, we'll dissect each component of the RoBERTa recipe, understand why these changes matter, and implement the key differences between BERT and RoBERTa training.

The Undertrained BERT HypothesisLink Copied

BERT established the pretrain-then-fine-tune paradigm that dominates modern NLP. But its training setup was constrained by available compute and the need to explore many design choices quickly. The original paper trained BERT-base for about 1 million steps with a batch size of 256, processing roughly 3.3 billion tokens.

The RoBERTa team asked: what if we just trained longer, with more data, and removed potentially harmful design choices? Their experiments revealed that BERT's reported performance was nowhere near the ceiling for its architecture.

Out[3]:

Visualization

Bar chart comparing BERT and RoBERTa scores across GLUE tasks. — Performance comparison between BERT and RoBERTa on GLUE benchmark. RoBERTa achieves substantial improvements using the same architecture, demonstrating that BERT was undertrained.

The improvements are striking. RoBERTa gains over 10 points on RTE, over 11 points on CoLA, and 5+ points on STS-B. These aren't marginal improvements; they represent meaningful capability differences. And all of this comes from training changes, not architectural innovations.

Removing Next Sentence PredictionLink Copied

BERT was trained with two objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). The NSP task classifies whether two sentences appear consecutively in the original text or are randomly paired. The intuition was that NSP would help the model understand document-level coherence.

Next Sentence Prediction (NSP)

A binary classification task where the model predicts whether sentence B follows sentence A in the original document. BERT used 50% real consecutive pairs and 50% random pairs. The [CLS] token's representation is used for classification.

RoBERTa's first major finding was that NSP hurts more than it helps. When the researchers trained BERT with MLM only, removing NSP entirely, performance improved on most downstream tasks.

Why would a seemingly useful objective hurt performance? Several factors contribute:

The NSP task is too easy. Randomly sampled sentences often come from completely different documents and topics. The model can solve NSP by detecting topic shifts rather than learning genuine discourse coherence. A sentence about cooking paired with one about astrophysics is trivially distinguishable without understanding narrative flow.

NSP constrains the input format. BERT's NSP training requires inputs to be sentence pairs, typically short. This prevents the model from seeing longer contiguous passages that might teach it more about extended context and document structure.

NSP dilutes the learning signal. Half the model's training updates come from a task that may not transfer well to downstream applications. Those gradients could instead strengthen the MLM objective.

Out[4]:

Visualization

Diagram showing sentence pairs with topic labels, illustrating why random pairs are easy to classify. — Next Sentence Prediction becomes trivially easy when random sentences come from different documents. The model learns to detect topic mismatches rather than discourse coherence.

The RoBERTa paper tested several input formats:

Comparison of different input formats tested in RoBERTa ablations.

Format	Description	Performance
SEGMENT-PAIR + NSP	Original BERT format with sentence pairs	Baseline
SENTENCE-PAIR + NSP	Single sentences instead of segments	Worse
FULL-SENTENCES	Contiguous text from one document, no NSP	Better
DOC-SENTENCES	Same as FULL, but don't cross document boundaries	Best

The DOC-SENTENCES format performed best. Models trained without NSP and on longer contiguous passages learn better representations. The takeaway: NSP was a red herring. The simple MLM objective on longer contexts works better.

Dynamic MaskingLink Copied

BERT used static masking: the training data was preprocessed once, with masks applied and saved. Each training example always had the same tokens masked, even across multiple epochs. This limited the diversity of training signal.

RoBERTa introduced dynamic masking: masks are generated on-the-fly during training. Each time the model sees a sequence, different tokens are masked. Over multiple epochs, the model sees the same underlying text with many different masking patterns.

In[5]:

Code

import torch


def static_masking_preprocess(token_ids, mask_token_id=103, mask_prob=0.15):
    """
    BERT-style static masking: Apply once during data preprocessing.

    The same positions are masked every time this example is used.
    """
    masked_ids = token_ids.clone()
    labels = token_ids.clone()

    # Determine which positions to mask (this is fixed)
    probability_matrix = torch.full(token_ids.shape, mask_prob)
    masked_indices = torch.bernoulli(probability_matrix).bool()

    # Non-masked positions get label -100 (ignored in loss)
    labels[~masked_indices] = -100

    # Apply 80-10-10 strategy
    indices_replaced = (
        torch.bernoulli(torch.full(token_ids.shape, 0.8)).bool()
        & masked_indices
    )
    masked_ids[indices_replaced] = mask_token_id

    indices_random = (
        torch.bernoulli(torch.full(token_ids.shape, 0.5)).bool()
        & masked_indices
        & ~indices_replaced
    )
    random_tokens = torch.randint(1000, token_ids.shape, dtype=token_ids.dtype)
    masked_ids[indices_random] = random_tokens[indices_random]

    return masked_ids, labels


def dynamic_masking(token_ids, mask_token_id=103, mask_prob=0.15):
    """
    RoBERTa-style dynamic masking: Apply fresh masks each training step.

    Different positions are masked each time, increasing training diversity.
    """
    # Exactly the same logic, but called each time during training
    # rather than once during preprocessing
    return static_masking_preprocess(token_ids, mask_token_id, mask_prob)

import torch


def static_masking_preprocess(token_ids, mask_token_id=103, mask_prob=0.15):
    """
    BERT-style static masking: Apply once during data preprocessing.

    The same positions are masked every time this example is used.
    """
    masked_ids = token_ids.clone()
    labels = token_ids.clone()

    # Determine which positions to mask (this is fixed)
    probability_matrix = torch.full(token_ids.shape, mask_prob)
    masked_indices = torch.bernoulli(probability_matrix).bool()

    # Non-masked positions get label -100 (ignored in loss)
    labels[~masked_indices] = -100

    # Apply 80-10-10 strategy
    indices_replaced = (
        torch.bernoulli(torch.full(token_ids.shape, 0.8)).bool()
        & masked_indices
    )
    masked_ids[indices_replaced] = mask_token_id

    indices_random = (
        torch.bernoulli(torch.full(token_ids.shape, 0.5)).bool()
        & masked_indices
        & ~indices_replaced
    )
    random_tokens = torch.randint(1000, token_ids.shape, dtype=token_ids.dtype)
    masked_ids[indices_random] = random_tokens[indices_random]

    return masked_ids, labels


def dynamic_masking(token_ids, mask_token_id=103, mask_prob=0.15):
    """
    RoBERTa-style dynamic masking: Apply fresh masks each training step.

    Different positions are masked each time, increasing training diversity.
    """
    # Exactly the same logic, but called each time during training
    # rather than once during preprocessing
    return static_masking_preprocess(token_ids, mask_token_id, mask_prob)

Let's visualize the difference over multiple training epochs:

In[6]:

Code

# Simulate 4 epochs of seeing the same sequence
torch.manual_seed(42)
original_sequence = torch.tensor([101, 2023, 2003, 1037, 3231, 6251, 2517, 102])

# Static masking: preprocess once
static_masked, static_labels = static_masking_preprocess(
    original_sequence.clone()
)

# Dynamic masking: generate fresh masks each epoch
dynamic_epochs = []
for epoch in range(4):
    torch.manual_seed(epoch * 100 + 42)  # Different seed each epoch
    masked, labels = dynamic_masking(original_sequence.clone())
    dynamic_epochs.append((masked.tolist(), labels.tolist()))

# Simulate 4 epochs of seeing the same sequence
torch.manual_seed(42)
original_sequence = torch.tensor([101, 2023, 2003, 1037, 3231, 6251, 2517, 102])

# Static masking: preprocess once
static_masked, static_labels = static_masking_preprocess(
    original_sequence.clone()
)

# Dynamic masking: generate fresh masks each epoch
dynamic_epochs = []
for epoch in range(4):
    torch.manual_seed(epoch * 100 + 42)  # Different seed each epoch
    masked, labels = dynamic_masking(original_sequence.clone())
    dynamic_epochs.append((masked.tolist(), labels.tolist()))

Out[7]:

Console

Original sequence: [101, 2023, 2003, 1037, 3231, 6251, 2517, 102]

--- Static Masking (BERT) ---
Epoch 1: [101, 2023, 2003, 1037, 3231, 6251, 2517, 102]
Epoch 2: [101, 2023, 2003, 1037, 3231, 6251, 2517, 102]
Epoch 3: [101, 2023, 2003, 1037, 3231, 6251, 2517, 102]
Epoch 4: [101, 2023, 2003, 1037, 3231, 6251, 2517, 102]

--- Dynamic Masking (RoBERTa) ---
Epoch 1: [101, 2023, 2003, 1037, 3231, 6251, 2517, 102]  (masked positions: [])
Epoch 2: [101, 2023, 2003, 1037, 3231, 6251, 103, 102]  (masked positions: [6])
Epoch 3: [101, 2023, 103, 1037, 3231, 6251, 2517, 102]  (masked positions: [2])
Epoch 4: [101, 2023, 2003, 1037, 3231, 6251, 2517, 102]  (masked positions: [])

With static masking, the model sees identical inputs across all epochs. With dynamic masking, each epoch presents new challenges. Position 3 might be masked in epoch 1, positions 2 and 5 in epoch 2, and so on. This multiplies the effective training diversity without requiring more data.

Out[8]:

Visualization

Heatmaps showing which tokens are masked in each epoch for static and dynamic masking strategies. — Comparison of token exposure under static vs. dynamic masking over 10 epochs. Dynamic masking ensures all tokens are eventually masked and predicted, while static masking leaves some positions permanently exposed.

The empirical benefit of dynamic masking is modest but consistent. The RoBERTa paper found it slightly improves or matches static masking across all benchmarks. More importantly, it's simpler: there's no need to preprocess and store multiple masked versions of the data. Masking happens during training, reducing storage requirements and preprocessing complexity.

Larger BatchesLink Copied

BERT-base was trained with a batch size of 256 sequences. RoBERTa found that much larger batches improve both training stability and final performance. The key insight is that larger batches provide better gradient estimates, especially important for MLM where only 15% of tokens contribute to each update.

Out[9]:

Visualization

Line plot showing training curves for different batch sizes, with larger batches achieving lower perplexity. — Impact of batch size on perplexity during pretraining. Larger batches converge to lower perplexity, indicating better language modeling. The improvement diminishes beyond 8K but remains meaningful.

Why do larger batches help MLM specifically? Consider the gradient signal per update:

With batch size 256 and 15% masking, each update aggregates gradients from roughly $256 \times 512 \times 0.15 \approx 19,700$ masked tokens
With batch size 8192, this grows to $8192 \times 512 \times 0.15 \approx 629,000$ masked tokens

The larger gradient estimates are less noisy, allowing the optimizer to take more confident steps. This translates to faster convergence and better final performance.

RoBERTa used batch sizes up to 8192 sequences. To maintain the same number of parameter updates, larger batches require adjusting the learning rate. The linear scaling rule suggests multiplying the learning rate by the batch size increase factor, though warmup and careful tuning remain important.

Gradient statistics at different batch sizes, assuming 512-token sequences and 15% masking.

Batch Size	Masked Tokens per Update	Relative Gradient Noise
256	~19,700	High
2048	~157,000	Medium
8192	~629,000	Low

More Data, Longer TrainingLink Copied

BERT was trained on BookCorpus (800M words) and English Wikipedia (2,500M words), totaling about 16GB of uncompressed text. RoBERTa expanded this substantially by adding three more datasets:

CC-News: 76GB of news articles crawled from Common Crawl
OpenWebText: 38GB of web content from Reddit-linked URLs
Stories: 31GB of story-like content from Common Crawl

The combined dataset is roughly 160GB, ten times larger than BERT's original training data. But data alone isn't enough. RoBERTa also trained for significantly more steps.

Out[10]:

Visualization

Bar chart comparing training data sizes in gigabytes for BERT and RoBERTa. — Training data scale comparison between BERT and RoBERTa. RoBERTa uses 10x more data and trains for more steps, dramatically increasing total token exposure.

The paper systematically varied training duration to understand its impact. They found that longer training consistently improved performance, even past the point where loss on the training data plateaued. This suggests that the model continues to learn useful representations even when the pretraining objective stops improving.

Out[11]:

Visualization

Line plot showing MNLI accuracy increasing with more training steps. — Downstream task performance as a function of pretraining steps. Performance continues to improve with longer training, even after pretraining loss stabilizes.

The Complete RoBERTa RecipeLink Copied

Let's consolidate all the changes that transform BERT into RoBERTa:

Complete comparison of BERT and RoBERTa training configurations.

Component	BERT	RoBERTa
Next Sentence Prediction	Yes	No
Masking Strategy	Static	Dynamic
Batch Size	256	8192
Training Steps	1M	500K (but larger batches)
Total Tokens Seen	~3.3B	~31B
Training Data	16GB	160GB
Input Format	Sentence pairs	Full sentences
Byte-Pair Encoding	Yes	Yes (50K vocab)

The RoBERTa paper carefully ablated each change to understand its individual contribution. The following visualization shows how performance on MNLI improves as each modification is applied cumulatively:

Out[12]:

Visualization

Horizontal bar chart showing MNLI accuracy increasing from 84.6 (BERT baseline) to 87.6 as each RoBERTa optimization is added. — Cumulative impact of RoBERTa training optimizations on MNLI accuracy. Each bar shows the effect of adding one more optimization to the previous configuration. The largest gains come from more data and longer training, but each change contributes measurably.

Each optimization contributes to the final result. Removing NSP provides a small but consistent gain. Dynamic masking adds marginally more. The largest improvements come from scaling: larger batches stabilize gradients, more data provides diverse training signal, and longer training allows the model to fully absorb this information. Together, these changes yield a 3-point improvement, a substantial gain for a model with identical architecture.

Notice that RoBERTa uses fewer training steps than BERT. This might seem contradictory to "training longer," but the larger batch size means each step processes far more data. The total number of tokens seen is about 10x higher in RoBERTa despite fewer steps.

In[13]:

Code

# Calculating effective training scale


def calculate_tokens_seen(batch_size, seq_len, steps):
    """Calculate total tokens processed during training."""
    return batch_size * seq_len * steps


# BERT training
bert_tokens = calculate_tokens_seen(
    batch_size=256, seq_len=512, steps=1_000_000
)

# RoBERTa training
roberta_tokens = calculate_tokens_seen(
    batch_size=8192, seq_len=512, steps=500_000
)

# Calculating effective training scale


def calculate_tokens_seen(batch_size, seq_len, steps):
    """Calculate total tokens processed during training."""
    return batch_size * seq_len * steps


# BERT training
bert_tokens = calculate_tokens_seen(
    batch_size=256, seq_len=512, steps=1_000_000
)

# RoBERTa training
roberta_tokens = calculate_tokens_seen(
    batch_size=8192, seq_len=512, steps=500_000
)

Out[14]:

Console

BERT total tokens:    131,072,000,000
RoBERTa total tokens: 2,097,152,000,000
RoBERTa / BERT ratio: 16.0x

RoBERTa sees roughly 16 times more tokens than BERT. Combined with the architectural simplification of removing NSP and the improved signal from dynamic masking, this massive increase in training scale produces substantially better representations.

Implementing RoBERTa-style TrainingLink Copied

Let's implement the key differences between BERT and RoBERTa training. We'll focus on the data loading and masking pipeline, which captures the most important changes.

In[15]:

Code

import torch
from typing import List, Tuple


class RoBERTaDataCollator:
    """
    Data collator for RoBERTa-style pretraining.

    Key differences from BERT:
    1. No NSP - just MLM on contiguous text
    2. Dynamic masking applied fresh each batch
    3. Full sentences without artificial sentence boundaries
    """

    def __init__(
        self,
        vocab_size: int,
        mask_token_id: int,
        pad_token_id: int,
        special_token_ids: List[int],
        mask_prob: float = 0.15,
    ):
        self.vocab_size = vocab_size
        self.mask_token_id = mask_token_id
        self.pad_token_id = pad_token_id
        self.special_token_ids = set(special_token_ids)
        self.mask_prob = mask_prob

    def __call__(
        self, token_ids: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Apply dynamic masking to a batch of sequences.

        Args:
            token_ids: Shape (batch_size, seq_len)

        Returns:
            masked_ids: Input with masking applied
            labels: Original tokens at masked positions, -100 elsewhere
        """
        batch_size, seq_len = token_ids.shape
        labels = token_ids.clone()
        masked_ids = token_ids.clone()

        # Create mask for special tokens (CLS, SEP, PAD) - never mask these
        special_mask = torch.zeros_like(token_ids, dtype=torch.bool)
        for special_id in self.special_token_ids:
            special_mask |= token_ids == special_id

        # Probability matrix - 0 for special tokens, mask_prob elsewhere
        probability_matrix = torch.full(token_ids.shape, self.mask_prob)
        probability_matrix[special_mask] = 0.0

        # Sample positions to mask
        masked_indices = torch.bernoulli(probability_matrix).bool()

        # Set labels: -100 for non-masked positions (ignored in loss)
        labels[~masked_indices] = -100

        # 80% of masked tokens -> [MASK]
        indices_replaced = (
            torch.bernoulli(torch.full(token_ids.shape, 0.8)).bool()
            & masked_indices
        )
        masked_ids[indices_replaced] = self.mask_token_id

        # 10% of masked tokens -> random token
        indices_random = (
            torch.bernoulli(torch.full(token_ids.shape, 0.5)).bool()
            & masked_indices
            & ~indices_replaced
        )
        random_tokens = torch.randint(
            self.vocab_size, token_ids.shape, dtype=token_ids.dtype
        )
        masked_ids[indices_random] = random_tokens[indices_random]

        # Remaining 10% stay unchanged
        return masked_ids, labels

import torch
from typing import List, Tuple


class RoBERTaDataCollator:
    """
    Data collator for RoBERTa-style pretraining.

    Key differences from BERT:
    1. No NSP - just MLM on contiguous text
    2. Dynamic masking applied fresh each batch
    3. Full sentences without artificial sentence boundaries
    """

    def __init__(
        self,
        vocab_size: int,
        mask_token_id: int,
        pad_token_id: int,
        special_token_ids: List[int],
        mask_prob: float = 0.15,
    ):
        self.vocab_size = vocab_size
        self.mask_token_id = mask_token_id
        self.pad_token_id = pad_token_id
        self.special_token_ids = set(special_token_ids)
        self.mask_prob = mask_prob

    def __call__(
        self, token_ids: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Apply dynamic masking to a batch of sequences.

        Args:
            token_ids: Shape (batch_size, seq_len)

        Returns:
            masked_ids: Input with masking applied
            labels: Original tokens at masked positions, -100 elsewhere
        """
        batch_size, seq_len = token_ids.shape
        labels = token_ids.clone()
        masked_ids = token_ids.clone()

        # Create mask for special tokens (CLS, SEP, PAD) - never mask these
        special_mask = torch.zeros_like(token_ids, dtype=torch.bool)
        for special_id in self.special_token_ids:
            special_mask |= token_ids == special_id

        # Probability matrix - 0 for special tokens, mask_prob elsewhere
        probability_matrix = torch.full(token_ids.shape, self.mask_prob)
        probability_matrix[special_mask] = 0.0

        # Sample positions to mask
        masked_indices = torch.bernoulli(probability_matrix).bool()

        # Set labels: -100 for non-masked positions (ignored in loss)
        labels[~masked_indices] = -100

        # 80% of masked tokens -> [MASK]
        indices_replaced = (
            torch.bernoulli(torch.full(token_ids.shape, 0.8)).bool()
            & masked_indices
        )
        masked_ids[indices_replaced] = self.mask_token_id

        # 10% of masked tokens -> random token
        indices_random = (
            torch.bernoulli(torch.full(token_ids.shape, 0.5)).bool()
            & masked_indices
            & ~indices_replaced
        )
        random_tokens = torch.randint(
            self.vocab_size, token_ids.shape, dtype=token_ids.dtype
        )
        masked_ids[indices_random] = random_tokens[indices_random]

        # Remaining 10% stay unchanged
        return masked_ids, labels

In[16]:

Code

# Demonstrate the collator
collator = RoBERTaDataCollator(
    vocab_size=50265,  # RoBERTa vocab size
    mask_token_id=50264,  # <mask> token
    pad_token_id=1,  # <pad> token
    special_token_ids=[0, 1, 2],  # <s>, </s>, <pad>
    mask_prob=0.15,
)

# Sample batch of token IDs
torch.manual_seed(42)
sample_batch = torch.randint(100, 1000, (4, 20))  # 4 sequences, 20 tokens each

# Apply dynamic masking
masked_batch, labels = collator(sample_batch)

# Demonstrate the collator
collator = RoBERTaDataCollator(
    vocab_size=50265,  # RoBERTa vocab size
    mask_token_id=50264,  # <mask> token
    pad_token_id=1,  # <pad> token
    special_token_ids=[0, 1, 2],  # <s>, </s>, <pad>
    mask_prob=0.15,
)

# Sample batch of token IDs
torch.manual_seed(42)
sample_batch = torch.randint(100, 1000, (4, 20))  # 4 sequences, 20 tokens each

# Apply dynamic masking
masked_batch, labels = collator(sample_batch)

Out[17]:

Console

Original batch (first sequence):
[142, 267, 476, 914, 226, 435, 620, 924, 950, 813, 878, 914, 810, 154, 631, 572, 615, 995, 967, 806]

Masked batch (first sequence):
[142, 267, 476, 914, 226, 435, 620, 924, 950, 813, 878, 914, 810, 154, 50264, 572, 615, 995, 967, 50264]

Labels (first sequence):
[-100, -100, -100, -100, -100, -100, -100, 924, -100, -100, -100, -100, -100, -100, 631, -100, -100, -100, -100, 806]

Masked tokens: 3 / 20 (15.0%)

Now let's implement the full-sentences input format that RoBERTa uses instead of sentence pairs:

In[18]:

Code

class FullSentenceDataset:
    """
    Dataset that packs contiguous text into sequences without NSP.

    Unlike BERT which uses sentence pairs, RoBERTa packs as much
    contiguous text as possible into each sequence.
    """

    def __init__(
        self,
        documents: List[List[int]],  # List of tokenized documents
        max_seq_len: int = 512,
        cls_token_id: int = 0,
        sep_token_id: int = 2,
    ):
        self.max_seq_len = max_seq_len
        self.cls_token_id = cls_token_id
        self.sep_token_id = sep_token_id

        # Flatten documents into sequences of max_seq_len
        self.examples = self._create_examples(documents)

    def _create_examples(
        self, documents: List[List[int]]
    ) -> List[torch.Tensor]:
        """Pack documents into fixed-length sequences."""
        examples = []
        current_chunk = []
        current_length = 0

        # Reserve space for [CLS] and [SEP]
        target_length = self.max_seq_len - 2

        for doc in documents:
            for token_id in doc:
                current_chunk.append(token_id)
                current_length += 1

                if current_length >= target_length:
                    # Create example: [CLS] tokens [SEP]
                    example = (
                        [self.cls_token_id]
                        + current_chunk[:target_length]
                        + [self.sep_token_id]
                    )
                    examples.append(torch.tensor(example))
                    current_chunk = []
                    current_length = 0

        return examples

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]

class FullSentenceDataset:
    """
    Dataset that packs contiguous text into sequences without NSP.

    Unlike BERT which uses sentence pairs, RoBERTa packs as much
    contiguous text as possible into each sequence.
    """

    def __init__(
        self,
        documents: List[List[int]],  # List of tokenized documents
        max_seq_len: int = 512,
        cls_token_id: int = 0,
        sep_token_id: int = 2,
    ):
        self.max_seq_len = max_seq_len
        self.cls_token_id = cls_token_id
        self.sep_token_id = sep_token_id

        # Flatten documents into sequences of max_seq_len
        self.examples = self._create_examples(documents)

    def _create_examples(
        self, documents: List[List[int]]
    ) -> List[torch.Tensor]:
        """Pack documents into fixed-length sequences."""
        examples = []
        current_chunk = []
        current_length = 0

        # Reserve space for [CLS] and [SEP]
        target_length = self.max_seq_len - 2

        for doc in documents:
            for token_id in doc:
                current_chunk.append(token_id)
                current_length += 1

                if current_length >= target_length:
                    # Create example: [CLS] tokens [SEP]
                    example = (
                        [self.cls_token_id]
                        + current_chunk[:target_length]
                        + [self.sep_token_id]
                    )
                    examples.append(torch.tensor(example))
                    current_chunk = []
                    current_length = 0

        return examples

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        return self.examples[idx]

In[19]:

Code

# Demonstrate full-sentence packing
sample_docs = [
    list(range(100, 150)),  # Document 1: 50 tokens
    list(range(200, 280)),  # Document 2: 80 tokens
    list(range(300, 400)),  # Document 3: 100 tokens
]

dataset = FullSentenceDataset(
    documents=sample_docs,
    max_seq_len=64,  # Short for demonstration
    cls_token_id=0,
    sep_token_id=2,
)

# Demonstrate full-sentence packing
sample_docs = [
    list(range(100, 150)),  # Document 1: 50 tokens
    list(range(200, 280)),  # Document 2: 80 tokens
    list(range(300, 400)),  # Document 3: 100 tokens
]

dataset = FullSentenceDataset(
    documents=sample_docs,
    max_seq_len=64,  # Short for demonstration
    cls_token_id=0,
    sep_token_id=2,
)

Out[20]:

Console

Created 3 examples from 3 documents
Each example is 64 tokens

First example (showing first 20 tokens):
[0, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118]

The first example starts with the [CLS] token (ID 0), contains packed text from the documents, and would end with [SEP] (ID 2). The key insight is that RoBERTa's input format is simpler than BERT's. No segment embeddings needed for NSP, no alternating sentence A and B. Just pack as much contiguous text as possible and apply MLM.

Comparing BERT and RoBERTa TrainingLink Copied

Let's put together a minimal training loop that highlights the differences:

In[21]:

Code

def train_step_bert(model, batch, optimizer, nsp_weight=0.5):
    """
    BERT training step with MLM + NSP.

    Requires: input_ids, attention_mask, token_type_ids,
              mlm_labels, nsp_labels
    """
    outputs = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        token_type_ids=batch["token_type_ids"],  # Segment embeddings for NSP
    )

    # MLM loss
    mlm_logits = outputs.mlm_logits
    mlm_loss = F.cross_entropy(
        mlm_logits.view(-1, mlm_logits.size(-1)),
        batch["mlm_labels"].view(-1),
        ignore_index=-100,
    )

    # NSP loss
    nsp_logits = outputs.nsp_logits
    nsp_loss = F.cross_entropy(nsp_logits, batch["nsp_labels"])

    # Combined loss
    loss = mlm_loss + nsp_weight * nsp_loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return {
        "loss": loss.item(),
        "mlm_loss": mlm_loss.item(),
        "nsp_loss": nsp_loss.item(),
    }


def train_step_roberta(model, batch, optimizer):
    """
    RoBERTa training step with MLM only.

    Requires: input_ids, attention_mask, labels
    No token_type_ids needed (no NSP)
    """
    outputs = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        # No token_type_ids - RoBERTa doesn't use them
    )

    # MLM loss only
    logits = outputs.logits
    loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),
        batch["labels"].view(-1),
        ignore_index=-100,
    )

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return {"loss": loss.item()}

def train_step_bert(model, batch, optimizer, nsp_weight=0.5):
    """
    BERT training step with MLM + NSP.

    Requires: input_ids, attention_mask, token_type_ids,
              mlm_labels, nsp_labels
    """
    outputs = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        token_type_ids=batch["token_type_ids"],  # Segment embeddings for NSP
    )

    # MLM loss
    mlm_logits = outputs.mlm_logits
    mlm_loss = F.cross_entropy(
        mlm_logits.view(-1, mlm_logits.size(-1)),
        batch["mlm_labels"].view(-1),
        ignore_index=-100,
    )

    # NSP loss
    nsp_logits = outputs.nsp_logits
    nsp_loss = F.cross_entropy(nsp_logits, batch["nsp_labels"])

    # Combined loss
    loss = mlm_loss + nsp_weight * nsp_loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return {
        "loss": loss.item(),
        "mlm_loss": mlm_loss.item(),
        "nsp_loss": nsp_loss.item(),
    }


def train_step_roberta(model, batch, optimizer):
    """
    RoBERTa training step with MLM only.

    Requires: input_ids, attention_mask, labels
    No token_type_ids needed (no NSP)
    """
    outputs = model(
        input_ids=batch["input_ids"],
        attention_mask=batch["attention_mask"],
        # No token_type_ids - RoBERTa doesn't use them
    )

    # MLM loss only
    logits = outputs.logits
    loss = F.cross_entropy(
        logits.view(-1, logits.size(-1)),
        batch["labels"].view(-1),
        ignore_index=-100,
    )

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return {"loss": loss.item()}

The RoBERTa training step is cleaner. No NSP labels to prepare, no segment embeddings to track, no secondary loss to balance. This simplicity makes training easier to debug and scale.

Using Pre-trained RoBERTaLink Copied

In practice, you'll likely use RoBERTa through the Hugging Face transformers library rather than training from scratch. Here's how to load and use it:

In[22]:

Code

from transformers import RobertaTokenizer, RobertaForMaskedLM

# Load pre-trained RoBERTa
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForMaskedLM.from_pretrained("roberta-base")
model.eval()

# Prepare input with a masked token
text = "The capital of France is <mask>."
inputs = tokenizer(text, return_tensors="pt")

from transformers import RobertaTokenizer, RobertaForMaskedLM

# Load pre-trained RoBERTa
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForMaskedLM.from_pretrained("roberta-base")
model.eval()

# Prepare input with a masked token
text = "The capital of France is <mask>."
inputs = tokenizer(text, return_tensors="pt")

Out[23]:

Console

Input text: The capital of France is <mask>.
Tokenized input IDs: [0, 133, 812, 9, 1470, 16, 50264, 4, 2]
Tokens: ['<s>', 'The', 'Ġcapital', 'Ġof', 'ĠFrance', 'Ġis', '<mask>', '.', '</s>']

In[24]:

Code

# Get predictions for the masked token
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# Find the position of the mask token
mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(
    as_tuple=True
)[1]

# Get top 5 predictions
mask_logits = predictions[0, mask_token_index, :].squeeze()
top_5 = torch.topk(mask_logits, 5)

# Get predictions for the masked token
with torch.no_grad():
    outputs = model(**inputs)
    predictions = outputs.logits

# Find the position of the mask token
mask_token_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero(
    as_tuple=True
)[1]

# Get top 5 predictions
mask_logits = predictions[0, mask_token_index, :].squeeze()
top_5 = torch.topk(mask_logits, 5)

Out[25]:

Console


Top 5 predictions for <mask>:
   Paris: 21.54
   Lyon: 19.12
   Nice: 16.30
   Nancy: 15.47
   Napoleon: 14.85

RoBERTa correctly predicts "Paris" with high confidence. The model has learned rich representations of factual knowledge through its MLM pretraining.

Extracting RepresentationsLink Copied

For downstream tasks, you often want the hidden state representations rather than MLM predictions. Here's how to extract them:

In[26]:

Code

from transformers import RobertaModel

# Load the base model (without MLM head) for embeddings
encoder = RobertaModel.from_pretrained("roberta-base")
encoder.eval()

sentences = [
    "The cat sat on the mat.",
    "The dog lay on the rug.",
    "Machine learning is fascinating.",
]

embeddings = []
for sent in sentences:
    inputs = tokenizer(sent, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = encoder(**inputs)
    # Use [CLS] token representation as sentence embedding
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    embeddings.append(cls_embedding)

embeddings = torch.cat(embeddings, dim=0)

from transformers import RobertaModel

# Load the base model (without MLM head) for embeddings
encoder = RobertaModel.from_pretrained("roberta-base")
encoder.eval()

sentences = [
    "The cat sat on the mat.",
    "The dog lay on the rug.",
    "Machine learning is fascinating.",
]

embeddings = []
for sent in sentences:
    inputs = tokenizer(sent, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = encoder(**inputs)
    # Use [CLS] token representation as sentence embedding
    cls_embedding = outputs.last_hidden_state[:, 0, :]
    embeddings.append(cls_embedding)

embeddings = torch.cat(embeddings, dim=0)

Out[27]:

Console

Embedding shape: torch.Size([3, 768])
Each sentence represented as a 768-dimensional vector

Cosine similarities:
  'The cat sat on the mat....' <-> 'The dog lay on the rug....': 1.000
  'The cat sat on the mat....' <-> 'Machine learning is fascinatin...': 0.998
  'The dog lay on the rug....' <-> 'Machine learning is fascinatin...': 0.998

The similarities make sense: the two sentences about animals on surfaces are more similar to each other than to the machine learning sentence. RoBERTa's representations capture semantic relationships effectively.

Out[28]:

Visualization

Scatter plot showing three sentences as points, with semantically similar sentences closer together. — Visualization of RoBERTa sentence embeddings projected to 2D using PCA. Similar sentences cluster together, demonstrating that the representations capture semantic meaning.

Limitations and ImpactLink Copied

RoBERTa's contribution is both its strength and its limitation. The paper demonstrated that BERT was undertrained, but it did so through brute force: more data, more compute, more time. This isn't a scalable research methodology. Not every lab can replicate the conditions needed to train a model for weeks on 1024 V100 GPUs.

The removal of NSP remains somewhat controversial. While RoBERTa showed that NSP hurts on GLUE benchmarks, some researchers argue that sentence-level objectives matter for tasks requiring cross-sentence reasoning. Models like ALBERT reintroduced sentence-level objectives in modified forms, suggesting the story isn't complete.

RoBERTa also doesn't address MLM's fundamental limitations. The model still cannot generate text autoregressively. It still processes fixed-length sequences. It still requires significant compute for inference. These limitations drove the field toward encoder-decoder models like T5 and decoder-only models like GPT.

Yet RoBERTa's impact was substantial. It established that careful training matters as much as architecture design. It provided a stronger baseline that subsequent papers had to beat. Its open release democratized access to high-quality pretrained models. And it showed that simple, principled improvements often outperform complex architectural changes.

The lesson extends beyond NLP: before proposing new architectures or objectives, ensure existing approaches are properly optimized. Many research "improvements" might be artifacts of undertrained baselines. RoBERTa proved this point emphatically, and the field has been more careful about training protocols ever since.

Key ParametersLink Copied

When implementing or fine-tuning RoBERTa, these parameters most significantly impact performance:

mask_prob (default: 0.15): Fraction of tokens to mask per sequence. The 15% rate balances training signal strength against context preservation. Higher rates mask more tokens per update but risk destroying too much context for accurate predictions.
batch_size (RoBERTa default: 8192): Number of sequences per training step. Larger batches provide more stable gradient estimates, particularly important for MLM where only 15% of tokens contribute gradients. Requires proportional learning rate scaling.
max_seq_len (default: 512): Maximum sequence length. Longer sequences capture more context but require quadratically more memory for attention. RoBERTa packs contiguous text up to this limit.
vocab_size (RoBERTa: 50265): Size of the BPE vocabulary. RoBERTa uses a 50K vocabulary trained on its larger dataset, slightly larger than BERT's ~30K.
special_token_ids: Token IDs that should never be masked, including <s> (CLS), </s> (SEP), and <pad>. These tokens serve structural purposes and masking them would disrupt input format.
learning_rate: Typically 1e-4 to 6e-4 for pretraining. When scaling batch size, apply the linear scaling rule: if you double batch size, double the learning rate (with appropriate warmup).

SummaryLink Copied

RoBERTa demonstrated that BERT was undertrained by systematically optimizing its pretraining recipe. The key changes include:

Removing Next Sentence Prediction: NSP proved to be a hindrance rather than a help. Training on full sentences without NSP produces better representations for downstream tasks.
Dynamic masking: Generating fresh masks each training step instead of using static masks increases training signal diversity and slightly improves results.
Larger batch sizes: Training with batch sizes up to 8192 provides more stable gradients, especially important given MLM's sparse 15% masking rate.
More data and longer training: Expanding training data 10x and processing more total tokens dramatically improves model quality.
Full-sentence format: Packing contiguous text into sequences without artificial sentence boundaries allows the model to learn from longer coherent contexts.

None of these changes require modifying BERT's architecture. RoBERTa uses the exact same transformer encoder with the exact same hidden dimensions, attention heads, and layer counts. The improvements come entirely from training methodology.

The practical implication is clear: when using pretrained models for NLP tasks, RoBERTa generally outperforms BERT with no additional complexity. Its representations are richer, its downstream performance is higher, and it requires no special handling for sentence pairs. For most applications, RoBERTa is the better starting point for fine-tuning.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{robertarobustlyoptimizedbertpretrainingapproach, author = {Michael Brenndoerfer}, title = {RoBERTa: Robustly Optimized BERT Pretraining Approach}, year = {2025}, url = {https://mbrenndoerfer.com/writing/roberta-robustly-optimized-bert-pretraining}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). RoBERTa: Robustly Optimized BERT Pretraining Approach. Retrieved from https://mbrenndoerfer.com/writing/roberta-robustly-optimized-bert-pretraining

MLAAcademic

Michael Brenndoerfer. "RoBERTa: Robustly Optimized BERT Pretraining Approach." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/roberta-robustly-optimized-bert-pretraining>.

CHICAGOAcademic

Michael Brenndoerfer. "RoBERTa: Robustly Optimized BERT Pretraining Approach." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/roberta-robustly-optimized-bert-pretraining.

HARVARDAcademic

Michael Brenndoerfer (2025) 'RoBERTa: Robustly Optimized BERT Pretraining Approach'. Available at: https://mbrenndoerfer.com/writing/roberta-robustly-optimized-bert-pretraining (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). RoBERTa: Robustly Optimized BERT Pretraining Approach. https://mbrenndoerfer.com/writing/roberta-robustly-optimized-bert-pretraining

Direct link:

https://mbrenndoerfer.com/writing/roberta-robustly-optimized-bert-pretraining

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

RoBERTa: Robustly Optimized BERT Pretraining Approach

RoBERTaLink Copied

The Undertrained BERT HypothesisLink Copied

Removing Next Sentence PredictionLink Copied

Dynamic MaskingLink Copied

Larger BatchesLink Copied

More Data, Longer TrainingLink Copied

The Complete RoBERTa RecipeLink Copied

Implementing RoBERTa-style TrainingLink Copied

Comparing BERT and RoBERTa TrainingLink Copied

Using Pre-trained RoBERTaLink Copied

Extracting RepresentationsLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

ELECTRA: Efficient Pre-training with Replaced Token Detection

BERT Fine-tuning: Classification, NER & Question Answering

DeBERTa: Disentangled Attention and Enhanced Mask Decoding

Stay updated