BART Pre-training: Denoising Strategies & Text Infilling

Michael BrenndoerferOctober 19, 202542 min read

Learn BART's denoising pre-training approach including text infilling, token masking, sentence permutation, and how corruption schemes enable generation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

BART Pre-training

In the previous chapter, we explored BART's bidirectional encoder and autoregressive decoder architecture. Now we turn to what makes BART distinctive: its pre-training strategy. While BERT uses masked language modeling and GPT uses causal language modeling, BART takes a fundamentally different approach by framing pre-training as denoising autoencoding. The model learns to reconstruct original text from deliberately corrupted inputs, and the specific corruption schemes determine what linguistic knowledge BART acquires.

What makes BART's approach powerful is its flexibility. Rather than committing to a single corruption strategy, the researchers experimented with five different noising functions and their combinations. Each corruption type forces the model to learn different aspects of language, from local token patterns to document-level structure. This chapter examines each noising scheme in detail, explores why certain combinations work better than others, and implements the key corruption functions.

The Denoising Framework

At its core, BART's pre-training objective is straightforward in principle: given a corrupted document x~\tilde{x}, reconstruct the original document xx. This framing transforms the abstract challenge of "learning language" into a concrete task: undo the corruption and recover what was originally there. The power of this approach comes from how the simple objective creates rich learning signals, depending on how we choose to corrupt the input.

The model minimizes the negative log-likelihood loss:

L=t=1xlogP(xtx<t,x~)\mathcal{L} = -\sum_{t=1}^{|x|} \log P(x_t \mid x_{<t}, \tilde{x})

where:

  • L\mathcal{L}: the loss function to minimize (negative log-likelihood)
  • xx: the original uncorrupted document
  • x|x|: the number of tokens in the original document
  • xtx_t: the tt-th token of the original document
  • x<tx_{<t}: all tokens before position tt (the decoder's previous outputs)
  • x~\tilde{x}: the corrupted version of the document (encoder input)
  • P(xtx<t,x~)P(x_t \mid x_{<t}, \tilde{x}): the probability the model assigns to the correct token xtx_t

To understand this formula intuitively, consider what happens at each decoding step. The model has access to two sources of information: the corrupted input x~\tilde{x}, which has been processed by the encoder and contains whatever context survived the corruption process, and the previously generated tokens x<tx_{<t}, which represent the reconstruction built so far. Using these two information streams, the model must predict the probability distribution over all possible next tokens, and we measure how much probability mass it assigns to the correct token xtx_t. Taking the logarithm converts multiplicative probabilities into additive quantities, and the negative sign transforms our goal from maximizing likelihood to minimizing loss, a standard convention in optimization.

This is simply cross-entropy loss over the original tokens, conditioned on both the previously generated tokens and the corrupted input. The encoder processes x~\tilde{x}, and the decoder generates xx token by token. The summation over all x|x| tokens means we receive a training signal at every position in the sequence, not just at positions that were corrupted. This dense supervision distinguishes BART from approaches like BERT that only compute loss at masked positions.

Denoising Autoencoder

A model that learns to reconstruct clean data from corrupted versions. The corruption process is called "noising," and the reconstruction process is "denoising." The model learns robust representations by being forced to fill in missing or corrupted information.

The critical design choice is selecting the noising function gg that transforms xx into x~=g(x)\tilde{x} = g(x). This function determines the entire character of what the model learns during pre-training. Different noising functions create different learning signals. Some emphasize local word relationships, others stress document structure, and still others require the model to infer the length of missing content. If corruption is too easy to reverse, the model learns little because it can trivially recover the original without developing strong linguistic understanding. If corruption is too difficult or destroys too much structure, the model cannot learn meaningful patterns because there's insufficient signal remaining to guide reconstruction. BART's innovation was systematically exploring this design space to identify corruptions that hit the sweet spot, challenging enough to force learning but structured enough to permit it.

Unlike BERT, which only predicts masked tokens, BART's decoder must generate the entire original sequence. This architectural choice significantly affects training. The model receives a training signal for every token position, not just the corrupted ones, which means even "easy" positions where the input was not corrupted contribute to learning. The decoder still must decide to copy these tokens and position them correctly within the output sequence. Unlike T5, which uses a specific span corruption scheme with sentinel tokens, BART explores a broader range of corruption strategies and reconstructs exact original text rather than just the corrupted spans. This full-sequence reconstruction teaches the model to blend generated content with content that can be copied from the input, a skill directly relevant to tasks like summarization where the output combines novel phrases with material drawn from the source.

Token Masking

Token masking is the most familiar corruption scheme, directly inherited from BERT's masked language modeling objective. Random tokens are replaced with a special [MASK] token, and the model must predict the original tokens. This approach has proven remarkably effective for learning contextual word representations, and BART includes it as one of its corruption options to benefit from these established strengths.

How It Works

Given a sequence of tokens, we mask each token independently with probability pp (typically 15%, following BERT). The independence means that each token's fate is determined by its own random draw, without consideration of neighboring tokens:

x~i={[MASK]with probability pxiotherwise\tilde{x}_i = \begin{cases} \text{[MASK]} & \text{with probability } p \\ x_i & \text{otherwise} \end{cases}

where:

  • x~i\tilde{x}_i: the ii-th token in the corrupted sequence
  • xix_i: the ii-th token in the original sequence
  • pp: the masking probability (typically 0.15)
  • [MASK]\text{[MASK]}: a special token indicating a masked position

This piecewise definition captures a simple stochastic process: for each position ii, flip a biased coin that lands on "mask" with probability pp and "keep" with probability 1p1-p. The resulting corrupted sequence has the same length as the original, with [MASK] tokens scattered throughout. The specific positions vary each time we corrupt the same document, providing varied training examples from identical source text.

For example, the sentence "The quick brown fox jumps over the lazy dog" might become "The [MASK] brown fox [MASK] over the lazy [MASK]". Notice that the structure remains intact. We can still see that something describes "brown," that an action connects "fox" to "over," and that some modifier precedes "dog." These structural clues are precisely what the model must learn to exploit.

What the Model Learns

Token masking teaches the model to use bidirectional context for word prediction. To predict "quick," the encoder must understand that something describing "fox" fits between "The" and "brown." The model learns that adjectives typically precede the nouns they modify, that certain adjectives co-occur (quick and brown both describe animals in motion), and that syntactic patterns constrain what words can appear where. This develops strong contextual word representations, the same capability that makes BERT effective for classification and extraction tasks where understanding word meaning in context matters most.

However, token masking has limitations that become apparent when we consider generation tasks. The corruption is always one-to-one: one mask corresponds to exactly one original token. This rigid correspondence means the model never learns to handle situations where the number of output tokens differs from the number of input tokens, an important skill for generation tasks. When summarizing, the model must compress many input tokens into fewer output tokens. When elaborating or explaining, it must expand. Token masking provides no training signal for these variable-length transformations.

In[2]:
Code
import random
from typing import List, Tuple


def token_masking(
    tokens: List[str], mask_prob: float = 0.15, mask_token: str = "[MASK]"
) -> Tuple[List[str], List[str]]:
    """Apply token masking corruption.

    Returns:
        Tuple of (corrupted_tokens, original_tokens)
    """
    corrupted = []
    for token in tokens:
        if random.random() < mask_prob:
            corrupted.append(mask_token)
        else:
            corrupted.append(token)
    return corrupted, tokens
In[3]:
Code
# Example application
random.seed(42)
original = "The quick brown fox jumps over the lazy dog".split()
corrupted, original_out = token_masking(original, mask_prob=0.3)
Out[4]:
Console
Original:  The quick brown fox jumps over the lazy dog
Corrupted: The [MASK] [MASK] [MASK] jumps over the [MASK] dog
Masked positions: [1, 2, 3, 7]

With a 30% masking rate (higher than typical for visibility), several tokens are replaced with the [MASK] placeholder. The decoder must generate all original tokens, but the key learning signal comes from correctly predicting the masked positions. The output shows which token indices were masked, allowing us to verify the corruption is working as expected.

Token Deletion

Token deletion takes masking one step further: instead of replacing tokens with a placeholder, we remove them entirely. This seemingly small change has significant implications for what the model must learn. Where masking leaves behind a marker saying "something was here," deletion removes all evidence of the token's existence, forcing the model to infer not just what is missing but also where the gaps are.

How It Works

Random tokens are deleted from the sequence with probability pp:

x~={xi:ri>p}\tilde{x} = \{x_i : r_i > p\}

where:

  • x~\tilde{x}: the corrupted sequence (a subset of the original tokens)
  • xix_i: the ii-th token in the original sequence
  • rir_i: a random value drawn independently for each token
  • riUniform(0,1)r_i \sim \text{Uniform}(0, 1): each rir_i is drawn uniformly from the interval [0,1][0, 1]
  • pp: the deletion probability

The notation {xi:ri>p}\{x_i : r_i > p\} uses set-builder notation to express "keep only those tokens where the random draw exceeds pp." Since each rir_i is uniformly distributed between 0 and 1, the probability that ri>pr_i > p is exactly 1p1-p, meaning each token survives with probability 1p1-p. The corrupted sequence is shorter than the original (its expected length is (1p)(1-p) times the original length), and crucially, there's no marker indicating where deletions occurred. The tokens that remain simply adjoin each other, closing the gaps left by deleted content.

For example, "The quick brown fox jumps over the lazy dog" might become "The brown fox over lazy dog" with "quick," "jumps," "the," and possibly other tokens deleted. The resulting sequence reads almost grammatically but contains subtle errors—missing descriptors, absent verbs, dropped articles—that reveal the corruption.

What the Model Learns

Token deletion forces the model to solve two problems simultaneously, which creates a richer learning signal than masking alone:

  1. Detection: Where were tokens deleted? Without mask tokens to mark positions, the model must infer deletion points from grammatical or semantic inconsistency. This requires understanding what should be present in well-formed text.
  2. Prediction: What tokens were deleted? Once the model hypothesizes a gap location, it must determine what content is missing based on surrounding context.

This dual challenge is harder than masked language modeling. Consider "The brown fox over lazy dog": the model must recognize that an adjective is missing before "brown" (parallel structure suggests another descriptor might have been present), that something is missing between "fox" and "over" (likely a verb, since noun phrases do not directly connect to prepositions), and that an article is missing before "lazy" (English requires articles before adjective-noun pairs in most contexts).

The detection problem is especially important for generation tasks. A summarization model might receive a document with implicit information gaps (places where the source text assumes background knowledge that isn't stated) and must decide what content to generate and where. Token deletion pre-training provides practice at exactly this skill: identifying where information is missing and filling those gaps appropriately.

In[5]:
Code
def token_deletion(
    tokens: List[str], delete_prob: float = 0.15
) -> Tuple[List[str], List[str]]:
    """Apply token deletion corruption.

    Returns:
        Tuple of (corrupted_tokens, original_tokens)
    """
    corrupted = []
    for token in tokens:
        if random.random() >= delete_prob:  # Keep token if >= threshold
            corrupted.append(token)
    return corrupted, tokens
In[6]:
Code
# Example showing length difference
random.seed(123)
original = "The quick brown fox jumps over the lazy dog".split()
corrupted, _ = token_deletion(original, delete_prob=0.3)
Out[7]:
Console
Original (9 tokens):  The quick brown fox jumps over the lazy dog
Corrupted (5 tokens): brown jumps the lazy dog
Tokens removed: 4

With a 30% deletion probability, approximately 3 tokens are removed from the 9-token sentence. Notice the length mismatch: the decoder must generate a longer sequence than the encoder received. This requires the model to learn when to "expand" during generation, inserting tokens that have no corresponding position in the input. This skill is entirely absent from pure masked language modeling, where input and output lengths always match.

Text Infilling

Text infilling is BART's most important corruption scheme and the one that proved most effective in experiments. Instead of corrupting individual tokens, we corrupt contiguous spans of varying lengths. Critically, each span is replaced with a single mask token regardless of how many tokens it originally contained. This creates a challenging reconstruction task that teaches the model key skills for generation: predicting not just what content is missing but also how much content is missing.

How It Works

The infilling process has two steps that together create spans of varying lengths positioned throughout the document:

  1. Sample span lengths: Draw span lengths from a Poisson distribution with parameter λ\lambda (typically λ=3\lambda = 3). This creates spans of varying sizes: many single-token spans, some two-token spans, fewer three-token spans, and so on. The Poisson distribution provides a natural way to generate this variety with a single parameter controlling the expected length.

  2. Sample span positions: Randomly select starting positions for spans until approximately pp of the total tokens are covered by spans. The sampling continues until the target corruption level is reached, ensuring consistent corruption intensity across documents of different lengths.

Each span, regardless of length, is replaced with exactly one [MASK] token:

span(xi,xi+1,,xi+k)[MASK]\text{span}(x_i, x_{i+1}, \ldots, x_{i+k}) \to \text{[MASK]}

where:

  • xi,xi+1,,xi+kx_i, x_{i+1}, \ldots, x_{i+k}: a contiguous span of k+1k+1 tokens starting at position ii
  • [MASK]\text{[MASK]}: a single mask token that replaces the entire span
  • \to: indicates the replacement operation

This many-to-one mapping is the crucial innovation. A single [MASK] in the input might correspond to zero tokens, one token, three tokens, or even more in the output. The model cannot simply count masks and produce the same number of tokens. It must learn to judge, from context alone, how extensive each gap is.

For example, "The quick brown fox jumps over the lazy dog" might become "The [MASK] fox [MASK] the [MASK] dog" where:

  • "[MASK]₁" replaces "quick brown" (2 tokens)
  • "[MASK]₂" replaces "jumps over" (2 tokens)
  • "[MASK]₃" replaces "lazy" (1 token)

The decoder must figure out that the first mask needs two tokens (an adjective pair), the second needs two (a verb phrase), and the third needs one (a single adjective), all from the contextual clues in "The ... fox ... the ... dog."

The Poisson Distribution for Span Lengths

Why use Poisson? We want span lengths that vary naturally, mostly short spans with occasional longer ones, without needing to set a maximum length. The Poisson distribution is ideal because it models counts of independent events and has a single parameter λ\lambda controlling the average. This is convenient: rather than specifying an entire distribution over possible lengths, we tune one number and get a reasonable distribution automatically.

The Poisson distribution gives the probability of observing exactly kk events:

P(k;λ)=λkeλk!P(k; \lambda) = \frac{\lambda^k e^{-\lambda}}{k!}

where:

  • P(k;λ)P(k; \lambda): the probability of observing exactly kk events (here, a span of kk tokens)
  • kk: the span length (number of tokens, can be 0, 1, 2, ...)
  • λ\lambda: the expected (average) span length, set to 3 in BART
  • eλe^{-\lambda}: the exponential decay factor ensuring probabilities sum to 1
  • k!k!: factorial of kk, which appears because ordering doesn't matter within the span count

The formula works as follows: λk\lambda^k makes longer spans increasingly likely as λ\lambda grows—if the average is higher, we should see more long spans. Meanwhile, k!k! in the denominator penalizes very long spans (since factorial grows faster than exponential), preventing the distribution from placing too much mass on extremely long spans that would destroy too much context. The eλe^{-\lambda} term normalizes everything to sum to 1, ensuring we have a valid probability distribution. Together, these three components create a distribution peaked near λ\lambda with a long tail toward higher values.

With λ=3\lambda = 3:

  • P(0)0.05P(0) \approx 0.05: 5% of spans have length 0 (insertions!)
  • P(1)0.15P(1) \approx 0.15: 15% single-token spans
  • P(2)0.22P(2) \approx 0.22: 22% two-token spans
  • P(3)0.22P(3) \approx 0.22: 22% three-token spans
  • P(4)0.36P(\geq 4) \approx 0.36: 36% longer spans
Out[8]:
Visualization
Line plot showing Poisson probability mass functions for lambda values 1, 2, 3, and 5.
Poisson PMF for span lengths with different λ values. BART uses λ=3.
Bar chart with PMF and line for CDF at lambda=3.
BART default λ=3 showing PMF and CDF. Note 5% zero-length spans enable insertions.

The inclusion of zero-length spans is particularly clever and deserves special attention. A zero-length span means inserting a [MASK] token between two existing tokens without removing anything. The decoder must learn to generate empty content for this mask—to recognize that sometimes a mask is a false alarm requiring no output. This teaches the model that not every mask needs to expand into content, a nuance absent from standard masking approaches where every mask always corresponds to exactly one token.

What the Model Learns

Text infilling develops generation capabilities that transfer directly to downstream tasks:

  1. Length prediction: The model must infer how many tokens each mask represents. A single mask might expand to zero, one, five, or more tokens. This requires understanding semantic and syntactic constraints: how much content is needed to complete a phrase, how many words typically appear in certain constructions, and what the surrounding context demands.

  2. Coherent multi-token generation: Generating "quick brown" as a unit requires understanding that these words form a coherent phrase modifying "fox." The model cannot generate each token independently; it must ensure that multi-token outputs hang together grammatically and semantically.

  3. Span boundary detection: The model must determine where masked content ends and original content resumes. When generating tokens for a mask, the model must decide when to stop generating and allow the remaining input to take over.

This directly addresses a limitation of T5's span corruption. T5 uses sentinel tokens that indicate span boundaries and generates only the corrupted spans with their sentinels. BART must reconstruct the full original text, interleaving generated and copied content smoothly. This trains the model for generation tasks where the output blends novel phrases with material from the input. This is exactly what happens in summarization, paraphrasing, and many other applications.

In[9]:
Code
import numpy as np


def text_infilling(
    tokens: List[str],
    mask_ratio: float = 0.15,
    poisson_lambda: float = 3.0,
    mask_token: str = "[MASK]",
) -> Tuple[List[str], List[str]]:
    """Apply text infilling corruption.

    Replaces spans of varying lengths (Poisson-distributed) with single mask tokens.

    Returns:
        Tuple of (corrupted_tokens, original_tokens)
    """
    n = len(tokens)
    if n == 0:
        return [], tokens

    # Calculate target number of tokens to mask
    target_masked = int(n * mask_ratio)

    # Sample spans until we've masked enough tokens
    spans = []  # List of (start, end) tuples
    total_masked = 0

    while total_masked < target_masked:
        # Sample span length from Poisson
        span_length = np.random.poisson(poisson_lambda)

        # Sample random start position
        start = random.randint(0, n - 1)
        end = min(start + span_length, n)

        # Check for overlap with existing spans
        overlaps = False
        for existing_start, existing_end in spans:
            if not (end <= existing_start or start >= existing_end):
                overlaps = True
                break

        if not overlaps:
            spans.append((start, end))
            total_masked += end - start

    # Sort spans by start position
    spans.sort()

    # Build corrupted sequence
    corrupted = []
    prev_end = 0
    for start, end in spans:
        # Add tokens before this span
        corrupted.extend(tokens[prev_end:start])
        # Add single mask for entire span (even if span has 0 length)
        corrupted.append(mask_token)
        prev_end = end

    # Add remaining tokens
    corrupted.extend(tokens[prev_end:])

    return corrupted, tokens
In[10]:
Code
# Demonstrate infilling with visible span corruption
np.random.seed(42)
random.seed(42)

original = "The quick brown fox jumps over the lazy dog".split()
corrupted, _ = text_infilling(original, mask_ratio=0.4, poisson_lambda=2.0)
Out[11]:
Console
Original (9 tokens):  The quick brown fox jumps over the lazy dog
Corrupted (6 tokens): The [MASK] over the lazy dog
Length reduction: 3 tokens

The corrupted sequence is notably shorter than the original. Each [MASK] might represent multiple tokens, so the decoder must learn to expand these placeholders appropriately. This many-to-one corruption followed by one-to-many generation is central to BART's effectiveness on generative tasks.

Sentence Permutation

Moving from token-level to structural corruption, sentence permutation shuffles the order of sentences within a document. This tests the model's understanding of discourse coherence and document structure—aspects of language that token-level corruptions cannot address. Well-written documents have logical flow: ideas build on each other, references point backward to established concepts, and conclusions follow from premises. Sentence permutation disrupts this flow and asks the model to restore it.

How It Works

The document is first segmented into sentences (typically using punctuation heuristics or a sentence tokenizer, as covered in Part I). These sentences are then randomly permuted:

x~=shuffle([s1,s2,,sk])\tilde{x} = \operatorname{shuffle}([s_1, s_2, \ldots, s_k])

where:

  • x~\tilde{x}: the corrupted document with reordered sentences
  • [s1,s2,,sk][s_1, s_2, \ldots, s_k]: the original document as a sequence of kk sentences
  • shuffle()\text{shuffle}(\cdot): a random permutation function that reorders the sentences

The shuffle operation creates a random permutation of the kk sentences, selecting uniformly from all k!k! possible orderings. For a three-sentence document, there are 3!=63! = 6 possible orderings, and each has equal probability of being selected. For example, a three-sentence document might be reordered from (s1,s2,s3)(s_1, s_2, s_3) to (s2,s3,s1)(s_2, s_3, s_1), placing what was the middle sentence first and the original first sentence last.

What the Model Learns

Sentence permutation teaches document-level coherence in several ways:

  • Discourse relations: How does one sentence logically follow another? Causal relationships ("Because X, therefore Y"), temporal sequences ("First X, then Y"), and elaboration patterns ("X. More specifically, Y") all provide cues for ordering. The model must recognize these implicit connections even when discourse markers are absent.

  • Referential coherence: Pronouns and definite references ("the man," "this idea") typically follow their antecedents. Permuted text disrupts these chains—a sentence might use "he" before the person has been introduced, or mention "this approach" before the approach has been described. The model learns to recognize and restore proper reference chains.

  • Topic flow: Well-organized text develops ideas progressively. Opening sentences introduce topics; middle sentences elaborate; closing sentences conclude or transition. The model learns to recognize these structural roles and restore this flow.

However, sentence permutation alone proved less effective than token-level corruption in BART's experiments. The learning signal is sparse (each training example provides only a few reordering decisions), and many permutations might be equally valid for loosely connected sentences. A document with three independent observations could be reasonably ordered in any sequence, providing ambiguous signal to the model.

In[12]:
Code
import re


def sentence_permutation(
    text: str, sentence_end_pattern: str = r"(?<=[.!?])\s+"
) -> Tuple[str, str]:
    """Apply sentence permutation corruption.

    Returns:
        Tuple of (corrupted_text, original_text)
    """
    # Split into sentences
    sentences = re.split(sentence_end_pattern, text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]

    # Shuffle sentences
    shuffled = sentences.copy()
    random.shuffle(shuffled)

    corrupted = " ".join(shuffled)
    return corrupted, text
In[13]:
Code
random.seed(42)

document = """Natural language processing has advanced rapidly. 
Transformers revolutionized the field in 2017. 
Today, large language models can perform diverse tasks."""

corrupted, original = sentence_permutation(document)
Out[14]:
Console
Original:
Natural language processing has advanced rapidly. 
Transformers revolutionized the field in 2017. 
Today, large language models can perform diverse tasks.

Corrupted (permuted):
Transformers revolutionized the field in 2017. Natural language processing has advanced rapidly. Today, large language models can perform diverse tasks.

The permuted version reads awkwardly—the chronological and logical flow is disrupted. The decoder must restore proper ordering based on discourse cues.

Document Rotation

Document rotation is the most aggressive structural corruption in BART's repertoire. A random token is selected as the new starting point, and the document is rotated so that it begins with that token. Unlike sentence permutation, which at least preserves sentence boundaries, document rotation can split sentences arbitrarily, placing sentence fragments at the beginning or end.

How It Works

Given a document with tokens [x1,x2,,xn][x_1, x_2, \ldots, x_n], we uniformly sample a position k{1,,n}k \in \{1, \ldots, n\} and rotate:

x~=[xk,xk+1,,xn,x1,x2,,xk1]\tilde{x} = [x_k, x_{k+1}, \ldots, x_n, x_1, x_2, \ldots, x_{k-1}]

where:

  • x~\tilde{x}: the rotated (corrupted) document
  • x1,x2,,xnx_1, x_2, \ldots, x_n: the original document with nn tokens
  • kk: the randomly chosen rotation point (sampled uniformly from 1 to nn)
  • [xk,,xn][x_k, \ldots, x_n]: tokens from position kk to the end become the new beginning
  • [x1,,xk1][x_1, \ldots, x_{k-1}]: tokens before position kk are moved to the end

Think of the document as arranged in a circle. We pick a random position and "cut" the circle there, then lay it out linearly starting from that cut point. All tokens remain in their relative order, but the absolute positions shift. The original beginning might now appear in the middle or at the end.

For example, "The quick brown fox jumps" with k=3k=3 becomes "brown fox jumps The quick." The fragment "brown fox jumps" appears first, followed by "The quick"—a clear violation of sentence structure that signals something is wrong.

What the Model Learns

Document rotation teaches the model to identify document boundaries, specifically where a coherent document actually begins. This seems like a narrow skill, but it has practical uses:

  • Document structure recognition: Real documents have clear beginnings (titles, topic sentences) and endings (conclusions, sign-offs). The model learns to recognize these structural markers (capitalized first words, sentence-initial positions, opening phrases like "In this paper" or "The purpose of") that distinguish genuine document starts from arbitrary mid-document positions.

  • Global coherence: Unlike sentence permutation, which preserves some local structure, rotation can place mid-sentence fragments at document boundaries. The model must recognize incomplete thoughts, such as a sentence missing its subject or a phrase without context, as evidence of artificial boundaries.

In practice, document rotation was the least effective corruption scheme when used alone. The task is difficult (many rotation points produce plausible-looking starts, especially in long documents where sentence-initial positions are common), and the learning signal is limited to identifying a single boundary per document. The model receives one bit of useful information per training example: where the true start is.

In[15]:
Code
def document_rotation(tokens: List[str]) -> Tuple[List[str], List[str]]:
    """Apply document rotation corruption.

    Returns:
        Tuple of (corrupted_tokens, original_tokens)
    """
    if len(tokens) <= 1:
        return tokens, tokens

    # Sample rotation point uniformly
    k = random.randint(1, len(tokens) - 1)

    # Rotate: move tokens before k to the end
    corrupted = tokens[k:] + tokens[:k]

    return corrupted, tokens
In[16]:
Code
random.seed(123)

original = "The quick brown fox jumps over the lazy dog".split()
corrupted, _ = document_rotation(original)
Out[17]:
Console
Original:  The quick brown fox jumps over the lazy dog
Corrupted: quick brown fox jumps over the lazy dog The
Rotation point: position 1 ('quick')

The rotated sequence maintains word order within the two fragments but loses the natural starting point. The decoder must identify "The" as the true document beginning.

Combining Corruption Schemes

BART's key experimental contribution was systematically testing different combinations of these corruption schemes. Not all combinations are equally effective, and some corruptions complement each other while others are redundant. Understanding these interactions helps explain BART's final design choices and provides guidance for future model development.

Experimental Findings

The BART paper evaluated combinations across several downstream tasks. Key findings include:

Text infilling dominates. When used alone, text infilling matched or exceeded other single-scheme corruptions across most tasks. The combination of length prediction (inferring how many tokens each mask represents) and span reconstruction (generating coherent multi-token sequences) provides strong learning signals that transfer effectively to generation tasks.

Sentence permutation adds modest gains. Combining text infilling with sentence permutation slightly improved results on some tasks, particularly those requiring document understanding. However, the gains were small compared to infilling alone.

Token deletion is redundant with infilling. Text infilling subsumes token deletion because zero-length spans in infilling are equivalent to deletions. Combining them provides little benefit.

Document rotation hurts performance. Adding document rotation to other schemes generally degraded results. The task may be too difficult or too different from downstream applications.

Token masking alone is inferior. Pure masking performs worse than infilling because it cannot teach length prediction. Each mask always corresponds to exactly one token.

Out[18]:
Visualization
Horizontal bar chart showing relative effectiveness scores for different corruption schemes and their combinations.
Relative effectiveness of BART corruption schemes based on experimental findings. Text infilling provides the strongest signal for downstream tasks, while document rotation can actually hurt performance. Combinations that include infilling with sentence permutation achieve the best overall results.

The Final BART Recipe

Based on these experiments, the standard BART pre-training uses:

  1. Text infilling with λ=3\lambda = 3 and approximately 30% of tokens in corrupted spans
  2. Sentence permutation applied to all documents

This combination balances local token prediction (from infilling) with global structure learning (from permutation) while avoiding the problems with rotation and the redundancy of separate deletion.

In[19]:
Code
def bart_corruption(
    text: str,
    mask_ratio: float = 0.30,
    poisson_lambda: float = 3.0,
    sentence_end_pattern: str = r"(?<=[.!?])\s+",
    mask_token: str = "[MASK]",
) -> Tuple[str, str]:
    """Apply BART's combined corruption scheme.

    Combines text infilling and sentence permutation.

    Returns:
        Tuple of (corrupted_text, original_text)
    """
    # Step 1: Sentence permutation
    sentences = re.split(sentence_end_pattern, text.strip())
    sentences = [s.strip() for s in sentences if s.strip()]

    if len(sentences) > 1:
        random.shuffle(sentences)

    permuted_text = " ".join(sentences)

    # Step 2: Text infilling on permuted text
    tokens = permuted_text.split()
    corrupted_tokens, _ = text_infilling(
        tokens,
        mask_ratio=mask_ratio,
        poisson_lambda=poisson_lambda,
        mask_token=mask_token,
    )

    corrupted_text = " ".join(corrupted_tokens)

    return corrupted_text, text
In[20]:
Code
np.random.seed(42)
random.seed(42)

document = """Natural language processing enables computers to understand human text.
Deep learning has driven recent advances in the field.
BART combines bidirectional and autoregressive training."""

corrupted, original = bart_corruption(document, mask_ratio=0.25)
Out[21]:
Console
Original:
Natural language processing enables computers to understand human text.
Deep learning has driven recent advances in the field.
BART combines bidirectional and autoregressive training.

Corrupted (permuted + infilled):
[MASK] recent advances in the [MASK] processing enables computers to understand human text. BART combines bidirectional and autoregressive [MASK]

The combined corruption produces text that is both structurally shuffled and locally corrupted. The decoder faces the full challenge of restoring both sentence order and masked content.

Comparison with Other Pre-training Approaches

Understanding BART's pre-training requires comparing it to related approaches we've covered in earlier chapters.

BART vs. BERT

BERT's masked language modeling (covered in Part XVII) masks 15% of tokens and predicts only those positions. Key differences:

BERT vs BART pre-training comparison.
AspectBERTBART
ArchitectureEncoder-onlyEncoder-decoder
OutputMasked positions onlyFull sequence
Mask-to-token ratio1:1Variable (many:many)
Position signalsMask tokens mark positionsNo explicit position markers
Generation capabilityLimitedNative

BERT performs well at tasks requiring bidirectional understanding, such as classification and extraction, but struggles with generation because it was never trained to produce full sequences. BART maintains BERT's bidirectional encoder benefits while adding generation capability.

BART vs. T5

T5's span corruption (covered in Part XX, Chapter 2) shares BART's span-based approach but differs in key ways:

T5 vs BART pre-training comparison.
AspectT5BART
Corruption outputReplace spans with sentinelsReplace spans with single mask
Decoder outputOnly corrupted spansFull original sequence
Span identificationSentinels mark boundariesNo explicit boundaries
Training signalOnly corrupted tokensAll tokens

T5's sentinel-based approach is more efficient (the decoder generates less text), but BART's full-sequence reconstruction provides better signal for tasks requiring exact output control, such as grammatical error correction or style transfer.

BART vs. GPT

GPT's causal language modeling (Part XVIII) trains the model to predict the next token given all previous tokens. Comparing to BART:

GPT vs BART pre-training comparison.
AspectGPTBART
Input corruptionNone (left context only)Various noising schemes
AttentionCausal (unidirectional)Encoder bidirectional, decoder causal
Learning signalNext token predictionFull reconstruction
Zero-shot capabilityStrongModerate

GPT's unidirectional approach prevents it from using future context during encoding. BART's encoder sees the full corrupted input bidirectionally, which helps for tasks where understanding the complete input matters, such as summarization and translation.

Out[22]:
Visualization
BERT input-output diagram showing masked tokens and sparse predictions.
BERT: Masked LM predicts only at masked positions (sparse output).
GPT input-output diagram showing shifted causal prediction.
GPT: Causal LM predicts next token at each position (dense but unidirectional).
T5 input-output diagram showing sentinel tokens and span generation.
T5: Span corruption generates only corrupted spans with sentinels (efficient).
BART input-output diagram showing full sequence reconstruction.
BART: Denoising reconstructs the full sequence (dense bidirectional signal).

Implementation: Full Pre-training Data Pipeline

Let's implement a complete data pipeline for BART pre-training, showing how to process documents into training batches.

In[23]:
Code
from collections import namedtuple

BARTExample = namedtuple("BARTExample", ["source", "target"])


class BARTCorruptor:
    """Handles BART-style text corruption for pre-training."""

    def __init__(
        self,
        mask_ratio: float = 0.30,
        poisson_lambda: float = 3.0,
        mask_token: str = "<mask>",
        apply_sentence_permutation: bool = True,
    ):
        self.mask_ratio = mask_ratio
        self.poisson_lambda = poisson_lambda
        self.mask_token = mask_token
        self.apply_sentence_permutation = apply_sentence_permutation

    def _split_sentences(self, text: str) -> List[str]:
        """Split text into sentences."""
        sentences = re.split(r"(?<=[.!?])\s+", text.strip())
        return [s.strip() for s in sentences if s.strip()]

    def _permute_sentences(self, sentences: List[str]) -> List[str]:
        """Randomly permute sentence order."""
        shuffled = sentences.copy()
        random.shuffle(shuffled)
        return shuffled

    def _infill_tokens(self, tokens: List[str]) -> List[str]:
        """Apply text infilling to token sequence."""
        n = len(tokens)
        if n == 0:
            return []

        target_masked = max(1, int(n * self.mask_ratio))
        spans = []
        total_masked = 0
        attempts = 0
        max_attempts = n * 10

        while total_masked < target_masked and attempts < max_attempts:
            attempts += 1
            span_length = np.random.poisson(self.poisson_lambda)
            start = random.randint(0, n - 1)
            end = min(start + span_length, n)

            overlaps = any(not (end <= s or start >= e) for s, e in spans)

            if not overlaps:
                spans.append((start, end))
                total_masked += max(1, end - start)

        spans.sort()

        corrupted = []
        prev_end = 0
        for start, end in spans:
            corrupted.extend(tokens[prev_end:start])
            corrupted.append(self.mask_token)
            prev_end = end
        corrupted.extend(tokens[prev_end:])

        return corrupted

    def corrupt(self, text: str) -> BARTExample:
        """Apply full BART corruption pipeline."""
        # Sentence permutation
        if self.apply_sentence_permutation:
            sentences = self._split_sentences(text)
            if len(sentences) > 1:
                sentences = self._permute_sentences(sentences)
            permuted_text = " ".join(sentences)
        else:
            permuted_text = text

        # Text infilling
        tokens = permuted_text.split()
        corrupted_tokens = self._infill_tokens(tokens)

        return BARTExample(source=" ".join(corrupted_tokens), target=text)
In[24]:
Code
# Demonstrate the corruptor
np.random.seed(42)
random.seed(42)

corruptor = BARTCorruptor(mask_ratio=0.30, poisson_lambda=3.0)

sample_document = """Machine learning models learn patterns from data.
Neural networks are particularly effective for complex patterns.
Pre-training helps models learn general representations."""

example = corruptor.corrupt(sample_document)
Out[25]:
Console
Source (corrupted input to encoder):
<mask> <mask> complex patterns. <mask> learning models learn patterns from data. Pre-training helps models learn general representations.

Target (original text for decoder):
Machine learning models learn patterns from data.
Neural networks are particularly effective for complex patterns.
Pre-training helps models learn general representations.

The BARTCorruptor class encapsulates the full corruption pipeline. The source text shows the combined effect of sentence permutation and text infilling: sentences appear in a different order with mask tokens replacing spans of varying lengths. The target remains the original text that the decoder must learn to reconstruct. In a real implementation, you would integrate this with a tokenizer (as covered in Part V) and batch multiple examples together.

Key Parameters

The key parameters for BART pre-training corruption are:

  • mask_ratio: Fraction of tokens to corrupt (default 0.30). Higher values create more challenging reconstruction tasks but may remove too much signal.
  • poisson_lambda: Mean of the Poisson distribution for span lengths (default 3.0). Controls average span size; higher values create longer spans.
  • mask_token: Special token replacing corrupted spans. Must match the tokenizer's mask token.
  • apply_sentence_permutation: Whether to shuffle sentence order before infilling. Enables document-level structure learning.

Visualizing Corruption Density

Understanding how different corruption schemes distribute their modifications helps build intuition about what the model learns.

Out[26]:
Visualization
Histogram showing distribution of masked token fractions.
Token masking (p=0.15) affects ~15% of tokens.
Histogram showing distribution of deleted token fractions.
Token deletion (p=0.15) removes ~15% of tokens.
Histogram showing distribution of tokens in corrupted spans.
Text infilling (30%, λ=3) corrupts ~30% of tokens with higher variance.

Token masking and deletion show similar distributions centered around their target probability. Text infilling affects more tokens (30% target) and shows higher variance due to the stochastic span lengths—some samples have many short spans while others have fewer long spans.

Training Dynamics

BART's pre-training uses standard practices for large language models, with some choices informed by its architecture:

Optimizer: AdamW with β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, weight decay of 0.01. As we discussed in Part VII, AdamW properly decouples weight decay from the adaptive learning rate.

Learning rate schedule: Linear warmup for the first 10,000 steps, then linear decay. This follows the pattern we saw in BERT pre-training but with a longer warmup suitable for larger models.

Batch size: Large batches (8,000 tokens per batch for BART-Base) improve training stability and allow higher learning rates.

Data: BART was trained on the same data as RoBERTa: 160GB of text including books, Wikipedia, news, and web text. This diverse corpus exposes the model to varied writing styles and domains.

Out[27]:
Visualization
Line plot showing learning rate increasing linearly then decreasing linearly over training steps.
BART learning rate schedule showing linear warmup followed by linear decay. The peak learning rate is reached after warmup steps, then decreases linearly to zero.

The warmup phase prevents early training instability when gradients are noisy and model weights are randomly initialized. The subsequent decay helps the model converge to a stable minimum.

Limitations and Impact

BART's pre-training approach, while effective, has notable limitations that shaped subsequent research.

The full-sequence reconstruction objective is computationally expensive. Unlike T5, which generates only corrupted spans, BART's decoder processes every token of the original sequence. For long documents, this increases training cost significantly. The autoregressive decoder also prevents parallelization during training, unlike the encoder which processes all positions simultaneously.

The fixed corruption scheme may not be optimal for all downstream tasks. BART's text infilling with sentence permutation works well for summarization and generation, but other corruptions might better suit specific applications. More recent work has explored task-adaptive pre-training, where the corruption scheme is tuned to match downstream task characteristics.

BART's bidirectional encoder limits its effectiveness for pure language modeling tasks. The model cannot directly perform next-token prediction like GPT because its encoder expects to see corrupted versions of complete sequences, not prefixes of text. This makes BART less suitable for applications requiring left-to-right generation from prompts without conditioning input.

Despite these limitations, BART demonstrated that encoder-decoder models could match or exceed BERT and GPT on their respective strengths (understanding and generation) by using appropriate pre-training corruptions. This finding influenced subsequent work on unified models that handle diverse tasks within a single architecture.

The span-based corruption approach has proven durable. Modern models continue to use variants of span corruption, with innovations focusing on how to select span boundaries, what to replace spans with, and how to balance local and global corruptions. The core idea, that variable-length span corruption teaches useful inductive biases for generation, remains central to pre-training design.

Summary

BART's pre-training strategy frames language model training as denoising autoencoding. By systematically exploring five corruption schemes (token masking, token deletion, text infilling, sentence permutation, and document rotation), the researchers identified effective combinations for learning both local token patterns and global document structure.

Text infilling proved to be the most important individual corruption, teaching the model to predict how many tokens each mask represents and to generate coherent multi-token spans. Sentence permutation adds modest gains for tasks requiring document-level understanding. The final BART recipe combines these two approaches while avoiding less effective corruptions.

The key design choices that distinguish BART from related approaches are:

  • Full-sequence reconstruction: Unlike T5, BART generates the complete original text, providing training signal for every position
  • Variable-length spans: Unlike BERT's one-to-one masking, BART's infilling teaches length prediction through many-to-one corruption.
  • Combined structural and local corruption: Sentence permutation addresses discourse while infilling addresses token-level prediction

These pre-training choices directly enable BART's strong performance on generation tasks such as summarization, translation, and dialogue, which we'll explore further in subsequent chapters on fine-tuning and application.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about BART's pre-training corruption schemes.

Loading component...

Reference

BIBTEXAcademic
@misc{bartpretrainingdenoisingstrategiestextinfilling, author = {Michael Brenndoerfer}, title = {BART Pre-training: Denoising Strategies & Text Infilling}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bart-pretraining-denoising-text-infilling-strategies}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }
APAAcademic
Michael Brenndoerfer (2025). BART Pre-training: Denoising Strategies & Text Infilling. Retrieved from https://mbrenndoerfer.com/writing/bart-pretraining-denoising-text-infilling-strategies
MLAAcademic
Michael Brenndoerfer. "BART Pre-training: Denoising Strategies & Text Infilling." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/bart-pretraining-denoising-text-infilling-strategies>.
CHICAGOAcademic
Michael Brenndoerfer. "BART Pre-training: Denoising Strategies & Text Infilling." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/bart-pretraining-denoising-text-infilling-strategies.
HARVARDAcademic
Michael Brenndoerfer (2025) 'BART Pre-training: Denoising Strategies & Text Infilling'. Available at: https://mbrenndoerfer.com/writing/bart-pretraining-denoising-text-infilling-strategies (Accessed: 12/25/2025).
SimpleBasic
Michael Brenndoerfer (2025). BART Pre-training: Denoising Strategies & Text Infilling. https://mbrenndoerfer.com/writing/bart-pretraining-denoising-text-infilling-strategies