Search

Search articles

Denoising Objectives: BART's Corruption Strategies for Language Models

Michael BrenndoerferUpdated July 14, 202533 min read

Learn how BART trains language models using diverse text corruptions including token deletion, shuffling, sentence permutation, and text infilling to build versatile encoder-decoder models.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Denoising Objectives

Language models learn by predicting missing or corrupted text. Masked language modeling replaces tokens with [MASK], and span corruption hides contiguous chunks. But what if we corrupted text in more diverse ways? Token deletion, shuffling, sentence permutation, and document rotation all introduce different types of noise that force models to learn different aspects of language structure.

Denoising objectives generalize the idea of text reconstruction. Instead of a single corruption strategy, they apply multiple transformations that break different properties of natural text. The model learns to recover the original by developing robust understanding of word order, sentence boundaries, document structure, and semantic coherence. BART (Bidirectional and Auto-Regressive Transformers) pioneered this approach, showing that combining diverse noise types produces models that excel at both understanding and generation.

In this chapter, we'll explore the major denoising transformations, understand what each one teaches the model, implement them from scratch, and see how combining them creates versatile language models.

The Denoising Autoencoder Framework

Denoising objectives treat pretraining as an autoencoding problem. The model receives corrupted input and must reconstruct the original. This differs from standard autoencoders, which simply copy input to output. By corrupting the input, we force the model to learn meaningful representations rather than trivial identity mappings.

Denoising Autoencoder

A model trained to reconstruct clean data from corrupted inputs. By learning to remove noise, the model develops robust representations that capture the underlying structure of the data rather than surface-level patterns.

Formally, given original text xx, we apply a corruption function c()c(\cdot) to obtain noisy input x~=c(x)\tilde{x} = c(x). The model learns parameters θ\theta to maximize the probability of recovering xx from x~\tilde{x}. The training objective minimizes the negative log-likelihood of reconstructing the original:

L=logPθ(xx~)\mathcal{L} = -\log P_\theta(x \mid \tilde{x})

where:

  • L\mathcal{L}: the denoising loss we want to minimize
  • xx: the original, uncorrupted text sequence
  • x~\tilde{x}: the corrupted input produced by applying the corruption function c()c(\cdot) to xx
  • θ\theta: the model parameters (encoder and decoder weights)
  • Pθ(xx~)P_\theta(x \mid \tilde{x}): the probability the model assigns to the original text xx given the corrupted input x~\tilde{x}

The negative log transforms the probability (a value between 0 and 1) into a loss: when the model assigns high probability to the correct reconstruction, the loss is low. When the model is uncertain or wrong, the loss is high. Minimizing this loss trains the model to reliably reconstruct clean text from corrupted inputs.

The choice of corruption function determines what the model must learn. Simple corruptions like random token replacement teach local dependencies. Complex corruptions like document rotation teach global structure. By combining multiple corruption types, we can train models that understand language at every level.

The encoder-decoder architecture fits naturally with denoising. The encoder processes the corrupted input bidirectionally, building rich representations. The decoder generates the original text autoregressively, learning to produce coherent output. This combination gives BART-style models the best of both worlds: bidirectional encoding for understanding and autoregressive decoding for generation.

Out[3]:
Visualization
Flow diagram showing original text, corruption step, encoder, decoder, and reconstructed text.
The denoising autoencoder framework for language models. Corruption transforms the original text into noisy input. The encoder processes this bidirectionally, and the decoder reconstructs the original text autoregressively.

Token Deletion

Token deletion randomly removes tokens from the input sequence. Unlike masking, which replaces tokens with a placeholder, deletion removes them entirely. The model must determine which tokens are missing and where they belonged.

This corruption is more challenging than masking because position information is lost. With masking, the model knows exactly where each missing token should go. With deletion, it must infer both the missing content and its location from context alone.

In[4]:
Code
def apply_token_deletion(tokens, deletion_prob=0.15):
    """
    Delete tokens randomly from the sequence.

    Args:
        tokens: List of tokens
        deletion_prob: Probability of deleting each token

    Returns:
        corrupted: Tokens with some deleted
        original: Original tokens for reconstruction target
    """
    corrupted = []
    for token in tokens:
        if np.random.random() > deletion_prob:
            corrupted.append(token)
    return corrupted, tokens
In[5]:
Code
# Demonstrate token deletion
np.random.seed(42)
original_sentence = "The quick brown fox jumps over the lazy dog".split()
deleted, target = apply_token_deletion(original_sentence, deletion_prob=0.2)
Out[6]:
Console
Original (9 tokens):
The quick brown fox jumps over the lazy dog

After deletion (6 tokens):
The quick brown fox lazy dog

Deleted 3 tokens (33% of original)

The corrupted sequence is shorter than the original. With a 20% deletion probability, we expect roughly 2 tokens to be removed from a 9-token sentence. The actual number varies due to random sampling. Notice that the deleted tokens could be anywhere in the sequence, and the remaining tokens are simply concatenated without any placeholder marking where deletions occurred. The model must learn to expand the sequence during reconstruction, inserting the missing tokens in the correct positions.

Token deletion teaches several important skills. First, the model learns robust representations that don't depend on specific tokens being present. Second, it learns about syntactic obligatoriness, recognizing when articles, prepositions, or other grammatical elements are missing. Third, it develops sensitivity to semantic completeness, detecting when content words have been removed.

Deletion Rate Selection

The deletion probability controls the difficulty of reconstruction. Too low (under 5%), and most sequences pass through unchanged, providing little training signal. Too high (over 30%), and too much information is lost for reliable reconstruction.

Out[7]:
Visualization
Line plot showing remaining sequence length decreasing as deletion rate increases from 0% to 50%.
Impact of deletion rate on sequence length and reconstruction difficulty. Higher rates produce shorter sequences with more missing information, making reconstruction harder.

A 15% deletion rate, matching BERT's masking rate, provides a reasonable balance. This removes enough tokens for meaningful learning while preserving sufficient context for accurate reconstruction. Note that BART's final configuration uses text infilling (30%) rather than pure token deletion, but 15% remains a common baseline when experimenting with deletion-based corruption.

Token Shuffling

Token shuffling permutes tokens within the sequence, breaking the original word order. The model must learn to reorder tokens into grammatically correct sequences. This corruption targets a model's understanding of syntax and word order constraints.

Unlike deletion, shuffling preserves all tokens. The information is present but scrambled. The model must learn that "dog the lazy" should become "the lazy dog" based on its understanding of English syntax.

In[8]:
Code
def apply_token_shuffling(tokens, shuffle_distance=3):
    """
    Shuffle tokens within a limited distance of their original positions.

    Args:
        tokens: List of tokens
        shuffle_distance: Maximum positions a token can move

    Returns:
        shuffled: Tokens with local shuffling applied
        original: Original tokens for reconstruction target
    """
    n = len(tokens)
    # Add noise to positions, then sort
    noisy_positions = np.arange(n) + np.random.uniform(0, shuffle_distance, n)
    shuffle_order = np.argsort(noisy_positions)
    shuffled = [tokens[i] for i in shuffle_order]
    return shuffled, tokens
In[9]:
Code
# Demonstrate token shuffling
np.random.seed(42)
original_sentence = "The quick brown fox jumps over the lazy dog".split()
shuffled, target = apply_token_shuffling(original_sentence, shuffle_distance=3)
Out[10]:
Console
Original:
The quick brown fox jumps over the lazy dog

Shuffled:
The quick brown jumps fox over the lazy dog

2 of 9 tokens changed position (22%)

With a shuffle distance of 3, tokens can move up to 3 positions from their original location. The algorithm adds random noise to each position, then sorts by the noisy positions, creating local permutations while preserving rough ordering. This approach ensures that tokens tend to stay near their original positions rather than being scattered randomly across the entire sequence.

Out[11]:
Visualization
Line plot showing displacement distributions for shuffle distances 1, 3, and 5, all peaked near zero with exponential decay.
Distribution of actual token displacements for different shuffle distance parameters. Even with distance 5, most tokens move only 1-2 positions, with decreasing probability for larger displacements. This creates a soft constraint that preserves local structure while still testing word order understanding.

Smaller distance values produce local permutations that are easier to correct. Larger values create more severe scrambling that requires understanding longer-range dependencies.

Local vs. Global Shuffling

Local shuffling (small distance) tests whether the model understands adjacent word relationships. Phrases like "the quick" or "brown fox" have strong local coherence that the model should detect.

Global shuffling (large distance) tests whether the model understands sentence-level structure. The subject-verb-object ordering of English, or the tendency for adjectives to precede nouns, becomes important when tokens move far from their original positions.

Out[12]:
Visualization
Three example sentences showing progressive scrambling with distance 1, 3, and 5.
Comparison of shuffling with different distance parameters. Distance 1 produces minimal disruption, while distance 5 creates significant scrambling while still keeping some local structure.

Sentence Permutation

Sentence permutation shuffles the order of sentences within a document. While token shuffling breaks word-level order, sentence permutation breaks discourse-level structure. The model must learn how sentences relate to each other and what ordering makes a coherent document.

This corruption helps with tasks that require understanding document structure, such as summarization, document classification, and multi-document reasoning.

In[13]:
Code
def apply_sentence_permutation(text, sentence_delimiter="."):
    """
    Randomly permute sentences within the document.

    Args:
        text: Input text string
        sentence_delimiter: Character that separates sentences

    Returns:
        permuted: Text with sentences reordered
        original: Original text for reconstruction target
    """
    # Split into sentences (keeping delimiter)
    sentences = [
        s.strip() + sentence_delimiter
        for s in text.split(sentence_delimiter)
        if s.strip()
    ]

    # Random permutation
    permuted_order = np.random.permutation(len(sentences))
    permuted_sentences = [sentences[i] for i in permuted_order]

    return " ".join(permuted_sentences), text
In[14]:
Code
# Demonstrate sentence permutation
np.random.seed(42)
document = "The cat sat on the mat. It was a sunny day. The cat fell asleep. Later it woke up hungry."
permuted, original = apply_sentence_permutation(document)
Out[15]:
Console
Original document:
The cat sat on the mat. It was a sunny day. The cat fell asleep. Later it woke up hungry.

Permuted document:
It was a sunny day. Later it woke up hungry. The cat sat on the mat. The cat fell asleep.

4 sentences randomly reordered

The permuted document contains all the original information but in scrambled order. Notice how "Later it woke up hungry" appears without prior context about the cat falling asleep, making the narrative harder to follow. The model must learn to detect these coherence breaks and recognize that certain sentences require prior context to make sense.

Discourse Coherence

Sentence permutation forces the model to learn discourse markers and rhetorical relationships. Words like "however," "therefore," "first," and "finally" provide strong signals about sentence ordering. Pronoun resolution also provides clues: "it" in "It was a sunny day" must refer to something previously mentioned.

Narrative text is harder because events unfold in a specific order. Scientific writing with clear logical progression is easier because explicit markers guide the ordering. Creative writing with complex temporal structure is hardest because the "correct" order may not be unique.

Out[16]:
Visualization
Diagram showing original sentence order and permuted order with arrows indicating coherence breaks.
Sentence permutation disrupts discourse coherence. The model must learn to detect when references like 'it' or 'this' require prior context and when temporal markers indicate ordering.

Document Rotation

Document rotation moves a random portion of the document from the beginning to the end, or vice versa. The text is "rotated" around a randomly chosen pivot point. This corruption breaks the document's beginning and end while preserving internal structure.

In[17]:
Code
def apply_document_rotation(tokens):
    """
    Rotate the document by moving tokens from the beginning to the end.

    Args:
        tokens: List of tokens

    Returns:
        rotated: Tokens with rotation applied
        original: Original tokens for reconstruction target
        rotation_point: Where the rotation occurred
    """
    if len(tokens) < 2:
        return tokens, tokens, 0

    # Choose a random rotation point
    rotation_point = np.random.randint(1, len(tokens))

    # Rotate: move first part to the end
    rotated = tokens[rotation_point:] + tokens[:rotation_point]

    return rotated, tokens, rotation_point
In[18]:
Code
# Demonstrate document rotation
np.random.seed(42)
original_tokens = (
    "Once upon a time there was a brave knight who lived in a castle".split()
)
rotated, original, pivot = apply_document_rotation(original_tokens)
Out[19]:
Console
Original:
Once upon a time there was a brave knight who lived in a castle

Rotated at position 7 (50% into document):
brave knight who lived in a castle Once upon a time there was a

Moved to end: 'Once upon a time there was a'

The rotated text starts mid-sentence with "was a brave knight..." rather than the natural opening "Once upon a time." The model must recognize this unnatural beginning and identify where the original document started. The classic fairy tale opening provides a strong signal about document structure, teaching the model to recognize common opening patterns like "Once upon a time," "In this paper," or "Chapter 1."

Start Token Identification

BART uses a special approach for document rotation: it always starts the corrupted input at a sentinel token that marks where the original document began. This simplifies the task slightly but still requires the model to learn document structure.

Without the sentinel, the model must learn to identify document beginnings from content alone. Phrases like "In this paper," "Chapter 1," or narrative openings like "It was a dark and stormy night" signal document starts. The model develops sensitivity to these patterns through the rotation objective.

In[20]:
Code
def apply_document_rotation_with_sentinel(tokens, sentinel="[START]"):
    """
    Rotate document and mark original start position with sentinel.

    Args:
        tokens: List of tokens
        sentinel: Token to mark original document start

    Returns:
        rotated: Rotated tokens with sentinel marker
        original: Original tokens
    """
    if len(tokens) < 2:
        return tokens, tokens

    rotation_point = np.random.randint(1, len(tokens))
    rotated = tokens[rotation_point:] + [sentinel] + tokens[:rotation_point]

    return rotated, tokens
In[21]:
Code
np.random.seed(42)
rotated_with_sentinel, original = apply_document_rotation_with_sentinel(
    original_tokens
)
Out[22]:
Console
Original:
Once upon a time there was a brave knight who lived in a castle

Rotated with sentinel:
brave knight who lived in a castle [START] Once upon a time there was a

The sentinel token provides an explicit signal about where rotation occurred. The model's job becomes somewhat easier: generate the correct sequence given knowledge of where the original start was.

Text Infilling

Text infilling combines aspects of span corruption and token deletion. Random spans of text are replaced with a single mask token, and the model must generate the missing content. Unlike span corruption where each span gets a unique sentinel, infilling uses a single mask token for all gaps.

This is more challenging because the model must determine both the content and the length of each missing span. A single [MASK] might represent one word or ten words.

In[23]:
Code
def apply_text_infilling(
    tokens, mask_prob=0.15, mean_span_length=3, mask_token="[MASK]"
):
    """
    Replace spans of tokens with a single mask token.

    Args:
        tokens: List of tokens
        mask_prob: Probability of each token being in a masked span
        mean_span_length: Average length of masked spans
        mask_token: Token to insert for masked spans

    Returns:
        infilled: Tokens with spans replaced by mask tokens
        original: Original tokens
        spans: List of (start, end) tuples for masked spans
    """
    n = len(tokens)
    num_to_mask = int(n * mask_prob)

    # Sample span lengths from geometric distribution
    p = 1 / mean_span_length
    spans = []
    total_masked = 0

    while total_masked < num_to_mask and len(spans) < n:
        span_length = np.random.geometric(p)
        span_length = min(span_length, num_to_mask - total_masked)
        if span_length > 0:
            spans.append(span_length)
            total_masked += span_length

    # Randomly place spans
    mask = np.zeros(n, dtype=bool)
    span_positions = []

    for span_length in spans:
        # Find valid start positions
        valid_starts = []
        for start in range(n - span_length + 1):
            if not any(mask[start : start + span_length]):
                valid_starts.append(start)

        if not valid_starts:
            continue

        start = np.random.choice(valid_starts)
        mask[start : start + span_length] = True
        span_positions.append((start, start + span_length))

    # Sort spans by position
    span_positions.sort()

    # Build infilled sequence
    infilled = []
    last_end = 0

    for start, end in span_positions:
        infilled.extend(tokens[last_end:start])
        infilled.append(mask_token)
        last_end = end

    infilled.extend(tokens[last_end:])

    return infilled, tokens, span_positions
In[24]:
Code
np.random.seed(42)
sentence = "The transformer architecture has revolutionized natural language processing".split()
infilled, original, spans = apply_text_infilling(
    sentence, mask_prob=0.3, mean_span_length=2
)
Out[25]:
Console
Original:
The transformer architecture has revolutionized natural language processing

Infilled:
The transformer [MASK] has revolutionized [MASK] language processing

Masked 2 tokens (25%) in 2 span(s):
  [2:3] 'architecture' -> [MASK]
  [5:6] 'natural' -> [MASK]

Multiple tokens collapse into single [MASK] placeholders. The 8-token sentence shrinks to fewer tokens because each span, regardless of length, becomes a single mask. The model must learn to generate the correct number of tokens for each mask, a more challenging objective than predicting a fixed number of tokens per placeholder.

Span Length Distribution

The geometric distribution controls how span lengths are sampled. With a mean of 3, most spans are short (1-2 tokens), but the heavy tail allows occasional longer spans that test phrase-level understanding.

Out[26]:
Visualization
Bar chart showing span length probabilities decreasing exponentially, with length 1 at 33%, length 2 at 22%, and so on.
Distribution of span lengths sampled from a geometric distribution with mean 3. Short spans (1-2 tokens) dominate, but the heavy tail includes spans up to 10+ tokens, exposing the model to both word-level and phrase-level reconstruction challenges.

The distribution shows that roughly one-third of spans contain just a single token, similar to standard MLM. However, the remaining two-thirds contain multiple tokens, forcing the model to generate coherent phrases rather than isolated words. This balance between token-level and phrase-level prediction is key to text infilling's effectiveness.

BART-Style Combined Denoising

BART combines multiple corruption types to create a versatile denoising objective. The original BART paper explored five transformations:

  • Token masking: Replace tokens with [MASK] (similar to BERT)
  • Token deletion: Remove tokens entirely
  • Text infilling: Replace spans with single [MASK] tokens
  • Sentence permutation: Shuffle sentence order
  • Document rotation: Rotate document around random point

The best-performing configuration used text infilling as the primary corruption, combined with sentence permutation. This combination forces the model to learn both local (word-level) and global (document-level) structure.

In[27]:
Code
class BARTCorruptor:
    """
    Apply BART-style denoising corruption to text.

    Combines multiple corruption strategies:
    - Text infilling (spans replaced with single mask)
    - Sentence permutation (reorder sentences)
    """

    def __init__(
        self,
        mask_prob=0.30,
        mean_span_length=3,
        permute_sentences=True,
        mask_token="[MASK]",
    ):
        self.mask_prob = mask_prob
        self.mean_span_length = mean_span_length
        self.permute_sentences = permute_sentences
        self.mask_token = mask_token

    def _split_sentences(self, tokens):
        """Split token list into sentences (simple period-based)."""
        sentences = []
        current = []

        for token in tokens:
            current.append(token)
            if (
                token.endswith(".")
                or token.endswith("!")
                or token.endswith("?")
            ):
                sentences.append(current)
                current = []

        if current:
            sentences.append(current)

        return sentences

    def _infill_tokens(self, tokens):
        """Apply text infilling to token list."""
        n = len(tokens)
        if n == 0:
            return tokens

        num_to_mask = max(1, int(n * self.mask_prob))

        # Sample span lengths
        p = 1 / self.mean_span_length
        mask = np.zeros(n, dtype=bool)
        span_positions = []
        total_masked = 0

        attempts = 0
        while total_masked < num_to_mask and attempts < 100:
            attempts += 1
            span_length = np.random.geometric(p)
            span_length = min(span_length, num_to_mask - total_masked, n)

            valid_starts = [
                i
                for i in range(n - span_length + 1)
                if not any(mask[i : i + span_length])
            ]

            if not valid_starts:
                continue

            start = np.random.choice(valid_starts)
            mask[start : start + span_length] = True
            span_positions.append((start, start + span_length))
            total_masked += span_length

        span_positions.sort()

        # Build result
        result = []
        last_end = 0
        for start, end in span_positions:
            result.extend(tokens[last_end:start])
            result.append(self.mask_token)
            last_end = end
        result.extend(tokens[last_end:])

        return result

    def corrupt(self, tokens):
        """
        Apply BART-style corruption.

        Args:
            tokens: List of string tokens

        Returns:
            dict with 'input' (corrupted) and 'target' (original) token lists
        """
        working_tokens = list(tokens)

        # Step 1: Sentence permutation (if enabled and multiple sentences)
        if self.permute_sentences:
            sentences = self._split_sentences(working_tokens)
            if len(sentences) > 1:
                order = np.random.permutation(len(sentences))
                working_tokens = []
                for i in order:
                    working_tokens.extend(sentences[i])

        # Step 2: Text infilling
        corrupted = self._infill_tokens(working_tokens)

        return {"input": corrupted, "target": list(tokens)}
In[28]:
Code
# Demonstrate BART corruption
np.random.seed(42)
document = """The transformer changed NLP. Attention is all you need. 
Models became larger. Performance improved dramatically."""
tokens = document.split()

corruptor = BARTCorruptor(
    mask_prob=0.30, mean_span_length=3, permute_sentences=True
)
result = corruptor.corrupt(tokens)
Out[29]:
Console
Original:
The transformer changed NLP. Attention is all you need. Models became larger. Performance improved dramatically.

Corrupted (12 tokens):
Attention is all you [MASK] The transformer changed NLP. Models became larger.

Compression: 15 -> 12 tokens (20% reduction)

The corrupted version demonstrates both transformations working together. Sentences appear in a different order, and multiple spans have been replaced with [MASK] tokens. The sequence is noticeably shorter because each masked span, regardless of how many tokens it contained, becomes a single placeholder. This compression is a key efficiency advantage of text infilling: the encoder processes fewer tokens while the decoder still generates the full original.

Out[30]:
Visualization
Bar chart comparing input/output length ratios for different corruption types, showing infilling has the lowest ratio around 0.75.
Sequence length changes across corruption strategies. Token shuffling and sentence permutation preserve length (ratio = 1.0), while deletion and infilling reduce encoder input length. Text infilling achieves the greatest compression because multiple tokens collapse into single masks.

The compression from text infilling provides computational savings during training. The encoder processes shorter sequences, reducing the quadratic attention cost, while the decoder still learns to generate the full original sequence.

Comparing Corruption Strategies

Different corruptions teach different aspects of language understanding. Let's compare how each strategy transforms the same input.

In[31]:
Code
def compare_corruptions(text, seed=42):
    """Compare all corruption strategies on the same text."""
    tokens = text.split()
    results = {}

    # Token deletion
    np.random.seed(seed)
    deleted, _ = apply_token_deletion(tokens, deletion_prob=0.2)
    results["Token Deletion"] = " ".join(deleted)

    # Token shuffling
    np.random.seed(seed)
    shuffled, _ = apply_token_shuffling(tokens, shuffle_distance=3)
    results["Token Shuffling"] = " ".join(shuffled)

    # Text infilling
    np.random.seed(seed)
    infilled, _, _ = apply_text_infilling(
        tokens, mask_prob=0.25, mean_span_length=2
    )
    results["Text Infilling"] = " ".join(infilled)

    # Sentence permutation
    np.random.seed(seed)
    permuted, _ = apply_sentence_permutation(text)
    results["Sentence Permutation"] = permuted

    return results
In[32]:
Code
sample = "The quick brown fox jumps. It leaps over the lazy dog. The dog wakes up surprised."
comparisons = compare_corruptions(sample)
Out[33]:
Console
Original:
The quick brown fox jumps. It leaps over the lazy dog. The dog wakes up surprised.

============================================================

Token Deletion:
The quick brown fox over the lazy The dog wakes

Token Shuffling:
The quick brown jumps. fox It leaps over the dog. lazy wakes The dog up surprised.

Text Infilling:
The quick brown fox jumps. It leaps [MASK] [MASK] The dog wakes up surprised.

Sentence Permutation:
The quick brown fox jumps. It leaps over the lazy dog. The dog wakes up surprised.

Each corruption breaks different properties:

  • Token deletion: Removes specific words, breaking local completeness
  • Token shuffling: Scrambles word order, breaking syntax
  • Text infilling: Hides contiguous chunks, breaking both content and length
  • Sentence permutation: Reorders sentences, breaking discourse flow

The choice of corruption depends on the target application. For generation tasks, text infilling works best because it trains the decoder to produce variable-length outputs. For understanding tasks, combining multiple corruptions produces more robust representations.

Out[34]:
Visualization
Table showing corruption types and what linguistic properties they break.
Comparison of what each corruption strategy breaks. Token-level corruptions (deletion, shuffling) test local structure, while document-level corruptions (permutation, rotation) test global coherence.

Training with Denoising Objectives

Training a denoising model requires carefully balancing the encoder and decoder. The encoder must build useful representations from corrupted input. The decoder must learn to generate coherent text conditioned on those representations.

In[35]:
Code
class DenoisingTrainer:
    """
    Training loop for denoising language models.

    Handles batching, corruption, and loss computation.
    """

    def __init__(self, model, tokenizer, corruptor, learning_rate=1e-4):
        self.model = model
        self.tokenizer = tokenizer
        self.corruptor = corruptor
        self.optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    def prepare_batch(self, texts):
        """
        Prepare a batch for training.

        Args:
            texts: List of text strings

        Returns:
            dict with encoder_input, decoder_input, labels tensors
        """
        batch_encoder = []
        batch_decoder = []
        batch_labels = []

        for text in texts:
            tokens = text.split()
            result = self.corruptor.corrupt(tokens)

            # Corrupted text goes to encoder
            encoder_tokens = result["input"]
            # Original text is the target for decoder
            target_tokens = result["target"]

            batch_encoder.append(" ".join(encoder_tokens))
            batch_decoder.append(" ".join(target_tokens))
            batch_labels.append(" ".join(target_tokens))

        return {
            "encoder_texts": batch_encoder,
            "decoder_texts": batch_decoder,
            "target_texts": batch_labels,
        }

    def compute_loss(self, encoder_input, decoder_input, labels):
        """
        Compute cross-entropy loss on target tokens.

        This is a placeholder showing the structure.
        Real implementation would use model forward pass.
        """
        # In practice:
        # outputs = self.model(encoder_input, decoder_input)
        # loss = F.cross_entropy(outputs.logits, labels)
        pass

The training loop follows a straightforward sequence-to-sequence pattern:

  1. Sample a batch of documents from the training corpus
  2. Apply corruption (text infilling and sentence permutation)
  3. Pass corrupted input through the encoder
  4. Generate the original text with the decoder using teacher forcing
  5. Compute cross-entropy loss on target tokens
  6. Backpropagate gradients and update parameters

The training objective is standard sequence-to-sequence cross-entropy. The difference from other seq2seq tasks lies entirely in how the training pairs are constructed: the corrupted input is the source, and the original text is the target.

Hyperparameter Considerations

Key hyperparameters for denoising pretraining include:

  • Mask probability (default: 30% for BART): Higher than BERT's 15% because infilling is more efficient. Each mask token can represent multiple original tokens.

  • Mean span length (default: 3): Balances short spans (token-level learning) with longer spans (phrase-level learning). Span lengths follow a geometric distribution with parameter p=1/3p = 1/3, producing a mean of 3 tokens. Most spans are short (1-2 tokens), with occasional longer spans.

Out[36]:
Visualization
Dual-axis line plot showing training signal increasing and encoder context decreasing as mask probability increases from 10% to 50%.
Trade-off between mask probability and training dynamics. Higher mask probabilities provide more training signal (more tokens to reconstruct) but also reduce encoder context. The 30% rate used by BART provides substantial signal while preserving 70% of context for the encoder.
  • Sentence permutation probability: BART applies permutation to all documents. Some variants skip permutation for single-sentence inputs.

  • Learning rate: Standard transformer learning rates (1e-4 to 5e-4) with warmup work well.

  • Batch size: Large batches (2048+ tokens per GPU) help with the noisy gradients from aggressive corruption.

Limitations and Impact

Denoising objectives have transformed language model pretraining, but they come with tradeoffs that shape their practical applications.

The most fundamental limitation is the mismatch between pretraining and downstream tasks. During pretraining, the model always receives corrupted input and must produce the original. During fine-tuning and inference, the model receives clean input and must produce novel output. This gap means that denoising models may not immediately transfer their reconstruction skills to generation tasks without fine-tuning. BART addresses this partially through its autoregressive decoder, which learns generation patterns even during denoising, but the task distribution shift remains a concern for zero-shot applications.

Computational efficiency presents another tradeoff. Aggressive corruption reduces the input sequence length (especially with text infilling), which speeds up encoding. However, the decoder must still generate the full original sequence, and the encoder-decoder architecture is more expensive than encoder-only or decoder-only models of equivalent capacity. For pure understanding tasks, BERT-style encoders may be more efficient. For pure generation tasks, GPT-style decoders may be preferable. Denoising models like BART occupy a middle ground that excels when both capabilities are needed.

Key practical considerations include:

  • Corruption choice matters: Different corruptions suit different downstream tasks. Text infilling helps summarization but may not help classification. Empirical tuning is often necessary.

  • Sentence boundaries required: Sentence permutation requires reliable sentence segmentation. For domains with unusual punctuation or formatting, this corruption may introduce artifacts.

  • Length changes complicate batching: Token deletion and infilling change sequence lengths, requiring dynamic padding or variable-length batching infrastructure.

Despite these limitations, denoising objectives work well in practice. BART achieved state-of-the-art results on summarization, translation, and question answering when released. The insight that diverse corruptions produce versatile models has influenced subsequent work like T5, mBART, and PEGASUS. Models trained with denoising objectives often transfer better to generation tasks than MLM-only models, while maintaining competitive understanding performance.

Key Parameters

The corruption functions implemented in this chapter share several configurable parameters that control the difficulty and nature of the denoising task:

  • deletion_prob (default: 0.15): Probability of deleting each token in token deletion. Higher values remove more content but may destroy too much context for reliable reconstruction. The 15% rate matches BERT's masking rate, though BART's final model uses text infilling rather than pure deletion.

  • shuffle_distance (default: 3): Maximum positions a token can move during shuffling. Smaller values (1-2) test local word order, while larger values (5+) test sentence-level syntax understanding. The noise-based algorithm ensures tokens typically move less than the maximum.

  • mask_prob (default: 0.30 for BART): Fraction of tokens to include in masked spans during text infilling. Higher than BERT's 15% because each mask can represent multiple tokens, making the objective more efficient.

  • mean_span_length (default: 3): Average number of tokens per masked span. Sampled from a geometric distribution, meaning most spans are short (1-2 tokens) with occasional longer spans. Controls the balance between token-level and phrase-level learning.

  • permute_sentences (default: True): Whether to apply sentence permutation before text infilling. Combining both corruptions teaches both local and global structure. Can be disabled for single-sentence inputs.

  • mask_token (default: "[MASK]"): Placeholder token inserted for masked spans. Unlike span corruption which uses unique sentinels for each span, text infilling uses the same token for all spans, requiring the model to infer span boundaries.

Summary

Denoising objectives train models to reconstruct original text from corrupted inputs. This chapter covered the major corruption strategies and their effects:

  • Token deletion removes tokens entirely, forcing the model to identify missing content and insert it at the correct positions

  • Token shuffling permutes word order within a local window, teaching the model syntax and word order constraints

  • Sentence permutation reorders sentences within documents, developing understanding of discourse structure and coherence

  • Document rotation moves content from the beginning to the end, training sensitivity to document boundaries and openings

  • Text infilling replaces spans with single mask tokens, combining content prediction with length prediction for flexible generation

  • BART-style combination uses text infilling plus sentence permutation to train models that excel at both understanding and generation

The choice of corruption strategy depends on the target application. Text infilling suits generation tasks, while combining multiple corruptions produces more robust general-purpose models. The encoder-decoder architecture of denoising models naturally supports both bidirectional encoding for understanding and autoregressive decoding for generation.

The next part of the book explores BERT and its variants, examining how encoder-only architectures apply masked language modeling to build powerful understanding models.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about denoising objectives and BART's corruption strategies.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{denoisingobjectivesbartscorruptionstrategiesforlanguagemodels, author = {Michael Brenndoerfer}, title = {Denoising Objectives: BART's Corruption Strategies for Language Models}, year = {2025}, url = {https://mbrenndoerfer.com/writing/denoising-objectives-bart-corruption-strategies}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Denoising Objectives: BART's Corruption Strategies for Language Models. Retrieved from https://mbrenndoerfer.com/writing/denoising-objectives-bart-corruption-strategies
MLAAcademic
Michael Brenndoerfer. "Denoising Objectives: BART's Corruption Strategies for Language Models." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/denoising-objectives-bart-corruption-strategies>.
CHICAGOAcademic
Michael Brenndoerfer. "Denoising Objectives: BART's Corruption Strategies for Language Models." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/denoising-objectives-bart-corruption-strategies.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Denoising Objectives: BART's Corruption Strategies for Language Models'. Available at: https://mbrenndoerfer.com/writing/denoising-objectives-bart-corruption-strategies (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Denoising Objectives: BART's Corruption Strategies for Language Models. https://mbrenndoerfer.com/writing/denoising-objectives-bart-corruption-strategies
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free