Search

Search articles

Span Corruption: T5's Pretraining Objective for Sequence-to-Sequence Learning

Michael BrenndoerferUpdated July 11, 202535 min read

Learn how span corruption works in T5, including span selection strategies, geometric distributions, sentinel tokens, and computational benefits over masked language modeling.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Span Corruption

Masked language modeling (MLM) trains models to predict individual tokens hidden behind [MASK] placeholders. But what if we masked entire phrases or multi-word expressions instead? Span corruption takes this idea further: rather than masking isolated tokens, it corrupts contiguous spans of text and asks the model to reconstruct them. This simple shift fundamentally changes what the model learns during pretraining.

T5 (Text-to-Text Transfer Transformer) popularized span corruption as its primary pretraining objective. By treating the corrupted input as source text and the original spans as target text, T5 frames pretraining as a sequence-to-sequence problem. This approach produces models that excel at generation tasks while remaining competitive on understanding benchmarks.

In this chapter, you'll learn how span corruption works, why it outperforms token-level masking for certain applications, and how to implement it from scratch. We'll cover span selection strategies, the mathematics of span length distributions, sentinel token design, and the computational advantages that make this approach attractive at scale.

From Token Masking to Span Corruption

Standard MLM randomly selects 15% of tokens and replaces them with [MASK]. The model predicts each masked token independently, using the surrounding context. This works well but has limitations.

Consider the phrase "the cat sat on the mat." If MLM masks "sat" and "mat" independently, the model makes two separate predictions. It never learns that these words might be related or that predicting multi-word expressions requires different reasoning than predicting single tokens.

Span corruption addresses this by masking contiguous sequences. Instead of the [MASK] sat on the [MASK], we might see the [X] on [Y] where [X] replaces "cat sat" and [Y] replaces "the mat." The model must now generate complete phrases, not isolated words.

Span Corruption

A pretraining objective that replaces contiguous spans of tokens with single sentinel tokens. The model learns to reconstruct the original spans from the corrupted input, treating the task as sequence-to-sequence generation.

This shift has several consequences. First, the input sequence becomes shorter because multiple tokens collapse into single sentinels. Second, the target sequence contains multiple tokens per sentinel, requiring the model to generate coherently. Third, the model learns about phrase-level patterns and dependencies that token-level masking might miss.

Span Selection Strategies

Now that we understand what span corruption does conceptually, we need to answer a practical question: how do we decide which parts of a sequence to corrupt? This leads us to two interconnected design choices that significantly affect what the model learns.

Think of span corruption as a controlled demolition. We want to remove enough of the building (the text) to create a meaningful reconstruction challenge, but not so much that the remaining structure provides no clues about what was removed. This balance requires careful decisions about two quantities: how much total material to remove (the corruption rate) and how to divide that material into chunks (the span length distribution).

Corruption Rate: How Much to Remove

The corruption rate rr specifies what fraction of the original tokens should end up inside corrupted spans. T5 adopts r=0.15r = 0.15, matching BERT's masking rate, but the mechanics differ substantially from token-level masking.

In standard MLM, a 15% corruption rate means we independently flip a coin for each token, with 15% probability of masking. The result: exactly 15% of tokens become [MASK], scattered throughout the sequence. With span corruption, we instead select contiguous regions totaling approximately 15% of the sequence. The key difference is that these tokens are grouped, not isolated.

This grouping creates an interesting constraint. If we want to corrupt a fixed fraction of tokens, the number of spans we create depends on how long each span is. Suppose we have a sequence of nn tokens and want to corrupt fraction rr of them. If each span contains an average of μ\mu tokens, simple arithmetic tells us how many spans we need:

number of spansrnμ\text{number of spans} \approx \frac{r \cdot n}{\mu}

where:

  • nn: the total number of tokens in the input sequence
  • rr: the corruption rate (fraction of tokens to corrupt, e.g., 0.15 for 15%)
  • μ\mu: the average span length (e.g., 3 tokens per span)

The reasoning is direct: we need to "cover" rnr \cdot n tokens with our spans. If each span covers μ\mu tokens on average, we need (rn)/μ(r \cdot n) / \mu spans to achieve the target coverage.

Let's make this concrete. For a 512-token sequence with r=0.15r = 0.15 and μ=3\mu = 3:

  1. Tokens to corrupt: 0.15×512770.15 \times 512 \approx 77 tokens
  2. Spans needed: 77/32677 / 3 \approx 26 spans
  3. Input length after corruption: Each span gets replaced by one sentinel token, so the corrupted input contains 51277+26=461512 - 77 + 26 = 461 tokens

This calculation reveals an important property: span corruption naturally compresses the input sequence. We remove 77 tokens but add only 26 sentinels, achieving a net reduction of 51 tokens (about 10% shorter). This compression has computational benefits we'll explore later.

Span Length Distribution: The Shape of Uncertainty

The corruption rate tells us how much to remove; the span length distribution tells us how to partition that removal into individual spans. This choice significantly affects what patterns the model learns.

Consider the extremes. If all spans had length 1, span corruption would collapse to token-level masking: we'd have 77 isolated masks, each predicting a single token independently. If all spans had length 77, we'd have one massive gap containing over half the sequence, providing almost no useful training signal.

The sweet spot lies somewhere between. We want a mix of span lengths that exposes the model to diverse reconstruction challenges: some single-token predictions (maintaining fine-grained language modeling ability), some short phrases (learning local coherence), and occasional longer spans (forcing the model to reason about larger structures).

T5 uses a geometric distribution to achieve this mix. The geometric distribution is the discrete analog of the exponential distribution, modeling a natural process: imagine flipping a biased coin at each position within a span, continuing until we get "heads" (stop) instead of "tails" (continue). The span length equals how many positions we visited before stopping.

Mathematically, if the probability of stopping at each step is pp, the probability of a span having exactly \ell tokens is:

P()=(1p)1pP(\ell) = (1 - p)^{\ell - 1} \cdot p

where:

  • P()P(\ell): the probability of sampling a span of exactly \ell tokens
  • \ell: the span length (a positive integer: 1, 2, 3, ...)
  • pp: the stopping probability at each step, controlling the distribution's shape
  • (1p)1(1 - p)^{\ell - 1}: the probability of "continuing" for 1\ell - 1 steps before finally "stopping"

This formula captures a simple generative process: to get a span of length \ell, we need 1\ell - 1 consecutive "continue" decisions (each with probability 1p1 - p) followed by one "stop" decision (with probability pp).

The geometric distribution has a convenient property: its mean equals 1/p1/p. This gives us a direct way to control the average span length. If we want spans to average μ\mu tokens, we simply set:

p=1μp = \frac{1}{\mu}

For T5's choice of μ=3\mu = 3, this gives p=1/3p = 1/3. At each step within a span, there's a 1-in-3 chance of stopping. The resulting distribution strongly favors shorter spans while maintaining a "heavy tail" of longer ones:

  • P(=1)=0.33P(\ell = 1) = 0.33 (one-third of spans are single tokens)
  • P(=2)=0.22P(\ell = 2) = 0.22 (about one-fifth are two tokens)
  • P(=3)=0.15P(\ell = 3) = 0.15 (around one-sixth are three tokens)
  • P(=4)=0.10P(\ell = 4) = 0.10 (one-tenth are four tokens)
  • P(5)=0.20P(\ell \geq 5) = 0.20 (about one-fifth contain five or more tokens)

This distribution creates a rich training curriculum. The model frequently encounters single-token predictions, maintaining its vocabulary knowledge. It regularly sees two and three-token spans, learning common phrases and local syntax. And it occasionally faces longer spans requiring genuine compositional reasoning.

Let's visualize this distribution by sampling 10,000 span lengths and plotting the empirical frequencies. This will confirm our theoretical calculations and reveal the characteristic exponential decay.

In[2]:
Code
import matplotlib.pyplot as plt  # noqa: F401
import numpy as np

## Set random seed for reproducibility
np.random.seed(42)

## Geometric distribution parameters
mean_span_length = 3
p = 1 / mean_span_length

## Sample 10000 span lengths
span_lengths = np.random.geometric(p, size=10000)

## Calculate distribution statistics
unique, counts = np.unique(span_lengths, return_counts=True)
probabilities = counts / len(span_lengths)
Out[3]:
Visualization
Bar chart showing span length probabilities decreasing exponentially from length 1 to 15.
Span length distribution following a geometric distribution with mean 3. Shorter spans are more common, but the heavy tail ensures the model encounters longer multi-token spans during training.

The empirical distribution matches our theoretical expectations. Notice the characteristic exponential decay: each bar is roughly two-thirds the height of the previous one (since (1p)=2/3(1-p) = 2/3). The red dashed line marks the mean at 3, which falls slightly to the right of the mode (1), reflecting the distribution's right skew.

Why does this shape work well for pretraining? The geometric distribution provides a natural curriculum:

  1. High-frequency short spans maintain the model's ability to predict individual tokens accurately, preserving fine-grained vocabulary knowledge
  2. Medium-length spans teach local coherence and common phrases, the bread-and-butter of fluent text generation
  3. The heavy tail occasionally challenges the model with longer reconstructions, forcing it to reason about syntax, entities, and discourse structure

This is more effective than a uniform distribution, which would waste equal capacity on trivially short spans and overwhelmingly long ones. The geometric decay concentrates learning where it's most useful while still providing exposure to diverse span lengths.

Another way to understand the distribution is through the cumulative perspective: what fraction of spans have length at most \ell? This cumulative distribution function (CDF) answers practical questions like "what percentage of spans are single tokens?" or "how often do we see spans longer than 5 tokens?"

Out[4]:
Visualization
Step plot showing cumulative probability increasing from 0.33 at length 1 toward 1.0 at higher lengths.
Cumulative distribution of span lengths showing what fraction of spans fall at or below each length. With mean 3, about 55% of spans are length 1 or 2, while only 20% exceed length 4.

The CDF reveals that roughly half of all spans have length 2 or less, and about 80% have length 4 or less. Only the remaining 20% challenge the model with longer reconstructions. This steep rise followed by a gradual tail perfectly balances frequent short-span practice with occasional longer challenges.

Sentinel Tokens

Sentinel tokens serve as placeholders for corrupted spans. Unlike BERT's single [MASK] token, span corruption uses multiple distinct sentinels, typically denoted <extra_id_0>, <extra_id_1>, and so on.

Why use different sentinels for each span? Consider reconstructing two spans from the corrupted input. If both used the same [MASK] token, the model couldn't distinguish which output corresponds to which span. Distinct sentinels create a clear mapping between input positions and output targets.

Sentinel Tokens

Special placeholder tokens that replace corrupted spans in the input. Each span receives a unique sentinel (e.g., <extra_id_0>, <extra_id_1>), enabling the model to map reconstructed spans back to their original positions.

The target sequence concatenates all corrupted spans, each preceded by its corresponding sentinel:

Original: "The quick brown fox jumps over the lazy dog"

Corrupted Input: "The <extra_id_0> fox <extra_id_1> lazy dog"

Target: "<extra_id_0> quick brown <extra_id_1> jumps over the"

This format enables autoregressive generation of the target. The model sees <extra_id_0> and generates "quick brown" before the next sentinel signals a new span.

T5's vocabulary reserves 100 sentinel tokens (<extra_id_0> through <extra_id_99>). This typically suffices since even long sequences rarely contain more than 50-60 spans with a 15% corruption rate and average span length of 3.

Implementing Span Corruption

Now that we understand the mathematics behind span selection, let's translate these concepts into working code. We'll build the algorithm incrementally, starting from the core span selection logic and gradually assembling the complete corruption pipeline. By the end, you'll have a clear picture of how each formula we derived connects to the implementation.

Selecting Span Boundaries

Our first task is to decide which token positions fall within corrupted spans. We need an algorithm that:

  1. Determines how many spans to create (using our formula: rn/μ\approx r \cdot n / \mu)
  2. Samples each span's length from a geometric distribution
  3. Places spans at random positions without overlap

We'll represent the result as a corruption mask: a boolean array where True indicates that token should be part of some corrupted span.

In[5]:
Code
def select_corruption_mask(
    seq_length, corruption_rate=0.15, mean_span_length=3
):
    """
    Create a boolean mask indicating which positions to corrupt.
    Spans are selected to achieve approximately the target corruption rate.
    """
    # Calculate expected number of spans
    num_tokens_to_corrupt = int(seq_length * corruption_rate)
    num_spans = max(1, num_tokens_to_corrupt // mean_span_length)

    # Sample span lengths from geometric distribution
    p = 1 / mean_span_length
    span_lengths = np.random.geometric(p, size=num_spans)

    # Clip to ensure we don't exceed the corruption budget
    total_corrupted = span_lengths.sum()
    if total_corrupted > num_tokens_to_corrupt:
        # Scale down span lengths proportionally
        scale = num_tokens_to_corrupt / total_corrupted
        span_lengths = np.maximum(1, (span_lengths * scale).astype(int))

    # Randomly place spans without overlap
    mask = np.zeros(seq_length, dtype=bool)
    available_positions = list(range(seq_length))

    for length in span_lengths:
        if len(available_positions) < length:
            break
        # Choose a starting position
        valid_starts = [i for i in range(len(available_positions) - length + 1)]
        if not valid_starts:
            break
        start_idx = np.random.choice(valid_starts)

        # Mark positions as corrupted
        for offset in range(length):
            pos = available_positions[start_idx]
            mask[pos] = True
            available_positions.remove(pos)

    return mask
In[6]:
Code
## Test the span selection
np.random.seed(42)
test_length = 50
mask = select_corruption_mask(
    test_length, corruption_rate=0.15, mean_span_length=3
)
Out[7]:
Console
Sequence length: 50
Tokens corrupted: 6 (12.0%)

Corruption mask (1 = corrupted):
░░░░░░░█████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░█░░░░░░░

The visualization shows corrupted positions as filled blocks (█) and uncorrupted positions as empty blocks (░). Notice how the corrupted tokens cluster into contiguous groups rather than scattering randomly across the sequence. This is the defining characteristic of span corruption: tokens within each span will be replaced by a single sentinel and reconstructed together.

With approximately 15% of tokens corrupted, we've created distinct "holes" in the sequence that the model must learn to fill. The spans vary in length according to our geometric distribution, some containing just one or two tokens, others stretching across several positions.

Identifying Span Boundaries

The corruption mask tells us which positions are corrupted, but for the next step we need to know where each span begins and ends. This boundary information lets us replace each contiguous span with exactly one sentinel token and extract the corresponding target tokens.

In[8]:
Code
def get_span_boundaries(mask):
    """
    Extract start and end indices for each contiguous span in the mask.
    Returns list of (start, end) tuples where end is exclusive.
    """
    spans = []
    in_span = False
    start = 0

    for i, is_corrupted in enumerate(mask):
        if is_corrupted and not in_span:
            # Starting a new span
            start = i
            in_span = True
        elif not is_corrupted and in_span:
            # Ending current span
            spans.append((start, i))
            in_span = False

    # Handle span at end of sequence
    if in_span:
        spans.append((start, len(mask)))

    return spans
In[9]:
Code
## Find spans in our test mask
spans = get_span_boundaries(mask)
Out[10]:
Console
Found 2 spans:
  Span 0: positions 7-11 (length 5)
  Span 1: positions 42-42 (length 1)

The algorithm finds each contiguous run of True values in the mask. Each span gets a unique index (0, 1, 2, ...) that will serve as its identifier when we assign sentinel tokens. Notice how span lengths vary: some spans contain just one token, others two or three, following our geometric distribution.

Building Input and Target Sequences

With span boundaries identified, we can now construct the two sequences that define the training example:

  1. Corrupted input: The original sequence with each span replaced by its sentinel token (e.g., <extra_id_0>, <extra_id_1>)
  2. Target sequence: A concatenation of all corrupted spans, each preceded by its corresponding sentinel

This format creates a clear mapping: the model sees a sentinel in the input and learns to generate the corresponding tokens in the target.

In[11]:
Code
def corrupt_sequence(tokens, mask, sentinel_prefix="<extra_id_"):
    """
    Apply span corruption to a token sequence.

    Args:
        tokens: List of tokens
        mask: Boolean array indicating corrupted positions
        sentinel_prefix: Prefix for sentinel tokens

    Returns:
        corrupted_input: Input sequence with sentinels replacing spans
        target: Target sequence for reconstruction
    """
    spans = get_span_boundaries(mask)

    # Build corrupted input
    corrupted_input = []
    last_end = 0

    for span_idx, (start, end) in enumerate(spans):
        # Add uncorrupted tokens before this span
        corrupted_input.extend(tokens[last_end:start])
        # Add sentinel for this span
        corrupted_input.append(f"{sentinel_prefix}{span_idx}>")
        last_end = end

    # Add remaining uncorrupted tokens
    corrupted_input.extend(tokens[last_end:])

    # Build target sequence
    target = []
    for span_idx, (start, end) in enumerate(spans):
        # Add sentinel
        target.append(f"{sentinel_prefix}{span_idx}>")
        # Add original span tokens
        target.extend(tokens[start:end])

    return corrupted_input, target

Let's see this in action with a real sentence:

In[12]:
Code
## Example sentence
sentence = "The quick brown fox jumps over the lazy dog near the riverbank"
tokens = sentence.split()

## Create corruption mask for this specific example
np.random.seed(123)
mask = select_corruption_mask(
    len(tokens), corruption_rate=0.25, mean_span_length=2
)

## Apply corruption
corrupted_input, target = corrupt_sequence(tokens, mask)
Out[13]:
Console
Original tokens:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'near', 'the', 'riverbank']

Corruption mask: ░░██░░░░░░░░

Corrupted input (11 tokens):
['The', 'quick', '<extra_id_0>', 'jumps', 'over', 'the', 'lazy', 'dog', 'near', 'the', 'riverbank']

Target sequence (3 tokens):
['<extra_id_0>', 'brown', 'fox']

Examine the output carefully. The corrupted input is noticeably shorter than the original: multi-token spans have collapsed into single sentinel tokens. The target sequence contains exactly the tokens that were removed, each group prefixed by its sentinel to maintain the correspondence.

This structure shows how span corruption works in practice. The model must:

  1. Understand context: Use the uncorrupted tokens surrounding each sentinel to infer what's missing
  2. Generate coherently: Produce complete phrases, not just isolated words, for each sentinel
  3. Maintain boundaries: Know when one span ends and the next begins, guided by the sentinel markers

Complete Span Corruption Pipeline

With all the pieces in place, let's combine them into a reusable class that encapsulates the full span corruption logic:

In[14]:
Code
class SpanCorruptor:
    """Applies T5-style span corruption to text sequences."""

    def __init__(
        self, corruption_rate=0.15, mean_span_length=3, num_sentinels=100
    ):
        self.corruption_rate = corruption_rate
        self.mean_span_length = mean_span_length
        self.num_sentinels = num_sentinels
        self.sentinels = [f"<extra_id_{i}>" for i in range(num_sentinels)]

    def corrupt(self, tokens):
        """
        Corrupt a token sequence using span corruption.

        Args:
            tokens: List of string tokens

        Returns:
            dict with 'input' and 'target' token lists
        """
        if len(tokens) < 2:
            return {"input": tokens, "target": []}

        # Select spans to corrupt
        mask = select_corruption_mask(
            len(tokens), self.corruption_rate, self.mean_span_length
        )

        # Get span boundaries
        spans = get_span_boundaries(mask)

        if len(spans) > self.num_sentinels:
            # Too many spans, merge some
            spans = spans[: self.num_sentinels]

        # Build sequences
        corrupted_input = []
        target = []
        last_end = 0

        for span_idx, (start, end) in enumerate(spans):
            corrupted_input.extend(tokens[last_end:start])
            corrupted_input.append(self.sentinels[span_idx])
            target.append(self.sentinels[span_idx])
            target.extend(tokens[start:end])
            last_end = end

        corrupted_input.extend(tokens[last_end:])

        return {"input": corrupted_input, "target": target}
In[15]:
Code
## Test the complete pipeline
corruptor = SpanCorruptor(corruption_rate=0.15, mean_span_length=3)

np.random.seed(42)
text = """Natural language processing enables computers to understand 
and generate human language in meaningful ways"""
tokens = text.split()

result = corruptor.corrupt(tokens)
Out[16]:
Console
Original (14 tokens):
Natural language processing enables computers to understand and generate human language in meaningful ways

Corrupted input (13 tokens):
Natural language processing enables computers to understand and generate human language in <extra_id_0>

Target (3 tokens):
<extra_id_0> meaningful ways

The results confirm our earlier calculations. The original 14-token sequence becomes a 12-token corrupted input (the corruption plus sentinels compress the sequence) and a compact target containing only the corrupted material plus sentinel markers.

This completes our span corruption implementation. Starting from the mathematical foundations, we've built each component:

  1. Span count estimation: (rn)/μ(r \cdot n) / \mu spans needed for the target corruption rate
  2. Length sampling: Geometric distribution with p=1/μp = 1/\mu for natural length variation
  3. Boundary detection: Linear scan to identify contiguous corrupted regions
  4. Sequence construction: Parallel building of input (with sentinels) and target (spans preceded by sentinels)

The algorithm is efficient, running in linear time relative to sequence length, and produces the exact format expected by T5-style encoder-decoder training.

T5-Style Training

T5 uses span corruption within an encoder-decoder framework. The corrupted input feeds into the encoder, while the decoder autoregressively generates the target sequence. This setup naturally handles variable-length span reconstruction.

Encoder-Decoder Architecture

The encoder processes the corrupted input sequence, producing hidden representations that capture the context around each sentinel:

Henc=Encoder(x1,,<extra_id_0>,,xn)H_{\text{enc}} = \text{Encoder}(x_1, \ldots, \text{<extra\_id\_0>}, \ldots, x_n)

where:

  • HencH_{\text{enc}}: the encoder's output hidden states, a matrix of shape (sequence length ×\times hidden dimension)
  • x1,,xnx_1, \ldots, x_n: the corrupted input tokens, including uncorrupted tokens and sentinel placeholders
  • <extra_id_0>\text{<extra\_id\_0>}: a sentinel token marking where a span was removed

The decoder then generates the target sequence autoregressively, conditioned on the encoder's hidden states:

P(yty<t,Henc)=Decoder(y<t,Henc)P(y_t \mid y_{<t}, H_{\text{enc}}) = \text{Decoder}(y_{<t}, H_{\text{enc}})

where:

  • yty_t: the token being predicted at position tt in the target sequence
  • y<ty_{<t}: all previously generated target tokens (y1,y2,,yt1)(y_1, y_2, \ldots, y_{t-1})
  • P(yty<t,Henc)P(y_t \mid y_{<t}, H_{\text{enc}}): the probability distribution over the vocabulary for the next token

During training, we use teacher forcing: the decoder receives the true previous tokens rather than its own predictions. The training objective is standard cross-entropy loss over the target sequence:

L=t=1TlogP(yty<t,Henc)\mathcal{L} = -\sum_{t=1}^{T} \log P(y_t \mid y_{<t}, H_{\text{enc}})

where:

  • L\mathcal{L}: the total loss for this training example
  • TT: the length of the target sequence
  • logP(yt)\log P(y_t \mid \cdot): the log probability assigned to the correct token at each position

This loss encourages the model to assign high probability to the actual tokens that were corrupted, learning to reconstruct spans from context.

Why Encoder-Decoder?

Span corruption works well with encoder-decoder models because:

  • Bidirectional encoding: The encoder sees all uncorrupted context simultaneously, enabling rich representations of what surrounds each sentinel.
  • Autoregressive decoding: The decoder generates spans token by token, learning proper phrase structure and coherence.
  • Natural length handling: Spans of different lengths produce targets of different lengths, which encoder-decoder models handle gracefully.

Decoder-only models can also use span corruption, treating the task as infilling. The corrupted input and target concatenate with appropriate separators, and the model learns to continue after each sentinel.

Comparison to MLM

Let's compare the training signals from span corruption versus token-level MLM:

In[17]:
Code
def mlm_mask(tokens, mask_rate=0.15, mask_token="[MASK]"):
    """Apply standard MLM masking (token-level, not spans)."""
    mask = np.random.random(len(tokens)) < mask_rate
    masked_tokens = [mask_token if m else t for t, m in zip(tokens, mask)]
    targets = [(i, t) for i, (t, m) in enumerate(zip(tokens, mask)) if m]
    return masked_tokens, targets


def compare_corruption_methods(text, seed=42):
    """Compare span corruption vs MLM on the same text."""
    tokens = text.split()

    # MLM
    np.random.seed(seed)
    mlm_input, mlm_targets = mlm_mask(tokens)

    # Span corruption
    np.random.seed(seed)
    span_result = SpanCorruptor().corrupt(tokens)

    return {
        "original": tokens,
        "mlm_input": mlm_input,
        "mlm_targets": mlm_targets,
        "span_input": span_result["input"],
        "span_target": span_result["target"],
    }
In[18]:
Code
sample_text = """The transformer architecture revolutionized natural language 
processing by enabling parallel computation and capturing long range dependencies 
through self attention mechanisms"""

comparison = compare_corruption_methods(sample_text)
Out[19]:
Console
=== Original ===
The transformer architecture revolutionized natural language processing by enabling parallel computation and capturing long range dependencies through self attention mechanisms

=== MLM (2 masks) ===
The transformer architecture revolutionized natural language [MASK] by enabling parallel [MASK] and capturing long range dependencies through self attention mechanisms
Targets: [(6, 'processing'), (10, 'computation')]

=== Span Corruption (19 input tokens) ===
The transformer architecture revolutionized natural language processing by enabling parallel computation and capturing long <extra_id_0> through self attention mechanisms
Target: <extra_id_0> range dependencies

With MLM, each mask corresponds to exactly one token. With span corruption, each sentinel may require generating multiple tokens, forcing the model to learn phrase-level coherence.

The difference becomes clearer when we visualize the corruption patterns side by side. Let's create multiple samples and compare how MLM scatters masks across the sequence while span corruption creates contiguous blocks.

In[20]:
Code
def visualize_corruption_patterns(seq_length=60, num_samples=8, seed=42):
    """Generate corruption patterns for visualization."""
    np.random.seed(seed)

    mlm_patterns = []
    span_patterns = []

    for i in range(num_samples):
        # MLM: independent token masking
        mlm_mask = np.random.random(seq_length) < 0.15
        mlm_patterns.append(mlm_mask)

        # Span corruption
        span_mask = select_corruption_mask(
            seq_length, corruption_rate=0.15, mean_span_length=3
        )
        span_patterns.append(span_mask)

    return np.array(mlm_patterns), np.array(span_patterns)


mlm_patterns, span_patterns = visualize_corruption_patterns()
Out[21]:
Visualization
Heatmap showing scattered corruption patterns for MLM across 8 samples.
MLM produces scattered, isolated masks where each dark cell is an independent prediction task.
Heatmap showing contiguous block corruption patterns for span corruption across 8 samples.
Span corruption creates contiguous blocks of corruption that must be reconstructed together.

The visual contrast is clear. MLM produces a scattered, salt-and-pepper pattern where each dark cell is an isolated prediction task. Span corruption creates horizontal streaks, each representing a multi-token reconstruction challenge. This structural difference explains why span-corrupted models develop stronger phrase-level generation capabilities.

Span Corruption Variants

Researchers have explored several variations of the basic span corruption approach.

Uniform vs. Geometric Span Lengths

While T5 uses geometric distribution, some work explores uniform distributions. With uniform sampling between 1 and kk, every span length up to kk is equally likely:

In[22]:
Code
def sample_span_lengths_uniform(num_spans, min_length=1, max_length=5):
    """Sample span lengths from uniform distribution."""
    return np.random.randint(min_length, max_length + 1, size=num_spans)


def sample_span_lengths_geometric(num_spans, mean_length=3):
    """Sample span lengths from geometric distribution."""
    p = 1 / mean_length
    return np.random.geometric(p, size=num_spans)
In[23]:
Code
np.random.seed(42)
n_samples = 5000

uniform_spans = sample_span_lengths_uniform(n_samples, max_length=5)
geometric_spans = sample_span_lengths_geometric(n_samples, mean_length=3)
Out[24]:
Visualization
Bar chart showing uniform distribution of span lengths from 1 to 5.
Uniform distribution gives equal probability to all lengths in the range 1-5.
Bar chart showing geometric distribution of span lengths with exponential decay.
Geometric distribution favors shorter spans with a heavy tail allowing occasional long ones.

Geometric distributions produce more short spans but allow occasional long ones. Uniform distributions guarantee exposure to longer spans but may waste capacity on trivially short reconstructions.

We can also explore how different mean span lengths affect the distribution shape. Higher means produce flatter distributions with more emphasis on longer spans.

Out[25]:
Visualization
Three overlapping bar charts showing geometric distributions with means 2, 3, and 5, with lower means having steeper decay.
Effect of mean span length on the geometric distribution. Lower means (mean=2) concentrate probability on single tokens, while higher means (mean=5) spread probability more evenly across longer spans.

With μ=2\mu = 2, nearly half of all spans are single tokens, providing mostly token-level signal. With μ=5\mu = 5, the distribution flattens significantly, exposing the model to longer spans more frequently but at the cost of fewer distinct spans per sequence. T5's choice of μ=3\mu = 3 balances these extremes.

Corruption Rate Variations

The original T5 paper tested corruption rates from 10% to 50%. Higher rates create shorter inputs but longer targets:

In[26]:
Code
def analyze_corruption_rate(
    seq_length, corruption_rate, mean_span=3, num_trials=1000
):
    """Analyze the effect of corruption rate on sequence lengths."""
    input_lengths = []
    target_lengths = []

    for _ in range(num_trials):
        mask = select_corruption_mask(seq_length, corruption_rate, mean_span)
        spans = get_span_boundaries(mask)

        # Input length = original - corrupted + num_sentinels
        corrupted_count = mask.sum()
        input_len = seq_length - corrupted_count + len(spans)
        input_lengths.append(input_len)

        # Target length = corrupted + num_sentinels
        target_len = corrupted_count + len(spans)
        target_lengths.append(target_len)

    return np.mean(input_lengths), np.mean(target_lengths)
In[27]:
Code
seq_length = 512
corruption_rates = [0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5]

results = []
for rate in corruption_rates:
    input_len, target_len = analyze_corruption_rate(seq_length, rate)
    results.append(
        {
            "rate": rate,
            "input_length": input_len,
            "target_length": target_len,
            "total": input_len + target_len,
        }
    )
Out[28]:
Visualization
Line plot showing input length decreasing and target length increasing as corruption rate rises from 10% to 50%.
Effect of corruption rate on input and target sequence lengths. Higher corruption rates produce shorter inputs but longer targets, with total sequence length increasing.

At 15% corruption, the input shrinks modestly while the target remains manageable. Higher rates shift more content to the target, which can slow training due to longer autoregressive generation.

Prefix LM Variant

Some models combine span corruption with prefix language modeling. The uncorrupted prefix receives bidirectional attention, while corrupted spans use causal attention:

Input: "The quick brown" + <extra_id_0> + "jumps over" + <extra_id_1>

Attention:

  • "The quick brown": Bidirectional (full visibility)
  • Sentinels and after: Causal (left-to-right only)

This hybrid approach lets the model leverage bidirectional context for understanding while maintaining generative capability.

Computational Benefits

Span corruption offers surprising computational advantages over token-level masking.

Shorter Sequences

Because multiple tokens collapse into single sentinels, the encoder processes fewer tokens. For a 512-token sequence with 15% corruption and mean span length 3:

  • Original tokens corrupted: ~77 tokens
  • Sentinels added: ~26 tokens
  • Net input length: ~461 tokens (10% reduction)

This reduction compounds across the quadratic attention mechanism. Attention cost scales as O(n2)O(n^2), where nn is the sequence length. A 10% length reduction (from nn to 0.9n0.9n) yields roughly 19% savings in attention computation, since (0.9)2=0.81(0.9)^2 = 0.81.

Shorter Targets

The target sequence contains only corrupted spans plus sentinels. With 15% corruption:

  • Target length: ~77 (spans) + ~26 (sentinels) = ~103 tokens

The decoder processes 103 tokens instead of 512, dramatically reducing generation cost during training.

Training Efficiency Comparison

Let's quantify the computational savings:

In[29]:
Code
def compute_training_cost(seq_length, corruption_rate, mean_span):
    """
    Estimate relative computational cost for different pretraining approaches.
    Uses simplified model where cost ~ sequence_length^2 for attention.
    """
    # Calculate expected lengths
    num_corrupted = int(seq_length * corruption_rate)
    num_spans = max(1, num_corrupted // mean_span)

    # Span corruption
    span_input_len = seq_length - num_corrupted + num_spans
    span_target_len = num_corrupted + num_spans
    span_cost = span_input_len**2 + span_target_len**2  # Encoder + Decoder

    # Standard MLM (encoder only, full sequence)
    mlm_cost = seq_length**2

    # Causal LM (decoder only, full sequence)
    clm_cost = (
        seq_length**2
    )  # Simplified; actual is O(n^2/2) due to causal mask

    return {
        "span_corruption": span_cost,
        "mlm": mlm_cost,
        "causal_lm": clm_cost,
        "span_input_len": span_input_len,
        "span_target_len": span_target_len,
    }
In[30]:
Code
costs = compute_training_cost(512, 0.15, 3)
Out[31]:
Console
Sequence length: 512
Corruption rate: 15%, Mean span: 3

Span corruption:
  Input length:  461
  Target length: 101
  Relative cost: 84.96% of MLM

MLM/Causal LM cost: 100% (baseline)

Let's visualize these computational trade-offs across different corruption rates to understand when span corruption provides the greatest efficiency gains.

In[32]:
Code
## Compute costs across corruption rates
cost_comparison = []
for rate in [0.1, 0.15, 0.2, 0.25, 0.3]:
    costs = compute_training_cost(512, rate, 3)
    cost_comparison.append(
        {
            "rate": rate,
            "relative_cost": costs["span_corruption"] / costs["mlm"] * 100,
            "input_len": costs["span_input_len"],
            "target_len": costs["span_target_len"],
        }
    )
Out[33]:
Visualization
Bar chart showing relative computational cost decreasing from about 85% at 10% corruption to 95% at 30% corruption.
Relative computational cost of span corruption compared to MLM baseline (100%) at different corruption rates. Lower values indicate greater efficiency. The encoder processes a shortened input while the decoder handles only the target spans.

Span corruption achieves meaningful computational savings while still exposing the model to diverse reconstruction challenges. At the standard 15% corruption rate, we save roughly 15% of computation compared to MLM. The encoder sees nearly the full context (minus corrupted tokens), while the decoder focuses only on what needs reconstruction.

Limitations and Practical Considerations

Span corruption offers compelling benefits, but it comes with trade-offs that affect model behavior and downstream applications.

The most significant limitation is the mismatch between pretraining and generation tasks. During pretraining, the model learns to infill missing spans given surrounding context. During generation, the model must produce text autoregressively without such scaffolding. This gap means span-corrupted models may struggle with open-ended generation compared to models pretrained with causal language modeling. T5 addresses this partially by framing all tasks as text-to-text, but the underlying tension remains. Fine-tuning helps bridge the gap, yet models pretrained exclusively on span corruption often require more adaptation for generation-heavy applications.

Another consideration is span boundary artifacts. The model learns that sentinels mark span boundaries, potentially creating implicit assumptions about phrase structure. If spans happen to align with linguistic units (noun phrases, clauses), the model may learn useful structure. If spans cut through units arbitrarily, the model must learn to handle artificial boundaries. The random nature of span selection means both cases occur, which may introduce noise into the learned representations.

Key practical limitations include:

  • Sentinel token overhead: Each span requires a unique sentinel, consuming vocabulary space and adding tokens to both input and target sequences.
  • Span length sensitivity: Very short spans (length 1) resemble MLM without the multi-prediction benefit. Very long spans remove too much context for reliable reconstruction.
  • Reconstruction ambiguity: When spans contain common phrases, multiple valid reconstructions may exist, yet training penalizes all but the original.

Despite these limitations, span corruption works well in practice. T5 and its variants achieve strong performance across diverse benchmarks, suggesting that the benefits of efficient training and phrase-level learning outweigh the costs.

Key Parameters

When implementing span corruption, the following parameters have the greatest impact on training behavior and model performance:

  • corruption_rate (float, default: 0.15): Fraction of tokens to include in corrupted spans. Higher rates (0.3-0.5) create more challenging reconstruction tasks but produce longer targets that slow training. Lower rates (0.1) may underutilize the model's capacity. The T5 default of 0.15 balances learning signal with computational efficiency.

  • mean_span_length (int, default: 3): Controls the average number of tokens per corrupted span via the geometric distribution parameter p=1/μp = 1/\mu. Shorter spans (mean 1-2) behave like token-level masking. Longer spans (mean 5+) force phrase-level generation but reduce the number of distinct spans. A mean of 3 provides good coverage of both short and medium-length phrases.

  • num_sentinels (int, default: 100): Maximum number of unique sentinel tokens available. This caps the number of spans per sequence. With a 15% corruption rate and mean span length of 3, a 512-token sequence produces ~26 spans, well within the default limit. Increase only for very long sequences or high corruption rates.

  • span_length_distribution (str, options: "geometric", "uniform"): Geometric distributions favor shorter spans with occasional long ones, matching T5's approach. Uniform distributions guarantee equal exposure to all span lengths within a range but may waste capacity on trivially short reconstructions.

Summary

Span corruption extends masked language modeling by corrupting contiguous token spans rather than individual tokens. Key takeaways:

  • Span selection: Geometric distributions with mean 3 balance short and long spans, exposing the model to both token-level and phrase-level reconstruction.
  • Sentinel tokens: Unique sentinels (<extra_id_0>, <extra_id_1>, ...) replace each span, creating a clear mapping between input positions and target outputs.
  • Sequence-to-sequence framing: The corrupted input becomes the encoder source; concatenated spans with sentinels become the decoder target.
  • Computational efficiency: Shorter input and target sequences reduce attention costs, making training more efficient than full-sequence objectives.
  • Trade-offs: The infilling objective may not transfer perfectly to open-ended generation, requiring careful fine-tuning for generation tasks.

Span corruption powers T5 and influenced subsequent models like UL2 and PaLM, which explore mixtures of denoising objectives. Understanding this technique provides insight into how pretraining objectives shape model capabilities and efficiency.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about span corruption and T5-style pretraining.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{spancorruptiont5spretrainingobjectiveforsequencetosequencelearning, author = {Michael Brenndoerfer}, title = {Span Corruption: T5's Pretraining Objective for Sequence-to-Sequence Learning}, year = {2025}, url = {https://mbrenndoerfer.com/writing/span-corruption-t5-pretraining-objective}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Span Corruption: T5's Pretraining Objective for Sequence-to-Sequence Learning. Retrieved from https://mbrenndoerfer.com/writing/span-corruption-t5-pretraining-objective
MLAAcademic
Michael Brenndoerfer. "Span Corruption: T5's Pretraining Objective for Sequence-to-Sequence Learning." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/span-corruption-t5-pretraining-objective>.
CHICAGOAcademic
Michael Brenndoerfer. "Span Corruption: T5's Pretraining Objective for Sequence-to-Sequence Learning." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/span-corruption-t5-pretraining-objective.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Span Corruption: T5's Pretraining Objective for Sequence-to-Sequence Learning'. Available at: https://mbrenndoerfer.com/writing/span-corruption-t5-pretraining-objective (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Span Corruption: T5's Pretraining Objective for Sequence-to-Sequence Learning. https://mbrenndoerfer.com/writing/span-corruption-t5-pretraining-objective
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free