T5 Pre-training: Span Corruption & Denoising Objectives

Michael Brenndoerfer

Learn how T5 uses span corruption for pre-training. Covers sentinel tokens, geometric span sampling, the C4 corpus, and why span masking outperforms token masking.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

T5 Pre-trainingLink Copied

T5, which stands for "Text-to-Text Transfer Transformer," introduced a unified approach to NLP that treats every task as converting input text to output text. While this text-to-text framework simplifies how we frame tasks, the model's remarkable capabilities stem largely from its innovative pre-training procedure. At the heart of T5's pre-training lies span corruption, a denoising objective that teaches the model to reconstruct corrupted portions of text.

Unlike BERT's approach of masking individual tokens, T5 corrupts contiguous spans of text and replaces them with single sentinel tokens. The model must then generate the missing content, producing the original spans in sequence. This design choice proves surprisingly effective: by predicting multiple consecutive tokens at once, the model learns richer contextual representations than single-token prediction allows.

The researchers at Google who developed T5 conducted an exhaustive empirical study, systematically testing dozens of architectural choices, pre-training objectives, and training configurations. The resulting paper, "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," became a landmark reference for understanding what actually matters in training large language models. In this chapter, we'll examine the pre-training procedure that emerged from this exploration.

Span Corruption: The Core ObjectiveLink Copied

Span corruption transforms input text into a denoising task, fundamentally reshaping how the model learns to understand language. The core idea is straightforward: rather than asking a model to fill in individual blanks scattered throughout a sentence, we ask it to reconstruct entire missing phrases. This seemingly simple change has significant implications for what the model must learn.

Consider the difference between predicting a single missing word versus predicting a missing phrase. When filling in a single blank, a model can often succeed by identifying the grammatical category needed (noun, verb, adjective) and selecting a plausible word from that category. However, when an entire phrase is missing, the model must generate multiple words that not only fit the surrounding context but also cohere with each other. The phrase "large amounts" requires understanding that these two words naturally pair together, that they describe a quantity, and that they grammatically connect "require" to "of training data." This multi-token challenge forces the model to develop richer internal representations of language structure.

The procedure works in three distinct stages: selecting spans to corrupt, replacing them with sentinel tokens, and training the model to reconstruct the original spans. Each stage involves design decisions from extensive testing.

How Span Selection WorksLink Copied

The algorithm begins by deciding which tokens to corrupt, a process that must balance several competing concerns. We want enough corruption to provide meaningful training signal. If we corrupt too little, each training example teaches the model relatively little. However, we also want the task to remain solvable. If we corrupt too much, the model cannot reasonably infer what was removed. T5 uses a corruption rate of 15%, meaning roughly 15% of tokens in each sequence will be masked. This particular rate emerged from experiments as a good balance that provides strong learning signal while keeping the reconstruction task feasible.

However, rather than selecting tokens independently like BERT does, T5 groups consecutive tokens into spans. This grouping decision is not merely aesthetic. It fundamentally changes the nature of what the model learns. When BERT masks the word "fox" in "The quick brown fox jumps," the model learns that an animal noun fits this context. When T5 masks "brown fox" as a single span, the model must learn not only that an animal noun fits here, but also that adjective-noun pairs describing animals have particular patterns, that "brown" and "fox" co-occur naturally, and that the resulting phrase must connect smoothly with both "quick" before it and "jumps" after it.

The span lengths follow a geometric distribution with a mean of 3 tokens. This means short spans are most common, but occasionally longer spans appear. Why a geometric distribution? This distribution has the useful property that span lengths are memoryless: given that a span has already reached length $k$ , the probability it continues for another token remains constant. This memoryless property simplifies the sampling procedure and ensures a natural spread of span lengths without complicated bookkeeping.

For a geometric distribution with mean $\mu = 3$ , the probability of a span having exactly length $k$ is:

P(\text{length} = k) = (1-p)^{k-1} \cdot p

where:

$k$ : the span length in tokens (must be at least 1)
$p$ : the probability parameter of the geometric distribution, set to $p = 1/\mu = 1/3$
$(1-p)^{k-1}$ : the probability of "failing" to stop for the first $k-1$ tokens
$p$ : the probability of "stopping" at exactly the $k$ -th token

To build intuition for this formula, imagine flipping a biased coin for each token in the span. The coin has probability $p = 1/3$ of coming up "stop" and probability $1-p = 2/3$ of coming up "continue." A span of length 1 occurs when we flip "stop" immediately on the first token, which happens with probability $p = 1/3$ . A span of length 2 occurs when we flip "continue" on the first token (probability $2/3$ ) and then "stop" on the second (probability $1/3$ ), giving $(2/3) \cdot (1/3) = 2/9 \approx 0.22$ . A span of length 3 requires "continue, continue, stop," which is $(2/3)^2 \cdot (1/3) = 4/27 \approx 0.15$ .

With $p = 1/3$ , this gives us probabilities of approximately 33% for length 1, 22% for length 2, 15% for length 3, and so on. The distribution decays exponentially, ensuring that very long spans are rare but not impossible. This mix of span lengths exposes the model to a variety of reconstruction challenges: sometimes it must predict just a single word, sometimes a short phrase, and occasionally an extended expression.

Out[2]:

Visualization

Geometric distribution probability mass function for span lengths. Each bar shows the exact probability of a span having that length, with cumulative probability annotated.

Sentinel TokensLink Copied

Sentinel Token

A sentinel token is a special token that serves as a placeholder for corrupted spans. T5 uses a sequence of sentinel tokens (<extra_id_0>, <extra_id_1>, etc.) where each unique sentinel marks a different corrupted span in the input.

When T5 corrupts a span, it replaces the entire span with a single sentinel token, regardless of how many tokens the span contained. The first corrupted span becomes <extra_id_0>, the second becomes <extra_id_1>, and so forth. This compression is crucial: it means the corrupted input is significantly shorter than the original, making training more efficient.

The naming convention "extra_id" reflects that these tokens exist outside the normal vocabulary. They carry no inherent meaning and exist purely as markers. Think of them as numbered placeholders, similar to how a librarian might place numbered bookmarks where removed pages once sat. Each sentinel uniquely identifies which span was removed from that position, allowing the model to precisely reconstruct what belongs where.

This design choice has an elegant efficiency benefit. If a span contains four tokens and we replace it with a single sentinel, we've reduced the input length by three tokens. Across an entire sequence with multiple corrupted spans, these savings accumulate substantially. A 512-token input might shrink to 460 tokens after corruption, allowing the encoder to process more information per unit of computation. This efficiency gain compounds across millions of training examples.

The target sequence contains the sentinel tokens followed by the spans they replaced, in order. This creates a clear mapping between placeholders and their corresponding content. When the model sees <extra_id_0> in the encoder input, it knows that somewhere in the target sequence, <extra_id_0> will be followed by the exact tokens that belong in that position. The sentinel acts as both a placeholder in the input and a delimiter in the output.

The Input-Target TransformationLink Copied

Let's trace through the complete transformation with a concrete example, examining each step to build intuition for how span corruption restructures text into a training signal. Consider this input sentence:

"The quick brown fox jumps over the lazy dog and runs away."

Suppose the span corruption procedure selects two spans: "brown fox" (positions 2-3) and "lazy dog" (positions 7-8). The transformation produces:

Corrupted Input:

"The quick <extra_id_0> jumps over the <extra_id_1> and runs away."

Target Output:

"<extra_id_0> brown fox <extra_id_1> lazy dog"

To understand why this format is so effective, let's examine what the model must learn to successfully complete this task. Looking at the corrupted input, the model sees that something is missing between "quick" and "jumps." This something immediately follows "quick" and is modified by it. The model must infer that this is likely an animal (because animals jump) and that "quick" suggests it's being described with adjectives. The surrounding context of a sentence about jumping and running further constrains the possibilities.

Notice several key properties of this transformation:

The input is shorter than the original (sentinel tokens replace multi-token spans)
Each sentinel appears exactly once in the input and once in the target
The target contains only the corrupted spans, not the full original text
Sentinels in the target act as delimiters separating the reconstructed spans

This design means the model predicts far fewer tokens than the original sequence length, making training computationally efficient while still requiring deep understanding of context. The model doesn't waste computation predicting tokens that weren't corrupted. It focuses its predictive capacity entirely on the challenging reconstruction task.

The one-to-one correspondence between sentinels in input and target creates an elegant bookkeeping system. During generation, when the model outputs <extra_id_0>, it signals that the following tokens should fill the position marked by <extra_id_0> in the input. When it outputs <extra_id_1>, it transitions to filling the next gap. This structure allows the decoder to generate all missing content in a single autoregressive pass, without needing special position-tracking mechanisms.

Out[3]:

Visualization

Mathematical FormulationLink Copied

Having developed intuition for span corruption through examples, we can now formalize the objective mathematically. This formalization reveals the precise learning signal the model receives and connects span corruption to the broader family of sequence-to-sequence training objectives.

The span corruption objective can be formalized as follows. Given an input sequence $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ , we:

Sample span start positions and lengths according to the corruption parameters
Replace each span with a unique sentinel token to produce $\mathbf{x}_{\text{corrupted}}$
Construct target $\mathbf{y}$ by concatenating sentinels with their corresponding spans

The model is trained to maximize:

\mathcal{L} = \sum_{t=1}^{|\mathbf{y}|} \log P(y_t \mid y_{<t}, \mathbf{x}_{\text{corrupted}}; \theta)

where:

$\mathcal{L}$ : the log-likelihood objective to maximize (equivalently, minimizing negative log-likelihood as the loss)
$\mathbf{y}$ : the target sequence containing sentinels and their corresponding original spans
$|\mathbf{y}|$ : the length of the target sequence
$y_t$ : the $t$ -th token in the target sequence
$y_{<t}$ : all target tokens before position $t$ (the autoregressive context)
$\mathbf{x}_{\text{corrupted}}$ : the input sequence with spans replaced by sentinel tokens
$\theta$ : the model parameters (encoder and decoder weights)
$P(y_t \mid y_{<t}, \mathbf{x}_{\text{corrupted}}; \theta)$ : the probability the model assigns to token $y_t$ given the corrupted input and previous target tokens

This formulation deserves careful unpacking. The summation runs over every token in the target sequence, meaning the model receives a learning signal for each prediction it makes. The conditioning on $\mathbf{x}_{\text{corrupted}}$ indicates that the encoder has processed the corrupted input and produced contextual representations that the decoder can attend to. The conditioning on $y_{<t}$ reflects the autoregressive nature of the decoder: when predicting token $t$ , the model can see all previous target tokens it has generated.

This is the standard sequence-to-sequence cross-entropy loss, where the encoder processes the corrupted input and the decoder generates the target autoregressively. The encoder-decoder architecture is crucial here: the encoder builds rich bidirectional representations of the corrupted input (processing both left and right context simultaneously), while the decoder generates the target one token at a time, attending to both the encoder representations and its own previous outputs.

The strength of this formulation lies in its simplicity. Despite the somewhat complex corruption procedure, the training objective is identical to any other sequence-to-sequence task like machine translation. The model learns to generate the correct output sequence given the input sequence, and the span corruption simply defines what those input-output pairs look like during pre-training. This means the same architecture and training infrastructure developed for translation, summarization, and other seq2seq tasks applies directly to pre-training.

Why Span Corruption Outperforms Token MaskingLink Copied

The T5 paper compared span corruption against several alternatives, including BERT-style token masking. Span corruption consistently performed better for several reasons, each stemming from fundamental differences in what the model must learn.

When BERT masks tokens, each masked position is predicted independently using bidirectional context. The model produces a probability distribution over the vocabulary for each masked position, but these predictions don't need to cohere with each other. If "The [MASK] [MASK] jumps" has two masked positions, BERT predicts each independently. It might assign high probability to "brown" for the first position and high probability to "cat" for the second, even though "brown cat" would be unusual in context while "brown fox" would be natural.

Span corruption eliminates this incoherence problem entirely. When the decoder generates "brown fox" to fill a single span, it produces "fox" after having already committed to "brown." The autoregressive generation ensures that multi-token predictions are internally consistent. The model learns not just which words fit each position, but which word sequences form natural phrases.

The key advantages of span corruption include:

Richer predictions: Generating multiple consecutive tokens requires modeling local coherence, not just fill-in-the-blank vocabulary matching
Computational efficiency: Replacing spans with single sentinels shortens inputs, allowing larger batch sizes or longer sequences
Better calibration: The model learns to generate varying-length outputs, preparing it for downstream generation tasks

The corruption rate of 15% with mean span length 3 emerged from extensive testing as an effective balance. Higher corruption rates provide more training signal per example but can make the task too difficult. Lower rates are easier but provide less learning signal. The mean span length of 3 ensures a good mix of single-token, short-phrase, and occasional longer-phrase predictions, exposing the model to reconstruction challenges at multiple scales.

The efficiency benefit is particularly important. Consider a 512-token input with 15% corruption (approximately 77 tokens masked). With BERT-style masking, the input remains 512 tokens, and the model predicts 77 tokens. With T5-style span corruption, assuming approximately 26 spans (77 tokens ÷ 3 mean length), the corrupted input shrinks to roughly 461 tokens (512 - 77 + 26). The encoder processes fewer tokens while the decoder still predicts the same 77 tokens of content. This efficiency compounds across the billions of training examples in pre-training.

Out[4]:

Visualization

Comparison of BERT-style token masking versus T5-style span corruption. BERT masks individual tokens independently, while T5 replaces contiguous spans with single sentinels, producing coherent multi-token predictions.

Worked Example: Step-by-Step CorruptionLink Copied

Let's walk through the span corruption algorithm in detail, tracing each step to solidify understanding. This worked example will make the abstract procedure concrete, showing exactly how a sentence transforms into a training example. We'll use a simple sentence and trace each step.

Original text: "Machine learning models require large amounts of training data."

Step 1: Tokenization

Before any corruption occurs, the text must be converted to tokens. This step matters because corruption operates at the token level, not the word level. T5 uses SentencePiece tokenization, which might split words into subwords, but for this example we'll use word-level tokenization for clarity.

After tokenization (using T5's SentencePiece tokenizer), we might get:

Tokenized input sequence with position indices.

Position	0	1	2	3	4	5	6	7	8	9
Token	Machine	learning	models	require	large	amounts	of	training	data	.

Each position in this sequence is now a candidate for inclusion in a corrupted span. The algorithm will select some of these positions to mask, group them into spans, and replace those spans with sentinels.

Step 2: Determine corruption budget

With 10 tokens and a 15% corruption rate, we aim to corrupt approximately 1.5 tokens. In practice, the algorithm samples span lengths that sum close to this target. The 15% rate is a target, not an exact requirement. Individual examples may vary slightly while the average across many examples approaches 15%.

In this short example, the low corruption budget means we'll likely corrupt just one short span. Longer sequences would have multiple spans, creating more complex reconstruction tasks.

Step 3: Sample span positions and lengths

The algorithm might sample one span of length 2 starting at position 4. This corrupts "large amounts." The choice of which positions to corrupt is random, ensuring the model sees varied examples. Some training examples might corrupt the beginning of sentences, others the middle or end, and still others might have multiple scattered spans.

Step 4: Apply corruption

Replace positions 4-5 with <extra_id_0>:

Corrupted sequence after span replacement with sentinel token.

Position	0	1	2	3	4	5	6	7	8
Token	Machine	learning	models	require	`<extra_id_0>`	of	training	data	.

Notice that the corrupted sequence is now 9 tokens instead of 10. The two-token span "large amounts" has been replaced by the single token <extra_id_0>, achieving the compression that makes span corruption efficient. The sentinel serves as a placeholder that unambiguously marks where content was removed.

Step 5: Construct target

The target becomes: <extra_id_0> large amounts

The target sequence begins with the sentinel token, followed by the content of the span that sentinel replaced. If there were multiple spans, the target would contain multiple sentinel-content pairs in order: <extra_id_0> first span content <extra_id_1> second span content, and so forth.

Final result:

Input: "Machine learning models require <extra_id_0> of training data."
Target: "<extra_id_0> large amounts"

The model learns that given the context of "models require ___ of training data," the missing span should be "large amounts." To succeed at this task, the model must understand that "require" takes a noun phrase object, that "of training data" suggests the object describes a quantity, and that "large amounts" is a natural phrase that satisfies these constraints. This single example teaches vocabulary, grammar, phrase structure, and world knowledge simultaneously.

Code ImplementationLink Copied

Let's implement span corruption from scratch to understand the mechanics, then see how to use the actual T5 tokenizer. Building the algorithm ourselves reveals the design decisions that make span corruption work, while the practical implementation shows how these ideas integrate with real NLP pipelines.

Building Span CorruptionLink Copied

We'll start by implementing the core span selection logic. This function samples span lengths from a geometric distribution, ensuring we corrupt approximately the target fraction of tokens while respecting the random nature of span lengths:

In[5]:

Code

import random
import numpy as np


def sample_span_lengths(num_tokens, corruption_rate=0.15, mean_span_length=3):
    """
    Sample span lengths that corrupt approximately corruption_rate of tokens.
    Uses geometric distribution with specified mean.
    """
    # Calculate expected number of tokens to corrupt
    num_to_corrupt = int(num_tokens * corruption_rate)

    # Geometric distribution parameter (p = 1/mean gives us desired mean)
    p = 1.0 / mean_span_length

    spans = []
    total_corrupted = 0

    while total_corrupted < num_to_corrupt:
        # Sample from geometric distribution (minimum length 1)
        span_length = np.random.geometric(p)
        # Don't exceed remaining budget
        span_length = min(span_length, num_to_corrupt - total_corrupted)
        spans.append(span_length)
        total_corrupted += span_length

    return spans


# Test span sampling
random.seed(42)
np.random.seed(42)

test_lengths = sample_span_lengths(
    100, corruption_rate=0.15, mean_span_length=3
)

import random
import numpy as np


def sample_span_lengths(num_tokens, corruption_rate=0.15, mean_span_length=3):
    """
    Sample span lengths that corrupt approximately corruption_rate of tokens.
    Uses geometric distribution with specified mean.
    """
    # Calculate expected number of tokens to corrupt
    num_to_corrupt = int(num_tokens * corruption_rate)

    # Geometric distribution parameter (p = 1/mean gives us desired mean)
    p = 1.0 / mean_span_length

    spans = []
    total_corrupted = 0

    while total_corrupted < num_to_corrupt:
        # Sample from geometric distribution (minimum length 1)
        span_length = np.random.geometric(p)
        # Don't exceed remaining budget
        span_length = min(span_length, num_to_corrupt - total_corrupted)
        spans.append(span_length)
        total_corrupted += span_length

    return spans


# Test span sampling
random.seed(42)
np.random.seed(42)

test_lengths = sample_span_lengths(
    100, corruption_rate=0.15, mean_span_length=3
)

Out[6]:

Console

Number of spans: 4
Span lengths: [2, 8, 4, 1]
Total tokens corrupted: 15
Corruption rate achieved: 15.0%

With 100 tokens, we generated 5 spans of varying lengths that together corrupt 15% of the sequence. Notice how the span lengths vary. We got a mix of short and medium-length spans, exactly what the geometric distribution produces. Now let's implement the full corruption procedure, which must not only sample span lengths but also place them in the sequence without overlapping:

In[7]:

Code

def corrupt_spans(tokens, corruption_rate=0.15, mean_span_length=3, seed=None):
    """
    Apply span corruption to a list of tokens.

    Returns:
        corrupted_input: tokens with spans replaced by sentinels
        target: sentinel tokens followed by their corresponding spans
    """
    if seed is not None:
        np.random.seed(seed)
        random.seed(seed)

    num_tokens = len(tokens)
    span_lengths = sample_span_lengths(
        num_tokens, corruption_rate, mean_span_length
    )

    # Randomly place spans (non-overlapping)
    # Create a list of possible start positions
    available_positions = list(range(num_tokens))
    span_positions = []  # List of (start, length) tuples

    for length in span_lengths:
        # Filter positions where span would fit without overlap
        valid_starts = []
        for pos in available_positions:
            # Check if span fits and doesn't overlap existing spans
            if pos + length <= num_tokens:
                span_range = set(range(pos, pos + length))
                if span_range.issubset(set(available_positions)):
                    valid_starts.append(pos)

        if not valid_starts:
            continue

        # Randomly select start position
        start = random.choice(valid_starts)
        span_positions.append((start, length))

        # Remove these positions from available
        for i in range(start, start + length):
            available_positions.remove(i)

    # Sort spans by position for consistent ordering
    span_positions.sort(key=lambda x: x[0])

    # Build corrupted input and target
    corrupted_input = []
    target = []
    sentinel_id = 0

    current_pos = 0
    for start, length in span_positions:
        # Add tokens before this span
        corrupted_input.extend(tokens[current_pos:start])
        # Add sentinel token
        corrupted_input.append(f"<extra_id_{sentinel_id}>")
        # Add to target: sentinel followed by span content
        target.append(f"<extra_id_{sentinel_id}>")
        target.extend(tokens[start : start + length])

        sentinel_id += 1
        current_pos = start + length

    # Add remaining tokens after last span
    corrupted_input.extend(tokens[current_pos:])

    return corrupted_input, target

def corrupt_spans(tokens, corruption_rate=0.15, mean_span_length=3, seed=None):
    """
    Apply span corruption to a list of tokens.

    Returns:
        corrupted_input: tokens with spans replaced by sentinels
        target: sentinel tokens followed by their corresponding spans
    """
    if seed is not None:
        np.random.seed(seed)
        random.seed(seed)

    num_tokens = len(tokens)
    span_lengths = sample_span_lengths(
        num_tokens, corruption_rate, mean_span_length
    )

    # Randomly place spans (non-overlapping)
    # Create a list of possible start positions
    available_positions = list(range(num_tokens))
    span_positions = []  # List of (start, length) tuples

    for length in span_lengths:
        # Filter positions where span would fit without overlap
        valid_starts = []
        for pos in available_positions:
            # Check if span fits and doesn't overlap existing spans
            if pos + length <= num_tokens:
                span_range = set(range(pos, pos + length))
                if span_range.issubset(set(available_positions)):
                    valid_starts.append(pos)

        if not valid_starts:
            continue

        # Randomly select start position
        start = random.choice(valid_starts)
        span_positions.append((start, length))

        # Remove these positions from available
        for i in range(start, start + length):
            available_positions.remove(i)

    # Sort spans by position for consistent ordering
    span_positions.sort(key=lambda x: x[0])

    # Build corrupted input and target
    corrupted_input = []
    target = []
    sentinel_id = 0

    current_pos = 0
    for start, length in span_positions:
        # Add tokens before this span
        corrupted_input.extend(tokens[current_pos:start])
        # Add sentinel token
        corrupted_input.append(f"<extra_id_{sentinel_id}>")
        # Add to target: sentinel followed by span content
        target.append(f"<extra_id_{sentinel_id}>")
        target.extend(tokens[start : start + length])

        sentinel_id += 1
        current_pos = start + length

    # Add remaining tokens after last span
    corrupted_input.extend(tokens[current_pos:])

    return corrupted_input, target

The implementation carefully handles span placement to avoid overlaps. It tracks which positions have already been claimed by earlier spans and only places new spans in the remaining available positions. The final sorting step ensures sentinels appear in reading order, matching T5's expected format.

Let's test our implementation on a real sentence:

In[8]:

Code

# Example sentence
text = "The transformer architecture revolutionized natural language processing and enabled large language models"
tokens = text.split()

corrupted, target = corrupt_spans(
    tokens, corruption_rate=0.15, mean_span_length=3, seed=42
)

original_length = len(tokens)
corrupted_length = len(corrupted)

# Example sentence
text = "The transformer architecture revolutionized natural language processing and enabled large language models"
tokens = text.split()

corrupted, target = corrupt_spans(
    tokens, corruption_rate=0.15, mean_span_length=3, seed=42
)

original_length = len(tokens)
corrupted_length = len(corrupted)

Out[9]:

Console

Original tokens:
['The', 'transformer', 'architecture', 'revolutionized', 'natural', 'language', 'processing', 'and', 'enabled', 'large', 'language', 'models']

Original length: 12 tokens

============================================================

Corrupted input:
['The', 'transformer', 'architecture', 'revolutionized', 'natural', 'language', 'processing', 'and', 'enabled', 'large', '<extra_id_0>', 'models']

Corrupted length: 12 tokens

============================================================

Target sequence:
['<extra_id_0>', 'language']

The corrupted input is shorter than the original because multi-token spans are replaced by single sentinel tokens. The target contains only the sentinels and their corresponding spans, making it much shorter than the full reconstruction target used in BERT-style approaches. This efficiency, with shorter inputs and focused targets, compounds across millions of training examples to significantly reduce computational costs.

Using the T5 TokenizerLink Copied

In practice, you'll use the Hugging Face transformers library which handles tokenization and provides pre-trained models. The library's tokenizer includes all the sentinel tokens and correctly handles the subword tokenization that T5 expects:

In[10]:

Code

from transformers import T5Tokenizer
import warnings

warnings.filterwarnings("ignore")

# Load T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small", legacy=False)

from transformers import T5Tokenizer
import warnings

warnings.filterwarnings("ignore")

# Load T5 tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-small", legacy=False)

In[11]:

Code

# Examine sentinel tokens in T5's vocabulary
sentinel_tokens = [f"<extra_id_{i}>" for i in range(10)]
sentinel_ids = [tokenizer.convert_tokens_to_ids(tok) for tok in sentinel_tokens]

# Examine sentinel tokens in T5's vocabulary
sentinel_tokens = [f"<extra_id_{i}>" for i in range(10)]
sentinel_ids = [tokenizer.convert_tokens_to_ids(tok) for tok in sentinel_tokens]

Out[12]:

Console

T5 Sentinel Tokens:
  <extra_id_0>: token_id = 32099
  <extra_id_1>: token_id = 32098
  <extra_id_2>: token_id = 32097
  <extra_id_3>: token_id = 32096
  <extra_id_4>: token_id = 32095
  <extra_id_5>: token_id = 32094
  <extra_id_6>: token_id = 32093
  <extra_id_7>: token_id = 32092
  <extra_id_8>: token_id = 32091
  <extra_id_9>: token_id = 32090

Total sentinel tokens available: 100

T5's tokenizer includes 100 sentinel tokens by default, allowing for up to 100 separate spans to be corrupted in a single example. The token IDs are at the end of the vocabulary, keeping them separate from regular text tokens. This separation is intentional: sentinel tokens should never appear in normal text, so placing them at vocabulary boundaries helps prevent accidental collisions.

Examining Real T5 TokenizationLink Copied

Let's see how T5's SentencePiece tokenizer handles our example text. SentencePiece tokenization can split words into subwords, which affects how span corruption operates at a fine-grained level:

In[13]:

Code

text = (
    "The transformer architecture revolutionized natural language processing."
)

# Tokenize with T5
encoded = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(encoded)

text = (
    "The transformer architecture revolutionized natural language processing."
)

# Tokenize with T5
encoded = tokenizer.encode(text, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(encoded)

Out[14]:

Console

SentencePiece tokenization:
  0: '▁The' (id: 37)
  1: '▁transformer' (id: 19903)
  2: '▁architecture' (id: 4648)
  3: '▁revolution' (id: 9481)
  4: 'ized' (id: 1601)
  5: '▁natural' (id: 793)
  6: '▁language' (id: 1612)
  7: '▁processing' (id: 3026)
  8: '.' (id: 5)

Total subword tokens: 9

T5 uses SentencePiece tokenization, which may split words into subword units. The underscore character (▁) indicates the start of a new word in the original text. This subword approach allows T5 to handle rare words and morphological variations gracefully. Even words never seen during training can be represented as combinations of known subwords.

T5 Pre-training Data: The C4 CorpusLink Copied

T5 was pre-trained on the Colossal Clean Crawled Corpus (C4), a massive dataset created specifically for this research. Understanding C4 helps explain T5's capabilities and limitations.

C4 ConstructionLink Copied

C4 was derived from Common Crawl, a publicly available web scrape. The researchers applied aggressive filtering to clean the data:

Language filtering: Only English text was retained, using a language detection model
Quality heuristics: Pages were removed if they contained fewer than 5 sentences, had sentences shorter than 3 words, or contained offensive words from a blocklist
Deduplication: Exact duplicate paragraphs were removed across the entire corpus
JavaScript/boilerplate removal: Common web artifacts like "Terms of Service" and cookie notices were filtered out

The resulting dataset contains approximately 750 GB of clean text, roughly 156 billion tokens. This scale was unprecedented at the time and enabled training much larger models than previous work.

Out[15]:

Console

C4 Corpus Statistics:
----------------------------------------
  Total size: ~750 GB
  Token count: ~156 billion
  Source: Common Crawl (April 2019)
  Languages: English only
  Filtering: Extensive quality heuristics

Quality vs. Quantity TradeoffsLink Copied

The T5 paper experimented with different levels of filtering. More aggressive filtering produced higher-quality text but reduced dataset size. The researchers found that:

Light filtering produced worse downstream performance despite more data
Heavy filtering (C4) struck a good balance between quality and quantity
Extremely aggressive filtering reduced the dataset too much, hurting performance

This finding influenced subsequent work on LLM pre-training, emphasizing that data quality often matters more than raw quantity.

T5 Training ScaleLink Copied

The T5 paper introduced models at five different scales, enabling systematic study of how performance varies with model size.

Model VariantsLink Copied

T5 was released in five sizes:

T5 model variants with architectural specifications.

Model	Parameters	Encoder Layers	Decoder Layers	Hidden Size	Attention Heads
T5-Small	60M	6	6	512	8
T5-Base	220M	12	12	768	12
T5-Large	770M	24	24	1024	16
T5-3B	3B	24	24	1024	32
T5-11B	11B	24	24	1024	128

As shown in Table t5-variants, the largest model (T5-11B) achieved state-of-the-art results on numerous benchmarks at the time of release. The size progression allowed researchers to study scaling laws and helped establish that larger models consistently perform better when trained appropriately.

Out[16]:

Visualization

T5 model variants by parameter count. The exponential scaling from 60M to 11B parameters enabled systematic study of how model size affects performance.

Training ConfigurationLink Copied

T5 models were trained with the following key hyperparameters:

Sequence length: 512 tokens for both encoder and decoder
Batch size: 128 sequences (65,536 tokens per batch for T5-Base)
Training steps: 1 million steps for the base configuration
Optimizer: Adafactor, a memory-efficient variant of Adam
Learning rate: Inverse square root decay schedule

The total compute for T5-11B was substantial, requiring multiple TPU pods running for weeks. This level of investment established a new baseline for what's possible with sufficient resources.

Effective Training TokensLink Copied

With 1 million training steps and batch size 128, the model sees:

\text{Total examples} = 1{,}000{,}000 \times 128 = 128{,}000{,}000

With 512 tokens per sequence:

\text{Total tokens seen} = 128{,}000{,}000 \times 512 \approx 65.5 \text{ billion tokens}

This means T5 was trained for less than one epoch over C4, seeing each example on average less than once. This is common in large-scale pre-training: the dataset is so large that even partial coverage provides sufficient training signal.

Visualizing Span Corruption StatisticsLink Copied

Let's examine the distribution of span lengths and corruption patterns:

Out[17]:

Visualization

Bar chart showing span length frequencies decreasing geometrically from length 1 to 10. — Distribution of span lengths sampled from a geometric distribution with mean 3. Shorter spans are most common, with probability decreasing geometrically.

The empirical distribution closely matches the theoretical geometric distribution. About one-third of spans contain just a single token, making span corruption somewhat similar to token-level masking in many cases. However, the longer spans provide valuable training signal for learning longer-range dependencies.

Out[18]:

Visualization

Effect of corruption rate on input compression. Higher corruption rates lead to shorter corrupted inputs.

Effect of corruption rate on number of spans. More spans are created as corruption increases.

At T5's default 15% corruption rate with mean span length 3, we expect about 25 spans in a 512-token sequence. The input length shrinks to roughly 460 tokens because each multi-token span is replaced by a single sentinel.

Out[19]:

Visualization

Training efficiency comparison: tokens processed by encoder versus tokens predicted by decoder across different corruption rates. T5''s span corruption achieves favorable ratios.

Limitations and ImpactLink Copied

T5's pre-training approach, while innovative and effective, has several important limitations and requirements that shaped its adoption and influence on the field.

Computational RequirementsLink Copied

T5's pre-training demands significant resources. Training T5-11B required TPU pods running for extended periods, placing such models out of reach for most researchers. Even inference with the largest models requires specialized hardware. This computational barrier led to a focus on smaller variants (T5-Small, T5-Base) for practical applications.

The span corruption objective itself is efficient, producing targets shorter than the original input. However, the encoder-decoder architecture processes sequences twice (once in the encoder, once in the decoder), making T5 less efficient than decoder-only models for pure generation tasks.

Data LimitationsLink Copied

C4's English-only focus means T5 underperforms on non-English tasks without additional multilingual training. The web-crawled data also contains biases present in internet text, and the quality filtering may have inadvertently removed certain types of legitimate content. Later work addressed some of these issues with mT5 (multilingual T5) and updated versions of the C4 corpus.

What T5 UnlockedLink Copied

Despite these limitations, T5 had substantial impact on the field:

The text-to-text framework demonstrated that diverse NLP tasks could be unified under a single paradigm. Rather than designing task-specific architectures, researchers could cast any problem as text generation. This simplified both training and deployment, as a single model could handle classification, generation, question answering, and translation.

The systematic empirical study in the T5 paper established methodological standards for LLM research. By testing many design choices in a controlled setting, the authors provided evidence-based guidance that influenced subsequent work. Findings about data quality, model scaling, and training objectives became foundational knowledge.

T5 also demonstrated the continued relevance of encoder-decoder architectures. While GPT-style decoder-only models dominated certain applications, T5 showed that encoders provide advantages for tasks requiring deep understanding of input context before generating output.

Comparison with Other ApproachesLink Copied

Span corruption occupies a middle ground between BERT's masked language modeling and GPT's causal language modeling:

Comparison of pre-training objectives across model families.

Approach	Input Processing	Target	Best For
BERT MLM	Bidirectional	Single masked tokens	Classification, NLU
T5 Span Corruption	Bidirectional	Corrupted spans	Conditional generation
GPT CLM	Causal (left-to-right)	Next token	Open-ended generation

As shown in Table pretraining-comparison, T5's approach excels when the task has clear input-output structure, such as translation, summarization, or question answering. For open-ended generation without explicit conditioning, decoder-only models tend to be more natural fits.

SummaryLink Copied

T5 pre-training centers on span corruption, a denoising objective that teaches the model to reconstruct corrupted portions of text. The key innovations include:

Span-level masking: Rather than individual tokens, T5 masks contiguous spans of text with an average length of 3 tokens. This provides richer training signal than single-token prediction.
Sentinel tokens: Each corrupted span is replaced by a unique sentinel token (<extra_id_0>, <extra_id_1>, etc.), creating a compact input representation. The model learns to map these placeholders back to their original content.
Efficient targets: The target sequence contains only the corrupted spans, not the full original text. This makes training more computationally efficient while still requiring comprehensive understanding of context.
The C4 corpus: T5 introduced a carefully filtered 750 GB dataset derived from Common Crawl, demonstrating that data quality significantly impacts downstream performance.
Scale studies: With models ranging from 60 million to 11 billion parameters, T5 established clear scaling relationships and provided the community with models at various resource requirements.

The span corruption objective, combined with the text-to-text framework, produced models that achieved state-of-the-art results across diverse benchmarks. While subsequent decoder-only models have dominated headlines, T5's encoder-decoder architecture and pre-training approach remain highly effective for tasks with clear input-output structure, continuing to influence modern NLP systems.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about T5's span corruption pre-training objective.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{t5pretrainingspancorruptiondenoisingobjectives, author = {Michael Brenndoerfer}, title = {T5 Pre-training: Span Corruption & Denoising Objectives}, year = {2025}, url = {https://mbrenndoerfer.com/writing/t5-pretraining-span-corruption-denoising-objectives}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }

APAAcademic

Michael Brenndoerfer (2025). T5 Pre-training: Span Corruption & Denoising Objectives. Retrieved from https://mbrenndoerfer.com/writing/t5-pretraining-span-corruption-denoising-objectives

MLAAcademic

Michael Brenndoerfer. "T5 Pre-training: Span Corruption & Denoising Objectives." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/t5-pretraining-span-corruption-denoising-objectives>.

CHICAGOAcademic

Michael Brenndoerfer. "T5 Pre-training: Span Corruption & Denoising Objectives." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/t5-pretraining-span-corruption-denoising-objectives.

HARVARDAcademic

Michael Brenndoerfer (2025) 'T5 Pre-training: Span Corruption & Denoising Objectives'. Available at: https://mbrenndoerfer.com/writing/t5-pretraining-span-corruption-denoising-objectives (Accessed: 12/25/2025).

SimpleBasic

Michael Brenndoerfer (2025). T5 Pre-training: Span Corruption & Denoising Objectives. https://mbrenndoerfer.com/writing/t5-pretraining-span-corruption-denoising-objectives

Direct link:

https://mbrenndoerfer.com/writing/t5-pretraining-span-corruption-denoising-objectives

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

T5 Pre-training: Span Corruption & Denoising Objectives

T5 Pre-trainingLink Copied

Span Corruption: The Core ObjectiveLink Copied

How Span Selection WorksLink Copied

Sentinel TokensLink Copied

The Input-Target TransformationLink Copied

Mathematical FormulationLink Copied

Why Span Corruption Outperforms Token MaskingLink Copied

Worked Example: Step-by-Step CorruptionLink Copied

Code ImplementationLink Copied

Building Span CorruptionLink Copied

Using the T5 TokenizerLink Copied

Examining Real T5 TokenizationLink Copied

T5 Pre-training Data: The C4 CorpusLink Copied

C4 ConstructionLink Copied

Quality vs. Quantity TradeoffsLink Copied

T5 Training ScaleLink Copied

Model VariantsLink Copied

Training ConfigurationLink Copied

Effective Training TokensLink Copied

Visualizing Span Corruption StatisticsLink Copied

Limitations and ImpactLink Copied

Computational RequirementsLink Copied

Data LimitationsLink Copied

What T5 UnlockedLink Copied

Comparison with Other ApproachesLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

mT5: Multilingual T5 Architecture & Cross-Lingual Transfer

BART Pre-training: Denoising Strategies & Text Infilling

T5 Task Formatting: Text-to-Text NLP Unification

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

mT5: Multilingual T5 Architecture & Cross-Lingual Transfer

BART Pre-training: Denoising Strategies & Text Infilling

T5 Task Formatting: Text-to-Text NLP Unification

Stay updated