T5 Task Formatting: Text-to-Text NLP Unification

Michael BrenndoerferOctober 15, 202536 min read

Learn how T5 reformulates all NLP tasks as text-to-text problems. Master task prefixes, classification, NER, and QA formatting for unified language models.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

T5 Task Formatting

In the previous chapters, we explored T5's encoder-decoder architecture and its span corruption pre-training objective. But what makes T5 distinctive isn't just its architecture. The key insight is that every NLP task can be reformulated as a text-to-text problem. Classification, translation, summarization, question answering, and even structured prediction tasks like named entity recognition can all be expressed as "given this input text, produce this output text." This unification simplifies natural language processing by replacing many specialized methods with a single, coherent framework. This chapter explores how T5 achieves this unification through clever task formatting and what it means for building versatile language models.

The Text-to-Text Paradigm

Traditional NLP systems treat different tasks as fundamentally different problems. A classifier predicts discrete labels. A tagger assigns one label per token. A generator produces sequences. Each task type requires its own output layer, loss function, and often its own model architecture. This fragmentation meant that advances in one area did not automatically transfer to others. A useful innovation in sequence labeling might require substantial rearchitecting to apply to classification tasks. T5 rejects this fragmentation entirely.

The key insight is that text itself is a universal interface. Any structured output, whether a single label, a sequence of tags, or a complex annotation, can be serialized as a string. Consider what this means. A sentiment label like "positive" is just text. An entity tag sequence like "B-PER I-PER O B-LOC" is just text. A translated sentence is obviously text. If the output is a string, and the input is already a string, then every NLP task becomes sequence-to-sequence generation. This means a single model with a single architecture and training procedure can handle them all.

To see why this matters, consider the alternative. Before unified approaches like T5, a practitioner building a multi-task NLP system might need a BERT-based classifier for sentiment analysis with a classification head. They might need a separate BiLSTM-CRF for named entity recognition. They might also require a transformer encoder-decoder for translation and another architecture for summarization. Each model required its own training pipeline, its own hyperparameter tuning, and its own deployment infrastructure. The text-to-text paradigm collapses this complexity into a single model that learns to perform all tasks through the same mechanism.

Out[2]:
Visualization
Traditional NLP requires specialized architectures for each task type.
Traditional NLP requires specialized architectures for each task type.
The text-to-text paradigm unifies all tasks through a single encoder-decoder model.
The text-to-text paradigm unifies all tasks through a single encoder-decoder model.
Text-to-Text Transfer

The text-to-text paradigm treats every NLP task as translating from one text string to another. The model learns a unified mapping from input sequences to output sequences, regardless of whether the underlying task is classification, extraction, or generation.

This unification brings several benefits:

  • Single architecture: The same encoder-decoder model handles all tasks without modification
  • Shared pre-training: Knowledge learned during pre-training transfers to any downstream task
  • Multitask learning: Multiple tasks can be mixed in a single training batch
  • Zero-shot generalization: The model can attempt new tasks if given appropriate formatting

Beyond these practical benefits, there's something conceptually satisfying about this approach. It suggests that the boundary between "understanding" and "generation" may be more fluid than traditional NLP architectures implied. A model that can generate the correct answer to a question demonstrates understanding of both the question and the relevant context. A model that generates appropriate entity labels has learned to recognize those entities. Generation becomes the unified test of linguistic competence.

Task Prefixes

How does T5 know which task to perform when given an input? The answer is task prefixes, short text strings prepended to the input that signal the desired operation. This mechanism is remarkably simple yet surprisingly powerful, essentially teaching the model to follow instructions embedded in natural language.

Consider these examples:

  • "translate English to German: The house is wonderful."
  • "summarize: The stock market fell sharply today after..."
  • "question: What is the capital of France? context: Paris is the capital..."

The prefix acts as a routing instruction, telling the model how to process the input and what output to generate. When the model sees "translate English to German:", it knows to treat the following text as English source material and produce German output. When it sees "summarize:", it understands that it should produce a condensed version of the following content. This resembles how special tokens function in earlier chapters, but instead of learned embeddings with special meanings, T5 uses natural language instructions that the model learns to interpret through training.

This approach uses the model's core competency (understanding language) to specify its behavior. Rather than introducing specialized tokens or architectural modifications for each task, T5 leverages the same linguistic representations it uses for everything else. The model learns that certain word patterns at the start of the input correlate with certain expected output patterns, much as it learns any other linguistic regularity.

The following examples show how task prefixes work in practice:

In[4]:
Code
from transformers import T5Tokenizer, T5ForConditionalGeneration
import warnings

warnings.filterwarnings("ignore")

## Load T5 model and tokenizer
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(model_name)
In[5]:
Code
## Different tasks with different prefixes
tasks = [
    "translate English to German: The weather is nice today.",
    "translate English to French: Hello, how are you?",
    "summarize: Machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It has applications in image recognition, natural language processing, and many other fields.",
]
In[6]:
Code
## Generate outputs for each task
results = []
for task_input in tasks:
    inputs = tokenizer(
        task_input, return_tensors="pt", max_length=512, truncation=True
    )
    outputs = model.generate(**inputs, max_new_tokens=64)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    results.append((task_input[:50] + "...", decoded))
Out[7]:
Console
Input: translate English to German: The weather is nice t...
Output: Das Wetter ist heute schön.
------------------------------------------------------------
Input: translate English to French: Hello, how are you?...
Output: Bonjour, comment êtes-vous?
------------------------------------------------------------
Input: summarize: Machine learning is a subset of artific...
Output: machine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. it has applications in image recognition, natural language processing, and many other fields.
------------------------------------------------------------

The model produces task-appropriate outputs based solely on the prefix—German translation for the first, French for the second, and a summary for the third. The prefix creates an implicit conditioning that shapes the entire generation process. Note that T5-small is a relatively small model, so translation quality may be limited compared to larger variants. The key insight is that the same architecture handles fundamentally different tasks through prefix-based routing.

Prefix Design Principles

The choice of prefix matters more than one might initially expect. T5's authors experimented with different prefix styles and found that clarity and consistency are more important than brevity. The prefix isn't just a tag. It's a specification that the model must parse and act upon. Ambiguous or inconsistent prefixes lead to ambiguous or inconsistent outputs. Effective prefixes share several characteristics:

  • Explicit task naming: "translate", "summarize", "classify" clearly indicate the operation
  • Parameter specification: "English to German" specifies source and target languages
  • Consistent formatting: Using the same pattern ("task: input") across all tasks
  • Natural language: Prefixes read as instructions a human would understand

The natural language aspect is worth examining. Because T5 is trained on vast amounts of natural text, it has strong priors about how language works. A prefix like "translate English to German:" leverages the model's understanding of what translation means, what English and German are, and how to interpret the colon as a delimiter. This is why natural language prefixes often work better than arbitrary codes or symbols. They use knowledge the model already has.

Here's how different prefix choices affect the same underlying task:

In[8]:
Code
## Same task, different prefix formulations
prefix_variations = [
    "cola sentence: The cat sat on the mat.",  # T5's actual format for CoLA
    "Is this sentence grammatically correct? The cat sat on the mat.",
    "grammar check: The cat sat on the mat.",
]

The comments highlight an important point. T5 was trained with specific prefixes like "cola sentence:", so using different formulations at inference time may produce unexpected results unless the model has been fine-tuned on the new format.

The specific prefixes used during training become part of the model's learned behavior. T5's original training used prefixes like "cola sentence:" for the CoLA grammaticality task, "sst2 sentence:" for sentiment analysis, and "translate English to German:" for translation. Using different prefixes at inference time may produce unexpected results unless the model has been fine-tuned on the new format. This highlights an important principle: the text-to-text paradigm is flexible but not magical—the model can only reliably perform tasks it has been trained to recognize through their prefixes.

Classification as Generation

Converting classification to generation might seem wasteful. Why generate text when you only need a label? This question highlights the efficiency versus flexibility trade-off that defines the text-to-text paradigm. But the benefits of unification outweigh the slight overhead, and the approach proves surprisingly powerful in practice.

To understand why, consider what classification is. A classifier takes an input and selects one of several possible outputs. In traditional systems, this selection happens through a softmax over a fixed set of classes. In T5, the selection happens through generation. The "selection" is the model's choice of which label text to produce. The mechanism differs, but the underlying computation is conceptually similar. The model must encode the input, reason about its meaning, and produce an appropriate response.

For binary classification, the model simply generates a label token.

In[9]:
Code
## Sentiment classification as text generation
sentiment_examples = [
    ("sst2 sentence: This movie was absolutely fantastic!", "positive"),
    ("sst2 sentence: What a waste of time and money.", "negative"),
    ("sst2 sentence: The acting was superb but the plot was confusing.", "?"),
]
Out[10]:
Console
Input:    sst2 sentence: This movie was absolutely fantastic!
Expected: positive

Input:    sst2 sentence: What a waste of time and money.
Expected: negative

Input:    sst2 sentence: The acting was superb but the plot was confusing.
Expected: ?

These examples show how sentiment classification maps to text generation. The model learns to produce "positive" or "negative" based on the input text. The third example with "?" as expected output illustrates an ambiguous case—real training data would need a consistent strategy for handling mixed sentiment.

During training, the model learns to generate the appropriate label text. For SST-2 (Stanford Sentiment Treebank), T5 was trained to output "positive" or "negative". For CoLA (grammaticality), it outputs "acceptable" or "unacceptable". The choice of label text is arbitrary in principle—the model could learn to output "1" for positive and "0" for negative—but descriptive labels leverage the model's semantic understanding and tend to work better in practice.

Multi-class and Multi-label Classification

The text-to-text format naturally extends to more complex classification scenarios without requiring any architectural changes. This is where the paradigm's flexibility shines. A traditional multi-class classifier needs its output layer sized to the number of classes. Adding a new class means modifying the architecture. In T5, adding a new class just means introducing a new label string in the training data.

In[11]:
Code
## Multi-class: News topic classification
multiclass_examples = [
    (
        "classify topic: Apple announces new iPhone with revolutionary camera.",
        "technology",
    ),
    ("classify topic: Lakers defeat Warriors in overtime thriller.", "sports"),
    (
        "classify topic: Federal Reserve raises interest rates by 0.25%.",
        "business",
    ),
]

## Multi-label: Multiple applicable tags
multilabel_examples = [
    (
        "tags: A new AI chip powers the latest smartphone.",
        "technology, electronics",
    ),
    (
        "tags: Tech stocks surge after earnings reports.",
        "technology, business, finance",
    ),
]

For multi-label classification, the model generates multiple labels separated by a delimiter. This approach is more flexible than traditional multi-hot encoding because the model can generate any number of labels and even labels it hasn't seen during training (though with uncertain accuracy). The model learns not just which labels to generate, but also the delimiter pattern and the approximate number of labels appropriate for different inputs.

Extracting Probabilities

One apparent limitation of generation-based classification is losing access to prediction probabilities. Traditional classifiers output a probability distribution over labels. This is useful for calibration, thresholding, and uncertainty estimation. If a traditional classifier is 51% confident about "positive" and 49% about "negative", you know the prediction is uncertain. With greedy generation, you just get "positive" with no indication of the model's uncertainty.

T5 addresses this by allowing access to the decoder's output logits. Since generation proceeds token by token, and each token is selected based on a probability distribution over the vocabulary, we can recover probability information by examining these distributions directly:

In[12]:
Code
import torch


def get_label_probability(model, tokenizer, input_text, labels):
    """Get probability of each label for a classification task."""
    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt")

    # Get logits for each label
    label_probs = {}
    for label in labels:
        label_ids = tokenizer(label, return_tensors="pt").input_ids

        # Forward pass to get logits
        with torch.no_grad():
            outputs = model(**inputs, decoder_input_ids=label_ids[:, :-1])
            logits = outputs.logits

        # Calculate probability of generating this label
        # (simplified - real implementation handles multiple tokens)
        probs = torch.softmax(logits[0, -1, :], dim=-1)
        # Handle case where label tokenizes to single token (index 0 after BOS)
        target_idx = label_ids[0, 0].item()
        label_probs[label] = probs[target_idx].item()

    return label_probs

This approach scores how likely the model is to generate each candidate label given the input, recovering the probability information that traditional classifiers provide directly. The key insight is that we're not just looking at what the model generates, but at the probability distribution from which it generates. This distribution contains rich information about the model's confidence and the relative likelihood of alternative outputs.

Out[13]:
Visualization
Traditional classifiers output a softmax distribution over fixed classes.
Traditional classifiers output a softmax distribution over fixed classes.
T5 generates label text with probabilities derived from the vocabulary distribution.
T5 generates label text with probabilities derived from the vocabulary distribution.

Named Entity Recognition as Generation

Named entity recognition, as we covered in Part VI, traditionally uses BIO tagging to assign labels to each token. This approach is elegant for sequence labeling architectures that produce one output per input position. However, it doesn't naturally fit the text-to-text paradigm where input and output lengths can differ arbitrarily. Converting NER to text-to-text requires reformulating the task: instead of predicting per-token labels, the model generates the entities directly.

This reformulation changes what the model needs to learn. Instead of learning to classify each token independently, or with CRF dependencies, the model learns to read the input text, identify entity spans, and express those findings in in a structured output format. This is closer to how humans perform entity recognition. We do not mentally label each word as B-PER or I-PER. We read the text and notice that "Marie Curie" is a person's name.

There are two main approaches to formatting NER as text generation:

Approach 1: Entity Listing

The model generates a structured list of entities and their types. This format separates the entity recognition task from positional information. It focuses purely on what entities exist and their types.

In[14]:
Code
## NER as entity extraction
ner_examples = [
    {
        "input": "ner: Barack Obama was born in Honolulu, Hawaii.",
        "output": "person: Barack Obama, location: Honolulu, location: Hawaii",
    },
    {
        "input": "ner: Apple Inc. announced the iPhone 15 in Cupertino.",
        "output": "organization: Apple Inc., product: iPhone 15, location: Cupertino",
    },
    {
        "input": "ner: The patient was prescribed 500mg of ibuprofen.",
        "output": "dosage: 500mg, medication: ibuprofen",
    },
]
Out[15]:
Console
Input:  ner: Barack Obama was born in Honolulu, Hawaii.
Output: person: Barack Obama, location: Honolulu, location: Hawaii

Input:  ner: Apple Inc. announced the iPhone 15 in Cupertino.
Output: organization: Apple Inc., product: iPhone 15, location: Cupertino

Input:  ner: The patient was prescribed 500mg of ibuprofen.
Output: dosage: 500mg, medication: ibuprofen

This format is intuitive and handles overlapping entities naturally, since each entity is listed separately regardless of position. The model learns a consistent pattern: type followed by colon, then entity text, with entities separated by commas. This regularity helps the model generalize to new entity types and new texts.

Approach 2: Inline Markup

Alternatively, entities can be marked inline with special delimiters. This approach preserves the original text structure while adding entity annotations.

In[16]:
Code
## NER with inline markup
inline_ner_examples = [
    {
        "input": "extract entities: Barack Obama was born in Honolulu, Hawaii.",
        "output": "[PER Barack Obama] was born in [LOC Honolulu], [LOC Hawaii].",
    },
    {
        "input": "extract entities: Microsoft acquired GitHub for $7.5 billion.",
        "output": "[ORG Microsoft] acquired [ORG GitHub] for [MONEY $7.5 billion].",
    },
]

This approach preserves entity positions and makes nested entities explicit, but requires careful parsing of the output. The inline format has the advantage of maintaining context around entities, which can help with certain downstream applications. However, it requires the model to regenerate the entire input with markup added. This is more computationally expensive and introduces more opportunities for the model to make mistakes.

Out[17]:
Visualization
Two approaches to NER as text generation. Entity listing extracts entities as type-value pairs, while inline markup annotates entities within the original text structure.
Two approaches to NER as text generation. Entity listing extracts entities as type-value pairs, while inline markup annotates entities within the original text structure.

Handling Edge Cases

The generative approach to NER introduces challenges that traditional sequence labeling doesn't face. Because the model generates entities as discrete items rather than labeling positions, it must handle various edge cases that don't arise in BIO tagging.

In[18]:
Code
## Edge cases in generative NER
edge_cases = [
    # Entity spans multiple mentions of same text
    {
        "input": "ner: New York hosted the New York Marathon.",
        "output": "location: New York, event: New York Marathon",
    },
    # Nested entities
    {
        "input": "ner: University of California, Berkeley researchers published...",
        "output": "organization: University of California, Berkeley, location: California, location: Berkeley",
    },
    # No entities
    {
        "input": "ner: It was a beautiful day.",
        "output": "",  # Or "none" depending on format
    },
]

Training data must cover these edge cases consistently for the model to handle them reliably. The format chosen during training becomes critical. Mixing formats leads to inconsistent outputs. If some training examples list nested entities and others don't, the model won't learn a consistent policy for handling nesting. Practitioners must make deliberate choices about how to handle these edge cases and apply those choices uniformly across training data.

Question Answering as Generation

Question answering fits the text-to-text paradigm naturally, more so than most tasks. The core operation—reading a question, consulting some source of knowledge, and producing an answer—is inherently about transforming one text into another. Given a question and context, generate the answer:

In[19]:
Code
## Extractive QA: answer is a span from the context
extractive_qa = {
    "input": "question: What is the capital of France? context: Paris is the capital and largest city of France. It is located on the Seine River.",
    "output": "Paris",
}

## Multi-hop QA: requires reasoning across facts
multihop_qa = {
    "input": "question: Who founded the company that makes the iPhone? context: Apple Inc. manufactures the iPhone. Apple was founded by Steve Jobs, Steve Wozniak, and Ronald Wayne.",
    "output": "Steve Jobs, Steve Wozniak, and Ronald Wayne",
}
Out[20]:
Console
Extractive QA Input: question: What is the capital of France? context: Paris is t...
Extractive QA Output: Paris

Multi-hop QA Input: question: Who founded the company that makes the iPhone? con...
Multi-hop QA Output: Steve Jobs, Steve Wozniak, and Ronald Wayne

The extractive example shows a simple fact lookup where the answer appears directly in the context. The multi-hop example requires combining information from multiple sentences, identifying that Apple makes the iPhone, then finding who founded Apple. This reasoning capability emerges from T5's pre-training on diverse text. The model learns not just to locate information, but to chain together facts and synthesize coherent answers.

In[21]:
Code
## Let's see T5 handle a question
qa_input = "question: What is the capital of Germany? context: Berlin is the capital and largest city of Germany. Munich is the third largest city."
inputs = tokenizer(
    qa_input, return_tensors="pt", max_length=512, truncation=True
)
outputs = model.generate(**inputs, max_new_tokens=32)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
Out[22]:
Console
Question: What is the capital of Germany?
Answer:   Berlin

T5 correctly extracts "Berlin" from the provided context. This demonstrates the open-book QA pattern where the model locates and returns the relevant information from the given passage rather than relying on memorized knowledge.

Open-book vs Closed-book QA

T5's text-to-text format supports a distinction that reveals what large language models actually learn. The format supports both open-book QA, where context is provided, and closed-book QA, where the model must rely entirely on knowledge encoded in its parameters.

In[23]:
Code
## Open-book QA: context provided
open_book = "question: When was Python created? context: Python was created by Guido van Rossum and first released in 1991."

## Closed-book QA: no context, rely on parametric knowledge
closed_book = "question: When was Python created?"

## Both use the same format, but closed-book relies on knowledge stored in model weights

Open-book QA provides the answer source explicitly. The model's job is comprehension and extraction: finding the relevant information in the provided text. Closed-book QA tests whether the model memorized relevant facts during pre-training. T5 demonstrated surprisingly strong closed-book QA performance, suggesting that large-scale pre-training encodes substantial world knowledge. This finding was influential in demonstrating that large language models do not just learn linguistic patterns. They also absorb and can retrieve factual information.

Abstractive vs Extractive Answers

Unlike span-based extractive QA models that can only copy text from the context, T5 can generate abstractive answers. This changes what question answering can accomplish. Traditional extractive models are limited to answers that appear verbatim in the source text. T5 can synthesize, paraphrase, and reason.

In[24]:
Code
## Extractive: answer is verbatim from context
extractive_example = {
    "context": "The Eiffel Tower is 330 meters tall.",
    "question": "How tall is the Eiffel Tower?",
    "extractive_answer": "330 meters tall",
    "abstractive_answer": "It stands at 330 meters.",
}

## T5 might paraphrase or synthesize information
synthesis_example = {
    "context": "Marie Curie won the Nobel Prize in Physics in 1903. She also won the Nobel Prize in Chemistry in 1911.",
    "question": "How many Nobel Prizes did Marie Curie win?",
    "answer": "two",  # Not a direct span!
}

The synthesis example demonstrates this well. The word "two" doesn't appear anywhere in the context, yet it's the correct answer. The model must count the Nobel Prizes mentioned and express that count as a word. This kind of reasoning and synthesis goes beyond pattern matching. It requires comprehension and inference.

This flexibility is powerful but requires careful training data design. If training answers are always extractive, the model will learn to copy; if they are abstractive, it may introduce errors or hallucinations. The nature of the training data shapes the model's behavior in subtle but important ways.

Additional Task Formatting Examples

The text-to-text paradigm extends far beyond the examples we've covered. Its power is its universality. Essentially any NLP task that can be described as "given this text, produce that text" fits naturally into the framework. Let's examine formatting for several more task types:

Semantic Similarity

Sentence similarity tasks ask whether two sentences mean the same thing. This is a fundamental capability underlying many applications, from duplicate detection to paraphrase identification.

In[25]:
Code
similarity_examples = [
    {
        "input": "mrpc sentence1: The company posted revenue of $1.2 billion. sentence2: Revenue reached $1.2 billion.",
        "output": "equivalent",
    },
    {
        "input": "mrpc sentence1: The stock fell 5%. sentence2: The stock rose 5%.",
        "output": "not_equivalent",
    },
]

For graded similarity (STS-B benchmark), the output is a numerical score. This demonstrates how the text-to-text format handles continuous outputs by simply generating the number as text.

In[26]:
Code
graded_similarity = {
    "input": "stsb sentence1: A man is playing guitar. sentence2: A person is playing a musical instrument.",
    "output": "4.2",  # Score from 0-5
}

Natural Language Inference

NLI determines whether a hypothesis follows from a premise. This task tests logical reasoning and semantic understanding, asking the model to evaluate relationships between statements.

In[27]:
Code
nli_examples = [
    {
        "input": "mnli hypothesis: The animal is sleeping. premise: The cat is curled up on the couch with its eyes closed.",
        "output": "entailment",
    },
    {
        "input": "mnli hypothesis: It is raining. premise: People are carrying umbrellas.",
        "output": "neutral",  # Possible but not certain
    },
    {
        "input": "mnli hypothesis: The restaurant is empty. premise: The restaurant is crowded with diners.",
        "output": "contradiction",
    },
]

Text Correction

Grammar and spelling correction naturally fit the text-to-text format. The task is to transform erroneous text into corrected text.

In[28]:
Code
correction_examples = [
    {
        "input": "correct: Their going to the store tommorrow.",
        "output": "They're going to the store tomorrow.",
    },
    {
        "input": "correct: The quick brown fox jump over the lazy dog.",
        "output": "The quick brown fox jumps over the lazy dog.",
    },
]

Summarization with Length Control

T5 can be trained to follow length specifications. By incorporating length requirements into the prefix, the model learns to produce summaries of varying granularity.

In[29]:
Code
## Summarization with different length targets
summarization_variants = [
    {
        "input": "summarize in one sentence: [long article text]",
        "output": "Brief one-line summary.",
    },
    {
        "input": "summarize in 3 sentences: [long article text]",
        "output": "First key point. Second important detail. Final takeaway.",
    },
    {
        "input": "summarize for twitter: [long article text]",
        "output": "Ultra-brief summary under 280 chars.",
    },
]

Structured Data Generation

Even structured outputs can be serialized as text. This shows the text-to-text paradigm's flexibility. Any output that can be expressed as a string becomes a valid target.

In[30]:
Code
structured_outputs = [
    # JSON-like output
    {
        "input": "extract info: John Smith is a 35-year-old software engineer at Google.",
        "output": '{"name": "John Smith", "age": 35, "occupation": "software engineer", "employer": "Google"}',
    },
    # SQL generation
    {
        "input": "translate to SQL: Show me all customers from New York",
        "output": "SELECT * FROM customers WHERE city = 'New York'",
    },
    # Code generation
    {
        "input": "python function: calculate factorial of n",
        "output": "def factorial(n): return 1 if n <= 1 else n * factorial(n-1)",
    },
]

These structured output examples hint at capabilities that would become central to later models. The ability to generate JSON, SQL, and code from natural language descriptions anticipates the code-generating and structured-reasoning capabilities of more recent large language models.

Building a Task Formatter

The following utility class handles task formatting consistently. This class encapsulates the formatting conventions we've discussed, providing a clean interface for preparing inputs across different task types.

In[31]:
Code
class T5TaskFormatter:
    """Format various NLP tasks for T5 text-to-text processing."""

    @staticmethod
    def sentiment(text: str) -> str:
        """Format text for sentiment classification."""
        return f"sst2 sentence: {text}"

    @staticmethod
    def translation(text: str, source_lang: str, target_lang: str) -> str:
        """Format text for translation."""
        return f"translate {source_lang} to {target_lang}: {text}"

    @staticmethod
    def summarization(text: str, max_words: int = None) -> str:
        """Format text for summarization with optional length control."""
        if max_words:
            return f"summarize to {max_words} words: {text}"
        return f"summarize: {text}"

    @staticmethod
    def qa(question: str, context: str = None) -> str:
        """Format for question answering (open or closed book)."""
        if context:
            return f"question: {question} context: {context}"
        return f"question: {question}"

    @staticmethod
    def ner(text: str) -> str:
        """Format text for named entity recognition."""
        return f"extract entities: {text}"

    @staticmethod
    def nli(premise: str, hypothesis: str) -> str:
        """Format for natural language inference."""
        return f"mnli premise: {premise} hypothesis: {hypothesis}"

    @staticmethod
    def grammar_correction(text: str) -> str:
        """Format for grammar and spelling correction."""
        return f"correct: {text}"

    @staticmethod
    def similarity(sentence1: str, sentence2: str) -> str:
        """Format for semantic similarity."""
        return f"stsb sentence1: {sentence1} sentence2: {sentence2}"
In[32]:
Code
## Using the formatter
formatter = T5TaskFormatter()

formatted_examples = [
    formatter.sentiment("This book was incredibly engaging!"),
    formatter.translation("Good morning", "English", "Spanish"),
    formatter.qa(
        "What color is the sky?", "The sky appears blue during the day."
    ),
    formatter.nli("All birds can fly.", "Penguins are birds."),
]
Out[33]:
Console
Formatted task examples:
  sst2 sentence: This book was incredibly engaging!
  translate English to Spanish: Good morning
  question: What color is the sky? context: The sky appears blue during the day.
  mnli premise: All birds can fly. hypothesis: Penguins are birds.

Each formatted string follows a consistent pattern: a task-specific prefix followed by the input content. This consistency is key to T5's ability to route inputs to the appropriate behavior. The formatter class encapsulates these patterns, making it easy to ensure correct formatting across an application. By centralizing the formatting logic, we also make it easier to update formats if needed. Changing the formatter updates all usages automatically.

Parsing Generated Outputs

Generating text is only half the problem. For tasks with structured outputs, you need to parse the model's generations back into usable data. The model produces strings, but your application likely needs Python objects, database entries, or API responses. This parsing step bridges the gap between the model's text-based interface and the structured world of software systems.

In[34]:
Code
import re
from typing import List, Tuple, Dict, Any


class T5OutputParser:
    """Parse T5 generated outputs back into structured data."""

    @staticmethod
    def parse_classification(output: str, valid_labels: List[str]) -> str:
        """Parse classification output, handling slight variations."""
        output = output.strip().lower()
        for label in valid_labels:
            if label.lower() in output:
                return label
        return output  # Return raw if no match

    @staticmethod
    def parse_entities(output: str) -> List[Tuple[str, str]]:
        """Parse entity list format: 'type1: entity1, type2: entity2'"""
        if not output.strip():
            return []

        entities = []
        for part in output.split(", "):
            if ": " in part:
                entity_type, entity_text = part.split(": ", 1)
                entities.append((entity_type.strip(), entity_text.strip()))

        return entities

    @staticmethod
    def parse_inline_entities(output: str) -> List[Dict[str, Any]]:
        """Parse inline markup format: '[TYPE entity text]'"""
        pattern = r"\[(\w+)\s+([^\]]+)\]"
        matches = re.findall(pattern, output)
        return [{"type": m[0], "text": m[1]} for m in matches]

    @staticmethod
    def parse_similarity_score(output: str) -> float:
        """Parse numeric similarity score."""
        try:
            # Handle outputs like "4.2" or "score: 4.2"
            numbers = re.findall(r"\d+\.?\d*", output)
            if numbers:
                return float(numbers[0])
        except ValueError:
            pass
        return 0.0
In[35]:
Code
## Test the parsers
parser = T5OutputParser()

## Classification parsing
sentiment_output = "positive"
parsed_sentiment = parser.parse_classification(
    sentiment_output, ["positive", "negative"]
)

## Entity parsing
ner_output = (
    "person: Marie Curie, location: Paris, organization: Sorbonne University"
)
parsed_entities = parser.parse_entities(ner_output)

## Similarity parsing
similarity_output = "3.8"
parsed_score = parser.parse_similarity_score(similarity_output)
Out[36]:
Console
Sentiment: 'positive' → positive

Entities: 'person: Marie Curie, location: Paris, organization: Sorbonne University'
  - person: Marie Curie
  - location: Paris
  - organization: Sorbonne University

Similarity score: '3.8' → 3.8

The parsers successfully convert raw text outputs back into structured Python data. Classification outputs map to discrete labels, entity strings become lists of (type, text) tuples, and numeric scores parse to floats. Robust parsing is essential in production systems because model outputs may occasionally deviate from expected formats. A well-designed parser handles variations gracefully. It normalizes outputs where possible and fails gracefully when the output is truly unparseable.

Multitask Training

A key consequence of unified task formatting is the ability to train on multiple tasks simultaneously. This is not just a convenience. It's a different training paradigm that enables positive transfer between tasks. A single training batch can contain translation, summarization, classification, and question answering examples.

In[37]:
Code
## Multitask training batch
multitask_batch = [
    # Translation
    {
        "input": "translate English to German: Hello world",
        "output": "Hallo Welt",
    },
    # Classification
    {"input": "sst2 sentence: Great movie!", "output": "positive"},
    # Summarization
    {
        "input": "summarize: Long article about climate change...",
        "output": "Climate change accelerates...",
    },
    # QA
    {
        "input": "question: What year? context: Founded in 1998.",
        "output": "1998",
    },
]

## All examples use the same model, loss function, and update procedure
Out[38]:
Visualization
Task proportions in T5's multitask training data.
Task proportions in T5's multitask training data.
Sample training batch composition showing task mixing.
Sample training batch composition showing task mixing.

T5's original training mixed examples from many tasks, with task proportions tuned to balance learning. This multitask setup enables the model to develop robust representations that transfer across tasks—skills learned from translation may help summarization, and reasoning from QA may improve classification. The shared encoder learns representations that capture linguistic information useful across tasks, while the decoder learns flexible generation strategies.

The intuition behind why multitask learning helps is that different tasks emphasize different aspects of language understanding. Translation forces the model to deeply understand semantics (to preserve meaning across languages). Summarization teaches compression and salience detection. QA develops reasoning and fact retrieval. Classification hones the ability to capture document-level properties. When these tasks share parameters, the model must develop representations that serve all these needs. Such representations tend to be richer and more robust than those learned for any single task.

Limitations and Practical Considerations

While the text-to-text paradigm is elegant, it introduces several challenges that practitioners must navigate. Understanding these limitations helps in deciding when the paradigm is appropriate and how to mitigate its weaknesses:

  • Generation overhead: For simple classification, generating text tokens is more expensive than predicting a single softmax over labels. A classifier might output a 2-dimensional probability vector. T5 must autoregressively generate the label text. For high-volume classification applications, this overhead matters. A traditional classifier might process thousands of examples per second, while a generative approach might be an order of magnitude slower.

  • Output format consistency: The model might generate valid but unexpected outputs. When asked for sentiment, it might output "positive" or "very positive" or "I think positive" depending on training data variations. Robust parsing and validation are essential. In production systems, you'll need to handle malformed outputs gracefully, whether by retry, fallback, or flagging for human review.

  • Error accumulation in structured output: When generating complex structured outputs like entity lists or JSON, early errors can cascade. If the model makes a formatting mistake partway through, the remainder may be unparseable. This differs fundamentally from traditional structured prediction, where each output position is typically independent.

  • Vocabulary constraints: T5's SentencePiece vocabulary affects what outputs are efficient to generate. Uncommon labels or domain-specific terms may require multiple tokens, introducing potential for errors. If your task requires outputting rare technical terms, the model must correctly generate multiple subword tokens in sequence, and each token is an opportunity for error.

  • Length bias: The model may learn spurious correlations between task prefixes and output lengths. If all sentiment training examples have one-word outputs, the model might struggle with more nuanced labels. Training data diversity is important not just in content but in format.

Despite these challenges, the benefits of unification generally outweigh the costs for most applications. The ability to fine-tune a single model on diverse tasks, share representations across domains, and handle new tasks with minimal architectural changes has advanced NLP system design. This text-to-text approach laid groundwork for the instruction-following capabilities we'll explore in later parts of this book.

Key Parameters

The key parameters for T5 task formatting are:

  • max_length: Maximum input sequence length for tokenization. Longer inputs are truncated. T5 variants support different maximum lengths (512 for t5-small/base, 1024 for larger variants).
  • max_new_tokens: Maximum number of tokens to generate in the output. Controls output length and inference time.
  • num_beams: Number of beams for beam search during generation. Higher values explore more candidates but increase computation.
  • skip_special_tokens: Whether to remove special tokens (like </s>) when decoding generated output to text.

Summary

T5's text-to-text paradigm transforms NLP by unifying all tasks into sequence-to-sequence generation. Task prefixes signal which operation to perform, while consistent formatting enables multitask training and transfer learning. Classification becomes generating label text, NER becomes listing or marking entities, and QA becomes generating answers from questions.

The key insights from this chapter:

  • Task prefixes act as routing instructions, conditioning the model on which transformation to apply
  • Classification as generation works by having the model output label text, with probabilities recoverable from decoder logits
  • NER as generation can use either entity listing or inline markup formats, each with different tradeoffs
  • QA naturally fits text-to-text, enabling both extractive and abstractive answers
  • Multitask training becomes trivial since all tasks share the same input/output format
  • Robust output parsing is essential since generated text may vary from expected formats

The next chapter introduces BART, another encoder-decoder model that takes a different approach to pre-training while maintaining similar flexibility for downstream tasks.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about T5's text-to-text task formatting.

Loading component...

Reference

BIBTEXAcademic
@misc{t5taskformattingtexttotextnlpunification, author = {Michael Brenndoerfer}, title = {T5 Task Formatting: Text-to-Text NLP Unification}, year = {2025}, url = {https://mbrenndoerfer.com/writing/t5-task-formatting-text-to-text-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }
APAAcademic
Michael Brenndoerfer (2025). T5 Task Formatting: Text-to-Text NLP Unification. Retrieved from https://mbrenndoerfer.com/writing/t5-task-formatting-text-to-text-nlp
MLAAcademic
Michael Brenndoerfer. "T5 Task Formatting: Text-to-Text NLP Unification." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/t5-task-formatting-text-to-text-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "T5 Task Formatting: Text-to-Text NLP Unification." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/t5-task-formatting-text-to-text-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'T5 Task Formatting: Text-to-Text NLP Unification'. Available at: https://mbrenndoerfer.com/writing/t5-task-formatting-text-to-text-nlp (Accessed: 12/25/2025).
SimpleBasic
Michael Brenndoerfer (2025). T5 Task Formatting: Text-to-Text NLP Unification. https://mbrenndoerfer.com/writing/t5-task-formatting-text-to-text-nlp