Search

Search articles

Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers

Michael BrenndoerferUpdated June 20, 202541 min read

Master the encoder-decoder transformer architecture that powers T5 and machine translation. Learn cross-attention mechanism, information flow between encoder and decoder, and when to choose encoder-decoder over other architectures.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Encoder-Decoder Architecture

The previous chapters explored encoder-only and decoder-only transformers as specialized tools: encoders for understanding, decoders for generation. But the original transformer from "Attention Is All You Need" was neither. It combined both components into a unified architecture designed for a specific challenge: transforming one sequence into another where input and output can have completely different lengths, vocabularies, and structures.

Machine translation epitomizes this challenge. "The quick brown fox" has four words, but its German translation "Der schnelle braune Fuchs" also has four, while its Japanese translation might have a completely different structure. Summarization compresses a 1000-word article into a 50-word abstract. Question answering takes a question and a passage, then produces an answer span. These tasks share a common pattern: process an input sequence, understand it deeply, then generate an output sequence that relates to but differs from the input.

The encoder-decoder architecture handles this by dividing labor. The encoder builds rich, bidirectional representations of the input. The decoder generates the output autoregressively, attending both to its own previous outputs and to the encoder's representations through a mechanism called cross-attention. This chapter explores how these components interact, the mathematics of cross-attention, and when encoder-decoder models outperform their encoder-only or decoder-only alternatives.

The Complete Transformer Architecture

The original transformer architecture consists of three main components that work together to transform input sequences into output sequences:

  1. Encoder stack: A series of encoder blocks that process the source sequence bidirectionally, producing contextualized representations for every input position.

  2. Decoder stack: A series of decoder blocks that generate the target sequence autoregressively. Each decoder block contains not just self-attention (masked for causality) but also cross-attention to the encoder's output.

  3. Cross-attention mechanism: The bridge between encoder and decoder. At each decoder position, cross-attention queries the encoder's representations to gather relevant source information for generation.

This division of labor reflects a fundamental insight: understanding input requires different processing than generating output. The encoder sees everything at once and can use bidirectional context. The decoder must respect temporal ordering since it generates tokens sequentially, but it benefits from full access to the encoder's understanding of the source.

Out[3]:
Visualization
Diagram showing encoder stack on left processing source tokens, connected via cross-attention arrows to decoder stack on right generating target tokens.
High-level view of the transformer encoder-decoder architecture. The encoder processes the source sequence bidirectionally, producing representations that the decoder accesses through cross-attention while generating the target sequence autoregressively.

Cross-Attention: The Bridge Between Encoder and Decoder

Consider the challenge of translation. When generating the French word "chat," how does the decoder know to look at the English word "cat" rather than "the" or "sat"? The encoder has produced rich representations of every source word, but the decoder needs a way to selectively access the right information at each generation step.

This is precisely what cross-attention accomplishes. It creates a dynamic, learned connection between the decoder's current state and the encoder's representations, allowing each generated token to "query" the source sequence and retrieve relevant information.

Cross-Attention vs Self-Attention

In self-attention, queries, keys, and values all come from the same sequence, with tokens attending to other tokens within their own context. Cross-attention breaks this symmetry: queries come from the decoder (what information am I looking for?), while keys and values come from the encoder (what information is available in the source?). This asymmetry is what enables the decoder to search for and retrieve relevant source information at each generation step.

The Cross-Attention Mechanism

To understand cross-attention mathematically, we need to track two distinct sequences flowing through the architecture. The decoder maintains hidden states HdecRm×d\mathbf{H}_{\text{dec}} \in \mathbb{R}^{m \times d} for the mm target positions generated so far. The encoder has already produced its output HencRn×d\mathbf{H}_{\text{enc}} \in \mathbb{R}^{n \times d} for all nn source positions. The key insight is that these sequences can have completely different lengths. We might translate a 10-word English sentence into a 15-word French sentence, or summarize a 1000-token document into 50 tokens.

Cross-attention connects these two sequences through three projections, each serving a distinct purpose:

Q=HdecWQ,K=HencWK,V=HencWV\mathbf{Q} = \mathbf{H}_{\text{dec}} \mathbf{W}^Q, \quad \mathbf{K} = \mathbf{H}_{\text{enc}} \mathbf{W}^K, \quad \mathbf{V} = \mathbf{H}_{\text{enc}} \mathbf{W}^V

where:

  • HdecRm×d\mathbf{H}_{\text{dec}} \in \mathbb{R}^{m \times d}: decoder hidden states, one per target position
  • HencRn×d\mathbf{H}_{\text{enc}} \in \mathbb{R}^{n \times d}: encoder output, one per source position
  • WQ,WK,WVRd×dk\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V \in \mathbb{R}^{d \times d_k}: learned projection matrices
  • QRm×dk\mathbf{Q} \in \mathbb{R}^{m \times d_k}: query matrix with mm queries from decoder
  • KRn×dk\mathbf{K} \in \mathbb{R}^{n \times d_k}: key matrix with nn keys from encoder
  • VRn×dv\mathbf{V} \in \mathbb{R}^{n \times d_v}: value matrix with nn values from encoder

The asymmetry is crucial: queries come from the decoder, while keys and values come from the encoder. Think of it as the decoder asking questions (queries) and the encoder providing a searchable index (keys) along with the actual information to retrieve (values).

With these projections in place, the attention computation proceeds through three steps. First, we measure compatibility between every decoder query and every encoder key. Then we convert these compatibility scores into attention weights. Finally, we use these weights to compute a weighted sum of encoder values:

CrossAttention(Q,K,V)=softmax(QKTdk)V\text{CrossAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

Let's unpack each component:

  1. Compatibility scores QKTRm×n\mathbf{Q}\mathbf{K}^T \in \mathbb{R}^{m \times n}: This matrix multiplication compares each decoder query against all encoder keys. Entry (i,j)(i, j) measures how strongly decoder position ii should attend to encoder position jj. Higher values indicate stronger relevance.

  2. Scaling by dk\sqrt{d_k}: As dimension dkd_k grows, dot products tend to grow in magnitude, pushing softmax toward extreme values (near 0 or 1). Dividing by dk\sqrt{d_k} keeps gradients stable during training.

Out[4]:
Visualization
Histogram showing unscaled dot products spreading wider for larger dimensions.
Without scaling, dot products grow with dimension d, concentrating softmax outputs near 0 or 1.
Histogram showing scaled dot products remaining consistent across dimensions.
With square root scaling, scores remain well-distributed regardless of dimension, enabling stable gradients.
  1. Softmax normalization: Converts each row of scores into a probability distribution. After softmax, each decoder position has attention weights over the source that sum to 1. These weights represent "how much" of each source position to incorporate.

  2. Weighted value sum: The final matrix multiplication retrieves information from the encoder. Each decoder position gets a weighted combination of encoder values, where the weights come from the attention distribution.

The key geometric insight is that QKT\mathbf{Q}\mathbf{K}^T produces an m×nm \times n rectangular matrix, not a square matrix. Each of the mm decoder positions computes attention weights over all nn encoder positions. This rectangular attention matrix is what distinguishes cross-attention from self-attention and enables sequence-to-sequence transformation between sequences of different lengths.

In[5]:
Code
def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def cross_attention(Q, K, V):
    """
    Compute cross-attention between decoder queries and encoder keys/values.

    Args:
        Q: Decoder queries of shape (m, d_k) - one per target position
        K: Encoder keys of shape (n, d_k) - one per source position
        V: Encoder values of shape (n, d_v) - one per source position

    Returns:
        Output of shape (m, d_v) and attention weights of shape (m, n)
    """
    d_k = Q.shape[-1]

    # Compute attention scores: (m, n)
    scores = Q @ K.T / np.sqrt(d_k)

    # No masking needed - decoder can attend to all encoder positions
    weights = softmax(scores, axis=-1)

    # Weighted sum of values: (m, d_v)
    output = weights @ V

    return output, weights


# Example: 4 decoder positions attending to 6 encoder positions
m, n, d_k = 4, 6, 64

# Simulate decoder queries (from current decoder hidden states)
H_dec = np.random.randn(m, d_k) * 0.1
W_q = np.random.randn(d_k, d_k) * 0.1
Q = H_dec @ W_q

# Simulate encoder outputs (already computed, fixed)
H_enc = np.random.randn(n, d_k) * 0.1
W_k = np.random.randn(d_k, d_k) * 0.1
W_v = np.random.randn(d_k, d_k) * 0.1
K = H_enc @ W_k
V = H_enc @ W_v

output, attn_weights = cross_attention(Q, K, V)
Out[6]:
Console
Cross-Attention Dimensions
---------------------------------------------
Decoder hidden states (H_dec): (4, 64)
Encoder hidden states (H_enc): (6, 64)

Queries from decoder (Q): (4, 64)
Keys from encoder (K): (6, 64)
Values from encoder (V): (6, 64)

Attention weights (m × n): (4, 6)
Output (m × d_v): (4, 64)

Each decoder position attends to all encoder positions.
Row sums of attention weights: [1. 1. 1. 1.]

The output confirms cross-attention produces a rectangular 4×64 \times 6 attention weight matrix rather than a square one. Each decoder position computes a probability distribution over all 6 encoder positions, which is why the row sums equal exactly 1.0. This demonstrates the key insight: queries from 4 target positions attend to keys from 6 source positions, creating asymmetric attention that lets each generated word "look at" the entire input sequence.

Out[7]:
Visualization
Heatmap showing cross-attention weights between 4 decoder and 6 encoder positions.
Cross-attention weight matrix showing how each decoder position (row) attends to encoder positions (column). Unlike self-attention, this matrix is rectangular when source and target lengths differ. Each row sums to 1, representing where each decoder position looks in the source sequence.

What Cross-Attention Learns

The remarkable property of cross-attention is that it learns meaningful alignments entirely from the end task, without any explicit supervision about which source words correspond to which target words. During training, gradients flow backward from the translation loss, gradually shaping the attention weights to focus on the most relevant source positions.

When translating "The cat sat" to "Le chat était assis," a trained model learns attention patterns that reflect linguistic structure:

  • When generating "chat" (French for cat), attend strongly to "cat" in the source, a direct lexical correspondence
  • When generating "était assis" (was sitting), attend primarily to "sat", capturing a one-to-many mapping where one English word expands to two French words
  • When generating "Le," attend to context that determines grammatical gender, since the model learns that "cat" requires the masculine article

This learned alignment emerges from the training signal alone. The model discovers through thousands of examples that attending to the right source positions helps produce correct translations. In effect, cross-attention replaces the explicit alignment models used in pre-neural statistical machine translation, while being more flexible since it can capture soft, distributed alignments rather than hard one-to-one correspondences.

Out[8]:
Visualization
Diagram showing alignment arrows from target tokens to source tokens based on attention weights.
Illustration of how cross-attention creates soft alignment between source and target tokens. When generating each target word, the decoder attends to relevant source words. This alignment is learned end-to-end without explicit supervision.

Anatomy of a Decoder Block with Cross-Attention

Understanding the decoder block requires recognizing that it must accomplish two distinct but interrelated tasks: it must generate coherent, grammatical output (like any language model), while also staying faithful to the source sequence (which distinguishes translation from free generation). The architecture addresses these requirements through a carefully ordered sequence of three sublayers.

Sublayer 1: Masked Self-Attention. The decoder first attends to its own previous outputs. When generating the fourth target word, it considers the first three words it has already produced. This builds a representation of the generation context so far. The attention is masked (causal) to prevent the decoder from seeing future tokens, which is essential for autoregressive generation where we produce one token at a time.

Sublayer 2: Cross-Attention. Armed with context about what it has generated, the decoder now queries the encoder. This is where source information enters. The decoder's refined representation serves as queries, and the encoder's output provides keys and values. The cross-attention sublayer answers: "Given what I've generated so far, what source information is relevant for the next token?"

Sublayer 3: Feed-Forward Network. Finally, a position-wise FFN applies nonlinear transformation to each position independently. This adds representational capacity and allows the model to process the combined information from self-attention and cross-attention.

The ordering matters: self-attention establishes generation context, cross-attention enriches it with source information, and the FFN transforms the result. Each sublayer includes residual connections and layer normalization, ensuring stable gradient flow through deep stacks.

In[9]:
Code
def rms_norm(x, gamma, eps=1e-6):
    """RMS Layer Normalization."""
    rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)


def gelu(x):
    """Gaussian Error Linear Unit activation."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))


def create_causal_mask(seq_len):
    """Create causal mask for decoder self-attention."""
    mask = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
    return mask


class MultiHeadAttention:
    """Multi-head attention supporting both self and cross attention."""

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def __call__(self, query_input, key_value_input=None, mask=None):
        """
        Apply multi-head attention.

        For self-attention: key_value_input is None, uses query_input for K, V
        For cross-attention: key_value_input provides K, V from encoder
        """
        if key_value_input is None:
            key_value_input = query_input

        q_len = query_input.shape[0]
        kv_len = key_value_input.shape[0]

        # Project to Q, K, V
        Q = query_input @ self.W_q
        K = key_value_input @ self.W_k
        V = key_value_input @ self.W_v

        # Reshape for multi-head
        Q = Q.reshape(q_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        K = K.reshape(kv_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        V = V.reshape(kv_len, self.n_heads, self.d_k).transpose(1, 0, 2)

        # Compute attention scores
        scores = Q @ K.transpose(0, 2, 1) / np.sqrt(self.d_k)

        if mask is not None:
            scores = scores + mask

        weights = softmax(scores, axis=-1)
        attended = weights @ V

        # Reshape and project
        attended = attended.transpose(1, 0, 2).reshape(q_len, self.d_model)
        output = attended @ self.W_o

        return output, weights


class FeedForward:
    """Position-wise feed-forward network."""

    def __init__(self, d_model, d_ff):
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))

        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)

    def __call__(self, x):
        hidden = gelu(x @ self.W1 + self.b1)
        return hidden @ self.W2 + self.b2


class EncoderDecoderBlock:
    """
    Decoder block with cross-attention for encoder-decoder architecture.

    Contains three sublayers:
    1. Masked self-attention (causal)
    2. Cross-attention to encoder
    3. Feed-forward network
    """

    def __init__(self, d_model, n_heads, d_ff):
        self.d_model = d_model

        # Three attention/FFN sublayers
        self.self_attn = MultiHeadAttention(d_model, n_heads)
        self.cross_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)

        # Three layer norms (pre-norm configuration)
        self.norm1_gamma = np.ones(d_model)
        self.norm2_gamma = np.ones(d_model)
        self.norm3_gamma = np.ones(d_model)

    def __call__(self, x, encoder_output, self_attn_mask=None):
        """
        Forward pass through decoder block.

        Args:
            x: Decoder input of shape (tgt_len, d_model)
            encoder_output: Encoder output of shape (src_len, d_model)
            self_attn_mask: Causal mask for self-attention

        Returns:
            Output of shape (tgt_len, d_model)
        """
        # Sublayer 1: Masked self-attention
        normed = rms_norm(x, self.norm1_gamma)
        self_attn_out, _ = self.self_attn(normed, mask=self_attn_mask)
        x = x + self_attn_out

        # Sublayer 2: Cross-attention to encoder
        normed = rms_norm(x, self.norm2_gamma)
        cross_attn_out, cross_weights = self.cross_attn(
            normed, key_value_input=encoder_output
        )
        x = x + cross_attn_out

        # Sublayer 3: Feed-forward
        normed = rms_norm(x, self.norm3_gamma)
        ffn_out = self.ffn(normed)
        x = x + ffn_out

        return x, cross_weights
Out[10]:
Console
Encoder-Decoder Block Test
--------------------------------------------------
Encoder output (source): (6, 256)
Decoder input (target): (4, 256)
Decoder output: (4, 256)

Cross-attention weights: (8, 4, 6)
  (n_heads, tgt_len, src_len)

Each target position attends to all source positions.
Cross-attention weight sums (should be 1.0):
  [1. 1. 1. 1.]

The decoder block successfully processes target embeddings while attending to encoder output. The cross-attention weights have shape (n_heads, tgt_len, src_len), showing that each of the 8 attention heads learns independent patterns for how target positions query source positions. Weight sums of 1.0 across each target position confirm proper softmax normalization. The output maintains the same shape as the input (tgt_len, d_model), ready to be passed to subsequent decoder layers.

Out[11]:
Visualization
Heatmap showing head 1 attention weights.
Head 1 cross-attention pattern.
Heatmap showing head 2 attention weights.
Head 2 cross-attention pattern.
Heatmap showing head 3 attention weights.
Head 3 cross-attention pattern.
Heatmap showing head 4 attention weights.
Head 4 cross-attention pattern.

Each attention head learns distinct patterns for how target positions attend to source positions. Some heads may specialize in local alignment while others capture longer-range dependencies.

Building the Complete Encoder-Decoder Transformer

With the individual components understood (cross-attention for source access, masked self-attention for autoregressive generation, and feed-forward networks for transformation), we can now assemble the complete architecture.

The full encoder-decoder transformer stacks multiple blocks on each side. The encoder builds increasingly abstract representations of the source through successive bidirectional self-attention layers. The decoder mirrors this depth, with each layer first refining its generation context (self-attention), then consulting the encoder's final output (cross-attention), and finally applying nonlinear transformation (FFN).

A crucial efficiency emerges from this design: the encoder processes the source sequence exactly once. Its output is then reused by every decoder layer at every generation step. When generating a 50-word summary, the decoder queries the same encoder representations 50 times, once per generated token, without recomputing them.

In[12]:
Code
class EncoderBlock:
    """Standard encoder block with bidirectional self-attention."""

    def __init__(self, d_model, n_heads, d_ff):
        self.d_model = d_model
        self.self_attn = MultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1_gamma = np.ones(d_model)
        self.norm2_gamma = np.ones(d_model)

    def __call__(self, x, mask=None):
        # Self-attention (bidirectional, no causal mask)
        normed = rms_norm(x, self.norm1_gamma)
        attn_out, _ = self.self_attn(normed, mask=mask)
        x = x + attn_out

        # Feed-forward
        normed = rms_norm(x, self.norm2_gamma)
        ffn_out = self.ffn(normed)
        x = x + ffn_out

        return x


class TransformerEncoderDecoder:
    """
    Complete encoder-decoder transformer.

    Follows the original architecture from "Attention Is All You Need."
    """

    def __init__(
        self,
        src_vocab_size,
        tgt_vocab_size,
        d_model,
        n_heads,
        d_ff,
        n_encoder_layers,
        n_decoder_layers,
        max_seq_len,
    ):
        self.d_model = d_model
        self.n_encoder_layers = n_encoder_layers
        self.n_decoder_layers = n_decoder_layers

        # Embeddings
        self.src_embedding = np.random.randn(src_vocab_size, d_model) * 0.02
        self.tgt_embedding = np.random.randn(tgt_vocab_size, d_model) * 0.02
        self.position_embedding = np.random.randn(max_seq_len, d_model) * 0.02

        # Encoder stack
        self.encoder_layers = [
            EncoderBlock(d_model, n_heads, d_ff)
            for _ in range(n_encoder_layers)
        ]
        self.encoder_norm_gamma = np.ones(d_model)

        # Decoder stack
        self.decoder_layers = [
            EncoderDecoderBlock(d_model, n_heads, d_ff)
            for _ in range(n_decoder_layers)
        ]
        self.decoder_norm_gamma = np.ones(d_model)

        # Output projection (to target vocabulary)
        self.output_projection = np.random.randn(d_model, tgt_vocab_size) * 0.02

    def encode(self, src_tokens):
        """
        Encode source sequence.

        Args:
            src_tokens: Source token indices of shape (src_len,)

        Returns:
            Encoder output of shape (src_len, d_model)
        """
        src_len = len(src_tokens)

        # Embed tokens and add positions
        x = self.src_embedding[src_tokens]
        x = x + self.position_embedding[:src_len]

        # Pass through encoder layers
        for layer in self.encoder_layers:
            x = layer(x)

        # Final normalization
        x = rms_norm(x, self.encoder_norm_gamma)

        return x

    def decode(self, tgt_tokens, encoder_output):
        """
        Decode target sequence given encoder output.

        Args:
            tgt_tokens: Target token indices of shape (tgt_len,)
            encoder_output: Encoder output of shape (src_len, d_model)

        Returns:
            Logits of shape (tgt_len, tgt_vocab_size)
        """
        tgt_len = len(tgt_tokens)

        # Embed tokens and add positions
        x = self.tgt_embedding[tgt_tokens]
        x = x + self.position_embedding[:tgt_len]

        # Causal mask for self-attention
        causal_mask = create_causal_mask(tgt_len)

        # Pass through decoder layers
        cross_attention_weights = []
        for layer in self.decoder_layers:
            x, cross_weights = layer(x, encoder_output, causal_mask)
            cross_attention_weights.append(cross_weights)

        # Final normalization
        x = rms_norm(x, self.decoder_norm_gamma)

        # Project to vocabulary
        logits = x @ self.output_projection

        return logits, cross_attention_weights

    def forward(self, src_tokens, tgt_tokens):
        """Complete forward pass."""
        encoder_output = self.encode(src_tokens)
        logits, cross_weights = self.decode(tgt_tokens, encoder_output)
        return logits, cross_weights
Out[13]:
Console
Encoder-Decoder Transformer Test
==================================================
Source sequence length: 6
Target sequence length: 4

Encoder layers: 4
Decoder layers: 4
Model dimension: 256

Logits shape: (4, 12000)
  (tgt_len × tgt_vocab_size)

Cross-attention weights per layer: (8, 4, 6)
  (n_heads × tgt_len × src_len)

The complete encoder-decoder transformer successfully processes source and target sequences. The encoder compresses the 6-token source into 256-dimensional representations, which the decoder queries through cross-attention while generating 4 target tokens. Each decoder layer produces cross-attention weights of shape (8, 4, 6), representing how 8 heads × 4 target positions attend to 6 source positions. The final logits of shape (4, 12000) give probability distributions over the target vocabulary for each generated position.

Information Flow in Encoder-Decoder Models

The encoder-decoder architecture's power comes from a carefully designed asymmetry in how information flows. The encoder and decoder process their sequences differently, and cross-attention creates a one-way bridge between them.

Encoder: Bidirectional Processing

The encoder enjoys complete freedom: every source position can attend to every other source position through bidirectional self-attention. When encoding "The cat sat on the mat," the representation for "cat" incorporates context from both directions: "The" before it and "sat on the mat" after it. This bidirectional view is essential for understanding meaning, since words often depend on both their left and right context for disambiguation.

The encoder runs in one pass, processing all source positions simultaneously (in parallel). By the time the decoder begins, the encoder has built rich, contextualized representations that capture the full meaning of each source word.

Decoder: Causal Self-Attention + Full Cross-Attention

The decoder operates under two different attention regimes simultaneously, and understanding this distinction is crucial:

Self-attention is causal (masked). When the decoder generates the third target word, it can only attend to the first and second target words, never to future tokens that haven't been generated yet. This respects the autoregressive nature of generation: we can't condition on outputs we haven't produced.

Cross-attention is unrestricted. When generating that same third target word, the decoder can attend to any source position, including positions near the end of the source sequence. There's no masking here because the source is fully available; it was processed by the encoder before generation began.

This asymmetry enables a powerful capability: the decoder can "look ahead" in the source sequence while generating sequentially in the target. When translating "The cat sat" to "Le chat était assis," the decoder can attend to "sat" when generating "Le", which is useful for determining that the sentence structure will require masculine agreement.

Out[14]:
Visualization
Lower triangular matrix showing causal self-attention pattern in decoder.
Decoder self-attention is causal. Each target position can only attend to previous target positions and itself. This prevents information leakage during autoregressive generation.
Full rectangular matrix showing cross-attention from all target to all source positions.
Cross-attention is unrestricted. Each target position can attend to all source positions, enabling the decoder to access relevant source information at any generation step.

Layer-by-Layer Refinement

Both encoder and decoder stack multiple layers, and depth enables increasingly abstract representations. But the pattern of refinement differs between them.

Each encoder layer refines source representations by allowing every position to gather information from all other positions. Early layers might resolve local ambiguities; later layers build higher-level semantic representations. A 6-layer encoder gives each word 6 opportunities to absorb context from the entire source.

Each decoder layer follows the three-step sequence we saw earlier:

  1. Refine target representations based on target context so far (masked self-attention)
  2. Enrich target representations with relevant source information (cross-attention)
  3. Apply nonlinear transformation (FFN)

The critical design choice is that cross-attention happens at every decoder layer, not just once. This means source information can influence target representations repeatedly, at multiple levels of abstraction. Early decoder layers might capture lexical correspondences, such as "chat" attending to "cat." Later decoder layers might capture more abstract semantic relationships, like attending to source context that helps determine verb tense or article agreement.

Out[15]:
Visualization
Heatmap showing cross-attention weights for layer 1.
Cross-attention in Layer 1 (earliest). Shows more diffuse attention as the model gathers broad context.
Heatmap showing cross-attention weights for layer 4.
Cross-attention in Layer 4 (deepest). Shows more focused attention to specific source positions.

To quantify how attention patterns change across layers, we can measure the entropy of each target position's attention distribution. Higher entropy indicates diffuse attention (looking at many source positions), while lower entropy indicates focused attention (concentrating on a few positions).

Out[16]:
Visualization
Line plot showing attention entropy decreasing across layers for different target positions.
Attention entropy across decoder layers for each target position. Early layers tend to have higher entropy (more diffuse attention), while later layers often show lower entropy (more focused on specific source positions). This progression reflects the model building from broad context to precise alignment.

T5: Text-to-Text with Encoder-Decoder

T5 (Text-to-Text Transfer Transformer) exemplifies the encoder-decoder paradigm at scale. Google's T5 treats every NLP task as a text-to-text problem: given some input text, generate some output text. Translation becomes "translate English to German: The cat sat" → "Die Katze saß." Summarization becomes "summarize: [long article]" → "[short summary]." Even classification becomes "sentiment positive or negative: This movie was great" → "positive."

T5 Architecture

T5 uses the standard encoder-decoder transformer architecture with relative position encodings. Both encoder and decoder share the same vocabulary and can be scaled independently. T5-Base has 220M parameters, T5-Large has 770M, T5-3B has 3 billion, and T5-11B has 11 billion parameters.

T5 Design Choices

T5 made several key architectural decisions that influenced subsequent encoder-decoder models:

  • Relative position encodings: Instead of absolute position embeddings, T5 uses relative position biases in attention, allowing better generalization to different sequence lengths.

  • Pre-norm architecture: Layer normalization applied before each sublayer rather than after, matching modern practice.

  • Shared embeddings: The same embedding matrix is used for both encoder and decoder, as well as for the output projection (tied embeddings).

  • Simplified layer norm: Removes the bias term and centering from standard layer normalization, keeping only learnable scale parameters.

Out[17]:
Console
T5 Model Configurations
======================================================================
Model        | Layers  | d_model  | d_ff     | Heads  | Params  
----------------------------------------------------------------------
T5-Small     | 6       | 512      | 2048     | 8      | 60M     
T5-Base      | 12      | 768      | 3072     | 12     | 220M    
T5-Large     | 24      | 1024     | 4096     | 16     | 770M    
T5-3B        | 24      | 1024     | 16384    | 32     | 3B      
T5-11B       | 24      | 1024     | 65536    | 128    | 11B     

Note: Encoder and decoder each have n_layers blocks.
Total layers = 2 × n_layers for encoder-decoder models.

The T5 family spans three orders of magnitude in parameter count, from 60M to 11B. The scaling pattern reveals interesting design choices: layer count increases modestly (6 → 12 → 24) while hidden dimensions and especially feed-forward dimensions grow substantially. T5-11B achieves its massive scale primarily through a 65,536-dimensional FFN, 16× larger than T5-Base. The number of attention heads also scales with model size, with T5-11B using 128 heads to maintain fine-grained attention patterns at scale.

Why Encoder-Decoder for T5?

Google explored various architectures in the T5 paper and found encoder-decoder models outperformed decoder-only models on most tasks when controlled for computational budget. The encoder's ability to process input bidirectionally proved especially valuable for:

  • Tasks with long inputs: Summarization, question answering over passages
  • Tasks requiring input understanding: Classification, span extraction
  • Tasks with complex input-output alignment: Translation, structured generation

Decoder-only models excelled primarily at pure generation tasks without structured input. For the broad range of NLP tasks T5 targets, the encoder-decoder architecture provided better overall performance.

When to Use Encoder-Decoder Architecture

The choice between encoder-only, decoder-only, and encoder-decoder architectures depends on the task structure. Here's a framework for deciding:

Use Encoder-Decoder When:

  • Input and output are both sequences but structurally different: Translation, summarization, style transfer
  • The input requires deep understanding before generation begins: Question answering, data-to-text
  • Input-output alignment is complex and non-monotonic: The output may reorder, add, or drop information from the input
  • The input is much longer than the output: Summarization, where encoding the input once and referencing it repeatedly is more efficient than including it in the decoder context

Use Decoder-Only When:

  • Generation is the primary goal: Creative writing, code generation, conversational AI
  • The task is primarily about extending or continuing: Completion, elaboration
  • Prompt engineering is the primary interface: The input is a short prompt, not a structured document to be transformed
  • Simplicity and scale are priorities: Decoder-only models are architecturally simpler and have dominated recent scaling efforts

Use Encoder-Only When:

  • No generation is required: Classification, NER, extractive QA
  • The task is fundamentally about understanding: Semantic similarity, sentiment analysis
  • Efficiency at inference is critical: Encoders require one forward pass, not autoregressive generation
Out[18]:
Visualization
Comparison chart showing which architecture suits different task types.
Task-architecture alignment. Each architecture excels at different task types. Encoder-decoder models occupy the middle ground, handling tasks that require both deep input understanding and complex output generation.

Implementing Encoder-Decoder Generation

At inference time, the encoder processes the source sequence once. The decoder then generates the target sequence autoregressively, querying the cached encoder output at each step.

In[19]:
Code
def generate_with_encoder_decoder(
    model, src_tokens, max_len=50, sos_token=0, eos_token=1, temperature=1.0
):
    """
    Generate target sequence given source sequence.

    Args:
        model: Encoder-decoder transformer
        src_tokens: Source token indices
        max_len: Maximum generation length
        sos_token: Start-of-sequence token index
        eos_token: End-of-sequence token index
        temperature: Sampling temperature

    Returns:
        Generated token indices
    """
    # Encode source (done once)
    encoder_output = model.encode(src_tokens)

    # Initialize target with SOS token
    generated = [sos_token]

    for _ in range(max_len):
        # Decode current sequence
        tgt_tokens = np.array(generated)
        logits, _ = model.decode(tgt_tokens, encoder_output)

        # Get logits for next token (last position)
        next_logits = logits[-1] / temperature
        probs = softmax(next_logits)

        # Sample next token
        next_token = np.random.choice(len(probs), p=probs)
        generated.append(next_token)

        # Stop if EOS generated
        if next_token == eos_token:
            break

    return np.array(generated)


# Generate from our model
np.random.seed(123)
generated = generate_with_encoder_decoder(
    model, src_tokens, max_len=10, temperature=0.8
)
Out[20]:
Console
Encoder-Decoder Generation
---------------------------------------------
Source tokens: [42, 100, 256, 789, 1000, 500]
Generated tokens: [0, 8323, 3415, 2696, 6594, 8594, 5082, 11770, 8186, 5762, 4705]
Source length: 6
Generated length: 11

Key observation: The encoder runs once, then the
decoder generates autoregressively while attending
to the cached encoder output at each step.

The generation demonstrates the encoder-decoder inference pattern: the encoder runs once to produce cached representations, then the decoder generates tokens autoregressively while querying that cache through cross-attention. The generated sequence (10 tokens) differs in length from the source (6 tokens), showing how the architecture naturally handles variable-length outputs. This "encode once, decode many times" pattern is particularly efficient for tasks like summarization, where a long input document is encoded once and then referenced repeatedly during output generation.

The generation process highlights the efficiency advantage of encoder-decoder models for tasks with long inputs. The encoder processes a 1000-word document once. The decoder then generates a 50-word summary, accessing the encoded document through cross-attention at each of the 50 generation steps. Compare this to a decoder-only approach that would need to include the full document in context at every step.

Limitations and Practical Considerations

Encoder-decoder models offer powerful capabilities for sequence transformation, but they come with trade-offs that affect when and how to use them.

Computational Considerations

The encoder and decoder are distinct modules that must be trained and deployed together. This doubles the parameter count compared to using just an encoder or decoder of the same size. For translation between N languages, you need either N(N-1) models (one per direction) or a multilingual model that handles all directions.

Cross-attention adds computational overhead. Each decoder layer computes attention over the full encoder output, adding O(m×n)O(m \times n) operations per layer, where:

  • mm: the target sequence length (number of tokens generated so far)
  • nn: the source sequence length (number of encoder positions)
  • O(m×n)O(m \times n): the complexity of computing attention scores between all decoder-encoder position pairs

For long source sequences, this cross-attention cost can dominate computation, especially as the decoder generates many tokens.

Out[21]:
Visualization
Line plot comparing FLOPs for encoder-decoder vs decoder-only across different output lengths.
Attention computation cost comparison for encoder-decoder vs decoder-only models. For a source of length n=512 and varying output lengths, encoder-decoder incurs additional cross-attention cost but avoids reprocessing the source at each step. The crossover point depends on the source/target length ratio.

The Rise of Decoder-Only Models

Despite encoder-decoder models' strong performance on structured tasks, recent trends have favored decoder-only architectures at scale. GPT-4, Claude, and Llama are all decoder-only. Several factors drive this:

  • Simplicity: One architecture handles both understanding and generation
  • Scaling: Easier to scale one component than two
  • Flexibility: The same model handles diverse tasks through prompting
  • Training efficiency: Next-token prediction provides dense supervision

However, encoder-decoder models remain valuable for production systems where task structure is fixed. Translation services, summarization APIs, and document processing pipelines often use encoder-decoder models because the structured input-output relationship benefits from specialized processing.

When Encoder-Decoder Wins

The encoder-decoder architecture offers clear advantages in specific scenarios:

  • Repeated access to input: When the decoder needs to reference the input multiple times during generation, encoding once is more efficient than reprocessing.
  • Length mismatch: When output is much shorter than input (summarization) or much longer (elaboration), the asymmetric architecture is natural.
  • Non-monotonic alignment: When output order differs from input order, cross-attention learns the complex alignment better than a decoder trying to handle both tasks.

Summary

The encoder-decoder transformer architecture combines the strengths of bidirectional encoding and autoregressive decoding through the cross-attention mechanism. Key takeaways:

  • Cross-attention enables the decoder to query encoder representations, creating a bridge between source understanding and target generation. Queries come from the decoder, keys and values from the encoder.

  • Three-sublayer decoder blocks include masked self-attention for autoregressive generation, cross-attention for source access, and a feed-forward network for transformation.

  • Information flow differs between components: the encoder sees bidirectional context, the decoder's self-attention is causal, but cross-attention provides unrestricted access to the source.

  • T5 and similar models demonstrate the architecture's versatility, treating diverse NLP tasks as text-to-text problems.

  • Architecture choice depends on task structure: encoder-decoder for sequence-to-sequence transformation, decoder-only for generation, encoder-only for understanding.

The encoder-decoder paradigm from "Attention Is All You Need" remains the foundation for many production NLP systems, particularly those with structured input-output relationships. While decoder-only models dominate current scaling efforts, encoder-decoder models continue to excel at tasks that benefit from dedicated encoding and controlled generation.

Key Parameters

When configuring encoder-decoder transformers, these parameters most directly impact model capacity and performance:

  • d_model (model dimension): The dimensionality of representations in both encoder and decoder. Must be divisible by n_heads. Typical values: 256 (small), 512 (base), 768 (large), 1024+ (very large).

  • n_heads (attention heads): Number of parallel attention heads in both self-attention and cross-attention. Usually chosen so d_model / n_heads = 64 or 128. More heads enable diverse attention patterns.

  • n_encoder_layers / n_decoder_layers: Depth of encoder and decoder stacks. Often equal (6 layers each in original transformer, 12 in T5-Base). Deeper models learn more complex transformations.

  • d_ff (feed-forward dimension): Hidden dimension in FFN sublayers. Conventionally 4 × d_model. The FFN accounts for most parameters in each layer.

  • src_vocab_size / tgt_vocab_size: Source and target vocabulary sizes. May be equal (shared vocabulary) or different (separate vocabularies). Larger vocabularies reduce sequence length but increase embedding parameters.

  • max_seq_len: Maximum supported sequence length. Affects position embedding size and memory requirements. Original transformer used 512; modern models support 2048-8192+.

  • temperature (generation): Controls randomness during sampling. Lower values (0.7) produce more focused output; higher values (1.2) increase diversity. Affects generation quality and creativity trade-off.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the encoder-decoder transformer architecture and cross-attention.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{encoderdecoderarchitecturecrossattentionsequencetosequencetransformers, author = {Michael Brenndoerfer}, title = {Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers}, year = {2025}, url = {https://mbrenndoerfer.com/writing/encoder-decoder-architecture-cross-attention-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers. Retrieved from https://mbrenndoerfer.com/writing/encoder-decoder-architecture-cross-attention-transformers
MLAAcademic
Michael Brenndoerfer. "Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/encoder-decoder-architecture-cross-attention-transformers>.
CHICAGOAcademic
Michael Brenndoerfer. "Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/encoder-decoder-architecture-cross-attention-transformers.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers'. Available at: https://mbrenndoerfer.com/writing/encoder-decoder-architecture-cross-attention-transformers (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers. https://mbrenndoerfer.com/writing/encoder-decoder-architecture-cross-attention-transformers
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free