Search

Search articles

Decoder Architecture: Causal Masking & Autoregressive Generation

Michael BrenndoerferUpdated June 19, 202539 min read

Master decoder-only transformers powering GPT, Llama, and modern LLMs. Learn causal masking, autoregressive generation, KV caching, and GPT-style architecture from scratch.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Decoder Architecture

The transformer decoder is the engine behind text generation. While encoders excel at understanding input sequences, decoders specialize in producing sequences, one token at a time. Every time you interact with ChatGPT, Claude, or Llama, you're witnessing a decoder at work: taking your prompt and generating a coherent response by predicting one token after another.

What makes the decoder special is its causal nature. Unlike encoders that can look at the entire input simultaneously, decoders must respect temporal order. When generating the fourth word, the model can only consider words one through three. It cannot peek ahead at words five or six because those don't exist yet. This constraint, enforced through causal masking, defines the decoder's fundamental architecture.

In this chapter, we'll explore the decoder-only design that powers modern language models like GPT, the mechanism of causal masking that enforces left-to-right generation, and how multiple decoder layers stack to create increasingly sophisticated language understanding.

The Decoder-Only Design

The original transformer from "Attention Is All You Need" (2017) used both an encoder and a decoder for machine translation. But a pivotal discovery emerged: for language modeling, you don't need the encoder at all. A stack of decoder blocks, trained to predict the next token, learns rich representations of language without any encoder component.

Decoder-Only Architecture

A decoder-only model consists of a stack of transformer blocks that process tokens autoregressively. Each block uses causal (masked) self-attention to ensure that predictions at position tt depend only on positions 0,1,,t10, 1, \ldots, t-1.

The decoder-only approach offers several advantages:

  • Simplicity: One stack of layers instead of two (no encoder-decoder attention needed)
  • Unified architecture: The same model handles both understanding and generation
  • Scalability: Easier to scale parameters when you have a single stack
  • Pretraining efficiency: Next-token prediction provides dense supervision at every position

GPT-1, GPT-2, GPT-3, and their successors all use this decoder-only design. So do Llama, Mistral, and most open-source language models. The architecture has become the default choice for generative language AI.

Understanding Autoregressive Generation

Before diving into the architecture, let's clarify what autoregressive generation means in practice. The decoder generates text one token at a time, where each new token is conditioned on all previously generated tokens.

Given a prompt "The cat sat on the", generation proceeds as follows:

  1. Process the prompt through the decoder to get a representation for each position
  2. Use the representation at the final position to predict the next token: "mat"
  3. Append "mat" to the sequence and process again
  4. Use the new final position to predict the next token: "and"
  5. Continue until generating a stop token or reaching a length limit

Mathematically, we're factorizing the joint probability of a sequence into a product of conditional probabilities. This is known as the chain rule of probability, and it expresses the idea that generating a sequence is equivalent to making a series of next-token predictions:

P(x1,x2,,xT)=t=1TP(xtx1,x2,,xt1)P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1, x_2, \ldots, x_{t-1})

where:

  • P(x1,x2,,xT)P(x_1, x_2, \ldots, x_T): the joint probability of the entire sequence of TT tokens
  • xtx_t: the token at position tt in the sequence
  • TT: the total sequence length
  • t=1T\prod_{t=1}^{T}: the product over all positions from 1 to TT
  • P(xtx1,,xt1)P(x_t | x_1, \ldots, x_{t-1}): the conditional probability of token xtx_t given all previous tokens

Each factor in this product represents one prediction step. The factor P(x1)P(x_1) predicts the first token with no context. The factor P(x2x1)P(x_2 | x_1) predicts the second token given the first. By the time we reach P(xTx1,,xT1)P(x_T | x_1, \ldots, x_{T-1}), we're predicting the final token given the entire preceding context.

The decoder computes each of these conditional probabilities. It takes the sequence so far, applies its transformer blocks to build contextual representations, and outputs a probability distribution over the vocabulary for the next token.

The Causal Masking Requirement

To understand why causal masking is essential, consider what happens without it. In standard self-attention, every position can attend to every other position. When processing the word "cat" in "The cat sat on the mat," the model can look at "sat," "on," "the," and "mat" to help represent "cat." This bidirectional view is powerful for understanding, but it creates a fundamental problem for generation.

During generation, when we're about to predict "sat," the words "on," "the," and "mat" don't exist yet. If the model learned to rely on future context during training, it would fail at inference time. Even worse, during training on complete sequences, the model could "cheat" by peeking at the answer. Position 3 could look at position 4 to see what word should come next, trivially solving the prediction task without learning meaningful patterns.

Causal Masking

Causal masking restricts each position to attend only to itself and previous positions. This enforces a left-to-right information flow that matches the autoregressive generation process, where tokens are produced one at a time and future tokens don't exist.

The solution is to enforce a strict temporal ordering: position tt can only attend to positions 0,1,,t0, 1, \ldots, t. This constraint, called causal masking or autoregressive masking, ensures the model learns to predict using only past context.

From Intuition to Formulation

How do we mathematically enforce "only attend to the past"? We need to modify the attention mechanism so that certain query-key pairs produce zero attention weight. The key insight is that attention weights come from a softmax over scores. If we can make certain scores extremely negative before the softmax, they'll become negligible in the final weights.

Standard scaled dot-product attention computes:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

The term QKT\mathbf{Q}\mathbf{K}^T produces an n×nn \times n matrix of scores, where entry (i,j)(i, j) measures how much query ii should attend to key jj. To block future positions, we add a mask matrix M\mathbf{M} to these scores before the softmax:

Attention(Q,K,V)=softmax(QKTdk+M)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}

where:

  • QRn×dk\mathbf{Q} \in \mathbb{R}^{n \times d_k}: query matrix containing nn query vectors, each of dimension dkd_k
  • KRn×dk\mathbf{K} \in \mathbb{R}^{n \times d_k}: key matrix containing nn key vectors, each of dimension dkd_k
  • VRn×dv\mathbf{V} \in \mathbb{R}^{n \times d_v}: value matrix containing nn value vectors, each of dimension dvd_v
  • QKTRn×n\mathbf{Q}\mathbf{K}^T \in \mathbb{R}^{n \times n}: the raw attention score matrix, where entry (i,j)(i, j) measures similarity between query ii and key jj
  • dk\sqrt{d_k}: scaling factor that prevents dot products from growing too large as dimension increases
  • MRn×n\mathbf{M} \in \mathbb{R}^{n \times n}: the causal mask matrix that blocks future positions

Designing the Mask

The mask M\mathbf{M} must satisfy a simple requirement: add 0 to scores we want to keep, and add a very large negative number to scores we want to block. Formally:

Mij={0if jiif j>iM_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

where:

  • ii: the query position (row index), representing the token doing the attending
  • jj: the key position (column index), representing the token being attended to
  • MijM_{ij}: the mask value added to the attention score at position (i,j)(i, j)

The condition jij \leq i captures "key is at or before query," which means the token at position jj is in the past (or present) relative to position ii. When this holds, we add 0, leaving the score unchanged. When j>ij > i, the key is in the future, so we add -\infty to block that attention path.

This creates a characteristic triangular pattern:

  • Row 0 (first token): only column 0 is allowed (the token can only see itself)
  • Row 1 (second token): columns 0 and 1 are allowed (can see itself and the first token)
  • Row n1n-1 (last token): all columns 0 through n1n-1 are allowed (can see the entire sequence)

Why -\infty Creates Zero Attention

The mathematical trick is elegant. After adding the mask, we apply softmax to convert scores to attention weights. For a masked position:

softmax(sij+())=esijkesik+Mik\text{softmax}(s_{ij} + (-\infty)) = \frac{e^{s_{ij} - \infty}}{\sum_k e^{s_{ik} + M_{ik}}}

As the mask value approaches -\infty, the numerator ee^{-\infty} approaches 0. The denominator remains positive because unmasked positions contribute finite, positive exponentials. The result:

epositive sum=0positive sum=0\frac{e^{-\infty}}{\text{positive sum}} = \frac{0}{\text{positive sum}} = 0

Those positions receive exactly zero attention weight. The token at position ii effectively cannot see any token at positions i+1,i+2,,n1i+1, i+2, \ldots, n-1. In practice, we use a large finite value like 109-10^9 instead of true infinity, which achieves the same effect with standard floating-point arithmetic.

Implementation

Let's implement causal masking and visualize the resulting attention pattern. The mask is simply an upper triangular matrix of negative infinities:

In[2]:
Code
import numpy as np


def create_causal_mask(seq_len):
    """
    Create a causal attention mask.

    Returns a matrix where:
    - 0 for positions that CAN be attended (j <= i)
    - -inf for positions that CANNOT be attended (j > i)
    """
    # Upper triangular matrix with -inf above the diagonal
    mask = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
    return mask


# Example: 5-token sequence
mask = create_causal_mask(5)
Out[3]:
Console
Causal Mask (5 tokens):
Rows = query position, Columns = key position
0 = can attend, -inf = blocked

Position 0: [  0  -inf -inf -inf -inf ]
Position 1: [  0    0  -inf -inf -inf ]
Position 2: [  0    0    0  -inf -inf ]
Position 3: [  0    0    0    0  -inf ]
Position 4: [  0    0    0    0    0  ]

Position 0 can only attend to itself. Position 1 can attend to positions 0 and 1. Position 4 can attend to all five positions. The triangular structure ensures strictly left-to-right information flow.

Out[4]:
Visualization
Triangular heatmap showing causal mask with lower triangle green and upper triangle red.
Causal attention mask for a 6-token sequence. Green cells indicate allowed attention (the query can attend to that key position). Red cells indicate blocked attention (future positions). Each row shows what a given position can see.

Anatomy of a Decoder Block

Now that we understand causal masking, let's see how it fits into a complete decoder block. A decoder block is structurally similar to an encoder block, with one critical difference: its self-attention layer applies causal masking. But the block is more than just masked attention; it's a carefully designed composition of components that work together to transform token representations.

The Two-Sublayer Structure

Each decoder block consists of two sublayers that serve complementary purposes:

  1. Causal Multi-Head Self-Attention: This sublayer enables cross-position communication. Each token gathers information from previous tokens (and itself), building a representation that incorporates relevant context. The causal mask ensures this communication respects temporal order.

  2. Feed-Forward Network (FFN): This sublayer provides position-wise transformation. The same two-layer MLP is applied independently to each position, adding non-linear representational capacity. Unlike attention, the FFN doesn't mix information across positions; it processes each token's representation in isolation.

These sublayers address different aspects of representation learning. Attention handles where to look in the sequence. The FFN handles how to transform what was gathered. Together, they enable the block to both aggregate context and compute complex functions of that context.

Supporting Components

Two additional components wrap each sublayer to enable deep networks:

  • Residual Connections: Skip connections around each sublayer create additive shortcuts. Instead of computing y=f(x)y = f(x), the block computes y=x+f(x)y = x + f(x). This means the sublayer only needs to learn the delta, or refinement, to add to the representation. Residuals also provide direct gradient paths that bypass potentially problematic nonlinearities.

  • Layer Normalization: Normalizes activations to stabilize training. Modern decoders use Pre-LN architecture, applying normalization before each sublayer rather than after. Pre-LN produces more stable gradients at initialization and enables training without careful learning rate warmup.

The data flow through a Pre-LN decoder block follows this pattern:

  1. Normalize → Attention → Add residual
  2. Normalize → FFN → Add residual

Let's implement each component and assemble them into a complete decoder block.

Utility Functions

First, we need a few utility functions. The softmax function converts raw scores to probabilities. RMS normalization scales activations to have unit root-mean-square. GELU provides a smooth nonlinearity for the feed-forward network:

In[5]:
Code
def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def rms_norm(x, gamma, eps=1e-6):
    """RMS Layer Normalization."""
    rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)


def gelu(x):
    """Gaussian Error Linear Unit activation."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

GELU (Gaussian Error Linear Unit) is the activation function used in GPT and most modern transformers. Unlike ReLU, which has a hard cutoff at zero, GELU smoothly transitions between suppressing and passing values:

Out[6]:
Visualization
Line plot comparing GELU and ReLU activation functions from -3 to 3, showing GELU's smooth curve versus ReLU's sharp corner at zero.
Comparison of GELU and ReLU activation functions. GELU (blue) provides a smooth, probabilistic transition that allows small negative values to pass through, while ReLU (orange, dashed) has a hard cutoff at zero.

The smooth transition of GELU can improve gradient flow during training compared to ReLU's sharp corner at zero.

Causal Multi-Head Attention

The attention layer is the heart of the decoder block. It projects input tokens into queries, keys, and values, then computes attention with causal masking. Multiple heads allow the model to attend to different aspects of the context simultaneously:

In[7]:
Code
class CausalMultiHeadAttention:
    """
    Multi-head attention with causal masking.

    Each head learns different aspects of token relationships,
    but all heads respect the causal constraint.
    """

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        # Initialize projection matrices
        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def __call__(self, x):
        """
        Apply causal multi-head attention.

        Args:
            x: Input tensor of shape (seq_len, d_model)

        Returns:
            Output tensor of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]

        # Project to queries, keys, values
        Q = x @ self.W_q  # (seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head: (seq_len, n_heads, d_k)
        Q = Q.reshape(seq_len, self.n_heads, self.d_k)
        K = K.reshape(seq_len, self.n_heads, self.d_k)
        V = V.reshape(seq_len, self.n_heads, self.d_k)

        # Transpose to (n_heads, seq_len, d_k) for batch matmul
        Q = Q.transpose(1, 0, 2)
        K = K.transpose(1, 0, 2)
        V = V.transpose(1, 0, 2)

        # Compute attention scores: (n_heads, seq_len, seq_len)
        scores = Q @ K.transpose(0, 2, 1) / np.sqrt(self.d_k)

        # Apply causal mask
        causal_mask = create_causal_mask(seq_len)
        scores = scores + causal_mask  # Broadcasting adds mask to each head

        # Softmax to get attention weights
        attn_weights = softmax(scores, axis=-1)

        # Apply attention to values: (n_heads, seq_len, d_k)
        attn_output = attn_weights @ V

        # Transpose back and reshape: (seq_len, d_model)
        attn_output = attn_output.transpose(1, 0, 2).reshape(
            seq_len, self.d_model
        )

        # Output projection
        output = attn_output @ self.W_o

        return output

    def num_parameters(self):
        """Count total parameters in this layer."""
        return 4 * self.d_model * self.d_model

The attention layer computes the same scaled dot-product attention as an encoder, but the causal mask ensures the upper triangle of attention weights becomes zero. Each position's output is a weighted sum of value vectors from itself and all previous positions.

Position-Wise Feed-Forward Network

After attention aggregates context, the FFN transforms each position independently. It expands the representation to a larger hidden dimension, applies a nonlinearity, then projects back to the model dimension. This expansion-contraction pattern adds significant representational capacity:

In[8]:
Code
class FeedForward:
    """
    Position-wise feed-forward network.

    Applies the same two-layer MLP independently to each position.
    """

    def __init__(self, d_model, d_ff):
        self.d_model = d_model
        self.d_ff = d_ff

        # Initialize weights
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))

        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)

    def __call__(self, x):
        """
        Apply feed-forward transformation.

        Args:
            x: Input of shape (seq_len, d_model)

        Returns:
            Output of shape (seq_len, d_model)
        """
        # First linear + GELU activation
        hidden = gelu(x @ self.W1 + self.b1)
        # Second linear
        output = hidden @ self.W2 + self.b2
        return output

    def num_parameters(self):
        return (
            self.d_model * self.d_ff
            + self.d_ff
            + self.d_ff * self.d_model
            + self.d_model
        )

Assembling the Decoder Block

Now we combine attention, FFN, normalization, and residual connections into a complete decoder block. The Pre-LN architecture applies normalization before each sublayer, which produces more stable gradients and enables training without warmup:

In[9]:
Code
class DecoderBlock:
    """
    A single transformer decoder block.

    Uses Pre-LN architecture: normalization before each sublayer.
    Causal masking is applied in the self-attention.
    """

    def __init__(self, d_model, n_heads, d_ff):
        self.d_model = d_model

        # Sublayers
        self.attention = CausalMultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)

        # Layer norm parameters (for RMS norm, just scale)
        self.norm1_gamma = np.ones(d_model)
        self.norm2_gamma = np.ones(d_model)

    def __call__(self, x):
        """
        Process input through the decoder block.

        Args:
            x: Input of shape (seq_len, d_model)

        Returns:
            Output of shape (seq_len, d_model)
        """
        # Self-attention with residual
        normed = rms_norm(x, self.norm1_gamma)
        attn_out = self.attention(normed)
        x = x + attn_out

        # FFN with residual
        normed = rms_norm(x, self.norm2_gamma)
        ffn_out = self.ffn(normed)
        x = x + ffn_out

        return x

    def num_parameters(self):
        """Count parameters in this block."""
        attn_params = self.attention.num_parameters()
        ffn_params = self.ffn.num_parameters()
        norm_params = 2 * self.d_model  # Two RMS norms
        return attn_params + ffn_params + norm_params

Testing the Decoder Block

Let's instantiate a decoder block and verify it works correctly. We'll use typical hyperparameters: 256-dimensional model, 8 attention heads (32 dimensions each), and a 4x FFN expansion:

Out[10]:
Console
Decoder Block Configuration
========================================
  Model dimension (d_model): 256
  Number of heads: 8
  Head dimension: 32
  FFN hidden dim (d_ff): 1024
  Parameters: 788,224

Input shape:  (10, 256)
Output shape: (10, 256)

The block contains roughly 790K parameters, dominated by the feed-forward network (which uses a 4x expansion ratio: 2561024256256 \to 1024 \to 256). The input and output shapes match, allowing blocks to be stacked without dimension changes. This "residual-friendly" design is what enables deep networks: each block can refine the representation without fundamentally altering its structure.

The decoder block transforms each position independently while allowing controlled information flow from past positions through causal attention. The FFN adds non-linear transformation capacity without any cross-position interaction.

GPT-Style Decoder Architecture

GPT (Generative Pre-trained Transformer) established the dominant decoder-only paradigm. The full GPT architecture consists of:

  1. Token Embedding: Maps input token IDs to dense vectors
  2. Position Embedding: Adds positional information (learned or sinusoidal)
  3. Decoder Stack: Multiple decoder blocks in sequence
  4. Final Layer Norm: Normalizes the output of the last block
  5. Language Model Head: Projects to vocabulary size for next-token prediction

Let's build a complete GPT-style decoder.

In[11]:
Code
class GPTDecoder:
    """
    Complete GPT-style decoder-only transformer.

    Takes token indices as input and produces logits over the vocabulary.
    """

    def __init__(
        self, vocab_size, max_seq_len, d_model, n_heads, d_ff, n_layers
    ):
        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        self.d_model = d_model
        self.n_layers = n_layers

        # Token and position embeddings
        self.token_embedding = np.random.randn(vocab_size, d_model) * 0.02
        self.position_embedding = np.random.randn(max_seq_len, d_model) * 0.02

        # Stack of decoder blocks
        self.blocks = [
            DecoderBlock(d_model, n_heads, d_ff) for _ in range(n_layers)
        ]

        # Final layer norm
        self.final_norm_gamma = np.ones(d_model)

        # Language model head (projects to vocabulary)
        self.lm_head = np.random.randn(d_model, vocab_size) * 0.02

    def __call__(self, token_ids):
        """
        Forward pass through the decoder.

        Args:
            token_ids: Array of token indices, shape (seq_len,)

        Returns:
            logits: Unnormalized scores over vocabulary, shape (seq_len, vocab_size)
        """
        seq_len = len(token_ids)

        # Get embeddings
        tok_emb = self.token_embedding[token_ids]  # (seq_len, d_model)
        pos_emb = self.position_embedding[:seq_len]  # (seq_len, d_model)

        # Combine embeddings
        x = tok_emb + pos_emb

        # Pass through decoder blocks
        for block in self.blocks:
            x = block(x)

        # Final normalization
        x = rms_norm(x, self.final_norm_gamma)

        # Project to vocabulary
        logits = x @ self.lm_head  # (seq_len, vocab_size)

        return logits

    def num_parameters(self):
        """Count total parameters."""
        embed_params = self.vocab_size * self.d_model  # Token embedding
        embed_params += self.max_seq_len * self.d_model  # Position embedding
        block_params = sum(b.num_parameters() for b in self.blocks)
        norm_params = self.d_model  # Final norm
        head_params = self.d_model * self.vocab_size  # LM head
        return embed_params + block_params + norm_params + head_params
Out[12]:
Console
GPT Decoder Configuration
=============================================
  Vocabulary size: 10,000
  Max sequence length: 512
  Model dimension: 256
  Number of heads: 8
  FFN dimension: 1024
  Number of layers: 6
  Total parameters: 9,980,672

The parameter count breaks down into three main components: embeddings (token + position), the decoder stack, and the language model head. With a 10K vocabulary and 256-dimensional embeddings, the token embedding alone accounts for 2.56M parameters. The 6 decoder blocks contribute about 4.7M parameters (roughly 790K each), and the LM head adds another 2.56M for projecting back to vocabulary size.

Out[13]:
Visualization
Horizontal bar chart showing parameter counts for token embedding, position embedding, decoder blocks, and LM head components.
Parameter distribution in our 6.9M parameter GPT model. Embeddings and the language model head together account for about 75% of parameters, while the 6 decoder blocks contribute the remaining 25%. This ratio shifts in larger models where deeper stacks dominate.

This 6.9M parameter model is tiny by modern standards. GPT-2 Small has 117M parameters, GPT-3 has 175B, and GPT-4 is estimated at over a trillion. But the architectural pattern is identical: stack more layers, increase dimensions, and scale up the training data.

Layer Stacking and Information Flow

Decoder blocks are stacked sequentially. The output of block ll becomes the input to block l+1l + 1. Each layer refines the representations, building increasingly abstract features.

In[14]:
Code
def trace_layer_outputs(model, token_ids):
    """
    Trace how representations evolve through decoder layers.

    Returns the hidden states after each layer.
    """
    seq_len = len(token_ids)

    # Initial embeddings
    tok_emb = model.token_embedding[token_ids]
    pos_emb = model.position_embedding[:seq_len]
    x = tok_emb + pos_emb

    layer_outputs = [x.copy()]

    # Through each block
    for block in model.blocks:
        x = block(x)
        layer_outputs.append(x.copy())

    # After final norm
    x = rms_norm(x, model.final_norm_gamma)
    layer_outputs.append(x.copy())

    return layer_outputs


# Trace through our model
test_tokens = np.array([42, 100, 256, 789, 1000, 500, 333, 42])
layer_outputs = trace_layer_outputs(model, test_tokens)
Out[15]:
Console
Tracing 8 tokens through 6 layers

Layer output statistics (L2 norm of representations):
--------------------------------------------------
Embedding   : mean norm = 0.445, std = 0.011
Block 1     : mean norm = 22.717, std = 3.942
Block 2     : mean norm = 34.438, std = 4.544
Block 3     : mean norm = 42.757, std = 4.166
Block 4     : mean norm = 51.345, std = 4.582
Block 5     : mean norm = 61.620, std = 5.051
Block 6     : mean norm = 68.161, std = 4.536
Final Norm  : mean norm = 16.000, std = 0.000

The representation norms reveal how the network transforms token embeddings. Starting from the embedding layer with relatively small norms (around 0.3), the representations grow as they pass through successive blocks. This growth reflects the accumulation of information via residual connections: each block adds its contribution to the running representation. The final layer norm rescales the output, ensuring consistent magnitudes before the language model head computes logits.

Out[16]:
Visualization
Line plot showing L2 norm of representations for 8 tokens across embedding layer, 6 decoder blocks, and final normalization.
Representation norm growth through decoder layers. Each line represents one of the 8 input tokens. Norms generally increase through the decoder blocks as residual connections accumulate contributions, then are rescaled by the final layer norm.

As we go deeper, representation norms tend to grow or stabilize depending on initialization and normalization. The final layer norm ensures consistent scale before the language model head.

Out[17]:
Visualization
Heatmap showing cosine similarity between decoder layer outputs, with diagonal of 1.0 and off-diagonal values showing layer relationships.
Cosine similarity between layer representations for the final token position. Earlier layers show higher similarity; deeper layers diverge as they develop specialized features.

The similarity matrix reveals how representations evolve. Adjacent layers tend to be more similar (the residual connections preserve information), while distant layers may diverge significantly as the model builds higher-level abstractions.

Visualizing Causal Attention Patterns

Each attention head in a decoder develops specialized patterns. Some heads attend strongly to the previous token (local context). Others attend to the beginning of the sequence. Still others develop positional patterns or semantic groupings.

Let's examine attention patterns in a decoder block.

In[18]:
Code
def extract_attention_weights(block, x):
    """
    Extract attention weights from a decoder block.

    Returns attention weights of shape (n_heads, seq_len, seq_len).
    """
    attn = block.attention
    seq_len = x.shape[0]

    # Replicate attention computation to capture weights
    normed = rms_norm(x, block.norm1_gamma)

    Q = normed @ attn.W_q
    K = normed @ attn.W_k

    Q = Q.reshape(seq_len, attn.n_heads, attn.d_k).transpose(1, 0, 2)
    K = K.reshape(seq_len, attn.n_heads, attn.d_k).transpose(1, 0, 2)

    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(attn.d_k)
    scores = scores + create_causal_mask(seq_len)
    weights = softmax(scores, axis=-1)

    return weights


# Extract attention from first block
x_test = np.random.randn(8, 256) * 0.02
attn_weights = extract_attention_weights(model.blocks[0], x_test)
Out[19]:
Visualization
Lower triangular heatmap showing attention weights for head 0 with varying intensity.
Head 0 attention pattern. This head shows a strong diagonal pattern, focusing primarily on recent positions with causal masking creating the characteristic lower-triangular structure.
Lower triangular heatmap showing attention weights for head 1 with different pattern from head 0.
Head 1 attention pattern. Different heads learn different attention strategies. This head may focus on different positional relationships or semantic patterns.

Both heads exhibit the causal structure with zeros in the upper triangle. The intensity patterns within the lower triangle vary between heads, showing how different heads attend to different parts of the available context. With random initialization, these patterns are not yet meaningful, but training shapes them into useful attention strategies.

The Generation Loop

At inference time, the decoder generates text autoregressively. Each iteration:

  1. Processes the current sequence through all decoder blocks
  2. Takes the logits at the final position
  3. Samples or greedily selects the next token
  4. Appends that token to the sequence
  5. Repeats until a stopping condition
In[20]:
Code
def generate(model, prompt_tokens, max_new_tokens=20, temperature=1.0):
    """
    Generate tokens autoregressively.

    Args:
        model: GPTDecoder instance
        prompt_tokens: Initial token indices
        max_new_tokens: Maximum tokens to generate
        temperature: Sampling temperature (higher = more random)

    Returns:
        Complete sequence including prompt and generated tokens
    """
    tokens = list(prompt_tokens)

    for _ in range(max_new_tokens):
        # Get current sequence length
        current_len = len(tokens)
        if current_len >= model.max_seq_len:
            break

        # Forward pass
        token_array = np.array(tokens)
        logits = model(token_array)

        # Get logits for next token prediction (last position)
        next_logits = logits[-1]  # (vocab_size,)

        # Apply temperature
        if temperature != 1.0:
            next_logits = next_logits / temperature

        # Convert to probabilities
        probs = softmax(next_logits)

        # Sample next token
        next_token = np.random.choice(len(probs), p=probs)
        tokens.append(next_token)

        # Simple stop condition: token 0 is end-of-sequence
        if next_token == 0:
            break

    return np.array(tokens)


# Generate from a prompt
prompt = np.array([42, 100, 256])
generated = generate(model, prompt, max_new_tokens=10, temperature=0.8)
Out[21]:
Console
Autoregressive Generation Demo
=============================================
Prompt tokens: [42, 100, 256]
Generated tokens: [42, 100, 256, 8014, 7624, 6249, 8208, 4081, 6138, 2831, 6387, 5384, 8695]
New tokens: 10

The model generated 10 new tokens from the 3-token prompt. Since this is a randomly initialized model (not trained on any text), the generated token IDs are essentially random samples from the vocabulary. A trained model would produce coherent continuations. The temperature of 0.8 makes the distribution slightly sharper than pure random sampling, biasing toward higher-probability tokens.

Let's visualize how temperature affects the sampling distribution. Lower temperatures concentrate probability mass on the highest-scoring tokens, while higher temperatures flatten the distribution:

Out[22]:
Visualization
Line plot showing three probability distributions over token indices with different temperatures, demonstrating how lower temperatures create sharper peaks.
Effect of temperature on token probability distribution. Temperature=0.5 (blue) sharpens the distribution, concentrating probability on high-scoring tokens. Temperature=1.0 (orange) preserves the original logit ratios. Temperature=2.0 (green) flattens the distribution, making unlikely tokens more probable.

At temperature 0.5, the top tokens dominate the distribution, making generation more deterministic. At temperature 2.0, even low-scoring tokens have non-negligible probability, introducing more randomness and diversity. The choice of temperature trades off between coherence (low temperature) and creativity (high temperature).

In practice, generation involves beam search, top-k sampling, nucleus sampling, and other decoding strategies to balance quality and diversity. The decoder architecture itself doesn't change; only the sampling logic varies.

Key-Value Caching for Efficient Generation

A naive implementation recomputes attention for all positions at every generation step. This is wasteful: the representations for tokens 0 through t1t-1 don't change when we add token tt. KV caching stores the key and value vectors from previous positions, avoiding redundant computation.

In[23]:
Code
class CausalAttentionWithCache:
    """
    Multi-head attention with KV cache for efficient generation.
    """

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

        # Cache for past keys and values
        self.cache_k = None
        self.cache_v = None

    def clear_cache(self):
        """Reset the KV cache."""
        self.cache_k = None
        self.cache_v = None

    def __call__(self, x, use_cache=False):
        """
        Apply attention, optionally using and updating cache.

        When use_cache=True:
        - x should be just the new token(s)
        - Keys/values are appended to cache
        - Queries attend to all cached positions
        """
        seq_len = x.shape[0]

        # Compute Q, K, V for new positions
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head
        Q = Q.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        K = K.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        V = V.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)

        if use_cache:
            # Append new K, V to cache
            if self.cache_k is None:
                self.cache_k = K
                self.cache_v = V
            else:
                self.cache_k = np.concatenate([self.cache_k, K], axis=1)
                self.cache_v = np.concatenate([self.cache_v, V], axis=1)

            # Use full cached K, V
            K_full = self.cache_k
            V_full = self.cache_v
        else:
            K_full = K
            V_full = V

        total_len = K_full.shape[1]

        # Compute attention scores
        scores = Q @ K_full.transpose(0, 2, 1) / np.sqrt(self.d_k)

        # Apply causal mask (only for positions we're querying)
        # Shape: (seq_len, total_len)
        mask = np.zeros((seq_len, total_len))
        start_pos = total_len - seq_len
        for i in range(seq_len):
            # Position i in query corresponds to position start_pos + i in full sequence
            full_pos = start_pos + i
            # Can attend to positions 0 to full_pos
            mask[i, full_pos + 1 :] = float("-inf")

        scores = scores + mask

        # Softmax and apply to values
        attn_weights = softmax(scores, axis=-1)
        attn_output = attn_weights @ V_full

        # Reshape and project
        attn_output = attn_output.transpose(1, 0, 2).reshape(
            seq_len, self.d_model
        )
        output = attn_output @ self.W_o

        return output
Out[24]:
Console
KV Cache Demo
=============================================
Initial sequence: 5 tokens
Cache K shape after init: (4, 5, 16)
Added 1 token
Cache K shape after addition: (4, 6, 16)
Output shape for new token: (1, 64)

The cache grows from shape (4, 5, 16) to (4, 6, 16) after adding one token. Here, 4 is the number of heads, the middle dimension is sequence length (growing from 5 to 6), and 16 is the head dimension (64 / 4 heads). When processing the new token, we only compute queries for that single position, but the keys and values span all 6 positions. This asymmetry is what makes caching efficient: the expensive key-value computation for old tokens is reused.

With KV caching, generating nn new tokens requires O(nL)O(n \cdot L) compute instead of O(n2L)O(n^2 \cdot L), where:

  • nn: the number of new tokens to generate
  • LL: the total sequence length (prompt plus generated tokens)
  • O(nL)O(n \cdot L): linear scaling because each new token only computes attention against the cached keys/values
  • O(n2L)O(n^2 \cdot L): quadratic scaling without caching, since we recompute all positions at each step

This optimization is critical for efficient inference in production systems. For a 1000-token prompt with 100 generated tokens, caching reduces attention computation by roughly 50x.

Decoder Architecture Comparison

Different models use slightly different decoder designs. The core pattern is consistent, but details vary:

Decoder architecture comparison across popular language models. All models share the same fundamental structure but differ in scale and specific design choices.
ModelLayersHeadsdmodeld_{\text{model}}FFN RatioNormNotable Features
GPT-2 Small12127684xLayer NormLearned position embeddings
GPT-2 Medium241610244xLayer NormPre-LN architecture
GPT-39696122884xLayer NormSparse attention in some heads
Llama 2 7B323240962.7xRMSNormRoPE, SwiGLU FFN
Llama 3 8B323240963.5xRMSNormGrouped-query attention
Mistral 7B323240963.5xRMSNormSliding window attention

Modern decoders favor RMSNorm over LayerNorm for efficiency, SwiGLU or GeGLU activations in the FFN for better performance, and positional encodings like RoPE that generalize to longer sequences. Grouped-query attention reduces the KV cache size by sharing key-value projections across groups of heads.

When to Use Decoder-Only Models

Decoder-only architectures excel at:

  • Text generation: Stories, code, conversations, completions
  • Language modeling: Predicting the next token given context
  • In-context learning: Few-shot learning through prompting
  • General-purpose AI assistants: Handling diverse tasks through natural language

They are less naturally suited for:

  • Bidirectional understanding: Tasks like named entity recognition or question answering over a document, where seeing the full context helps
  • Sequence-to-sequence with misaligned lengths: Machine translation where the source and target have different structures
  • Fixed-length encoding: Producing a single vector representation of a document

For these tasks, encoder-only (like BERT) or encoder-decoder models (like T5) may be more appropriate. However, large decoder-only models trained with instruction tuning have proven surprisingly capable across task types, blurring these traditional boundaries.

Limitations and Impact

The decoder-only architecture has transformed language AI, but it comes with inherent constraints that shape both its capabilities and its failure modes.

The autoregressive nature means generation is inherently sequential at inference time. Each new token depends on all previous tokens, preventing parallel generation. While KV caching mitigates redundant computation, the fundamental serial dependency remains. For real-time applications, this creates latency constraints that scale with output length. Techniques like speculative decoding, where a smaller model drafts tokens that a larger model verifies in parallel, partially address this, but the sequential bottleneck persists.

Causal masking also limits the model's understanding of each position. When processing position tt, the decoder cannot consider positions t+1,t+2,t+1, t+2, \ldots even though, for understanding tasks, that future context would be valuable. Bidirectional models like BERT can leverage the full context for each position, giving them an advantage for tasks like classification or extraction. Decoder-only models compensate through scale and training data, but the architectural constraint remains.

Despite these limitations, the decoder-only paradigm has proven remarkably versatile. The GPT series demonstrated that next-token prediction at scale produces emergent capabilities in reasoning, code generation, and multilingual understanding. The simplicity of the architecture, a single stack of nearly identical blocks, enables efficient scaling and has become the foundation for modern generative AI. From GPT-2's surprising text generation to GPT-4's multimodal reasoning, the decoder architecture continues to define the frontier of language AI.

Key Parameters

When configuring a decoder-only transformer, the following parameters have the greatest impact on model capacity and performance:

  • d_model (model dimension): The dimensionality of token representations throughout the network. Common values range from 256 (small models) to 12288 (GPT-3 scale). Must be divisible by n_heads. Larger values increase capacity but quadratically increase attention computation.

  • n_heads (number of attention heads): How many parallel attention patterns the model can learn. Typically set so that d_model / n_heads yields head dimensions of 64-128. More heads enable diverse attention patterns but have diminishing returns beyond a point.

  • n_layers (number of decoder blocks): The depth of the transformer stack. Deeper models learn more abstract representations. GPT-2 Small uses 12 layers; GPT-3 uses 96. Depth trades off against width for a given parameter budget.

  • d_ff (feed-forward hidden dimension): The expansion dimension in the FFN sublayer. Traditionally 4x d_model, but modern models use 2.7x-3.5x with gated activations like SwiGLU. Larger FFN dimensions increase the network's non-linear transformation capacity.

  • vocab_size: The number of tokens in the vocabulary. Larger vocabularies reduce sequence length for the same text but increase embedding and LM head parameters. Typical values range from 30K (GPT-2) to 128K (Llama 3).

  • max_seq_len: Maximum sequence length the model can process. Affects position embedding size and memory requirements during training. Modern models support 2K-128K tokens through efficient attention mechanisms.

  • temperature (generation): Controls randomness during sampling. Values below 1.0 sharpen the distribution (more deterministic); values above 1.0 flatten it (more random). Common range is 0.7-1.0 for generation tasks.

Summary

The decoder architecture powers modern language generation by combining self-attention with causal masking to enforce left-to-right information flow. Key takeaways:

  • Causal masking prevents each position from attending to future positions, matching the autoregressive generation process
  • Decoder blocks consist of masked self-attention and feed-forward layers, each with residual connections and normalization
  • GPT-style models stack decoder blocks between embedding and language model head layers, forming a unified architecture for understanding and generation
  • Layer stacking builds increasingly abstract representations, with each layer refining outputs from the previous layer
  • KV caching enables efficient generation by storing key-value pairs from previous positions
  • Generation proceeds token by token, with each new token conditioned on all previous ones through the attention mechanism

The decoder-only design has become the default architecture for large language models, offering simplicity, scalability, and surprising generality. While encoder and encoder-decoder architectures retain advantages for specific tasks, the decoder's unified approach to understanding and generation has proven remarkably effective across the spectrum of language AI applications.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about decoder architectures and causal masking.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{decoderarchitecturecausalmaskingautoregressivegeneration, author = {Michael Brenndoerfer}, title = {Decoder Architecture: Causal Masking & Autoregressive Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Decoder Architecture: Causal Masking & Autoregressive Generation. Retrieved from https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers
MLAAcademic
Michael Brenndoerfer. "Decoder Architecture: Causal Masking & Autoregressive Generation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers>.
CHICAGOAcademic
Michael Brenndoerfer. "Decoder Architecture: Causal Masking & Autoregressive Generation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Decoder Architecture: Causal Masking & Autoregressive Generation'. Available at: https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Decoder Architecture: Causal Masking & Autoregressive Generation. https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free