Decoder Architecture: Causal Masking & Autoregressive Generation

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Master decoder-only transformers powering GPT, Llama, and modern LLMs. Learn causal masking, autoregressive generation, KV caching, and GPT-style architecture from scratch.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Decoder ArchitectureLink Copied

The transformer decoder is the engine behind text generation. While encoders excel at understanding input sequences, decoders specialize in producing sequences, one token at a time. Every time you interact with ChatGPT, Claude, or Llama, you're witnessing a decoder at work: taking your prompt and generating a coherent response by predicting one token after another.

What makes the decoder special is its causal nature. Unlike encoders that can look at the entire input simultaneously, decoders must respect temporal order. When generating the fourth word, the model can only consider words one through three. It cannot peek ahead at words five or six because those don't exist yet. This constraint, enforced through causal masking, defines the decoder's fundamental architecture.

In this chapter, we'll explore the decoder-only design that powers modern language models like GPT, the mechanism of causal masking that enforces left-to-right generation, and how multiple decoder layers stack to create increasingly sophisticated language understanding.

The Decoder-Only DesignLink Copied

The original transformer from "Attention Is All You Need" (2017) used both an encoder and a decoder for machine translation. But a pivotal discovery emerged: for language modeling, you don't need the encoder at all. A stack of decoder blocks, trained to predict the next token, learns rich representations of language without any encoder component.

Decoder-Only Architecture

A decoder-only model consists of a stack of transformer blocks that process tokens autoregressively. Each block uses causal (masked) self-attention to ensure that predictions at position $t$ depend only on positions $0, 1, \ldots, t-1$ .

The decoder-only approach offers several advantages:

Simplicity: One stack of layers instead of two (no encoder-decoder attention needed)
Unified architecture: The same model handles both understanding and generation
Scalability: Easier to scale parameters when you have a single stack
Pretraining efficiency: Next-token prediction provides dense supervision at every position

GPT-1, GPT-2, GPT-3, and their successors all use this decoder-only design. So do Llama, Mistral, and most open-source language models. The architecture has become the default choice for generative language AI.

Understanding Autoregressive GenerationLink Copied

Before diving into the architecture, let's clarify what autoregressive generation means in practice. The decoder generates text one token at a time, where each new token is conditioned on all previously generated tokens.

Given a prompt "The cat sat on the", generation proceeds as follows:

Process the prompt through the decoder to get a representation for each position
Use the representation at the final position to predict the next token: "mat"
Append "mat" to the sequence and process again
Use the new final position to predict the next token: "and"
Continue until generating a stop token or reaching a length limit

Mathematically, we're factorizing the joint probability of a sequence into a product of conditional probabilities. This is known as the chain rule of probability, and it expresses the idea that generating a sequence is equivalent to making a series of next-token predictions:

P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t | x_1, x_2, \ldots, x_{t-1})

where:

$P(x_1, x_2, \ldots, x_T)$ : the joint probability of the entire sequence of $T$ tokens
$x_t$ : the token at position $t$ in the sequence
$T$ : the total sequence length
$\prod_{t=1}^{T}$ : the product over all positions from 1 to $T$
$P(x_t | x_1, \ldots, x_{t-1})$ : the conditional probability of token $x_t$ given all previous tokens

Each factor in this product represents one prediction step. The factor $P(x_1)$ predicts the first token with no context. The factor $P(x_2 | x_1)$ predicts the second token given the first. By the time we reach $P(x_T | x_1, \ldots, x_{T-1})$ , we're predicting the final token given the entire preceding context.

The decoder computes each of these conditional probabilities. It takes the sequence so far, applies its transformer blocks to build contextual representations, and outputs a probability distribution over the vocabulary for the next token.

The Causal Masking RequirementLink Copied

To understand why causal masking is essential, consider what happens without it. In standard self-attention, every position can attend to every other position. When processing the word "cat" in "The cat sat on the mat," the model can look at "sat," "on," "the," and "mat" to help represent "cat." This bidirectional view is powerful for understanding, but it creates a fundamental problem for generation.

During generation, when we're about to predict "sat," the words "on," "the," and "mat" don't exist yet. If the model learned to rely on future context during training, it would fail at inference time. Even worse, during training on complete sequences, the model could "cheat" by peeking at the answer. Position 3 could look at position 4 to see what word should come next, trivially solving the prediction task without learning meaningful patterns.

Causal Masking

Causal masking restricts each position to attend only to itself and previous positions. This enforces a left-to-right information flow that matches the autoregressive generation process, where tokens are produced one at a time and future tokens don't exist.

The solution is to enforce a strict temporal ordering: position $t$ can only attend to positions $0, 1, \ldots, t$ . This constraint, called causal masking or autoregressive masking, ensures the model learns to predict using only past context.

From Intuition to FormulationLink Copied

How do we mathematically enforce "only attend to the past"? We need to modify the attention mechanism so that certain query-key pairs produce zero attention weight. The key insight is that attention weights come from a softmax over scores. If we can make certain scores extremely negative before the softmax, they'll become negligible in the final weights.

Standard scaled dot-product attention computes:

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}

The term $\mathbf{Q}\mathbf{K}^T$ produces an $n \times n$ matrix of scores, where entry $(i, j)$ measures how much query $i$ should attend to key $j$ . To block future positions, we add a mask matrix $\mathbf{M}$ to these scores before the softmax:

\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}

where:

$\mathbf{Q} \in \mathbb{R}^{n \times d_k}$ : query matrix containing $n$ query vectors, each of dimension $d_k$
$\mathbf{K} \in \mathbb{R}^{n \times d_k}$ : key matrix containing $n$ key vectors, each of dimension $d_k$
$\mathbf{V} \in \mathbb{R}^{n \times d_v}$ : value matrix containing $n$ value vectors, each of dimension $d_v$
$\mathbf{Q}\mathbf{K}^T \in \mathbb{R}^{n \times n}$ : the raw attention score matrix, where entry $(i, j)$ measures similarity between query $i$ and key $j$
$\sqrt{d_k}$ : scaling factor that prevents dot products from growing too large as dimension increases
$\mathbf{M} \in \mathbb{R}^{n \times n}$ : the causal mask matrix that blocks future positions

Designing the MaskLink Copied

The mask $\mathbf{M}$ must satisfy a simple requirement: add 0 to scores we want to keep, and add a very large negative number to scores we want to block. Formally:

M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

where:

$i$ : the query position (row index), representing the token doing the attending
$j$ : the key position (column index), representing the token being attended to
$M_{ij}$ : the mask value added to the attention score at position $(i, j)$

The condition $j \leq i$ captures "key is at or before query," which means the token at position $j$ is in the past (or present) relative to position $i$ . When this holds, we add 0, leaving the score unchanged. When $j > i$ , the key is in the future, so we add $-\infty$ to block that attention path.

This creates a characteristic triangular pattern:

Row 0 (first token): only column 0 is allowed (the token can only see itself)
Row 1 (second token): columns 0 and 1 are allowed (can see itself and the first token)
Row $n-1$ (last token): all columns 0 through $n-1$ are allowed (can see the entire sequence)

Why $-\infty$ Creates Zero AttentionLink Copied

The mathematical trick is elegant. After adding the mask, we apply softmax to convert scores to attention weights. For a masked position:

\text{softmax}(s_{ij} + (-\infty)) = \frac{e^{s_{ij} - \infty}}{\sum_k e^{s_{ik} + M_{ik}}}

As the mask value approaches $-\infty$ , the numerator $e^{-\infty}$ approaches 0. The denominator remains positive because unmasked positions contribute finite, positive exponentials. The result:

\frac{e^{-\infty}}{\text{positive sum}} = \frac{0}{\text{positive sum}} = 0

Those positions receive exactly zero attention weight. The token at position $i$ effectively cannot see any token at positions $i+1, i+2, \ldots, n-1$ . In practice, we use a large finite value like $-10^9$ instead of true infinity, which achieves the same effect with standard floating-point arithmetic.

ImplementationLink Copied

Let's implement causal masking and visualize the resulting attention pattern. The mask is simply an upper triangular matrix of negative infinities:

In[2]:

Code

import numpy as np


def create_causal_mask(seq_len):
    """
    Create a causal attention mask.

    Returns a matrix where:
    - 0 for positions that CAN be attended (j <= i)
    - -inf for positions that CANNOT be attended (j > i)
    """
    # Upper triangular matrix with -inf above the diagonal
    mask = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
    return mask


# Example: 5-token sequence
mask = create_causal_mask(5)

import numpy as np


def create_causal_mask(seq_len):
    """
    Create a causal attention mask.

    Returns a matrix where:
    - 0 for positions that CAN be attended (j <= i)
    - -inf for positions that CANNOT be attended (j > i)
    """
    # Upper triangular matrix with -inf above the diagonal
    mask = np.triu(np.ones((seq_len, seq_len)) * float("-inf"), k=1)
    return mask


# Example: 5-token sequence
mask = create_causal_mask(5)

Out[3]:

Console

Causal Mask (5 tokens):
Rows = query position, Columns = key position
0 = can attend, -inf = blocked

Position 0: [  0  -inf -inf -inf -inf ]
Position 1: [  0    0  -inf -inf -inf ]
Position 2: [  0    0    0  -inf -inf ]
Position 3: [  0    0    0    0  -inf ]
Position 4: [  0    0    0    0    0  ]

Position 0 can only attend to itself. Position 1 can attend to positions 0 and 1. Position 4 can attend to all five positions. The triangular structure ensures strictly left-to-right information flow.

Out[4]:

Visualization

Triangular heatmap showing causal mask with lower triangle green and upper triangle red. — Causal attention mask for a 6-token sequence. Green cells indicate allowed attention (the query can attend to that key position). Red cells indicate blocked attention (future positions). Each row shows what a given position can see.

Anatomy of a Decoder BlockLink Copied

Now that we understand causal masking, let's see how it fits into a complete decoder block. A decoder block is structurally similar to an encoder block, with one critical difference: its self-attention layer applies causal masking. But the block is more than just masked attention; it's a carefully designed composition of components that work together to transform token representations.

The Two-Sublayer StructureLink Copied

Each decoder block consists of two sublayers that serve complementary purposes:

Causal Multi-Head Self-Attention: This sublayer enables cross-position communication. Each token gathers information from previous tokens (and itself), building a representation that incorporates relevant context. The causal mask ensures this communication respects temporal order.
Feed-Forward Network (FFN): This sublayer provides position-wise transformation. The same two-layer MLP is applied independently to each position, adding non-linear representational capacity. Unlike attention, the FFN doesn't mix information across positions; it processes each token's representation in isolation.

These sublayers address different aspects of representation learning. Attention handles where to look in the sequence. The FFN handles how to transform what was gathered. Together, they enable the block to both aggregate context and compute complex functions of that context.

Supporting ComponentsLink Copied

Two additional components wrap each sublayer to enable deep networks:

Residual Connections: Skip connections around each sublayer create additive shortcuts. Instead of computing $y = f(x)$ , the block computes $y = x + f(x)$ . This means the sublayer only needs to learn the delta, or refinement, to add to the representation. Residuals also provide direct gradient paths that bypass potentially problematic nonlinearities.
Layer Normalization: Normalizes activations to stabilize training. Modern decoders use Pre-LN architecture, applying normalization before each sublayer rather than after. Pre-LN produces more stable gradients at initialization and enables training without careful learning rate warmup.

The data flow through a Pre-LN decoder block follows this pattern:

Normalize → Attention → Add residual
Normalize → FFN → Add residual

Let's implement each component and assemble them into a complete decoder block.

Utility FunctionsLink Copied

First, we need a few utility functions. The softmax function converts raw scores to probabilities. RMS normalization scales activations to have unit root-mean-square. GELU provides a smooth nonlinearity for the feed-forward network:

In[5]:

Code

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def rms_norm(x, gamma, eps=1e-6):
    """RMS Layer Normalization."""
    rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)


def gelu(x):
    """Gaussian Error Linear Unit activation."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    x_max = np.max(x, axis=axis, keepdims=True)
    exp_x = np.exp(x - x_max)
    return exp_x / np.sum(exp_x, axis=axis, keepdims=True)


def rms_norm(x, gamma, eps=1e-6):
    """RMS Layer Normalization."""
    rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + eps)
    return gamma * (x / rms)


def gelu(x):
    """Gaussian Error Linear Unit activation."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))

GELU (Gaussian Error Linear Unit) is the activation function used in GPT and most modern transformers. Unlike ReLU, which has a hard cutoff at zero, GELU smoothly transitions between suppressing and passing values:

Out[6]:

Visualization

Line plot comparing GELU and ReLU activation functions from -3 to 3, showing GELU's smooth curve versus ReLU's sharp corner at zero. — Comparison of GELU and ReLU activation functions. GELU (blue) provides a smooth, probabilistic transition that allows small negative values to pass through, while ReLU (orange, dashed) has a hard cutoff at zero.

The smooth transition of GELU can improve gradient flow during training compared to ReLU's sharp corner at zero.

Causal Multi-Head AttentionLink Copied

The attention layer is the heart of the decoder block. It projects input tokens into queries, keys, and values, then computes attention with causal masking. Multiple heads allow the model to attend to different aspects of the context simultaneously:

In[7]:

Code

class CausalMultiHeadAttention:
    """
    Multi-head attention with causal masking.

    Each head learns different aspects of token relationships,
    but all heads respect the causal constraint.
    """

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        # Initialize projection matrices
        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def __call__(self, x):
        """
        Apply causal multi-head attention.

        Args:
            x: Input tensor of shape (seq_len, d_model)

        Returns:
            Output tensor of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]

        # Project to queries, keys, values
        Q = x @ self.W_q  # (seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head: (seq_len, n_heads, d_k)
        Q = Q.reshape(seq_len, self.n_heads, self.d_k)
        K = K.reshape(seq_len, self.n_heads, self.d_k)
        V = V.reshape(seq_len, self.n_heads, self.d_k)

        # Transpose to (n_heads, seq_len, d_k) for batch matmul
        Q = Q.transpose(1, 0, 2)
        K = K.transpose(1, 0, 2)
        V = V.transpose(1, 0, 2)

        # Compute attention scores: (n_heads, seq_len, seq_len)
        scores = Q @ K.transpose(0, 2, 1) / np.sqrt(self.d_k)

        # Apply causal mask
        causal_mask = create_causal_mask(seq_len)
        scores = scores + causal_mask  # Broadcasting adds mask to each head

        # Softmax to get attention weights
        attn_weights = softmax(scores, axis=-1)

        # Apply attention to values: (n_heads, seq_len, d_k)
        attn_output = attn_weights @ V

        # Transpose back and reshape: (seq_len, d_model)
        attn_output = attn_output.transpose(1, 0, 2).reshape(
            seq_len, self.d_model
        )

        # Output projection
        output = attn_output @ self.W_o

        return output

    def num_parameters(self):
        """Count total parameters in this layer."""
        return 4 * self.d_model * self.d_model

class CausalMultiHeadAttention:
    """
    Multi-head attention with causal masking.

    Each head learns different aspects of token relationships,
    but all heads respect the causal constraint.
    """

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        # Initialize projection matrices
        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

    def __call__(self, x):
        """
        Apply causal multi-head attention.

        Args:
            x: Input tensor of shape (seq_len, d_model)

        Returns:
            Output tensor of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]

        # Project to queries, keys, values
        Q = x @ self.W_q  # (seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head: (seq_len, n_heads, d_k)
        Q = Q.reshape(seq_len, self.n_heads, self.d_k)
        K = K.reshape(seq_len, self.n_heads, self.d_k)
        V = V.reshape(seq_len, self.n_heads, self.d_k)

        # Transpose to (n_heads, seq_len, d_k) for batch matmul
        Q = Q.transpose(1, 0, 2)
        K = K.transpose(1, 0, 2)
        V = V.transpose(1, 0, 2)

        # Compute attention scores: (n_heads, seq_len, seq_len)
        scores = Q @ K.transpose(0, 2, 1) / np.sqrt(self.d_k)

        # Apply causal mask
        causal_mask = create_causal_mask(seq_len)
        scores = scores + causal_mask  # Broadcasting adds mask to each head

        # Softmax to get attention weights
        attn_weights = softmax(scores, axis=-1)

        # Apply attention to values: (n_heads, seq_len, d_k)
        attn_output = attn_weights @ V

        # Transpose back and reshape: (seq_len, d_model)
        attn_output = attn_output.transpose(1, 0, 2).reshape(
            seq_len, self.d_model
        )

        # Output projection
        output = attn_output @ self.W_o

        return output

    def num_parameters(self):
        """Count total parameters in this layer."""
        return 4 * self.d_model * self.d_model

The attention layer computes the same scaled dot-product attention as an encoder, but the causal mask ensures the upper triangle of attention weights becomes zero. Each position's output is a weighted sum of value vectors from itself and all previous positions.

Position-Wise Feed-Forward NetworkLink Copied

After attention aggregates context, the FFN transforms each position independently. It expands the representation to a larger hidden dimension, applies a nonlinearity, then projects back to the model dimension. This expansion-contraction pattern adds significant representational capacity:

In[8]:

Code

class FeedForward:
    """
    Position-wise feed-forward network.

    Applies the same two-layer MLP independently to each position.
    """

    def __init__(self, d_model, d_ff):
        self.d_model = d_model
        self.d_ff = d_ff

        # Initialize weights
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))

        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)

    def __call__(self, x):
        """
        Apply feed-forward transformation.

        Args:
            x: Input of shape (seq_len, d_model)

        Returns:
            Output of shape (seq_len, d_model)
        """
        # First linear + GELU activation
        hidden = gelu(x @ self.W1 + self.b1)
        # Second linear
        output = hidden @ self.W2 + self.b2
        return output

    def num_parameters(self):
        return (
            self.d_model * self.d_ff
            + self.d_ff
            + self.d_ff * self.d_model
            + self.d_model
        )

class FeedForward:
    """
    Position-wise feed-forward network.

    Applies the same two-layer MLP independently to each position.
    """

    def __init__(self, d_model, d_ff):
        self.d_model = d_model
        self.d_ff = d_ff

        # Initialize weights
        scale1 = np.sqrt(2.0 / (d_model + d_ff))
        scale2 = np.sqrt(2.0 / (d_ff + d_model))

        self.W1 = np.random.randn(d_model, d_ff) * scale1
        self.b1 = np.zeros(d_ff)
        self.W2 = np.random.randn(d_ff, d_model) * scale2
        self.b2 = np.zeros(d_model)

    def __call__(self, x):
        """
        Apply feed-forward transformation.

        Args:
            x: Input of shape (seq_len, d_model)

        Returns:
            Output of shape (seq_len, d_model)
        """
        # First linear + GELU activation
        hidden = gelu(x @ self.W1 + self.b1)
        # Second linear
        output = hidden @ self.W2 + self.b2
        return output

    def num_parameters(self):
        return (
            self.d_model * self.d_ff
            + self.d_ff
            + self.d_ff * self.d_model
            + self.d_model
        )

Assembling the Decoder BlockLink Copied

Now we combine attention, FFN, normalization, and residual connections into a complete decoder block. The Pre-LN architecture applies normalization before each sublayer, which produces more stable gradients and enables training without warmup:

In[9]:

Code

class DecoderBlock:
    """
    A single transformer decoder block.

    Uses Pre-LN architecture: normalization before each sublayer.
    Causal masking is applied in the self-attention.
    """

    def __init__(self, d_model, n_heads, d_ff):
        self.d_model = d_model

        # Sublayers
        self.attention = CausalMultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)

        # Layer norm parameters (for RMS norm, just scale)
        self.norm1_gamma = np.ones(d_model)
        self.norm2_gamma = np.ones(d_model)

    def __call__(self, x):
        """
        Process input through the decoder block.

        Args:
            x: Input of shape (seq_len, d_model)

        Returns:
            Output of shape (seq_len, d_model)
        """
        # Self-attention with residual
        normed = rms_norm(x, self.norm1_gamma)
        attn_out = self.attention(normed)
        x = x + attn_out

        # FFN with residual
        normed = rms_norm(x, self.norm2_gamma)
        ffn_out = self.ffn(normed)
        x = x + ffn_out

        return x

    def num_parameters(self):
        """Count parameters in this block."""
        attn_params = self.attention.num_parameters()
        ffn_params = self.ffn.num_parameters()
        norm_params = 2 * self.d_model  # Two RMS norms
        return attn_params + ffn_params + norm_params

class DecoderBlock:
    """
    A single transformer decoder block.

    Uses Pre-LN architecture: normalization before each sublayer.
    Causal masking is applied in the self-attention.
    """

    def __init__(self, d_model, n_heads, d_ff):
        self.d_model = d_model

        # Sublayers
        self.attention = CausalMultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)

        # Layer norm parameters (for RMS norm, just scale)
        self.norm1_gamma = np.ones(d_model)
        self.norm2_gamma = np.ones(d_model)

    def __call__(self, x):
        """
        Process input through the decoder block.

        Args:
            x: Input of shape (seq_len, d_model)

        Returns:
            Output of shape (seq_len, d_model)
        """
        # Self-attention with residual
        normed = rms_norm(x, self.norm1_gamma)
        attn_out = self.attention(normed)
        x = x + attn_out

        # FFN with residual
        normed = rms_norm(x, self.norm2_gamma)
        ffn_out = self.ffn(normed)
        x = x + ffn_out

        return x

    def num_parameters(self):
        """Count parameters in this block."""
        attn_params = self.attention.num_parameters()
        ffn_params = self.ffn.num_parameters()
        norm_params = 2 * self.d_model  # Two RMS norms
        return attn_params + ffn_params + norm_params

Testing the Decoder BlockLink Copied

Let's instantiate a decoder block and verify it works correctly. We'll use typical hyperparameters: 256-dimensional model, 8 attention heads (32 dimensions each), and a 4x FFN expansion:

Out[10]:

Console

Decoder Block Configuration
========================================
  Model dimension (d_model): 256
  Number of heads: 8
  Head dimension: 32
  FFN hidden dim (d_ff): 1024
  Parameters: 788,224

Input shape:  (10, 256)
Output shape: (10, 256)

The block contains roughly 790K parameters, dominated by the feed-forward network (which uses a 4x expansion ratio: $256 \to 1024 \to 256$ ). The input and output shapes match, allowing blocks to be stacked without dimension changes. This "residual-friendly" design is what enables deep networks: each block can refine the representation without fundamentally altering its structure.

The decoder block transforms each position independently while allowing controlled information flow from past positions through causal attention. The FFN adds non-linear transformation capacity without any cross-position interaction.

GPT-Style Decoder ArchitectureLink Copied

GPT (Generative Pre-trained Transformer) established the dominant decoder-only paradigm. The full GPT architecture consists of:

Token Embedding: Maps input token IDs to dense vectors
Position Embedding: Adds positional information (learned or sinusoidal)
Decoder Stack: Multiple decoder blocks in sequence
Final Layer Norm: Normalizes the output of the last block
Language Model Head: Projects to vocabulary size for next-token prediction

Let's build a complete GPT-style decoder.

In[11]:

Code

class GPTDecoder:
    """
    Complete GPT-style decoder-only transformer.

    Takes token indices as input and produces logits over the vocabulary.
    """

    def __init__(
        self, vocab_size, max_seq_len, d_model, n_heads, d_ff, n_layers
    ):
        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        self.d_model = d_model
        self.n_layers = n_layers

        # Token and position embeddings
        self.token_embedding = np.random.randn(vocab_size, d_model) * 0.02
        self.position_embedding = np.random.randn(max_seq_len, d_model) * 0.02

        # Stack of decoder blocks
        self.blocks = [
            DecoderBlock(d_model, n_heads, d_ff) for _ in range(n_layers)
        ]

        # Final layer norm
        self.final_norm_gamma = np.ones(d_model)

        # Language model head (projects to vocabulary)
        self.lm_head = np.random.randn(d_model, vocab_size) * 0.02

    def __call__(self, token_ids):
        """
        Forward pass through the decoder.

        Args:
            token_ids: Array of token indices, shape (seq_len,)

        Returns:
            logits: Unnormalized scores over vocabulary, shape (seq_len, vocab_size)
        """
        seq_len = len(token_ids)

        # Get embeddings
        tok_emb = self.token_embedding[token_ids]  # (seq_len, d_model)
        pos_emb = self.position_embedding[:seq_len]  # (seq_len, d_model)

        # Combine embeddings
        x = tok_emb + pos_emb

        # Pass through decoder blocks
        for block in self.blocks:
            x = block(x)

        # Final normalization
        x = rms_norm(x, self.final_norm_gamma)

        # Project to vocabulary
        logits = x @ self.lm_head  # (seq_len, vocab_size)

        return logits

    def num_parameters(self):
        """Count total parameters."""
        embed_params = self.vocab_size * self.d_model  # Token embedding
        embed_params += self.max_seq_len * self.d_model  # Position embedding
        block_params = sum(b.num_parameters() for b in self.blocks)
        norm_params = self.d_model  # Final norm
        head_params = self.d_model * self.vocab_size  # LM head
        return embed_params + block_params + norm_params + head_params

class GPTDecoder:
    """
    Complete GPT-style decoder-only transformer.

    Takes token indices as input and produces logits over the vocabulary.
    """

    def __init__(
        self, vocab_size, max_seq_len, d_model, n_heads, d_ff, n_layers
    ):
        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len
        self.d_model = d_model
        self.n_layers = n_layers

        # Token and position embeddings
        self.token_embedding = np.random.randn(vocab_size, d_model) * 0.02
        self.position_embedding = np.random.randn(max_seq_len, d_model) * 0.02

        # Stack of decoder blocks
        self.blocks = [
            DecoderBlock(d_model, n_heads, d_ff) for _ in range(n_layers)
        ]

        # Final layer norm
        self.final_norm_gamma = np.ones(d_model)

        # Language model head (projects to vocabulary)
        self.lm_head = np.random.randn(d_model, vocab_size) * 0.02

    def __call__(self, token_ids):
        """
        Forward pass through the decoder.

        Args:
            token_ids: Array of token indices, shape (seq_len,)

        Returns:
            logits: Unnormalized scores over vocabulary, shape (seq_len, vocab_size)
        """
        seq_len = len(token_ids)

        # Get embeddings
        tok_emb = self.token_embedding[token_ids]  # (seq_len, d_model)
        pos_emb = self.position_embedding[:seq_len]  # (seq_len, d_model)

        # Combine embeddings
        x = tok_emb + pos_emb

        # Pass through decoder blocks
        for block in self.blocks:
            x = block(x)

        # Final normalization
        x = rms_norm(x, self.final_norm_gamma)

        # Project to vocabulary
        logits = x @ self.lm_head  # (seq_len, vocab_size)

        return logits

    def num_parameters(self):
        """Count total parameters."""
        embed_params = self.vocab_size * self.d_model  # Token embedding
        embed_params += self.max_seq_len * self.d_model  # Position embedding
        block_params = sum(b.num_parameters() for b in self.blocks)
        norm_params = self.d_model  # Final norm
        head_params = self.d_model * self.vocab_size  # LM head
        return embed_params + block_params + norm_params + head_params

Out[12]:

Console

GPT Decoder Configuration
=============================================
  Vocabulary size: 10,000
  Max sequence length: 512
  Model dimension: 256
  Number of heads: 8
  FFN dimension: 1024
  Number of layers: 6
  Total parameters: 9,980,672

The parameter count breaks down into three main components: embeddings (token + position), the decoder stack, and the language model head. With a 10K vocabulary and 256-dimensional embeddings, the token embedding alone accounts for 2.56M parameters. The 6 decoder blocks contribute about 4.7M parameters (roughly 790K each), and the LM head adds another 2.56M for projecting back to vocabulary size.

Out[13]:

Visualization

Horizontal bar chart showing parameter counts for token embedding, position embedding, decoder blocks, and LM head components. — Parameter distribution in our 6.9M parameter GPT model. Embeddings and the language model head together account for about 75% of parameters, while the 6 decoder blocks contribute the remaining 25%. This ratio shifts in larger models where deeper stacks dominate.

This 6.9M parameter model is tiny by modern standards. GPT-2 Small has 117M parameters, GPT-3 has 175B, and GPT-4 is estimated at over a trillion. But the architectural pattern is identical: stack more layers, increase dimensions, and scale up the training data.

Layer Stacking and Information FlowLink Copied

Decoder blocks are stacked sequentially. The output of block $l$ becomes the input to block $l + 1$ . Each layer refines the representations, building increasingly abstract features.

In[14]:

Code

def trace_layer_outputs(model, token_ids):
    """
    Trace how representations evolve through decoder layers.

    Returns the hidden states after each layer.
    """
    seq_len = len(token_ids)

    # Initial embeddings
    tok_emb = model.token_embedding[token_ids]
    pos_emb = model.position_embedding[:seq_len]
    x = tok_emb + pos_emb

    layer_outputs = [x.copy()]

    # Through each block
    for block in model.blocks:
        x = block(x)
        layer_outputs.append(x.copy())

    # After final norm
    x = rms_norm(x, model.final_norm_gamma)
    layer_outputs.append(x.copy())

    return layer_outputs


# Trace through our model
test_tokens = np.array([42, 100, 256, 789, 1000, 500, 333, 42])
layer_outputs = trace_layer_outputs(model, test_tokens)

def trace_layer_outputs(model, token_ids):
    """
    Trace how representations evolve through decoder layers.

    Returns the hidden states after each layer.
    """
    seq_len = len(token_ids)

    # Initial embeddings
    tok_emb = model.token_embedding[token_ids]
    pos_emb = model.position_embedding[:seq_len]
    x = tok_emb + pos_emb

    layer_outputs = [x.copy()]

    # Through each block
    for block in model.blocks:
        x = block(x)
        layer_outputs.append(x.copy())

    # After final norm
    x = rms_norm(x, model.final_norm_gamma)
    layer_outputs.append(x.copy())

    return layer_outputs


# Trace through our model
test_tokens = np.array([42, 100, 256, 789, 1000, 500, 333, 42])
layer_outputs = trace_layer_outputs(model, test_tokens)

Out[15]:

Console

Tracing 8 tokens through 6 layers

Layer output statistics (L2 norm of representations):
--------------------------------------------------
Embedding   : mean norm = 0.445, std = 0.011
Block 1     : mean norm = 22.717, std = 3.942
Block 2     : mean norm = 34.438, std = 4.544
Block 3     : mean norm = 42.757, std = 4.166
Block 4     : mean norm = 51.345, std = 4.582
Block 5     : mean norm = 61.620, std = 5.051
Block 6     : mean norm = 68.161, std = 4.536
Final Norm  : mean norm = 16.000, std = 0.000

The representation norms reveal how the network transforms token embeddings. Starting from the embedding layer with relatively small norms (around 0.3), the representations grow as they pass through successive blocks. This growth reflects the accumulation of information via residual connections: each block adds its contribution to the running representation. The final layer norm rescales the output, ensuring consistent magnitudes before the language model head computes logits.

Out[16]:

Visualization

Line plot showing L2 norm of representations for 8 tokens across embedding layer, 6 decoder blocks, and final normalization. — Representation norm growth through decoder layers. Each line represents one of the 8 input tokens. Norms generally increase through the decoder blocks as residual connections accumulate contributions, then are rescaled by the final layer norm.

As we go deeper, representation norms tend to grow or stabilize depending on initialization and normalization. The final layer norm ensures consistent scale before the language model head.

Out[17]:

Visualization

Heatmap showing cosine similarity between decoder layer outputs, with diagonal of 1.0 and off-diagonal values showing layer relationships. — Cosine similarity between layer representations for the final token position. Earlier layers show higher similarity; deeper layers diverge as they develop specialized features.

The similarity matrix reveals how representations evolve. Adjacent layers tend to be more similar (the residual connections preserve information), while distant layers may diverge significantly as the model builds higher-level abstractions.

Visualizing Causal Attention PatternsLink Copied

Each attention head in a decoder develops specialized patterns. Some heads attend strongly to the previous token (local context). Others attend to the beginning of the sequence. Still others develop positional patterns or semantic groupings.

Let's examine attention patterns in a decoder block.

In[18]:

Code

def extract_attention_weights(block, x):
    """
    Extract attention weights from a decoder block.

    Returns attention weights of shape (n_heads, seq_len, seq_len).
    """
    attn = block.attention
    seq_len = x.shape[0]

    # Replicate attention computation to capture weights
    normed = rms_norm(x, block.norm1_gamma)

    Q = normed @ attn.W_q
    K = normed @ attn.W_k

    Q = Q.reshape(seq_len, attn.n_heads, attn.d_k).transpose(1, 0, 2)
    K = K.reshape(seq_len, attn.n_heads, attn.d_k).transpose(1, 0, 2)

    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(attn.d_k)
    scores = scores + create_causal_mask(seq_len)
    weights = softmax(scores, axis=-1)

    return weights


# Extract attention from first block
x_test = np.random.randn(8, 256) * 0.02
attn_weights = extract_attention_weights(model.blocks[0], x_test)

def extract_attention_weights(block, x):
    """
    Extract attention weights from a decoder block.

    Returns attention weights of shape (n_heads, seq_len, seq_len).
    """
    attn = block.attention
    seq_len = x.shape[0]

    # Replicate attention computation to capture weights
    normed = rms_norm(x, block.norm1_gamma)

    Q = normed @ attn.W_q
    K = normed @ attn.W_k

    Q = Q.reshape(seq_len, attn.n_heads, attn.d_k).transpose(1, 0, 2)
    K = K.reshape(seq_len, attn.n_heads, attn.d_k).transpose(1, 0, 2)

    scores = Q @ K.transpose(0, 2, 1) / np.sqrt(attn.d_k)
    scores = scores + create_causal_mask(seq_len)
    weights = softmax(scores, axis=-1)

    return weights


# Extract attention from first block
x_test = np.random.randn(8, 256) * 0.02
attn_weights = extract_attention_weights(model.blocks[0], x_test)

Out[19]:

Visualization

Lower triangular heatmap showing attention weights for head 0 with varying intensity. — Head 0 attention pattern. This head shows a strong diagonal pattern, focusing primarily on recent positions with causal masking creating the characteristic lower-triangular structure.

Lower triangular heatmap showing attention weights for head 1 with different pattern from head 0. — Head 1 attention pattern. Different heads learn different attention strategies. This head may focus on different positional relationships or semantic patterns.

Both heads exhibit the causal structure with zeros in the upper triangle. The intensity patterns within the lower triangle vary between heads, showing how different heads attend to different parts of the available context. With random initialization, these patterns are not yet meaningful, but training shapes them into useful attention strategies.

The Generation LoopLink Copied

At inference time, the decoder generates text autoregressively. Each iteration:

Processes the current sequence through all decoder blocks
Takes the logits at the final position
Samples or greedily selects the next token
Appends that token to the sequence
Repeats until a stopping condition

In[20]:

Code

def generate(model, prompt_tokens, max_new_tokens=20, temperature=1.0):
    """
    Generate tokens autoregressively.

    Args:
        model: GPTDecoder instance
        prompt_tokens: Initial token indices
        max_new_tokens: Maximum tokens to generate
        temperature: Sampling temperature (higher = more random)

    Returns:
        Complete sequence including prompt and generated tokens
    """
    tokens = list(prompt_tokens)

    for _ in range(max_new_tokens):
        # Get current sequence length
        current_len = len(tokens)
        if current_len >= model.max_seq_len:
            break

        # Forward pass
        token_array = np.array(tokens)
        logits = model(token_array)

        # Get logits for next token prediction (last position)
        next_logits = logits[-1]  # (vocab_size,)

        # Apply temperature
        if temperature != 1.0:
            next_logits = next_logits / temperature

        # Convert to probabilities
        probs = softmax(next_logits)

        # Sample next token
        next_token = np.random.choice(len(probs), p=probs)
        tokens.append(next_token)

        # Simple stop condition: token 0 is end-of-sequence
        if next_token == 0:
            break

    return np.array(tokens)


# Generate from a prompt
prompt = np.array([42, 100, 256])
generated = generate(model, prompt, max_new_tokens=10, temperature=0.8)

def generate(model, prompt_tokens, max_new_tokens=20, temperature=1.0):
    """
    Generate tokens autoregressively.

    Args:
        model: GPTDecoder instance
        prompt_tokens: Initial token indices
        max_new_tokens: Maximum tokens to generate
        temperature: Sampling temperature (higher = more random)

    Returns:
        Complete sequence including prompt and generated tokens
    """
    tokens = list(prompt_tokens)

    for _ in range(max_new_tokens):
        # Get current sequence length
        current_len = len(tokens)
        if current_len >= model.max_seq_len:
            break

        # Forward pass
        token_array = np.array(tokens)
        logits = model(token_array)

        # Get logits for next token prediction (last position)
        next_logits = logits[-1]  # (vocab_size,)

        # Apply temperature
        if temperature != 1.0:
            next_logits = next_logits / temperature

        # Convert to probabilities
        probs = softmax(next_logits)

        # Sample next token
        next_token = np.random.choice(len(probs), p=probs)
        tokens.append(next_token)

        # Simple stop condition: token 0 is end-of-sequence
        if next_token == 0:
            break

    return np.array(tokens)


# Generate from a prompt
prompt = np.array([42, 100, 256])
generated = generate(model, prompt, max_new_tokens=10, temperature=0.8)

Out[21]:

Console

Autoregressive Generation Demo
=============================================
Prompt tokens: [42, 100, 256]
Generated tokens: [42, 100, 256, 8014, 7624, 6249, 8208, 4081, 6138, 2831, 6387, 5384, 8695]
New tokens: 10

The model generated 10 new tokens from the 3-token prompt. Since this is a randomly initialized model (not trained on any text), the generated token IDs are essentially random samples from the vocabulary. A trained model would produce coherent continuations. The temperature of 0.8 makes the distribution slightly sharper than pure random sampling, biasing toward higher-probability tokens.

Let's visualize how temperature affects the sampling distribution. Lower temperatures concentrate probability mass on the highest-scoring tokens, while higher temperatures flatten the distribution:

Out[22]:

Visualization

Line plot showing three probability distributions over token indices with different temperatures, demonstrating how lower temperatures create sharper peaks. — Effect of temperature on token probability distribution. Temperature=0.5 (blue) sharpens the distribution, concentrating probability on high-scoring tokens. Temperature=1.0 (orange) preserves the original logit ratios. Temperature=2.0 (green) flattens the distribution, making unlikely tokens more probable.

At temperature 0.5, the top tokens dominate the distribution, making generation more deterministic. At temperature 2.0, even low-scoring tokens have non-negligible probability, introducing more randomness and diversity. The choice of temperature trades off between coherence (low temperature) and creativity (high temperature).

In practice, generation involves beam search, top-k sampling, nucleus sampling, and other decoding strategies to balance quality and diversity. The decoder architecture itself doesn't change; only the sampling logic varies.

Key-Value Caching for Efficient GenerationLink Copied

A naive implementation recomputes attention for all positions at every generation step. This is wasteful: the representations for tokens 0 through $t-1$ don't change when we add token $t$ . KV caching stores the key and value vectors from previous positions, avoiding redundant computation.

In[23]:

Code

class CausalAttentionWithCache:
    """
    Multi-head attention with KV cache for efficient generation.
    """

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

        # Cache for past keys and values
        self.cache_k = None
        self.cache_v = None

    def clear_cache(self):
        """Reset the KV cache."""
        self.cache_k = None
        self.cache_v = None

    def __call__(self, x, use_cache=False):
        """
        Apply attention, optionally using and updating cache.

        When use_cache=True:
        - x should be just the new token(s)
        - Keys/values are appended to cache
        - Queries attend to all cached positions
        """
        seq_len = x.shape[0]

        # Compute Q, K, V for new positions
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head
        Q = Q.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        K = K.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        V = V.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)

        if use_cache:
            # Append new K, V to cache
            if self.cache_k is None:
                self.cache_k = K
                self.cache_v = V
            else:
                self.cache_k = np.concatenate([self.cache_k, K], axis=1)
                self.cache_v = np.concatenate([self.cache_v, V], axis=1)

            # Use full cached K, V
            K_full = self.cache_k
            V_full = self.cache_v
        else:
            K_full = K
            V_full = V

        total_len = K_full.shape[1]

        # Compute attention scores
        scores = Q @ K_full.transpose(0, 2, 1) / np.sqrt(self.d_k)

        # Apply causal mask (only for positions we're querying)
        # Shape: (seq_len, total_len)
        mask = np.zeros((seq_len, total_len))
        start_pos = total_len - seq_len
        for i in range(seq_len):
            # Position i in query corresponds to position start_pos + i in full sequence
            full_pos = start_pos + i
            # Can attend to positions 0 to full_pos
            mask[i, full_pos + 1 :] = float("-inf")

        scores = scores + mask

        # Softmax and apply to values
        attn_weights = softmax(scores, axis=-1)
        attn_output = attn_weights @ V_full

        # Reshape and project
        attn_output = attn_output.transpose(1, 0, 2).reshape(
            seq_len, self.d_model
        )
        output = attn_output @ self.W_o

        return output

class CausalAttentionWithCache:
    """
    Multi-head attention with KV cache for efficient generation.
    """

    def __init__(self, d_model, n_heads):
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        scale = np.sqrt(2.0 / (d_model + self.d_k))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

        # Cache for past keys and values
        self.cache_k = None
        self.cache_v = None

    def clear_cache(self):
        """Reset the KV cache."""
        self.cache_k = None
        self.cache_v = None

    def __call__(self, x, use_cache=False):
        """
        Apply attention, optionally using and updating cache.

        When use_cache=True:
        - x should be just the new token(s)
        - Keys/values are appended to cache
        - Queries attend to all cached positions
        """
        seq_len = x.shape[0]

        # Compute Q, K, V for new positions
        Q = x @ self.W_q
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape for multi-head
        Q = Q.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        K = K.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)
        V = V.reshape(seq_len, self.n_heads, self.d_k).transpose(1, 0, 2)

        if use_cache:
            # Append new K, V to cache
            if self.cache_k is None:
                self.cache_k = K
                self.cache_v = V
            else:
                self.cache_k = np.concatenate([self.cache_k, K], axis=1)
                self.cache_v = np.concatenate([self.cache_v, V], axis=1)

            # Use full cached K, V
            K_full = self.cache_k
            V_full = self.cache_v
        else:
            K_full = K
            V_full = V

        total_len = K_full.shape[1]

        # Compute attention scores
        scores = Q @ K_full.transpose(0, 2, 1) / np.sqrt(self.d_k)

        # Apply causal mask (only for positions we're querying)
        # Shape: (seq_len, total_len)
        mask = np.zeros((seq_len, total_len))
        start_pos = total_len - seq_len
        for i in range(seq_len):
            # Position i in query corresponds to position start_pos + i in full sequence
            full_pos = start_pos + i
            # Can attend to positions 0 to full_pos
            mask[i, full_pos + 1 :] = float("-inf")

        scores = scores + mask

        # Softmax and apply to values
        attn_weights = softmax(scores, axis=-1)
        attn_output = attn_weights @ V_full

        # Reshape and project
        attn_output = attn_output.transpose(1, 0, 2).reshape(
            seq_len, self.d_model
        )
        output = attn_output @ self.W_o

        return output

Out[24]:

Console

KV Cache Demo
=============================================
Initial sequence: 5 tokens
Cache K shape after init: (4, 5, 16)
Added 1 token
Cache K shape after addition: (4, 6, 16)
Output shape for new token: (1, 64)

The cache grows from shape (4, 5, 16) to (4, 6, 16) after adding one token. Here, 4 is the number of heads, the middle dimension is sequence length (growing from 5 to 6), and 16 is the head dimension (64 / 4 heads). When processing the new token, we only compute queries for that single position, but the keys and values span all 6 positions. This asymmetry is what makes caching efficient: the expensive key-value computation for old tokens is reused.

With KV caching, generating $n$ new tokens requires $O(n \cdot L)$ compute instead of $O(n^2 \cdot L)$ , where:

$n$ : the number of new tokens to generate
$L$ : the total sequence length (prompt plus generated tokens)
$O(n \cdot L)$ : linear scaling because each new token only computes attention against the cached keys/values
$O(n^2 \cdot L)$ : quadratic scaling without caching, since we recompute all positions at each step

This optimization is critical for efficient inference in production systems. For a 1000-token prompt with 100 generated tokens, caching reduces attention computation by roughly 50x.

Decoder Architecture ComparisonLink Copied

Different models use slightly different decoder designs. The core pattern is consistent, but details vary:

Decoder architecture comparison across popular language models. All models share the same fundamental structure but differ in scale and specific design choices.

Model	Layers	Heads	$d_{\text{model}}$	FFN Ratio	Norm	Notable Features
GPT-2 Small	12	12	768	4x	Layer Norm	Learned position embeddings
GPT-2 Medium	24	16	1024	4x	Layer Norm	Pre-LN architecture
GPT-3	96	96	12288	4x	Layer Norm	Sparse attention in some heads
Llama 2 7B	32	32	4096	2.7x	RMSNorm	RoPE, SwiGLU FFN
Llama 3 8B	32	32	4096	3.5x	RMSNorm	Grouped-query attention
Mistral 7B	32	32	4096	3.5x	RMSNorm	Sliding window attention

Modern decoders favor RMSNorm over LayerNorm for efficiency, SwiGLU or GeGLU activations in the FFN for better performance, and positional encodings like RoPE that generalize to longer sequences. Grouped-query attention reduces the KV cache size by sharing key-value projections across groups of heads.

When to Use Decoder-Only ModelsLink Copied

Decoder-only architectures excel at:

Text generation: Stories, code, conversations, completions
Language modeling: Predicting the next token given context
In-context learning: Few-shot learning through prompting
General-purpose AI assistants: Handling diverse tasks through natural language

They are less naturally suited for:

Bidirectional understanding: Tasks like named entity recognition or question answering over a document, where seeing the full context helps
Sequence-to-sequence with misaligned lengths: Machine translation where the source and target have different structures
Fixed-length encoding: Producing a single vector representation of a document

For these tasks, encoder-only (like BERT) or encoder-decoder models (like T5) may be more appropriate. However, large decoder-only models trained with instruction tuning have proven surprisingly capable across task types, blurring these traditional boundaries.

Limitations and ImpactLink Copied

The decoder-only architecture has transformed language AI, but it comes with inherent constraints that shape both its capabilities and its failure modes.

The autoregressive nature means generation is inherently sequential at inference time. Each new token depends on all previous tokens, preventing parallel generation. While KV caching mitigates redundant computation, the fundamental serial dependency remains. For real-time applications, this creates latency constraints that scale with output length. Techniques like speculative decoding, where a smaller model drafts tokens that a larger model verifies in parallel, partially address this, but the sequential bottleneck persists.

Causal masking also limits the model's understanding of each position. When processing position $t$ , the decoder cannot consider positions $t+1, t+2, \ldots$ even though, for understanding tasks, that future context would be valuable. Bidirectional models like BERT can leverage the full context for each position, giving them an advantage for tasks like classification or extraction. Decoder-only models compensate through scale and training data, but the architectural constraint remains.

Despite these limitations, the decoder-only paradigm has proven remarkably versatile. The GPT series demonstrated that next-token prediction at scale produces emergent capabilities in reasoning, code generation, and multilingual understanding. The simplicity of the architecture, a single stack of nearly identical blocks, enables efficient scaling and has become the foundation for modern generative AI. From GPT-2's surprising text generation to GPT-4's multimodal reasoning, the decoder architecture continues to define the frontier of language AI.

Key ParametersLink Copied

When configuring a decoder-only transformer, the following parameters have the greatest impact on model capacity and performance:

d_model (model dimension): The dimensionality of token representations throughout the network. Common values range from 256 (small models) to 12288 (GPT-3 scale). Must be divisible by n_heads. Larger values increase capacity but quadratically increase attention computation.
n_heads (number of attention heads): How many parallel attention patterns the model can learn. Typically set so that d_model / n_heads yields head dimensions of 64-128. More heads enable diverse attention patterns but have diminishing returns beyond a point.
n_layers (number of decoder blocks): The depth of the transformer stack. Deeper models learn more abstract representations. GPT-2 Small uses 12 layers; GPT-3 uses 96. Depth trades off against width for a given parameter budget.
d_ff (feed-forward hidden dimension): The expansion dimension in the FFN sublayer. Traditionally 4x d_model, but modern models use 2.7x-3.5x with gated activations like SwiGLU. Larger FFN dimensions increase the network's non-linear transformation capacity.
vocab_size: The number of tokens in the vocabulary. Larger vocabularies reduce sequence length for the same text but increase embedding and LM head parameters. Typical values range from 30K (GPT-2) to 128K (Llama 3).
max_seq_len: Maximum sequence length the model can process. Affects position embedding size and memory requirements during training. Modern models support 2K-128K tokens through efficient attention mechanisms.
temperature (generation): Controls randomness during sampling. Values below 1.0 sharpen the distribution (more deterministic); values above 1.0 flatten it (more random). Common range is 0.7-1.0 for generation tasks.

SummaryLink Copied

The decoder architecture powers modern language generation by combining self-attention with causal masking to enforce left-to-right information flow. Key takeaways:

Causal masking prevents each position from attending to future positions, matching the autoregressive generation process
Decoder blocks consist of masked self-attention and feed-forward layers, each with residual connections and normalization
GPT-style models stack decoder blocks between embedding and language model head layers, forming a unified architecture for understanding and generation
Layer stacking builds increasingly abstract representations, with each layer refining outputs from the previous layer
KV caching enables efficient generation by storing key-value pairs from previous positions
Generation proceeds token by token, with each new token conditioned on all previous ones through the attention mechanism

The decoder-only design has become the default architecture for large language models, offering simplicity, scalability, and surprising generality. While encoder and encoder-decoder architectures retain advantages for specific tasks, the decoder's unified approach to understanding and generation has proven remarkably effective across the spectrum of language AI applications.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about decoder architectures and causal masking.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Encoder Architecture

Next Chapter

Encoder-Decoder Architecture

Reference

BIBTEXAcademic

@misc{decoderarchitecturecausalmaskingautoregressivegeneration, author = {Michael Brenndoerfer}, title = {Decoder Architecture: Causal Masking & Autoregressive Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Decoder Architecture: Causal Masking & Autoregressive Generation. Retrieved from https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers

MLAAcademic

Michael Brenndoerfer. "Decoder Architecture: Causal Masking & Autoregressive Generation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers>.

CHICAGOAcademic

Michael Brenndoerfer. "Decoder Architecture: Causal Masking & Autoregressive Generation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Decoder Architecture: Causal Masking & Autoregressive Generation'. Available at: https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Decoder Architecture: Causal Masking & Autoregressive Generation. https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers

Direct link:

https://mbrenndoerfer.com/writing/decoder-architecture-causal-masking-autoregressive-transformers

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Decoder Architecture: Causal Masking & Autoregressive Generation

Decoder ArchitectureLink Copied

The Decoder-Only DesignLink Copied

Understanding Autoregressive GenerationLink Copied

The Causal Masking RequirementLink Copied

From Intuition to FormulationLink Copied

Designing the MaskLink Copied

Why $-\infty$ Creates Zero AttentionLink Copied

ImplementationLink Copied

Anatomy of a Decoder BlockLink Copied

The Two-Sublayer StructureLink Copied

Supporting ComponentsLink Copied

Utility FunctionsLink Copied

Causal Multi-Head AttentionLink Copied

Position-Wise Feed-Forward NetworkLink Copied

Assembling the Decoder BlockLink Copied

Testing the Decoder BlockLink Copied

GPT-Style Decoder ArchitectureLink Copied

Layer Stacking and Information FlowLink Copied

Visualizing Causal Attention PatternsLink Copied

The Generation LoopLink Copied

Key-Value Caching for Efficient GenerationLink Copied

Decoder Architecture ComparisonLink Copied

When to Use Decoder-Only ModelsLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers

Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide

Cross-Attention: Connecting Encoder and Decoder in Transformers

Stay updated

Decoder Architecture: Causal Masking & Autoregressive Generation

Decoder ArchitectureLink Copied

The Decoder-Only DesignLink Copied

Understanding Autoregressive GenerationLink Copied

The Causal Masking RequirementLink Copied

From Intuition to FormulationLink Copied

Designing the MaskLink Copied

Why −∞-\infty−∞ Creates Zero AttentionLink Copied

ImplementationLink Copied

Anatomy of a Decoder BlockLink Copied

The Two-Sublayer StructureLink Copied

Supporting ComponentsLink Copied

Utility FunctionsLink Copied

Causal Multi-Head AttentionLink Copied

Position-Wise Feed-Forward NetworkLink Copied

Assembling the Decoder BlockLink Copied

Testing the Decoder BlockLink Copied

GPT-Style Decoder ArchitectureLink Copied

Layer Stacking and Information FlowLink Copied

Visualizing Causal Attention PatternsLink Copied

The Generation LoopLink Copied

Key-Value Caching for Efficient GenerationLink Copied

Decoder Architecture ComparisonLink Copied

When to Use Decoder-Only ModelsLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers

Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide

Cross-Attention: Connecting Encoder and Decoder in Transformers

Stay updated

Why $-\infty$ Creates Zero AttentionLink Copied