Search

Search articles

Pre-Norm vs Post-Norm: Choosing Layer Normalization Placement for Training Stability

Michael BrenndoerferUpdated June 10, 202536 min read

Explore how moving layer normalization before the sublayer (pre-norm) rather than after (post-norm) enables stable training of deep transformers like GPT and LLaMA.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Pre-Norm vs Post-Norm

The original transformer paper placed layer normalization after the residual connection, a design choice known as post-norm. This seemed natural: apply the sublayer, add the residual, then normalize the combined result. But as researchers pushed transformers to greater depths, they discovered that this ordering creates training instabilities that become severe in very deep networks. The solution was elegantly simple: move the normalization before the sublayer. This pre-norm formulation has become the default in modern architectures like GPT and LLaMA, enabling stable training of models with hundreds of layers.

Understanding the difference between pre-norm and post-norm is essential for implementing transformers, diagnosing training issues, and choosing architectures. The choice affects not just stability but also the final model's behavior, with subtle trade-offs that practitioners should understand.

The Original Transformer: Post-Norm

The 2017 "Attention Is All You Need" paper introduced what we now call the post-norm configuration. In this design, each sublayer (attention or feed-forward) follows a specific pattern: apply the sublayer transformation, add the residual, then normalize.

Post-Norm Formulation

In post-norm transformers, layer normalization is applied after the residual addition. For a sublayer function Sublayer()\text{Sublayer}(\cdot), the output is:

x=LayerNorm(x+Sublayer(x))\mathbf{x}' = \text{LayerNorm}(\mathbf{x} + \text{Sublayer}(\mathbf{x}))

where:

  • xRd\mathbf{x} \in \mathbb{R}^d: the input representation (a dd-dimensional vector for each position)
  • Sublayer(x)\text{Sublayer}(\mathbf{x}): the sublayer transformation (either self-attention or feed-forward network)
  • x+Sublayer(x)\mathbf{x} + \text{Sublayer}(\mathbf{x}): the residual connection, adding the original input to the sublayer output
  • LayerNorm()\text{LayerNorm}(\cdot): layer normalization applied to the combined signal
  • xRd\mathbf{x}' \in \mathbb{R}^d: the output of the block, which becomes the input to the next block

The intuition behind post-norm is straightforward. The sublayer computes its transformation, the residual connection preserves the original signal, and layer normalization ensures the combined output has stable statistics. Each component does one job, and they compose cleanly.

Let's implement a post-norm transformer block to see this in action:

In[2]:
Code
import numpy as np


def layer_norm(x, gamma, beta, eps=1e-5):
    """
    Apply layer normalization.

    Args:
        x: Input of shape (seq_len, d_model)
        gamma: Scale parameter of shape (d_model,)
        beta: Shift parameter of shape (d_model,)
        eps: Small constant for numerical stability

    Returns:
        Normalized output of shape (seq_len, d_model)
    """
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    x_norm = (x - mean) / np.sqrt(var + eps)
    return gamma * x_norm + beta


def post_norm_block(x, sublayer_fn, gamma, beta):
    """
    Post-norm block: Sublayer -> Add -> Norm

    Args:
        x: Input tensor
        sublayer_fn: Function computing the sublayer transformation
        gamma, beta: LayerNorm parameters

    Returns:
        Output after post-norm block
    """
    # Apply sublayer
    sublayer_output = sublayer_fn(x)

    # Add residual
    residual_sum = x + sublayer_output

    # Normalize
    output = layer_norm(residual_sum, gamma, beta)

    return output

The ordering is explicit: sublayer first, residual addition second, normalization third. This creates a clean signal flow where the normalized output is always what enters the next layer.

In[3]:
Code
# Demonstrate post-norm with a simple sublayer
np.random.seed(42)
seq_len, d_model = 4, 8

# Input and normalization parameters
x = np.random.randn(seq_len, d_model)
gamma = np.ones(d_model)
beta = np.zeros(d_model)

# Simple sublayer (simulating attention or FFN)
W = np.random.randn(d_model, d_model) * 0.1


def simple_sublayer(x):
    return np.tanh(x @ W)


# Apply post-norm block
output_post = post_norm_block(x, simple_sublayer, gamma, beta)
Out[4]:
Console
Post-Norm Block Statistics
========================================
Input mean:  -0.1373, std: 0.9311
Output mean: 0.0000, std: 1.0000

The output has approximately zero mean and unit variance (std \approx 1), confirming that layer normalization has done its job. Regardless of what the sublayer computes or how the residual shifts the distribution, the final normalization step ensures stable statistics. This normalized output is what enters the next block in the network, preventing activation magnitudes from growing unboundedly.

The Pre-Norm Alternative

The pre-norm configuration emerged from research on training very deep networks. Instead of normalizing after the residual, pre-norm normalizes before applying the sublayer:

Pre-Norm Formulation

In pre-norm transformers, layer normalization is applied before the sublayer. For a sublayer function Sublayer()\text{Sublayer}(\cdot), the output is:

x=x+Sublayer(LayerNorm(x))\mathbf{x}' = \mathbf{x} + \text{Sublayer}(\text{LayerNorm}(\mathbf{x}))

where:

  • xRd\mathbf{x} \in \mathbb{R}^d: the input representation (a dd-dimensional vector for each position)
  • LayerNorm(x)\text{LayerNorm}(\mathbf{x}): layer normalization applied to the input before the sublayer
  • Sublayer()\text{Sublayer}(\cdot): the sublayer transformation (either self-attention or feed-forward network), operating on the normalized input
  • x+\mathbf{x} + \cdots: the residual connection, adding the original (unnormalized) input to the sublayer output
  • xRd\mathbf{x}' \in \mathbb{R}^d: the output of the block, which is not normalized

Notice the key difference: normalization happens inside the residual branch, not after the addition. This seemingly minor change has profound implications for gradient flow.

In[5]:
Code
def pre_norm_block(x, sublayer_fn, gamma, beta):
    """
    Pre-norm block: Norm -> Sublayer -> Add

    Args:
        x: Input tensor
        sublayer_fn: Function computing the sublayer transformation
        gamma, beta: LayerNorm parameters

    Returns:
        Output after pre-norm block
    """
    # Normalize first
    x_norm = layer_norm(x, gamma, beta)

    # Apply sublayer to normalized input
    sublayer_output = sublayer_fn(x_norm)

    # Add residual (to original x, not normalized x)
    output = x + sublayer_output

    return output

The ordering change is subtle but crucial: normalize, apply sublayer, add residual. The residual connection bypasses both the normalization and the sublayer, creating a direct gradient path from output to input.

In[6]:
Code
# Apply pre-norm block to the same input
output_pre = pre_norm_block(x, simple_sublayer, gamma, beta)
Out[7]:
Console
Pre-Norm Block Statistics
========================================
Input mean:  -0.1373, std: 0.9311
Output mean: -0.1117, std: 0.9280

Notice the key difference: the output statistics differ from the normalized values we saw with post-norm. The output is not normalized because the residual adds the original (unnormalized) input to the sublayer output. This accumulation behavior becomes important when we stack many blocks, as activations can grow in magnitude through the network.

Visualizing the Difference

The structural difference between pre-norm and post-norm becomes clearer when we visualize the computation graphs:

Out[8]:
Visualization
Block diagram showing post-norm flow: input to sublayer to add to layer norm to output.
Post-norm: The input flows through the sublayer, combines with the residual, then gets normalized. Layer normalization is the last operation.
Block diagram showing pre-norm flow: input to layer norm to sublayer to add with original input to output.
Pre-norm: The input is normalized before the sublayer. The residual connection adds the original input, bypassing normalization entirely.

The diagrams reveal the fundamental difference. In post-norm, the residual enters the add block alongside the sublayer output, and layer normalization processes their sum. In pre-norm, the residual bypasses both layer normalization and the sublayer, creating a direct path from input to output.

Training Stability: The Gradient Flow Perspective

The practical motivation for pre-norm comes from training stability. While the forward pass differences we've seen are instructive, the real story unfolds during training, when gradients must flow backward through potentially hundreds of layers. Understanding this gradient flow explains why pre-norm enables stable training of very deep networks while post-norm struggles.

The Vanishing Gradient Problem in Deep Networks

Training neural networks requires computing how changes in each parameter affect the final loss. We do this through backpropagation: starting from the loss, we work backward through the network, computing gradients layer by layer using the chain rule.

Here's the challenge. The chain rule multiplies gradients together. If you have 100 layers and each layer slightly shrinks the gradient (say, by a factor of 0.99), the gradient reaching the first layer is 0.991000.370.99^{100} \approx 0.37 of what it started as. If each layer shrinks it by 0.95, you get 0.951000.0060.95^{100} \approx 0.006, and the earliest layers barely learn anything. This is the vanishing gradient problem.

The converse, exploding gradients, happens when each layer slightly amplifies the gradient. Even small amplification factors compound exponentially, leading to numerical overflow and unstable training.

Out[9]:
Visualization
Line plot showing exponential decay of gradient magnitude for different per-layer scaling factors (0.99, 0.98, 0.95) across 100 layers.
Exponential gradient decay across network depth. Even small per-layer decay factors compound dramatically: a 5% reduction per layer leaves only 0.6% of the original gradient after 100 layers. The dashed lines mark typical transformer depths (12 and 24 layers).

The visualization makes the exponential decay visceral. Even at the modest depth of 12 layers (BERT's configuration), a 2% per-layer reduction leaves only about 78% of the gradient. At 100 layers, a 5% reduction per layer is catastrophic, leaving less than 1% of the original gradient signal.

Residual connections were designed to address this problem by providing a direct path for gradients. But as we'll see, where you place the layer normalization determines whether this path remains truly direct.

Deriving the Gradient Flow

To understand the stability difference, we need to trace how gradients flow through each block type. Let's denote:

  • LL: the loss function we're minimizing
  • x\mathbf{x}: the input to a transformer block
  • x\mathbf{x}': the output of the block
  • Lx\frac{\partial L}{\partial \mathbf{x}'}: the gradient arriving from subsequent layers (we receive this during backpropagation)
  • Lx\frac{\partial L}{\partial \mathbf{x}}: the gradient we need to compute and pass to earlier layers

The chain rule tells us that Lx=Lxxx\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{x}'} \cdot \frac{\partial \mathbf{x}'}{\partial \mathbf{x}}. The key question is: what does xx\frac{\partial \mathbf{x}'}{\partial \mathbf{x}} look like for each block type?

Gradient Flow in Post-Norm

Recall the post-norm formulation: x=LayerNorm(x+Sublayer(x))\mathbf{x}' = \text{LayerNorm}(\mathbf{x} + \text{Sublayer}(\mathbf{x})). To compute the gradient, we apply the chain rule from outside to inside:

  1. First, the gradient passes through LayerNorm
  2. Then, it splits into two paths: through the residual (+x+\mathbf{x}) and through the sublayer

Mathematically, this gives:

Lx=LxLayerNorm(x+Sublayer(x))(1+Sublayerx)\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{x}'} \cdot \frac{\partial \text{LayerNorm}}{\partial (\mathbf{x} + \text{Sublayer}(\mathbf{x}))} \cdot \left(1 + \frac{\partial \text{Sublayer}}{\partial \mathbf{x}}\right)

where:

  • Lx\frac{\partial L}{\partial \mathbf{x}}: the gradient of the loss with respect to the block input, which we want to compute
  • Lx\frac{\partial L}{\partial \mathbf{x}'}: the gradient of the loss with respect to the block output (provided by the next layer during backpropagation)
  • LayerNorm()\frac{\partial \text{LayerNorm}}{\partial (\cdot)}: the Jacobian of the layer normalization operation, a matrix that describes how small changes in the input to LayerNorm affect its output
  • 1+Sublayerx1 + \frac{\partial \text{Sublayer}}{\partial \mathbf{x}}: the combined contribution from the residual path (the 11) and the sublayer path

Notice the structure: the LayerNorm Jacobian appears as a multiplicative factor in the main gradient path. This is the crux of the instability problem. Even though LayerNorm has reasonably well-behaved gradients for a single layer (typically with eigenvalues close to 1), stacking many post-norm blocks means multiplying many of these Jacobians together. Small deviations from 1 compound across layers, leading to vanishing or exploding gradients in deep networks.

Gradient Flow in Pre-Norm

Now consider pre-norm: x=x+Sublayer(LayerNorm(x))\mathbf{x}' = \mathbf{x} + \text{Sublayer}(\text{LayerNorm}(\mathbf{x})). The structural difference is subtle but crucial. Applying the chain rule:

  1. First, the gradient splits at the addition: one path goes through the residual (+x+\mathbf{x}), another through the sublayer
  2. Only the sublayer path passes through LayerNorm

This yields:

Lx=Lx(1+SublayerLayerNorm(x)LayerNormx)\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{x}'} \cdot \left(1 + \frac{\partial \text{Sublayer}}{\partial \text{LayerNorm}(\mathbf{x})} \cdot \frac{\partial \text{LayerNorm}}{\partial \mathbf{x}}\right)

where:

  • Lx\frac{\partial L}{\partial \mathbf{x}}: the gradient of the loss with respect to the block input
  • Lx\frac{\partial L}{\partial \mathbf{x}'}: the gradient of the loss with respect to the block output
  • 11: the gradient contribution from the direct residual path (since xx=1\frac{\partial \mathbf{x}}{\partial \mathbf{x}} = 1)
  • SublayerLayerNorm(x)\frac{\partial \text{Sublayer}}{\partial \text{LayerNorm}(\mathbf{x})}: the Jacobian of the sublayer with respect to its (normalized) input
  • LayerNormx\frac{\partial \text{LayerNorm}}{\partial \mathbf{x}}: the Jacobian of layer normalization

The Gradient Highway

The key insight lies in that +1+1 term inside the parentheses. In pre-norm, the gradient from the output Lx\frac{\partial L}{\partial \mathbf{x}'} has a direct, unmodified path back to the input. This path bypasses both the sublayer and the layer normalization entirely.

Think of it as a gradient highway: no matter what happens in the sublayer (vanishing activations, saturated nonlinearities, poorly conditioned weight matrices), the gradient can always flow back through the residual connection. The +1+1 term ensures that at minimum, the full gradient Lx\frac{\partial L}{\partial \mathbf{x}'} reaches the input.

In post-norm, this highway doesn't exist in the same form. The LayerNorm Jacobian gates all gradient flow, including the residual path. If these Jacobians systematically shrink or grow gradients even slightly, the effects compound across many layers.

This mathematical property explains the empirical observations:

  • Pre-norm networks train stably even at 100+ layers
  • Post-norm networks require careful warmup schedules and smaller learning rates
  • Pre-norm tolerates larger learning rates without diverging
Out[10]:
Visualization
Heatmap showing gradient flow paths in post-norm, with LayerNorm Jacobian gating all paths.
Post-norm gradient paths: All gradients must pass through the LayerNorm Jacobian (shown in pink). There is no direct path from output to input, so every gradient signal is modulated.
Heatmap showing gradient flow paths in pre-norm, with a clear diagonal identity path.
Pre-norm gradient paths: The identity path (dark blue diagonal) provides a direct gradient highway. Gradients can flow unimpeded from any layer''s output to any earlier layer''s input.

The heatmaps reveal the structural difference in gradient flow. In post-norm, gradient strength decays as it travels backward through layers, with no escape from the cascading LayerNorm Jacobians. In pre-norm, the bright diagonal represents the gradient highway: full-strength gradients flowing directly from any layer's output to its input, bypassing all intermediate transformations.

Let's simulate this difference with a numerical experiment:

In[11]:
Code
def simulate_gradient_flow(n_layers, norm_type="pre"):
    """
    Simulate gradient magnitude through stacked transformer blocks.

    Args:
        n_layers: Number of transformer blocks to stack
        norm_type: 'pre' or 'post'

    Returns:
        List of gradient magnitudes at each layer
    """
    np.random.seed(42)
    d_model = 64

    # Initialize parameters for each layer
    sublayer_weights = [
        np.random.randn(d_model, d_model) * 0.02 for _ in range(n_layers)
    ]
    gamma = np.ones(d_model)
    beta = np.zeros(d_model)

    # Forward pass: track activations
    x = np.random.randn(1, d_model)
    activations = [x.copy()]

    for i in range(n_layers):
        W = sublayer_weights[i]

        if norm_type == "pre":
            x_norm = layer_norm(x, gamma, beta)
            sublayer_out = np.tanh(x_norm @ W)
            x = x + sublayer_out
        else:  # post
            sublayer_out = np.tanh(x @ W)
            x = layer_norm(x + sublayer_out, gamma, beta)

        activations.append(x.copy())

    # Backward pass: track gradient magnitudes
    # Start with unit gradient at output
    grad = np.ones_like(x)
    gradient_magnitudes = [np.linalg.norm(grad)]

    for i in range(n_layers - 1, -1, -1):
        W = sublayer_weights[i]
        x_prev = activations[i]

        if norm_type == "pre":
            # Simplified gradient (actual would involve LN Jacobian)
            # The key point is the +1 from residual
            x_norm = layer_norm(x_prev, gamma, beta)
            tanh_deriv = 1 - np.tanh(x_norm @ W) ** 2
            sublayer_grad = (grad * tanh_deriv) @ W.T

            # Gradient through residual (direct path)
            residual_grad = grad.copy()

            # Combine (simplified: ignoring LN Jacobian details)
            grad = residual_grad + sublayer_grad * 0.5  # LN scaling effect
        else:
            # Post-norm: gradient must go through LN first
            # Simplified: LN tends to have gradient magnitude ~1
            tanh_deriv = 1 - np.tanh(x_prev @ W) ** 2
            sublayer_grad = (grad * tanh_deriv) @ W.T
            grad = grad + sublayer_grad
            grad = (
                grad
                / (np.linalg.norm(grad) + 1e-6)
                * np.linalg.norm(grad)
                * 0.95
            )

        gradient_magnitudes.append(np.linalg.norm(grad))

    return gradient_magnitudes[::-1]


# Compare gradient flow for different depths
depths = [6, 12, 24, 48]
pre_norm_grads = {d: simulate_gradient_flow(d, "pre") for d in depths}
post_norm_grads = {d: simulate_gradient_flow(d, "post") for d in depths}
Out[12]:
Visualization
Line plot showing stable gradient magnitudes across layers for pre-norm at various depths.
Pre-norm gradient magnitudes remain stable across layers and depths. The direct residual path ensures gradients don't vanish even in very deep networks.
Line plot showing declining gradient magnitudes in deeper post-norm networks.
Post-norm gradient magnitudes can decay over many layers. Deeper networks show more pronounced gradient reduction, potentially causing training difficulties.

The visualization illustrates the stability difference. Pre-norm maintains consistent gradient magnitudes across all depths because the residual connection provides a direct gradient path. Post-norm gradients tend to decay in deeper networks, making optimization more challenging.

Output Scale: An Important Trade-off

Pre-norm's stability advantage comes with a trade-off: the output of each block accumulates without normalization. This can lead to growing activation magnitudes as we stack more layers.

Let's measure this effect:

In[13]:
Code
def measure_activation_growth(n_layers, norm_type="pre"):
    """
    Measure how activation magnitudes evolve through stacked blocks.
    """
    np.random.seed(42)
    d_model = 64

    x = np.random.randn(4, d_model)  # 4 tokens
    x = x / np.linalg.norm(x, axis=-1, keepdims=True)  # Normalize input

    gamma = np.ones(d_model)
    beta = np.zeros(d_model)

    magnitudes = [np.mean(np.linalg.norm(x, axis=-1))]

    for i in range(n_layers):
        np.random.seed(42 + i)
        W = np.random.randn(d_model, d_model) * 0.1

        if norm_type == "pre":
            x_norm = layer_norm(x, gamma, beta)
            sublayer_out = np.tanh(x_norm @ W)
            x = x + sublayer_out
        else:
            sublayer_out = np.tanh(x @ W)
            x = layer_norm(x + sublayer_out, gamma, beta)

        magnitudes.append(np.mean(np.linalg.norm(x, axis=-1)))

    return magnitudes


# Compare activation growth
n_layers = 24
pre_magnitudes = measure_activation_growth(n_layers, "pre")
post_magnitudes = measure_activation_growth(n_layers, "post")
Out[14]:
Visualization
Line plot comparing activation magnitudes for pre-norm (growing) and post-norm (constant) across 24 layers.
Activation magnitude across layers. Pre-norm (blue) shows gradual accumulation as residuals add up without normalization. Post-norm (orange) maintains constant magnitude due to normalization after each block.
Out[15]:
Console
Activation Magnitude Summary
========================================
Pre-Norm:  Start=1.00, End=19.48, Growth=19.48x
Post-Norm: Start=1.00, End=8.00, Growth=8.00x

The growth in pre-norm isn't necessarily problematic, as the final layer normalization (applied before the output projection in most architectures) handles the accumulated magnitude. However, it does mean that numerical precision matters more in very deep pre-norm networks.

To understand where this growth comes from, we can decompose the output into contributions from the original input and each sublayer:

Out[16]:
Visualization
Stacked bar chart showing the decomposition of activation magnitude into input and sublayer contributions across 12 layers.
Residual contribution decomposition in pre-norm. Each bar shows how the total activation magnitude is built up from the original input (blue) and successive sublayer contributions. The input signal persists through all layers while sublayer contributions accumulate.

The decomposition reveals the structure of pre-norm's activation growth. The original input signal (dark blue) persists unchanged through all layers, which is the residual connection at work. Each sublayer adds its own contribution on top, and these contributions accumulate. The total magnitude (red line) grows steadily, but this growth is predictable and well-behaved, making it easy to compensate for with a final normalization.

Modern Architectures: The Pre-Norm Consensus

The empirical evidence strongly favors pre-norm for training stability, especially in large models. Let's examine how major architectures make this choice:

Normalization placement in major transformer architectures. Modern large models (GPT, LLaMA, PaLM) use pre-norm for training stability.
ArchitectureNorm PlacementNotes
Original TransformerPost-NormFirst formulation, requires warmup
GPT-2, GPT-3Pre-NormEnabled stable training of large models
BERTPost-NormRelatively shallow (12/24 layers)
RoBERTaPost-NormFollows BERT architecture
T5Pre-NormStable training for encoder-decoder
LLaMA, LLaMA 2Pre-NormUses RMSNorm variant
PaLMPre-NormParallel attention variant
GPT-4 (likely)Pre-NormStandard for modern LLMs

The pattern is clear: newer and larger models overwhelmingly choose pre-norm. The stability benefits outweigh any theoretical concerns about activation accumulation.

The Final Layer Norm

Pre-norm architectures typically add a final layer normalization after all transformer blocks but before the output projection. This serves two purposes: it normalizes the accumulated activations, and it provides a consistent interface for the output layer.

In[17]:
Code
def pre_norm_transformer(x, n_layers, d_model, d_ff):
    """
    Complete pre-norm transformer stack with final normalization.

    Args:
        x: Input embeddings of shape (seq_len, d_model)
        n_layers: Number of transformer blocks
        d_model: Model dimension
        d_ff: Feed-forward hidden dimension

    Returns:
        Output representations of shape (seq_len, d_model)
    """
    np.random.seed(42)

    gamma = np.ones(d_model)
    beta = np.zeros(d_model)

    for i in range(n_layers):
        # Pre-norm attention block
        x_norm = layer_norm(x, gamma, beta)
        # Simplified attention (just a projection for demonstration)
        W_attn = np.random.randn(d_model, d_model) * 0.02
        attn_out = x_norm @ W_attn
        x = x + attn_out

        # Pre-norm feed-forward block
        x_norm = layer_norm(x, gamma, beta)
        W1 = np.random.randn(d_model, d_ff) * 0.02
        W2 = np.random.randn(d_ff, d_model) * 0.02
        ff_out = np.maximum(0, x_norm @ W1) @ W2  # ReLU activation
        x = x + ff_out

    # Final layer normalization
    x = layer_norm(x, gamma, beta)

    return x
In[18]:
Code
# Test the complete stack
np.random.seed(42)
x_input = np.random.randn(4, 64)
x_output = pre_norm_transformer(x_input, n_layers=12, d_model=64, d_ff=256)
Out[19]:
Console
Pre-Norm Transformer Stack
========================================
Input shape:  (4, 64)
Output shape: (4, 64)
Input magnitude:  7.78
Output magnitude: 8.00

Despite passing through 12 transformer blocks (each adding residual contributions), the final output magnitude remains controlled. The final layer normalization ensures that regardless of how activations accumulated through the network, the output has well-controlled statistics suitable for the output projection. This design pattern is why pre-norm architectures include a final LayerNorm before the language modeling head.

Initialization Considerations

The choice between pre-norm and post-norm affects how we should initialize the network. Post-norm requires more careful initialization because gradients must flow through the normalization layers.

For pre-norm, a common approach is to scale the output projection of each sublayer by a factor related to the network depth:

In[20]:
Code
def initialize_pre_norm_weights(d_model, n_layers):
    """
    Initialize weights for a pre-norm transformer with proper scaling.

    Following the GPT-2 approach: scale residual connections by 1/sqrt(2*n_layers)
    """
    # Standard initialization
    std = 0.02

    # Residual scaling factor
    residual_scale = 1 / np.sqrt(2 * n_layers)

    weights = {
        "attention_proj": np.random.randn(d_model, d_model) * std,
        "ff_proj": np.random.randn(d_model, d_model) * std * residual_scale,
    }

    return weights, residual_scale


weights, scale = initialize_pre_norm_weights(512, 24)
Out[21]:
Console
Initialization for 24-layer Pre-Norm Transformer
=============================================
Standard weight std:    0.02
Residual scale factor:  0.1443
Scaled projection std:  0.002887

The residual scale factor of approximately 0.14 significantly reduces the contribution of each sublayer's output. With 24 layers (and 2 sublayers per block, hence 48 residual additions), this scaling prevents the accumulated sum from exploding. The GPT-2 paper introduced this technique specifically to enable training of deeper pre-norm networks.

Learning Rate Sensitivity

Post-norm architectures are notoriously sensitive to learning rate choices. Large learning rates can cause training to diverge, necessitating careful warmup schedules.

Pre-norm is more forgiving because the direct gradient path through residual connections prevents gradient explosion. This allows training with larger learning rates and simpler schedules.

Out[22]:
Visualization
Line plot showing post-norm training loss with and without warmup, where no warmup causes immediate divergence.
Post-norm training requires warmup. Without warmup (orange), training diverges immediately at the target learning rate. With warmup (blue), the gradual learning rate increase allows the model to find a stable optimization trajectory.
Line plot showing pre-norm training loss with and without warmup, where both converge successfully.
Pre-norm handles aggressive learning rates. Both with warmup (blue) and without (orange), training proceeds stably. The gradient highway ensures that even large initial updates don't cause divergence.

The simulated training curves illustrate the practical difference. Post-norm without warmup experiences immediate instability: the gradient explosion causes the loss to spike and training to fail. With warmup, the gradual learning rate increase gives the optimizer time to find a stable path. Pre-norm, by contrast, handles both scenarios gracefully because the gradient highway prevents the cascading instabilities that plague post-norm.

In[23]:
Code
def simulate_lr_sensitivity(norm_type, learning_rates):
    """
    Simulate training stability across learning rates.
    Returns a stability score (lower is better, 0 = diverged).
    """
    np.random.seed(42)
    d_model = 32
    n_layers = 12

    stability_scores = []

    for lr in learning_rates:
        # Initialize
        x = np.random.randn(4, d_model)
        gamma = np.ones(d_model)
        beta = np.zeros(d_model)

        diverged = False
        for step in range(100):
            # Forward pass
            for i in range(n_layers):
                np.random.seed(i)
                W = np.random.randn(d_model, d_model) * 0.1

                if norm_type == "pre":
                    x_norm = layer_norm(x, gamma, beta)
                    out = np.tanh(x_norm @ W)
                    x = x + out
                else:
                    out = np.tanh(x @ W)
                    x = layer_norm(x + out, gamma, beta)

            # Check for numerical issues
            if np.any(np.isnan(x)) or np.any(np.abs(x) > 1e6):
                diverged = True
                break

            # Simulate gradient update (simplified)
            grad = np.random.randn(*x.shape) * 0.01
            x = x - lr * grad

        if diverged:
            stability_scores.append(0)
        else:
            stability_scores.append(1 / (1 + np.std(x)))

    return stability_scores


learning_rates = [1e-5, 1e-4, 1e-3, 5e-3, 1e-2, 5e-2]
pre_stability = simulate_lr_sensitivity("pre", learning_rates)
post_stability = simulate_lr_sensitivity("post", learning_rates)
Out[24]:
Visualization
Bar chart comparing stability scores for pre-norm and post-norm across learning rates from 1e-5 to 5e-2.
Learning rate sensitivity comparison. Pre-norm (blue) maintains stability across a wider range of learning rates, while post-norm (orange) becomes unstable at higher learning rates, requiring more careful hyperparameter tuning.

The stability comparison shows that pre-norm tolerates higher learning rates. This translates to faster convergence in practice, as larger learning rates allow taking bigger optimization steps.

When to Use Each Variant

Despite the modern consensus favoring pre-norm, there are situations where post-norm might be appropriate:

Use Pre-Norm when:

  • Training deep networks (more than 12 layers)
  • Building large language models (GPT-style)
  • You want simpler hyperparameter tuning
  • Training without extensive warmup periods
  • Prioritizing training stability

Consider Post-Norm when:

  • Working with shallow networks (12 layers or fewer)
  • Fine-tuning existing post-norm models (like BERT)
  • Reproducing results from older papers
  • Situations where final representation normalization is critical

The theoretical argument for post-norm is that normalized outputs at each layer provide a cleaner signal for subsequent processing. Some researchers have found that post-norm achieves slightly better final performance when training succeeds, though the stability challenges often make this advantage inaccessible in practice.

Hybrid Approaches

Recent work has explored combining the benefits of both approaches. One variant uses post-norm with scaled residuals:

In[25]:
Code
def scaled_post_norm_block(x, sublayer_fn, gamma, beta, alpha=0.1):
    """
    Scaled post-norm: reduces residual contribution for stability.

    Args:
        alpha: Scaling factor for sublayer output (typically 0.1-0.5)
    """
    sublayer_output = sublayer_fn(x)
    residual_sum = x + alpha * sublayer_output
    output = layer_norm(residual_sum, gamma, beta)
    return output

Another approach, used in some recent models, applies normalization both before and after the sublayer:

In[26]:
Code
def sandwich_norm_block(x, sublayer_fn, gamma1, beta1, gamma2, beta2):
    """
    Sandwich normalization: LayerNorm before and after sublayer.
    Provides extra stability at the cost of additional computation.
    """
    # Pre-normalize
    x_norm = layer_norm(x, gamma1, beta1)

    # Apply sublayer
    sublayer_output = sublayer_fn(x_norm)

    # Post-normalize the sublayer output
    sublayer_output = layer_norm(sublayer_output, gamma2, beta2)

    # Add residual
    output = x + sublayer_output

    return output

These hybrid approaches add complexity but can provide benefits in specific situations, particularly for extremely deep or wide networks.

Implementation Comparison

Let's implement both variants in a clean, comparable format that highlights their differences:

In[27]:
Code
class TransformerBlock:
    """
    Unified transformer block supporting both pre-norm and post-norm.
    """

    def __init__(self, d_model, d_ff, norm_type="pre"):
        """
        Args:
            d_model: Model dimension
            d_ff: Feed-forward hidden dimension
            norm_type: 'pre' or 'post'
        """
        self.d_model = d_model
        self.d_ff = d_ff
        self.norm_type = norm_type

        np.random.seed(42)

        # Attention weights (simplified)
        self.W_attn = np.random.randn(d_model, d_model) * 0.02

        # Feed-forward weights
        self.W1 = np.random.randn(d_model, d_ff) * 0.02
        self.W2 = np.random.randn(d_ff, d_model) * 0.02

        # Layer norm parameters (2 norms per block)
        self.gamma1 = np.ones(d_model)
        self.beta1 = np.zeros(d_model)
        self.gamma2 = np.ones(d_model)
        self.beta2 = np.zeros(d_model)

    def attention(self, x):
        """Simplified attention (just projection for demo)."""
        return x @ self.W_attn

    def feed_forward(self, x):
        """Two-layer feed-forward with ReLU."""
        hidden = np.maximum(0, x @ self.W1)
        return hidden @ self.W2

    def forward(self, x):
        """Forward pass with specified normalization strategy."""
        if self.norm_type == "pre":
            # Pre-norm attention
            x_norm = layer_norm(x, self.gamma1, self.beta1)
            x = x + self.attention(x_norm)

            # Pre-norm feed-forward
            x_norm = layer_norm(x, self.gamma2, self.beta2)
            x = x + self.feed_forward(x_norm)
        else:
            # Post-norm attention
            x = layer_norm(x + self.attention(x), self.gamma1, self.beta1)

            # Post-norm feed-forward
            x = layer_norm(x + self.feed_forward(x), self.gamma2, self.beta2)

        return x
In[28]:
Code
# Compare forward pass behavior
np.random.seed(42)
x = np.random.randn(4, 64)

pre_block = TransformerBlock(64, 256, norm_type="pre")
post_block = TransformerBlock(64, 256, norm_type="post")

pre_output = pre_block.forward(x)
post_output = post_block.forward(x)
Out[29]:
Console
Single Block Comparison
========================================
Input magnitude:       7.7756
Pre-norm output mag:   7.8477
Post-norm output mag:  8.0000

Even after just one block, the difference is visible. The post-norm output magnitude is close to d\sqrt{d} (the expected magnitude for a normalized dd-dimensional vector), while pre-norm's output reflects the accumulation of the input and sublayer contributions. This difference compounds across many layers, which is why the architectural choice matters significantly for deep networks.

Limitations and Impact

The pre-norm versus post-norm choice exemplifies a common pattern in deep learning architecture design: seemingly minor structural changes can have profound effects on trainability. Pre-norm solved the training stability challenges that limited early transformer scaling, directly enabling the large language models that define modern NLP.

The main limitation of pre-norm is the accumulating activation scale through the network. While the final layer normalization handles this for most purposes, it can create numerical precision challenges in extremely deep networks (hundreds of layers). Some architectures address this with periodic intermediate normalization or careful scaling of residual contributions.

Post-norm's limitation is its training instability for deep networks. The requirement for warmup schedules and careful hyperparameter tuning increases the complexity of training pipelines. For practitioners working with pre-trained models like BERT, the post-norm design is fixed, and fine-tuning inherits these stability considerations.

The broader impact of understanding normalization placement extends beyond transformers. Similar considerations apply to any deep residual network: where you normalize relative to the residual connection fundamentally changes how gradients flow. This principle guides architecture design across computer vision, speech processing, and other domains.

Key Parameters

When implementing pre-norm or post-norm transformer blocks, these parameters control the behavior and stability:

  • norm_type: Choice between "pre" or "post". Pre-norm places LayerNorm before the sublayer; post-norm places it after the residual addition. Modern large models (GPT, LLaMA) use pre-norm for stability.

  • eps (LayerNorm epsilon): Small constant (typically 1e-5 or 1e-6) added to the variance for numerical stability. Prevents division by zero when variance is very small.

  • gamma, beta (LayerNorm parameters): Learnable scale and shift parameters. Initialized to ones and zeros respectively, allowing the model to learn the optimal normalization behavior for each layer.

  • residual_scale: For pre-norm networks, scaling the output of sublayers by a factor like 1/2n1/\sqrt{2n} (where nn is the number of layers) prevents activation explosion. GPT-2 uses this technique.

  • n_layers: Network depth directly impacts the choice between pre-norm and post-norm. Networks deeper than 12 layers benefit significantly from pre-norm's gradient stability.

  • learning_rate: Post-norm requires smaller learning rates and warmup schedules. Pre-norm tolerates larger learning rates (often 3-10x higher) and simpler schedules.

Summary

The placement of layer normalization relative to residual connections determines whether a transformer uses pre-norm or post-norm architecture. This choice significantly impacts training dynamics, stability, and practical usability.

Key takeaways from this chapter:

  • Post-norm normalizes after the residual addition: The original transformer design applies normalization to the sum of the sublayer output and residual. This ensures each layer produces normalized outputs but can create gradient flow challenges in deep networks.

  • Pre-norm normalizes before the sublayer: Moving normalization inside the residual branch provides a direct gradient path through the residual connection. This enables stable training of very deep networks without complex warmup schedules.

  • Gradient highways explain stability: The +1+1 term from the residual connection in pre-norm ensures gradients can flow backward without attenuation, regardless of sublayer behavior. This mathematical property is why pre-norm trains more stably.

  • Activation accumulation is the trade-off: Pre-norm outputs are not normalized, leading to growing activation magnitudes. A final layer normalization before the output projection addresses this for most purposes.

  • Modern architectures favor pre-norm: GPT, LLaMA, and most large language models use pre-norm due to its stability benefits. Post-norm persists in some encoder models like BERT where network depth is more modest.

  • Learning rate tolerance differs: Pre-norm allows training with larger learning rates and simpler schedules. Post-norm requires careful warmup and learning rate selection to avoid divergence.

In the next chapter, we'll explore feed-forward networks, the other major component of transformer blocks alongside attention. Understanding how these two components work together completes the picture of transformer block architecture.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about pre-norm and post-norm transformer architectures.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{prenormvspostnormchoosinglayernormalizationplacementfortrainingstability, author = {Michael Brenndoerfer}, title = {Pre-Norm vs Post-Norm: Choosing Layer Normalization Placement for Training Stability}, year = {2025}, url = {https://mbrenndoerfer.com/writing/pre-norm-vs-post-norm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Pre-Norm vs Post-Norm: Choosing Layer Normalization Placement for Training Stability. Retrieved from https://mbrenndoerfer.com/writing/pre-norm-vs-post-norm
MLAAcademic
Michael Brenndoerfer. "Pre-Norm vs Post-Norm: Choosing Layer Normalization Placement for Training Stability." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/pre-norm-vs-post-norm>.
CHICAGOAcademic
Michael Brenndoerfer. "Pre-Norm vs Post-Norm: Choosing Layer Normalization Placement for Training Stability." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/pre-norm-vs-post-norm.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Pre-Norm vs Post-Norm: Choosing Layer Normalization Placement for Training Stability'. Available at: https://mbrenndoerfer.com/writing/pre-norm-vs-post-norm (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Pre-Norm vs Post-Norm: Choosing Layer Normalization Placement for Training Stability. https://mbrenndoerfer.com/writing/pre-norm-vs-post-norm
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free