Residual Connections: The Gradient Highways Enabling Deep Transformers

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Understand how residual connections solve the vanishing gradient problem in deep networks. Learn the math behind skip connections, gradient highways, residual scaling, and pre-norm vs post-norm configurations.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Residual ConnectionsLink Copied

Deep neural networks learn by stacking layers, with each layer refining the representation from the previous one. In theory, adding more layers should only help: a deeper network can learn anything a shallower network can, plus potentially more complex patterns. In practice, something strange happens. Beyond a certain depth, networks become harder to train. Loss stops decreasing. Accuracy plateaus or even drops.

Residual connections solve this paradox. By adding a simple shortcut that lets information bypass each layer, residual connections transform the optimization landscape. Instead of learning absolute transformations, layers learn small adjustments. Instead of gradients dying across dozens of layers, they flow freely through shortcut paths. This architectural pattern is so effective that it's now standard in virtually every deep learning architecture, from ResNets in computer vision to transformers in NLP.

This chapter explores the mathematics behind residual connections, their interpretation as gradient highways, and their specific role in transformer architectures. We'll also examine the choice between pre-norm and post-norm configurations, which affects training stability in ways that matter for very deep models.

The Deep Network ProblemLink Copied

Before residual connections, training deep networks was notoriously difficult. The culprit is the chain rule of calculus: during backpropagation, gradients are multiplied layer by layer. With many layers, these products either explode or vanish.

The Depth Dilemma

Deep networks have more representational capacity than shallow ones, but they're harder to optimize. The gradient signal that guides learning must travel through every layer, and each layer can amplify or diminish it. With dozens of layers, even small per-layer effects compound catastrophically.

Consider a network with $L$ layers. During backpropagation, we need to compute how the loss changes with respect to parameters deep in the network. The chain rule tells us to multiply derivatives through each intermediate layer. For parameters $\theta_l$ in layer $l$ , this gradient involves products of Jacobian matrices from all subsequent layers:

\frac{\partial \mathcal{L}}{\partial \theta_l} = \frac{\partial \mathcal{L}}{\partial h_L} \cdot \prod_{k=l}^{L-1} \frac{\partial h_{k+1}}{\partial h_k} \cdot \frac{\partial h_l}{\partial \theta_l}

where:

$\mathcal{L}$ : the loss function we're minimizing (e.g., cross-entropy)
$\theta_l$ : parameters (weights and biases) in layer $l$
$h_k$ : the hidden representation (activation vector) at layer $k$
$h_L$ : the final output of the network
$\frac{\partial h_{k+1}}{\partial h_k}$ : the Jacobian matrix of layer $k$ 's transformation, showing how each output dimension depends on each input dimension
$\prod_{k=l}^{L-1}$ : the product of all Jacobians from layer $l$ to the output layer $L-1$

The critical term is the product of Jacobians. Each Jacobian $\frac{\partial h_{k+1}}{\partial h_k}$ is a matrix, and we multiply $L - l$ of them together. If each Jacobian has spectral norm (largest singular value) slightly greater than 1, the product grows exponentially. If slightly less than 1, it shrinks exponentially. For example, with spectral norm 0.9 across 50 layers, the gradient shrinks by a factor of $0.9^{50} \approx 0.005$ , effectively killing the learning signal.

Keeping gradients stable across 50 or 100 layers requires exquisite balance that's nearly impossible to maintain during training.

In[2]:

Code

import numpy as np

np.random.seed(42)


def simulate_gradient_flow(n_layers, jacobian_scale):
    """Simulate gradient magnitude through a deep network."""
    gradient_magnitudes = [1.0]  # Start with unit gradient

    for _ in range(n_layers):
        # Simulate Jacobian multiplication with some noise
        scale = jacobian_scale * (1 + 0.1 * np.random.randn())
        gradient_magnitudes.append(gradient_magnitudes[-1] * scale)

    return gradient_magnitudes


# Compare different Jacobian scales
layers = 50
exploding = simulate_gradient_flow(layers, 1.05)  # Slightly > 1
stable = simulate_gradient_flow(layers, 1.0)  # Exactly 1
vanishing = simulate_gradient_flow(layers, 0.95)  # Slightly < 1

import numpy as np

np.random.seed(42)


def simulate_gradient_flow(n_layers, jacobian_scale):
    """Simulate gradient magnitude through a deep network."""
    gradient_magnitudes = [1.0]  # Start with unit gradient

    for _ in range(n_layers):
        # Simulate Jacobian multiplication with some noise
        scale = jacobian_scale * (1 + 0.1 * np.random.randn())
        gradient_magnitudes.append(gradient_magnitudes[-1] * scale)

    return gradient_magnitudes


# Compare different Jacobian scales
layers = 50
exploding = simulate_gradient_flow(layers, 1.05)  # Slightly > 1
stable = simulate_gradient_flow(layers, 1.0)  # Exactly 1
vanishing = simulate_gradient_flow(layers, 0.95)  # Slightly < 1

Out[3]:

Visualization

Line plot showing three gradient trajectories diverging exponentially based on Jacobian scale. — Gradient magnitude across network depth for different Jacobian scales. Even small deviations from 1.0 compound exponentially. With scale 1.05, gradients explode to 10× after 50 layers. With scale 0.95, they vanish to 0.1× of their original magnitude.

The simulation reveals the challenge. A mere 5% deviation from perfect gradient preservation compounds across 50 layers to produce 10x amplification or 90% attenuation. Real networks with nonlinearities, varying weight magnitudes, and changing activation patterns face even more severe instabilities. Training such networks requires fighting against mathematics.

Why Adding Layers Can Hurt PerformanceLink Copied

A counterintuitive phenomenon plagued early deep learning: adding more layers to a well-performing network often decreased accuracy. This wasn't just about training difficulty. Even on training data, deeper networks performed worse than shallower ones.

This violates basic intuition. A 56-layer network should be at least as expressive as a 20-layer network: it could theoretically learn to make the extra 36 layers act as identity mappings, reproducing the shallower network's behavior. Yet empirically, the 56-layer network trains to higher error.

In[4]:

Code

# Simulating the degradation problem
# In reality, this was observed on CIFAR-10 and ImageNet with plain networks

network_depths = [20, 32, 44, 56, 110]
# Simulated training errors (based on He et al., 2015 observations)
plain_train_error = [4.5, 5.2, 6.1, 7.5, 8.9]  # Gets worse with depth
plain_test_error = [8.2, 8.8, 9.5, 10.8, 12.1]

# Simulating the degradation problem
# In reality, this was observed on CIFAR-10 and ImageNet with plain networks

network_depths = [20, 32, 44, 56, 110]
# Simulated training errors (based on He et al., 2015 observations)
plain_train_error = [4.5, 5.2, 6.1, 7.5, 8.9]  # Gets worse with depth
plain_test_error = [8.2, 8.8, 9.5, 10.8, 12.1]

Out[5]:

Visualization

Grouped bar chart showing training and test error increasing with network depth from 20 to 110 layers. — The degradation problem in plain networks. Both training and test error increase with depth, indicating an optimization failure rather than overfitting. A 110-layer network performs substantially worse than a 20-layer network even on training data.

The pattern is unmistakable: deeper plain networks achieve higher error on both training and test sets. If this were overfitting, we'd see low training error and high test error. Instead, both increase monotonically with depth. The optimization itself is failing. The deeper networks simply cannot learn the transformations they theoretically can represent.

The Residual Connection SolutionLink Copied

The degradation problem presents a paradox: deeper networks should be more expressive, yet they perform worse. The solution comes from rethinking what we ask each layer to learn.

Consider a network that already performs well at 20 layers. If we add 10 more layers, the ideal outcome might be for those extra layers to simply pass information through unchanged, effectively becoming identity mappings. But asking a layer to learn $F(x) = x$ through its weights is surprisingly difficult. The layer must coordinate all its parameters to reproduce the input exactly, and any training noise pushes it away from this fragile configuration.

What if, instead of learning the complete transformation, each layer only learned the adjustment to make? This shift in perspective is the key insight behind residual learning.

From Transformations to AdjustmentsLink Copied

In a traditional neural network, each layer maps its input to a completely new representation. Given input $x$ , the layer computes:

y = F(x)

where:

$x$ : the input representation (a vector of activations from the previous layer)
$y$ : the output representation (passed to the next layer)
$F$ : the layer's learned transformation, typically combining linear projections (weights), nonlinearities (like ReLU), and possibly normalization

The layer bears full responsibility for producing $y$ from scratch. Every useful feature in the output must be constructed by the learned function $F$ .

Now consider what happens if the optimal output is simply the input, unchanged. The layer must learn $F(x) = x$ . This is an identity mapping, and while mathematically simple, it requires precise coordination of weights. Any perturbation during training disrupts this delicate balance.

Residual connections flip this relationship. Instead of learning the output directly, the layer learns only the difference between desired output and input:

y = F(x) + x

where:

$F(x)$ : the residual function, representing only the change the layer should make
$x$ : added directly to form the output (the skip or shortcut connection)
$y$ : equals the input plus whatever adjustment $F$ learns

The addition $+x$ is the skip connection or shortcut. It provides a direct path for the input to flow to the output unchanged. The function $F$ now only needs to learn what to add to the input, rather than recreating the input plus modifications.

Residual Learning

A residual block computes $y = F(x) + x$ , where $F(x)$ is the transformation learned by the layer and $x$ is the input. The layer learns the residual $F(x)$ rather than the full mapping. If the optimal transformation is close to identity, the layer only needs to learn small adjustments.

The Identity Mapping AdvantageLink Copied

This reformulation seems minor, just adding a term, but it fundamentally changes the optimization landscape. Consider what each architecture must learn to achieve an identity mapping:

Comparison of identity mapping requirements between traditional and residual architectures.

Architecture	To produce $y = x$ , must learn:	Difficulty
Traditional	$F(x) = x$	Hard: requires precise weight configuration
Residual	$F(x) = 0$	Easy: just set weights to zero

For the traditional layer, learning identity requires that every weight matrix perfectly reproduces the input. For the residual block, learning identity requires only that $F(x) = 0$ , which happens automatically when weights are small or zero. The skip connection handles the identity part for free.

This asymmetry has profound implications:

Easy initialization: A residual block initialized with small weights is nearly an identity function, passing information through without disruption.
Graceful degradation: If a layer's contribution isn't helpful, it can "turn off" by driving $F(x)$ toward zero. The network doesn't lose information because $x$ still flows through.
Additive refinement: Each layer adds information rather than replacing it. The network builds up its representation incrementally, with each layer contributing what it can.

Demonstrating the DifferenceLink Copied

Let's see this principle in action. We'll create two blocks, one traditional and one residual, and observe how they behave when initialized with small weights.

In[6]:

Code

import torch.nn as nn


class PlainBlock(nn.Module):
    """Traditional block without skip connection."""

    def __init__(self, dim):
        super().__init__()
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)
        return out  # Must learn full transformation


class ResidualBlock(nn.Module):
    """Block with skip connection."""

    def __init__(self, dim):
        super().__init__()
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        residual = self.linear1(x)
        residual = self.relu(residual)
        residual = self.linear2(residual)
        return x + residual  # Learn only the adjustment

import torch.nn as nn


class PlainBlock(nn.Module):
    """Traditional block without skip connection."""

    def __init__(self, dim):
        super().__init__()
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        out = self.linear1(x)
        out = self.relu(out)
        out = self.linear2(out)
        return out  # Must learn full transformation


class ResidualBlock(nn.Module):
    """Block with skip connection."""

    def __init__(self, dim):
        super().__init__()
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        residual = self.linear1(x)
        residual = self.relu(residual)
        residual = self.linear2(residual)
        return x + residual  # Learn only the adjustment

The two classes are nearly identical, differing only in their final line: the plain block returns out while the residual block returns x + residual. This single change, adding the input back, creates the skip connection.

Now we'll test what happens when the output layer is initialized to zero, simulating a layer that hasn't yet learned to contribute:

In[7]:

Code

import torch

# Compare what each block outputs for zero-initialized weights
dim = 64
plain = PlainBlock(dim)
residual = ResidualBlock(dim)

# Initialize second linear layer to zero
nn.init.zeros_(plain.linear2.weight)
nn.init.zeros_(plain.linear2.bias)
nn.init.zeros_(residual.linear2.weight)
nn.init.zeros_(residual.linear2.bias)

# Test with random input
x = torch.randn(1, dim)
plain_out = plain(x)
residual_out = residual(x)

import torch

# Compare what each block outputs for zero-initialized weights
dim = 64
plain = PlainBlock(dim)
residual = ResidualBlock(dim)

# Initialize second linear layer to zero
nn.init.zeros_(plain.linear2.weight)
nn.init.zeros_(plain.linear2.bias)
nn.init.zeros_(residual.linear2.weight)
nn.init.zeros_(residual.linear2.bias)

# Test with random input
x = torch.randn(1, dim)
plain_out = plain(x)
residual_out = residual(x)

Out[8]:

Console

Behavior with Zero-Initialized Output Layer
--------------------------------------------------
Input norm:                8.3361
Plain block output norm:   0.0000
Residual block output norm: 8.3361

Plain output equals input:    False
Residual output equals input: True

The results reveal the fundamental difference. The plain block, with its output layer zeroed, produces output with near-zero norm. It has destroyed the input signal entirely. The residual block, in contrast, produces output identical to its input. The skip connection preserves information perfectly when the learned function $F$ contributes nothing.

This behavior at initialization cascades through deep networks. Stack 50 plain blocks with zero output layers, and the signal is completely lost. Stack 50 residual blocks with zero output layers, and the input passes through unchanged. As training progresses, each layer can gradually "turn on" its residual contribution without ever disrupting the signal flow.

The Gradient Highway InterpretationLink Copied

The forward pass benefits of residual connections, preserving information and enabling identity mappings, are compelling. But the real power emerges during backpropagation. Skip connections don't just help information flow forward; they create express lanes for gradients to flow backward.

To understand why this matters, recall the vanishing gradient problem from earlier. In plain networks, gradients must traverse every layer's Jacobian matrix, and even slight attenuation at each layer compounds to near-zero gradients deep in the network. Residual connections fundamentally change this dynamic.

Unrolling the Residual NetworkLink Copied

Consider a deep network with $L$ residual blocks. Let $x_0$ be the network input and $x_l$ be the output of block $l$ . Each block computes:

x_{l+1} = x_l + F_l(x_l)

where:

$x_l$ : the input to block $l$ (output of block $l-1$ )
$x_{l+1}$ : the output of block $l$
$F_l$ : the residual function for block $l$ (the learned transformation, excluding the skip connection)

Unlike traditional layers that chain together as compositions $F_L(F_{L-1}(\cdots F_1(x_0)))$ , residual blocks have an additive structure. We can unroll this recurrence to see the elegant form it takes:

\begin{aligned} x_1 &= x_0 + F_0(x_0) \\ x_2 &= x_1 + F_1(x_1) = x_0 + F_0(x_0) + F_1(x_1) \\ &\vdots \\ x_L &= x_0 + \sum_{l=0}^{L-1} F_l(x_l) \end{aligned}

This formulation reveals something remarkable: the output $x_L$ is the sum of the original input $x_0$ and all intermediate residuals. The input isn't transformed through a chain of compositions. Instead, it's preserved and augmented. Each layer contributes additively rather than multiplicatively.

Out[9]:

Visualization

Heatmap showing how residual contributions accumulate additively through 8 layers, with the input pattern visible throughout. — Visualization of additive residual accumulation through an 8-layer network. Each row shows the cumulative representation at that layer. The input (bottom) is preserved and augmented by successive residual contributions, visible as the pattern evolving upward.

The Gradient DecompositionLink Copied

Now we trace gradients backward. We want to compute how the loss $\mathcal{L}$ changes with respect to an early representation $x_l$ . The chain rule gives:

\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \frac{\partial x_L}{\partial x_l}

where:

$\frac{\partial \mathcal{L}}{\partial x_L}$ : the gradient at the output layer (computed from the loss)
$\frac{\partial x_L}{\partial x_l}$ : how changes in $x_l$ affect the final output $x_L$

The critical question is: what is $\frac{\partial x_L}{\partial x_l}$ ? From the unrolled formulation $x_L = x_0 + \sum_{k=0}^{L-1} F_k(x_k)$ , we differentiate with respect to $x_l$ :

\frac{\partial x_L}{\partial x_l} = 1 + \frac{\partial}{\partial x_l} \sum_{k=l}^{L-1} F_k(x_k)

where:

The 1 comes from the direct path: $x_l$ contributes to $x_L$ through the chain of skip connections ( $x_l \to x_{l+1} \to \cdots \to x_L$ )
The sum captures indirect contributions through the residual functions $F_k$

This decomposition is the heart of residual learning. The gradient splits into two components:

Direct path (the "1"): Gradients flow directly through skip connections, completely bypassing all layer transformations. This path always contributes exactly 1 to the gradient magnitude.
Residual path (the sum): Gradients flow through the learned functions $F_k$ , just as they would in a plain network. This path can vanish, explode, or behave unpredictably.

The key insight: even if the residual path completely vanishes (all $\frac{\partial F_k}{\partial x_l} \to 0$ ), the gradient still maintains magnitude 1 through the identity path. The skip connection guarantees a floor on gradient magnitude.

Seeing the Gradient Highway in ActionLink Copied

Let's simulate this gradient decomposition to see how the two paths behave as we go deeper into the network:

In[10]:

Code

def compute_gradient_paths(n_layers, residual_contribution=0.1):
    """
    Simulate gradient flow through residual network.

    The gradient at each layer is: direct_path + residual_path
    direct_path = 1 (constant)
    residual_path = product of layer derivatives (can vanish)
    """
    # Simulate the residual path gradients (can decay)
    residual_gradients = []
    residual_product = 1.0

    for l in range(n_layers):
        # Residual path gradient decays (simulating vanishing)
        residual_product *= residual_contribution
        # But direct path always contributes 1
        total_gradient = 1.0 + residual_product
        residual_gradients.append(
            {
                "layer": l,
                "direct": 1.0,
                "residual": residual_product,
                "total": total_gradient,
            }
        )

    return residual_gradients


gradients = compute_gradient_paths(20, residual_contribution=0.5)

def compute_gradient_paths(n_layers, residual_contribution=0.1):
    """
    Simulate gradient flow through residual network.

    The gradient at each layer is: direct_path + residual_path
    direct_path = 1 (constant)
    residual_path = product of layer derivatives (can vanish)
    """
    # Simulate the residual path gradients (can decay)
    residual_gradients = []
    residual_product = 1.0

    for l in range(n_layers):
        # Residual path gradient decays (simulating vanishing)
        residual_product *= residual_contribution
        # But direct path always contributes 1
        total_gradient = 1.0 + residual_product
        residual_gradients.append(
            {
                "layer": l,
                "direct": 1.0,
                "residual": residual_product,
                "total": total_gradient,
            }
        )

    return residual_gradients


gradients = compute_gradient_paths(20, residual_contribution=0.5)

We simulate 20 layers where the residual path loses half its gradient at each layer (a 50% attenuation, more optimistic than many real networks). The direct path, by design, contributes exactly 1.0 at every layer:

Out[11]:

Console

Gradient Components Through Residual Network
------------------------------------------------------------
 Layer |  Direct Path |  Residual Path |      Total
------------------------------------------------------------
     0 |       1.0000 |       0.500000 |     1.5000
     4 |       1.0000 |       0.031250 |     1.0312
     8 |       1.0000 |       0.001953 |     1.0020
    12 |       1.0000 |       0.000122 |     1.0001
    16 |       1.0000 |       0.000008 |     1.0000

The table reveals the mechanism with striking clarity. The direct path column shows unwavering 1.0 values at every depth. This is the gradient highway, completely unaffected by layer count. Meanwhile, the residual path decays exponentially: by layer 16, it contributes only 0.000008 to the gradient, essentially vanishing as it would in a plain network.

But here's the crucial difference from plain networks: the total gradient never drops below 1.0. In a 20-layer plain network with 50% gradient attenuation per layer, the gradient would be $0.5^{20} \approx 0.000001$ , effectively zero. In the residual network, it's 1.000001. The skip connection provides a floor that prevents gradients from vanishing entirely.

Out[12]:

Visualization

Stacked area chart showing constant direct path contribution and exponentially decaying residual path contribution across 20 layers. — Gradient decomposition in a residual network. The direct path (blue) contributes a constant 1.0 at all depths, while the residual path (orange) decays exponentially. The combined gradient never falls below 1.0, ensuring trainability at any depth.

Visualizing the Gradient HighwayLink Copied

The term "gradient highway" captures this behavior perfectly. Think of a city's road network: local streets pass through many intersections, each one potentially causing delays or blockages. An expressway bypasses these intersections entirely, providing guaranteed throughput regardless of local traffic conditions. Skip connections serve the same purpose for gradients. They bypass the potentially problematic layer transformations, providing guaranteed gradient flow to early parameters.

Out[13]:

Visualization

Line plot comparing gradient flow in plain vs residual networks, showing plain gradients vanishing while residual gradients stay bounded. — Gradient magnitude comparison between plain networks and residual networks. The skip connection ensures gradients remain bounded away from zero even in very deep networks, maintaining the ability to update early layer parameters.

The visualization captures the essential difference. In plain networks, gradients decay exponentially. In residual networks, the direct path provides a floor: gradients never fall below 1.0 (plus whatever the residual path contributes). This guaranteed minimum gradient flow is why residual networks can train effectively at depths that would be impossible for plain networks.

Residual Connections in TransformersLink Copied

Having established why residual connections work in general, we now turn to their specific application in transformers. The transformer architecture relies heavily on residual connections, not as an optional enhancement, but as a fundamental design principle that makes training possible.

A transformer encoder or decoder consists of stacked identical blocks. Each block contains two distinct operations: multi-head self-attention (which lets tokens gather information from each other) and a position-wise feed-forward network (which processes each token independently). Without residual connections, stacking 12, 24, or 96 of these blocks would be impossible to train. With them, modern language models routinely use such depths.

The Transformer Block PatternLink Copied

A standard transformer block processes input $x$ through two sub-layers, each wrapped with its own residual connection. The computation proceeds as:

\begin{aligned} h &= x + \text{MultiHeadAttention}(x) \\ y &= h + \text{FFN}(h) \end{aligned}

where:

$x \in \mathbb{R}^{n \times d}$ : input to the transformer block, with $n$ tokens and $d$ -dimensional embeddings
$\text{MultiHeadAttention}(x)$ : the attention sub-layer output, same shape as $x$
$h \in \mathbb{R}^{n \times d}$ : intermediate representation after the first residual addition
$\text{FFN}(h)$ : the feed-forward sub-layer output, same shape as $h$
$y \in \mathbb{R}^{n \times d}$ : final block output, passed to the next transformer block

Each addition ( $+$ ) represents a skip connection. Notice the interpretive power this structure provides:

Attention as adjustment: The attention sub-layer doesn't produce a completely new representation; it learns what context-aware information to add to each token based on other positions in the sequence.
FFN as refinement: The feed-forward network learns what position-wise transformations to add to the post-attention representation, processing each token independently.
Graceful bypass: If either sub-layer's contribution isn't helpful for a particular input, it can learn to produce near-zero output. The information flows through unchanged via the skip connection.

This framing helps understand what transformers actually learn: not arbitrary transformations at each layer, but successive refinements to a representation that remains connected to the original input.

In[14]:

Code

class TransformerBlockWithResiduals(nn.Module):
    """
    Simplified transformer block showing residual structure.
    Uses post-norm configuration for clarity.
    """

    def __init__(self, d_model, n_heads, d_ff):
        super().__init__()
        # Multi-head attention
        self.attention = nn.MultiheadAttention(
            d_model, n_heads, batch_first=True
        )
        self.norm1 = nn.LayerNorm(d_model)

        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # First residual: attention
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_output)  # Residual + LayerNorm

        # Second residual: FFN
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)  # Residual + LayerNorm

        return x


# Create a transformer block
d_model = 512
n_heads = 8
d_ff = 2048

block = TransformerBlockWithResiduals(d_model, n_heads, d_ff)

class TransformerBlockWithResiduals(nn.Module):
    """
    Simplified transformer block showing residual structure.
    Uses post-norm configuration for clarity.
    """

    def __init__(self, d_model, n_heads, d_ff):
        super().__init__()
        # Multi-head attention
        self.attention = nn.MultiheadAttention(
            d_model, n_heads, batch_first=True
        )
        self.norm1 = nn.LayerNorm(d_model)

        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.ReLU(), nn.Linear(d_ff, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # First residual: attention
        attn_output, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_output)  # Residual + LayerNorm

        # Second residual: FFN
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)  # Residual + LayerNorm

        return x


# Create a transformer block
d_model = 512
n_heads = 8
d_ff = 2048

block = TransformerBlockWithResiduals(d_model, n_heads, d_ff)

Out[15]:

Console

Transformer Block with Residuals
--------------------------------------------------
Input shape:  (4, 16, 512)
Output shape: (4, 16, 512)
Number of parameters: 3,152,384

Components:
  Attention: MultiheadAttention(d_model=512, n_heads=8)
  FFN: Linear(512 -> 2048 -> 512)

The block processes a batch of 4 sequences, each with 16 tokens and 512-dimensional embeddings. The output maintains the same shape, a requirement for residual connections to work since the addition $x + F(x)$ requires matching dimensions. With over 6 million parameters split between attention and FFN, each sub-layer contributes its learned adjustments through the residual pattern x + SubLayer(x).

Why Every Sub-Layer Gets a ResidualLink Copied

You might wonder: why not just one residual connection per transformer block, wrapping both attention and FFN together? The answer relates to optimization stability and gradient flow.

With residual connections around each sub-layer:

Gradients have two "highways" per block: one bypassing attention, one bypassing FFN
Each sub-layer can be trained somewhat independently
The network can easily learn to skip either sub-layer if it's not helpful

With a single residual per block:

Gradients must flow through both attention and FFN or bypass both
Harder to learn that attention is useful but FFN isn't (or vice versa)
Effectively half as many gradient highways

The empirical result is that per-sub-layer residuals train more stably and achieve better final performance, especially in deep networks.

Residual ScalingLink Copied

Residual connections solve the vanishing gradient problem beautifully, but they introduce a subtler issue in very deep networks. Recall the unrolled form: $x_L = x_0 + \sum_{l=0}^{L-1} F_l(x_l)$ . We're summing $L$ residuals. If each residual has similar magnitude, the sum grows with depth.

For a 12-layer transformer, this growth is manageable. For a 96-layer model, summing residuals can cause the representation magnitude to explode, destabilizing training even though gradients flow well. The network needs a mechanism to control this accumulation.

The Scaling FactorLink Copied

Residual scaling dampens each residual contribution before adding it to the skip path. Instead of $y = x + F(x)$ , we compute:

y = x + \alpha \cdot F(x)

where:

$\alpha$ : a scaling factor, typically $\alpha < 1$
$F(x)$ : the residual function output
$\alpha \cdot F(x)$ : the dampened residual, smaller in magnitude than the original

The design question is: what value should $\alpha$ take? Several strategies have proven effective:

Fixed scaling: $\alpha = 0.1$ or similar constant, chosen empirically. Simple and effective for moderately deep networks.
Depth-dependent: $\alpha = 1/\sqrt{L}$ where $L$ is total layers. This ensures the variance of the sum $\sum_{l} \alpha F_l(x_l)$ stays bounded as depth increases, based on the statistics of summing independent random variables.
Learned: $\alpha$ as a trainable parameter per layer, allowing the network to discover appropriate scaling during training. More flexible but adds parameters.

Each approach reduces each layer's contribution, preventing the sum of residuals from growing unboundedly. Importantly, scaling doesn't hurt gradient flow: the gradient still flows through the unscaled skip connection, maintaining the gradient highway effect.

In[16]:

Code

class ScaledResidualBlock(nn.Module):
    """Residual block with scaling factor."""

    def __init__(self, dim, scale=0.1):
        super().__init__()
        self.scale = scale
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        residual = self.linear1(x)
        residual = self.relu(residual)
        residual = self.linear2(residual)
        return x + self.scale * residual  # Scaled residual


def compare_scaling_effects(n_layers, scales):
    """Compare output magnitude across different scaling factors."""
    results = {}

    for scale in scales:
        np.random.seed(42)
        torch.manual_seed(42)

        # Stack residual blocks
        blocks = nn.ModuleList(
            [ScaledResidualBlock(64, scale) for _ in range(n_layers)]
        )

        # Forward pass
        x = torch.randn(1, 64)
        magnitudes = [x.norm().item()]

        h = x
        for block in blocks:
            h = block(h)
            magnitudes.append(h.norm().item())

        results[scale] = magnitudes

    return results


scaling_results = compare_scaling_effects(30, [1.0, 0.5, 0.1])

class ScaledResidualBlock(nn.Module):
    """Residual block with scaling factor."""

    def __init__(self, dim, scale=0.1):
        super().__init__()
        self.scale = scale
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        residual = self.linear1(x)
        residual = self.relu(residual)
        residual = self.linear2(residual)
        return x + self.scale * residual  # Scaled residual


def compare_scaling_effects(n_layers, scales):
    """Compare output magnitude across different scaling factors."""
    results = {}

    for scale in scales:
        np.random.seed(42)
        torch.manual_seed(42)

        # Stack residual blocks
        blocks = nn.ModuleList(
            [ScaledResidualBlock(64, scale) for _ in range(n_layers)]
        )

        # Forward pass
        x = torch.randn(1, 64)
        magnitudes = [x.norm().item()]

        h = x
        for block in blocks:
            h = block(h)
            magnitudes.append(h.norm().item())

        results[scale] = magnitudes

    return results


scaling_results = compare_scaling_effects(30, [1.0, 0.5, 0.1])

Out[17]:

Visualization

Line plot showing representation magnitude growth across 30 layers for three scaling factors. — Output magnitude across layers with different residual scaling factors. Without scaling (factor 1.0), magnitude grows unboundedly. With aggressive scaling (0.1), magnitude stays controlled but may limit expressiveness.

The visualization reveals the trade-off clearly. Without scaling (α = 1.0), the representation magnitude grows steadily as each layer adds its full residual contribution. The growth appears roughly linear because we're adding vectors of similar magnitude. With moderate scaling (α = 0.5), growth slows substantially. With aggressive scaling (α = 0.1), the magnitude stays nearly constant. Each layer contributes so little that the sum barely increases.

The optimal choice balances two concerns: too little scaling allows magnitude explosion in very deep networks, while too much scaling limits each layer's expressive contribution. A depth-dependent rule like $\alpha = 1/\sqrt{L}$ provides automatic balance: deeper networks use smaller scaling factors.

ReZero: Learning to "Turn On" LayersLink Copied

The fixed and depth-dependent scaling approaches require choosing $\alpha$ before training. But what if the network could discover the right scaling for each layer on its own? The ReZero approach takes this idea to its logical extreme: initialize every scaling factor to zero, and let the network learn when and how much to "turn on" each layer.

y = x + \alpha \cdot F(x), \quad \alpha_{\text{init}} = 0

where:

$\alpha$ : a learnable scalar parameter, initialized to 0
$\alpha_{\text{init}} = 0$ : at the start of training, $\alpha = 0$ , so $y = x$ exactly

At initialization, every residual block is exactly an identity mapping, not approximately, but perfectly. The entire network is just a pass-through. As training progresses, gradient descent increases $\alpha$ values where the residual function $F$ provides useful contributions. Layers that aren't helpful can keep $\alpha$ near zero.

This approach offers elegant properties:

Perfect identity initialization: No approximation errors, no disruption to signal flow at the start of training
Layer-specific adaptation: Important layers develop larger $\alpha$ values; less useful layers remain suppressed
Automatic depth scaling: In very deep networks, the collective $\alpha$ values naturally stay small enough to prevent magnitude explosion

In[18]:

Code

class ReZeroBlock(nn.Module):
    """Residual block with learnable zero-initialized scaling."""

    def __init__(self, dim):
        super().__init__()
        self.alpha = nn.Parameter(torch.zeros(1))  # Start at zero
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        residual = self.linear1(x)
        residual = self.relu(residual)
        residual = self.linear2(residual)
        return x + self.alpha * residual


# Demonstrate zero initialization
rezero_block = ReZeroBlock(64)
x_test = torch.randn(4, 64)

class ReZeroBlock(nn.Module):
    """Residual block with learnable zero-initialized scaling."""

    def __init__(self, dim):
        super().__init__()
        self.alpha = nn.Parameter(torch.zeros(1))  # Start at zero
        self.linear1 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(dim, dim)

    def forward(self, x):
        residual = self.linear1(x)
        residual = self.relu(residual)
        residual = self.linear2(residual)
        return x + self.alpha * residual


# Demonstrate zero initialization
rezero_block = ReZeroBlock(64)
x_test = torch.randn(4, 64)

Out[19]:

Console

ReZero Block Initialization
--------------------------------------------------
Initial alpha: 0.000000
Input equals output: True

The output confirms that ReZero works as intended. With alpha initialized to exactly zero, the block is a perfect identity function: input and output are identical, not merely close. The residual function $F(x)$ still computes something (it has weights and activations), but its contribution is completely suppressed by the zero multiplier.

During training, gradient descent will increase alpha when doing so reduces the loss. Early in training, alpha values remain small, keeping the network close to identity. As the residual functions learn useful transformations, their alpha values grow to let those transformations contribute. This creates a natural curriculum: the network starts simple and gradually increases complexity.

Out[20]:

Visualization

Line plot showing alpha values for 12 layers growing from zero over training steps, with different final values per layer. — Simulated evolution of ReZero alpha values during training for a 12-layer network. Each layer's scaling factor starts at zero and grows as the residual function learns useful transformations. Different layers develop different final alpha values based on their importance.

Pre-Norm vs Post-Norm ResidualsLink Copied

We've established that residual connections enable deep network training through gradient highways. But there's a subtle design choice that significantly affects training dynamics: where do you place layer normalization relative to the residual addition?

Layer normalization is essential for stable training. It prevents activations from exploding or vanishing by normalizing to zero mean and unit variance. But its placement interacts with the skip connection in ways that matter for very deep networks. The two configurations are post-norm (the original transformer design) and pre-norm (now the dominant choice for large models).

Post-Norm: The Original ConfigurationLink Copied

The original "Attention Is All You Need" paper uses post-norm, where normalization comes after the residual addition:

y = \text{LayerNorm}(x + F(x))

where:

$x$ : the input to the block
$F(x)$ : the sub-layer output (e.g., attention or FFN)
$x + F(x)$ : the residual sum
$\text{LayerNorm}(\cdot)$ : layer normalization applied to the combined result
$y$ : the final block output, normalized

In[21]:

Code

class PostNormBlock(nn.Module):
    """Post-norm configuration: normalize after residual."""

    def __init__(self, dim):
        super().__init__()
        self.linear = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        residual = self.linear(x)
        residual = self.relu(residual)
        return self.norm(x + residual)  # Norm AFTER addition


class PreNormBlock(nn.Module):
    """Pre-norm configuration: normalize before sub-layer."""

    def __init__(self, dim):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.linear = nn.Linear(dim, dim)
        self.relu = nn.ReLU()

    def forward(self, x):
        normalized = self.norm(x)  # Norm BEFORE sub-layer
        residual = self.linear(normalized)
        residual = self.relu(residual)
        return x + residual  # Clean residual path

class PostNormBlock(nn.Module):
    """Post-norm configuration: normalize after residual."""

    def __init__(self, dim):
        super().__init__()
        self.linear = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        residual = self.linear(x)
        residual = self.relu(residual)
        return self.norm(x + residual)  # Norm AFTER addition


class PreNormBlock(nn.Module):
    """Pre-norm configuration: normalize before sub-layer."""

    def __init__(self, dim):
        super().__init__()
        self.norm = nn.LayerNorm(dim)
        self.linear = nn.Linear(dim, dim)
        self.relu = nn.ReLU()

    def forward(self, x):
        normalized = self.norm(x)  # Norm BEFORE sub-layer
        residual = self.linear(normalized)
        residual = self.relu(residual)
        return x + residual  # Clean residual path

Pre-Norm: The Modern DefaultLink Copied

Pre-norm rearranges the components, applying normalization before the sub-layer rather than after the residual addition:

y = x + F(\text{LayerNorm}(x))

where:

$\text{LayerNorm}(x)$ : normalization applied to the input before the sub-layer
$F(\text{LayerNorm}(x))$ : the sub-layer operates on normalized input
$x$ : the original input, added directly (not normalized in the skip path)
$y$ : the output, combining raw input with transformed normalized input

This rearrangement might seem cosmetic, but it has a crucial consequence for gradient flow. In post-norm, gradients flowing through the skip connection must pass through the LayerNorm operation, where they're multiplied by the normalization's derivative. In pre-norm, the skip path is completely clean: the $+x$ term passes gradients backward without any modification.

The pure gradient highway we derived earlier, where the direct path contributes exactly 1 to the gradient, only holds perfectly for pre-norm. Post-norm introduces the normalization's Jacobian into the skip path, which can amplify or attenuate gradients depending on the input statistics.

In[22]:

Code

def analyze_gradient_path(block_type, depth=20):
    """Analyze gradient flow through stacked blocks."""
    dim = 64

    if block_type == "post_norm":
        blocks = nn.ModuleList([PostNormBlock(dim) for _ in range(depth)])
    else:
        blocks = nn.ModuleList([PreNormBlock(dim) for _ in range(depth)])

    # Forward pass
    x = torch.randn(4, dim, requires_grad=True)
    h = x
    for block in blocks:
        h = block(h)

    # Backward pass
    loss = h.sum()
    loss.backward()

    return x.grad.norm().item()


# Compare gradient norms
torch.manual_seed(42)
post_norm_grad = analyze_gradient_path("post_norm", depth=20)
torch.manual_seed(42)
pre_norm_grad = analyze_gradient_path("pre_norm", depth=20)

def analyze_gradient_path(block_type, depth=20):
    """Analyze gradient flow through stacked blocks."""
    dim = 64

    if block_type == "post_norm":
        blocks = nn.ModuleList([PostNormBlock(dim) for _ in range(depth)])
    else:
        blocks = nn.ModuleList([PreNormBlock(dim) for _ in range(depth)])

    # Forward pass
    x = torch.randn(4, dim, requires_grad=True)
    h = x
    for block in blocks:
        h = block(h)

    # Backward pass
    loss = h.sum()
    loss.backward()

    return x.grad.norm().item()


# Compare gradient norms
torch.manual_seed(42)
post_norm_grad = analyze_gradient_path("post_norm", depth=20)
torch.manual_seed(42)
pre_norm_grad = analyze_gradient_path("pre_norm", depth=20)

Out[23]:

Console

Gradient Flow Comparison (20-layer networks)
--------------------------------------------------
Post-norm input gradient norm: 0.0000
Pre-norm input gradient norm:  35.7888
Ratio (pre/post): 65399430.84x

The gradient norm comparison quantifies what the theory predicts. Pre-norm maintains substantially stronger gradients at the input layer after passing through 20 blocks. The ratio between them indicates that post-norm's normalization operations in the skip path attenuate the gradient signal.

For a 20-layer network, this difference is noticeable but perhaps manageable. For a 96-layer model like GPT-3, the gap compounds dramatically. Pre-norm's clean gradient path becomes essential for stable training at that scale.

Out[24]:

Visualization

Line plot comparing gradient norms for pre-norm and post-norm configurations from 5 to 50 layers. — Gradient magnitude at the input layer for pre-norm and post-norm configurations across increasing network depths. Pre-norm maintains stable gradients regardless of depth, while post-norm shows progressive attenuation.

Training Stability in PracticeLink Copied

The gradient analysis translates directly to practical training differences. Pre-norm transformers are more stable and forgiving during training, especially in challenging regimes:

Very deep networks (50+ layers): The cumulative effect of normalization Jacobians in post-norm's skip path compounds with depth. Pre-norm avoids this entirely.
Large learning rates: Post-norm requires more careful learning rate tuning because gradient magnitudes are less predictable. Pre-norm tolerates larger learning rates.
Reduced warmup schedules: Post-norm typically needs extended warmup periods to avoid early training instabilities. Pre-norm can often skip or shorten warmup.

The mathematical reason is clear: with pre-norm, the skip connection contributes exactly $\frac{\partial \mathcal{L}}{\partial x_L}$ to the gradient at any earlier layer, with no modification and no layer-dependent scaling. With post-norm, this contribution is modulated by the chain of normalization derivatives, introducing variance that can destabilize training.

Out[25]:

Visualization

Diagram showing post-norm block with LayerNorm after residual addition. — Schematic of post-norm configuration. Normalization sits in the residual path, affecting gradient flow through skip connections.

Diagram showing pre-norm block with LayerNorm before the sub-layer. — Schematic of pre-norm configuration. The residual path is clean, allowing unimpeded gradient flow to early layers.

When to Use Each ConfigurationLink Copied

The choice between pre-norm and post-norm depends on your specific situation:

Use Pre-Norm when:

Training very deep transformers (24+ layers)
You want faster training convergence
Stability is more important than squeezing out final performance
Using larger learning rates or shorter warmup schedules

Use Post-Norm when:

Fine-tuning pre-trained models that used post-norm
Willing to use careful learning rate scheduling and warmup
Optimizing for final quality over training convenience
Working with shallower networks where stability is less critical

Modern large language models predominantly use pre-norm due to its training stability at scale. GPT-2, GPT-3, LLaMA, and many other recent models adopt pre-norm configurations.

Practical ImplementationLink Copied

Having covered the theory (residual learning, gradient highways, scaling, and normalization placement), we now synthesize everything into a practical implementation. The following code creates a flexible residual block that supports all the variants we've discussed: pre-norm or post-norm configuration, fixed or learned scaling, and ReZero initialization.

In[26]:

Code

class FlexibleResidualBlock(nn.Module):
    """
    Flexible residual block supporting:
    - Pre-norm or post-norm
    - Optional residual scaling
    - ReZero initialization
    """

    def __init__(self, dim, norm_position="pre", scale=1.0, rezero=False):
        super().__init__()
        self.norm_position = norm_position
        self.rezero = rezero

        # Scaling factor
        if rezero:
            self.scale = nn.Parameter(torch.zeros(1))
        else:
            self.scale = scale

        # Sub-layer
        self.linear1 = nn.Linear(dim, dim * 4)
        self.activation = nn.GELU()
        self.linear2 = nn.Linear(dim * 4, dim)

        # Normalization
        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        if self.norm_position == "pre":
            # Pre-norm: normalize input, clean residual
            normalized = self.norm(x)
            residual = self.linear1(normalized)
            residual = self.activation(residual)
            residual = self.linear2(residual)
            return x + self.scale * residual
        else:
            # Post-norm: normalize after residual addition
            residual = self.linear1(x)
            residual = self.activation(residual)
            residual = self.linear2(residual)
            return self.norm(x + self.scale * residual)


# Compare configurations
dim = 256
prenorm = FlexibleResidualBlock(dim, norm_position="pre")
postnorm = FlexibleResidualBlock(dim, norm_position="post")
rezero = FlexibleResidualBlock(dim, norm_position="pre", rezero=True)

class FlexibleResidualBlock(nn.Module):
    """
    Flexible residual block supporting:
    - Pre-norm or post-norm
    - Optional residual scaling
    - ReZero initialization
    """

    def __init__(self, dim, norm_position="pre", scale=1.0, rezero=False):
        super().__init__()
        self.norm_position = norm_position
        self.rezero = rezero

        # Scaling factor
        if rezero:
            self.scale = nn.Parameter(torch.zeros(1))
        else:
            self.scale = scale

        # Sub-layer
        self.linear1 = nn.Linear(dim, dim * 4)
        self.activation = nn.GELU()
        self.linear2 = nn.Linear(dim * 4, dim)

        # Normalization
        self.norm = nn.LayerNorm(dim)

    def forward(self, x):
        if self.norm_position == "pre":
            # Pre-norm: normalize input, clean residual
            normalized = self.norm(x)
            residual = self.linear1(normalized)
            residual = self.activation(residual)
            residual = self.linear2(residual)
            return x + self.scale * residual
        else:
            # Post-norm: normalize after residual addition
            residual = self.linear1(x)
            residual = self.activation(residual)
            residual = self.linear2(residual)
            return self.norm(x + self.scale * residual)


# Compare configurations
dim = 256
prenorm = FlexibleResidualBlock(dim, norm_position="pre")
postnorm = FlexibleResidualBlock(dim, norm_position="post")
rezero = FlexibleResidualBlock(dim, norm_position="pre", rezero=True)

Out[27]:

Console

Flexible Residual Block Configurations
--------------------------------------------------
Pre-norm:
  Scale: 1.0000
  Input mean: 0.0062, std: 1.0020
  Output mean: 0.0065, std: 1.0227

Post-norm:
  Scale: 1.0000
  Input mean: 0.0062, std: 1.0020
  Output mean: -0.0000, std: 1.0000

ReZero:
  Scale: 0.0000
  Input mean: 0.0062, std: 1.0020
  Output mean: 0.0062, std: 1.0020

The output statistics reveal each configuration's character. Pre-norm and post-norm, both with scale 1.0, transform the input substantially. Notice how the output statistics differ from input. The key difference between them appears in the output distribution: post-norm produces output with statistics closer to zero mean and unit variance (the normalization's effect on the final output), while pre-norm allows more drift.

ReZero behaves distinctly: with scale 0.0, input and output are identical. The residual function computes something, but the zero multiplier suppresses it completely. This is exactly the "perfect identity initialization" property: at the start of training, before alpha increases, ReZero blocks pass information through unchanged.

Deep Network TestLink Copied

The theoretical benefits of residual connections should manifest in concrete, measurable ways when we build deep networks. Let's construct 30-layer networks with and without residual connections and compare their gradient flow characteristics:

In[28]:

Code

def build_deep_network(n_layers, dim, use_residuals=True, norm="pre"):
    """Build a deep network with or without residual connections."""
    layers = []
    for _ in range(n_layers):
        if use_residuals:
            layers.append(FlexibleResidualBlock(dim, norm_position=norm))
        else:
            # Plain block without residual
            layers.append(
                nn.Sequential(
                    nn.LayerNorm(dim) if norm == "pre" else nn.Identity(),
                    nn.Linear(dim, dim * 4),
                    nn.GELU(),
                    nn.Linear(dim * 4, dim),
                    nn.LayerNorm(dim) if norm == "post" else nn.Identity(),
                )
            )
    return nn.Sequential(*layers)


def test_gradient_flow(network, dim):
    """Test if gradients flow through the network."""
    x = torch.randn(4, 32, dim, requires_grad=True)
    y = network(x)
    loss = y.sum()
    loss.backward()

    return {
        "output_norm": y.norm().item(),
        "gradient_norm": x.grad.norm().item(),
        "output_mean": y.mean().item(),
        "output_std": y.std().item(),
    }


# Compare 30-layer networks
dim = 128
n_layers = 30

torch.manual_seed(42)
residual_net = build_deep_network(n_layers, dim, use_residuals=True)
torch.manual_seed(42)
plain_net = build_deep_network(n_layers, dim, use_residuals=False)

residual_stats = test_gradient_flow(residual_net, dim)
plain_stats = test_gradient_flow(plain_net, dim)

def build_deep_network(n_layers, dim, use_residuals=True, norm="pre"):
    """Build a deep network with or without residual connections."""
    layers = []
    for _ in range(n_layers):
        if use_residuals:
            layers.append(FlexibleResidualBlock(dim, norm_position=norm))
        else:
            # Plain block without residual
            layers.append(
                nn.Sequential(
                    nn.LayerNorm(dim) if norm == "pre" else nn.Identity(),
                    nn.Linear(dim, dim * 4),
                    nn.GELU(),
                    nn.Linear(dim * 4, dim),
                    nn.LayerNorm(dim) if norm == "post" else nn.Identity(),
                )
            )
    return nn.Sequential(*layers)


def test_gradient_flow(network, dim):
    """Test if gradients flow through the network."""
    x = torch.randn(4, 32, dim, requires_grad=True)
    y = network(x)
    loss = y.sum()
    loss.backward()

    return {
        "output_norm": y.norm().item(),
        "gradient_norm": x.grad.norm().item(),
        "output_mean": y.mean().item(),
        "output_std": y.std().item(),
    }


# Compare 30-layer networks
dim = 128
n_layers = 30

torch.manual_seed(42)
residual_net = build_deep_network(n_layers, dim, use_residuals=True)
torch.manual_seed(42)
plain_net = build_deep_network(n_layers, dim, use_residuals=False)

residual_stats = test_gradient_flow(residual_net, dim)
plain_stats = test_gradient_flow(plain_net, dim)

Out[29]:

Console

Deep Network Comparison (30 layers)
--------------------------------------------------
Metric               |   Plain Network | Residual Network
--------------------------------------------------
output_norm          |         22.9232 |         189.8797
gradient_norm        |         63.5118 |         202.0742
output_mean          |         -0.0082 |           0.0353
output_std           |          0.1789 |           1.4831

The comparison reveals why residual connections have become ubiquitous in deep learning. At 30 layers of depth, the residual network maintains healthy gradient and output norms. The training signal reaches early parameters with sufficient magnitude to drive meaningful updates. The plain network shows different behavior in its gradient magnitudes, a sign of the compounding effects we analyzed earlier.

This gap widens dramatically with depth. At 50 layers, plain networks become nearly untrainable. At 100 layers, they're hopeless. Residual networks, in contrast, scale gracefully: models with 96, 128, or even 1000+ layers train successfully when equipped with skip connections. This scalability is why every major deep learning architecture since 2015, from ResNets to transformers to diffusion models, incorporates residual connections as a foundational element.

Limitations and Trade-offsLink Copied

Residual connections have proven remarkably effective, so much so that they're essentially mandatory for deep networks. But no architectural choice is without trade-offs. Understanding the limitations helps make informed design decisions.

Memory OverheadLink Copied

The skip connection requires storing the input $x$ until the residual $F(x)$ is computed, so they can be added. In a single block, this is trivial. In a deep network during training, it accumulates:

Plain networks: Store only the current layer's activations (can discard earlier layers)
Residual networks: Store input to each block until the addition completes

For training with large batch sizes and long sequences, this memory overhead becomes significant. A 32-layer transformer processing 2048-token sequences might need to store 32 copies of the activation tensor for the residual additions alone. Techniques like gradient checkpointing help by recomputing some activations during the backward pass instead of storing them, trading compute for memory.

Architectural ConstraintsLink Copied

The residual formulation $y = x + F(x)$ imposes a fundamental constraint: $x$ and $F(x)$ must have matching dimensions for the addition to be valid. In transformers, this is straightforward because the hidden dimension stays constant throughout. But in convolutional networks or architectures that change dimensionality, this creates design challenges.

When dimensions must change (e.g., increasing channels, reducing spatial resolution, or transitioning between embedding sizes), you need projection shortcuts:

y = W_s x + F(x)

where:

$W_s$ : a learned projection matrix of shape $(d_{\text{out}}, d_{\text{in}})$ that maps input dimension $d_{\text{in}}$ to output dimension $d_{\text{out}}$
$W_s x$ : the projected input, now matching the dimension of $F(x)$
$F(x)$ : the residual transformation, which outputs vectors of dimension $d_{\text{out}}$

Projection shortcuts work, but they sacrifice some benefits. The gradient through $W_s x$ is multiplied by $W_s^T$ , reintroducing potential gradient scaling issues. The skip path is no longer "free": it adds parameters, computation, and a transformation that can amplify or attenuate gradients. This is why transformer architectures prefer to maintain constant dimensions throughout, using the simpler $y = x + F(x)$ formulation everywhere.

The Identity Plateau ProblemLink Copied

The ease of learning identity mappings, a key feature of residual connections, can become a bug. If the optimization finds a local minimum where $F(x) \approx 0$ for many layers, the network effectively collapses to a much shallower architecture. The layers exist and consume memory, but they contribute nothing.

This "identity plateau" isn't catastrophic (the network still functions), but it fails to utilize the model's full capacity. You've paid for 96 layers but are effectively using 20. Careful initialization, appropriate learning rates, and normalization help avoid this plateau, as does training long enough for gradients to "turn on" dormant layers.

Despite these limitations, residual connections have become indispensable. The benefits (stable gradient flow, trainable depth, additive representation refinement) far outweigh the costs. Any serious deep learning practitioner will encounter residual connections repeatedly; understanding their mechanics is foundational knowledge.

SummaryLink Copied

Residual connections enable training of very deep neural networks by providing direct paths for information and gradients to flow. The core insight is simple: instead of learning transformations directly, layers learn adjustments to an identity baseline. This chapter covered the key concepts and their practical implications.

Key takeaways:

The depth problem: Without residual connections, gradients in deep networks either explode or vanish exponentially with depth, making optimization impossible.
Residual learning: The formulation $y = x + F(x)$ lets layers learn residuals (adjustments) rather than full transformations. Identity mappings become trivial, removing a major optimization barrier.
Gradient highways: Skip connections provide direct paths for gradient flow. Even if layer gradients vanish, gradients through the skip path maintain a constant magnitude of 1.0.
Transformer pattern: Each transformer sub-layer (attention and FFN) is wrapped with its own residual connection. This provides multiple gradient highways per block and allows selective use of each sub-layer.
Residual scaling: For very deep networks, scaling residuals by $\alpha < 1$ (or learning $\alpha$ from zero) prevents representation magnitude from growing unboundedly.
Pre-norm vs post-norm: Pre-norm places normalization before the sub-layer, keeping the skip path clean. It's more stable for deep networks and is the modern default. Post-norm was the original configuration and can achieve slightly better final performance with careful training.

Residual connections transformed deep learning from an art of careful initialization and shallow architectures into a science of stacking many simple blocks. Every major architecture since 2015, from ResNets to transformers to diffusion models, builds on this foundation.

Key ParametersLink Copied

When implementing residual connections in neural networks, several key parameters control their behavior and training dynamics:

scale ( $\alpha$ ): The residual scaling factor that multiplies $F(x)$ before adding to the skip path. Values less than 1.0 dampen residual contributions, preventing magnitude explosion in very deep networks. Common values range from 0.1 to 1.0, with $1/\sqrt{L}$ (where $L$ is total layers) providing depth-adaptive scaling.
norm_position: Controls whether layer normalization is applied before the sub-layer (pre-norm) or after the residual addition (post-norm). Pre-norm provides cleaner gradient flow and is preferred for deep networks (24+ layers). Post-norm was the original transformer configuration and may achieve marginally better final performance with careful training.
rezero: When enabled, initializes the scaling factor to zero, making each block an exact identity function at initialization. The network learns appropriate scaling during training. This eliminates approximation errors at initialization and can speed up early training.
dim: The hidden dimension of the residual block. Must match between input $x$ and output $F(x)$ for the addition to be valid. When dimensions change, projection shortcuts $W_s x$ are required.
expansion_factor: In feed-forward networks within transformers, this controls the ratio between inner dimension and model dimension. Typical value is 4, meaning $d_{ff} = 4 \times d_{model}$ .

For transformer implementations in PyTorch, these parameters interact with nn.LayerNorm and nn.MultiheadAttention. The choice between pre-norm and post-norm affects the placement of LayerNorm calls, while residual scaling can be implemented as a learnable nn.Parameter or a fixed constant multiplier.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Position Encoding Comparison

Next Chapter

Layer Normalization

Reference

BIBTEXAcademic

@misc{residualconnectionsthegradienthighwaysenablingdeeptransformers, author = {Michael Brenndoerfer}, title = {Residual Connections: The Gradient Highways Enabling Deep Transformers}, year = {2025}, url = {https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Residual Connections: The Gradient Highways Enabling Deep Transformers. Retrieved from https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers

MLAAcademic

Michael Brenndoerfer. "Residual Connections: The Gradient Highways Enabling Deep Transformers." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers>.

CHICAGOAcademic

Michael Brenndoerfer. "Residual Connections: The Gradient Highways Enabling Deep Transformers." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Residual Connections: The Gradient Highways Enabling Deep Transformers'. Available at: https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Residual Connections: The Gradient Highways Enabling Deep Transformers. https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers

Direct link:

https://mbrenndoerfer.com/writing/residual-connections-gradient-highways-deep-transformers

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Residual Connections: The Gradient Highways Enabling Deep Transformers

Residual ConnectionsLink Copied

The Deep Network ProblemLink Copied

Why Adding Layers Can Hurt PerformanceLink Copied

The Residual Connection SolutionLink Copied

From Transformations to AdjustmentsLink Copied

The Identity Mapping AdvantageLink Copied

Demonstrating the DifferenceLink Copied

The Gradient Highway InterpretationLink Copied

Unrolling the Residual NetworkLink Copied

The Gradient DecompositionLink Copied

Seeing the Gradient Highway in ActionLink Copied

Visualizing the Gradient HighwayLink Copied

Residual Connections in TransformersLink Copied

The Transformer Block PatternLink Copied

Why Every Sub-Layer Gets a ResidualLink Copied

Residual ScalingLink Copied

The Scaling FactorLink Copied

ReZero: Learning to "Turn On" LayersLink Copied

Pre-Norm vs Post-Norm ResidualsLink Copied

Post-Norm: The Original ConfigurationLink Copied

Pre-Norm: The Modern DefaultLink Copied

Training Stability in PracticeLink Copied

When to Use Each ConfigurationLink Copied

Practical ImplementationLink Copied

Deep Network TestLink Copied

Limitations and Trade-offsLink Copied

Memory OverheadLink Copied

Architectural ConstraintsLink Copied

The Identity Plateau ProblemLink Copied

SummaryLink Copied

Key ParametersLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Gated Linear Units: The FFN Architecture Behind Modern LLMs

FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models

Transformer Block Assembly: Building Complete Encoder & Decoder Blocks from Components

Stay updated