Search

Search articles

Vanishing Gradients in RNNs: Why Neural Networks Forget Long Sequences

Michael BrenndoerferDecember 16, 202531 min read

Master the vanishing gradient problem in recurrent neural networks. Learn why gradients decay exponentially, how this prevents learning long-range dependencies, and the solutions that led to LSTM.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Vanishing Gradients

Recurrent neural networks promise to model sequences of arbitrary length by sharing parameters across time. In theory, an RNN can learn dependencies between the first word of a sentence and the last, no matter how far apart they are. In practice, this promise often breaks down. The culprit? The vanishing gradient problem.

When you train an RNN using backpropagation through time, gradients flow backward from the loss at the final timestep all the way to the earliest inputs. At each step backward, the gradient gets multiplied by the same weight matrix. If the largest eigenvalue of this matrix is less than 1, these repeated multiplications cause the gradient to shrink exponentially. By the time the gradient reaches early timesteps, it has essentially vanished, carrying no useful learning signal.

This chapter dissects the vanishing gradient problem in depth. We'll derive exactly why gradients decay exponentially, visualize this decay, and demonstrate how it prevents RNNs from learning long-range dependencies. We'll also examine the flip side, exploding gradients, and survey the architectural solutions that address these fundamental limitations.

The Gradient Product Chain

To understand why gradients vanish, we need to trace exactly how learning signals flow through an RNN during training. The story begins with a simple question: when we compute a loss at the end of a sequence, how does that error signal reach the weights that processed the very first input?

The answer involves a chain of multiplications, and this chain structure is precisely what causes the vanishing gradient problem. Let's build up to this insight step by step, starting with the basic RNN computation and then deriving exactly how gradients propagate backward.

The Recurrence Equation

Before diving into gradients, let's establish the forward computation. An RNN processes a sequence by maintaining a hidden state that gets updated at each timestep. This update combines three things: the previous hidden state (memory of what came before), the current input (new information), and learned weights (how to combine them).

The standard RNN computes hidden states using the recurrence:

ht=tanh(Whhht1+Wxhxt+bh)\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{W}_{xh} \mathbf{x}_t + \mathbf{b}_h)

where:

  • htRd\mathbf{h}_t \in \mathbb{R}^d: the hidden state at timestep tt, a dd-dimensional vector encoding what the network "remembers"
  • WhhRd×d\mathbf{W}_{hh} \in \mathbb{R}^{d \times d}: the recurrent weight matrix that transforms the previous hidden state
  • WxhRd×n\mathbf{W}_{xh} \in \mathbb{R}^{d \times n}: the input weight matrix that transforms the current input
  • xtRn\mathbf{x}_t \in \mathbb{R}^n: the input at timestep tt
  • bhRd\mathbf{b}_h \in \mathbb{R}^d: the bias vector

The tanh activation squashes the output to the range (1,1)(-1, 1), preventing hidden states from growing unboundedly. But as we'll see, this squashing has consequences for gradient flow.

The crucial observation is the recursive dependency: ht\mathbf{h}_t depends on ht1\mathbf{h}_{t-1}, which depends on ht2\mathbf{h}_{t-2}, and so on, all the way back to h0\mathbf{h}_0. This means the hidden state at any timestep implicitly depends on all previous inputs and all applications of the weight matrices. When we want to compute how the loss depends on early hidden states, we must trace through this entire chain.

Deriving the Gradient Chain

Now we arrive at the heart of the matter: how do gradients flow backward through this recursive structure?

Suppose we compute a loss L\mathcal{L} at the final timestep TT. To update the recurrent weights Whh\mathbf{W}_{hh} using gradient descent, we need LWhh\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}}, which tells us how to adjust each weight to reduce the loss. But here's the complication: the same weight matrix Whh\mathbf{W}_{hh} is used at every timestep. A small change to Whh\mathbf{W}_{hh} affects h1\mathbf{h}_1, which affects h2\mathbf{h}_2, which affects h3\mathbf{h}_3, and so on. The total effect on the loss accumulates contributions from all these pathways.

Using the chain rule, we can express this accumulated gradient as:

LWhh=t=1TLhThThthtWhh\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}}{\partial \mathbf{h}_T} \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_t} \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_{hh}}

where:

  • LWhh\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}}: the gradient of the loss with respect to the recurrent weight matrix
  • t=1T\sum_{t=1}^{T}: summation over all timesteps, because Whh\mathbf{W}_{hh} influences every hidden state
  • LhT\frac{\partial \mathcal{L}}{\partial \mathbf{h}_T}: how the loss changes with the final hidden state (computed directly from the loss function)
  • hTht\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_t}: how the final hidden state depends on an earlier hidden state at time tt
  • htWhh\frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_{hh}}: how the hidden state at time tt depends directly on the weights

Let's focus on the critical term: hTht\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_t}. This measures how sensitive the final hidden state is to changes in an earlier hidden state. If this term is small, then even if the loss is very sensitive to hT\mathbf{h}_T, that sensitivity won't propagate back to influence how we update weights that affected ht\mathbf{h}_t.

Because each hidden state depends on the previous one through the recurrence (hk\mathbf{h}_k depends on hk1\mathbf{h}_{k-1}), we must chain together all the intermediate derivatives. This is where the product structure emerges. By the chain rule:

hTht=k=t+1Thkhk1\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_t} = \prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}}

where:

  • k=t+1T\prod_{k=t+1}^{T}: a product of TtT - t terms, one for each timestep between tt and TT
  • hkhk1\frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}}: the Jacobian matrix describing how hidden state kk depends on hidden state k1k-1

This product structure is the heart of the vanishing gradient problem. To go from timestep tt to timestep TT, we multiply together TtT - t matrices. If each matrix shrinks vectors (has norm less than 1), the product shrinks exponentially. If each matrix expands vectors (has norm greater than 1), the product explodes exponentially.

To understand what these Jacobian matrices look like, we differentiate the recurrence equation. Starting from hk=tanh(Whhhk1+Wxhxk+bh)\mathbf{h}_k = \tanh(\mathbf{W}_{hh} \mathbf{h}_{k-1} + \mathbf{W}_{xh} \mathbf{x}_k + \mathbf{b}_h), we apply the chain rule to find how hk\mathbf{h}_k changes when we perturb hk1\mathbf{h}_{k-1}:

hkhk1=diag(tanh(zk))Whh\frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}} = \text{diag}(\tanh'(\mathbf{z}_k)) \cdot \mathbf{W}_{hh}

where:

  • zk=Whhhk1+Wxhxk+bh\mathbf{z}_k = \mathbf{W}_{hh} \mathbf{h}_{k-1} + \mathbf{W}_{xh} \mathbf{x}_k + \mathbf{b}_h: the pre-activation (input to tanh) at timestep kk
  • diag(tanh(zk))\text{diag}(\tanh'(\mathbf{z}_k)): a d×dd \times d diagonal matrix where entry (i,i)(i, i) equals tanh(zk(i))\tanh'(z_k^{(i)}), the derivative of tanh evaluated at the ii-th component of zk\mathbf{z}_k
  • tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z): the derivative of the tanh function, which is always between 0 and 1
  • Whh\mathbf{W}_{hh}: the recurrent weight matrix

The Jacobian factors into two components:

  1. The diagonal matrix diag(tanh(zk))\text{diag}(\tanh'(\mathbf{z}_k)): This captures how the tanh activation attenuates the gradient. Since tanh(z)1\tanh'(z) \leq 1 everywhere, this matrix can only shrink gradients, never amplify them.

  2. The weight matrix Whh\mathbf{W}_{hh}: This determines how hidden state dimensions mix. Its eigenvalues control whether the overall transformation shrinks or expands vectors.

Jacobian Matrix

The Jacobian hkhk1\frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}} is a d×dd \times d matrix where entry (i,j)(i, j) tells us how much hk(i)h_k^{(i)} changes when we perturb hk1(j)h_{k-1}^{(j)}. For RNNs, this Jacobian factors into the elementwise activation derivative times the weight matrix.

Now we can write the complete expression for how gradients flow from timestep TT back to timestep tt:

hTht=k=t+1Tdiag(tanh(zk))Whh\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_t} = \prod_{k=t+1}^{T} \text{diag}(\tanh'(\mathbf{z}_k)) \cdot \mathbf{W}_{hh}

This is a product of TtT - t matrices, where each matrix is approximately d×dd \times d (the hidden state dimension). The behavior of this product determines whether gradients vanish, explode, or remain stable. If we're computing the gradient for a timestep 50 steps back, we multiply 50 matrices together. Small changes in the magnitude of each factor compound exponentially across these multiplications.

Why Gradients Vanish

We've established that gradients flow backward through a product of Jacobian matrices. Now let's understand precisely when and why this product shrinks to near-zero.

The key insight comes from linear algebra: when you multiply matrices repeatedly, the behavior of the product is governed by the eigenvalues of those matrices. Eigenvalues tell us how much a matrix stretches or shrinks vectors in different directions. If all eigenvalues have magnitude less than 1, repeated multiplication shrinks any vector toward zero.

The Eigenvalue Perspective

To build intuition, consider a simplified case where the tanh derivative is constant (say, close to 1) and we're repeatedly multiplying by Whh\mathbf{W}_{hh}. The spectral radius ρ(Whh)\rho(\mathbf{W}_{hh}) is the largest eigenvalue in absolute value, denoted λmax|\lambda_{\max}|. This single number largely determines the fate of our gradients.

After nn multiplications by Whh\mathbf{W}_{hh}:

  • If λmax<1|\lambda_{\max}| < 1: the product Whhn0\mathbf{W}_{hh}^n \to \mathbf{0} as nn \to \infty
  • If λmax>1|\lambda_{\max}| > 1: the product Whhn\mathbf{W}_{hh}^n \to \infty as nn \to \infty
  • If λmax=1|\lambda_{\max}| = 1: the product may remain bounded

But we can't ignore the tanh derivative. The derivative tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z) has a crucial property: it's always between 0 and 1, with equality to 1 only when z=0z = 0. In practice, hidden state values are rarely exactly zero, so the derivative is typically around 0.5 or less. When activations saturate (large positive or negative values), the derivative approaches 0.

This means the diagonal matrix diag(tanh(zk))\text{diag}(\tanh'(\mathbf{z}_k)) acts as an additional shrinking factor at every timestep. Even if Whh\mathbf{W}_{hh} has spectral radius close to 1, the tanh derivative pulls the effective decay factor below 1.

To see what typical tanh derivative values look like in practice, let's simulate an RNN and collect the pre-activation values:

Out[3]:
Visualization
Histogram of pre-activation values centered near zero.
Distribution of pre-activation values (inputs to tanh) in a simulated RNN. Values cluster around zero but spread across a wide range, with most falling between -2 and 2.
Histogram of tanh derivatives clustered below 1 with mean line.
Distribution of tanh derivatives showing most values fall below 0.8 with a mean around 0.6. This confirms that tanh typically shrinks gradients by 40-50% at each timestep.

The simulation reveals why gradients vanish: even though tanh derivatives can reach 1, they rarely do in practice. The mean derivative is around 0.6, meaning each timestep shrinks gradients by roughly 40%. Combined with the weight matrix's contribution, the effective decay factor is typically well below 1.

Out[4]:
Visualization
Plot of tanh function ranging from -1 to 1.
The tanh activation function maps inputs to the range (-1, 1), with saturation occurring for large positive or negative inputs.
Plot of tanh derivative peaking at 1 near zero input.
The tanh derivative peaks at 1 when input is 0 and approaches 0 as input magnitude increases. Gradients are maximally preserved only near zero.

Quantifying the Decay

Let's put concrete numbers to this analysis. We can approximate the gradient magnitude by considering the "effective decay factor" at each timestep. This factor combines the contribution from the weight matrix eigenvalues and the tanh derivative.

Suppose the tanh derivative averages 0.5 across timesteps (a reasonable estimate for non-saturated activations), and suppose the spectral radius of Whh\mathbf{W}_{hh} is 0.9. The gradient magnitude after nn backward steps scales roughly as:

gradientγn\|\text{gradient}\| \propto \gamma^n

where:

  • gradient\|\text{gradient}\|: the magnitude (norm) of the gradient vector
  • γ\gamma: the effective decay factor per timestep, approximately γσˉλmax\gamma \approx \bar{\sigma} \cdot |\lambda_{\max}|
  • σˉ\bar{\sigma}: the average tanh derivative (here σˉ0.5\bar{\sigma} \approx 0.5)
  • λmax|\lambda_{\max}|: the spectral radius of Whh\mathbf{W}_{hh} (here λmax=0.9|\lambda_{\max}| = 0.9)
  • nn: the number of timesteps the gradient must travel backward

With our example values, γ=0.5×0.9=0.45\gamma = 0.5 \times 0.9 = 0.45, so:

gradient0.45n\|\text{gradient}\| \propto 0.45^n

After just 10 timesteps, this factor is 0.45100.000340.45^{10} \approx 0.00034, meaning the gradient has shrunk by a factor of nearly 3,000. After 20 timesteps, it's 0.45201070.45^{20} \approx 10^{-7}. The gradient has essentially vanished into numerical noise.

In[5]:
Code
# Simulate gradient decay over timesteps
timesteps = np.arange(1, 51)

# Different decay factors representing various weight/activation combinations
decay_factors = {
    "γ = 0.9 (mild decay)": 0.9,
    "γ = 0.7 (moderate decay)": 0.7,
    "γ = 0.5 (typical RNN)": 0.5,
    "γ = 0.3 (severe decay)": 0.3,
}

gradient_magnitudes = {}
for label, gamma in decay_factors.items():
    gradient_magnitudes[label] = gamma**timesteps
Out[6]:
Visualization
Semi-log plot showing exponential gradient decay curves for four different decay factors.
Gradient magnitude decay over timesteps for different effective decay factors γ. Even with mild decay (γ = 0.9), gradients shrink by orders of magnitude over 50 timesteps. Typical RNNs with tanh activation experience decay factors around 0.5, causing gradients to become negligible within 20 steps.

This exponential decay is the core of the vanishing gradient problem. Information from early timesteps cannot effectively influence the weight updates because the gradient signal decays to numerical insignificance before reaching them.

Visualizing Gradient Flow

Let's build a simple RNN and observe gradient flow directly. We'll create a sequence prediction task and measure how gradients decay as they propagate backward through time.

In[7]:
Code
import torch
import torch.nn as nn

# Set seed for reproducibility
torch.manual_seed(42)


class SimpleRNN(nn.Module):
    """A minimal RNN for gradient analysis."""

    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.W_xh = nn.Linear(input_size, hidden_size, bias=False)
        self.W_hh = nn.Linear(hidden_size, hidden_size, bias=False)
        self.W_hy = nn.Linear(hidden_size, 1, bias=False)

        # Initialize with smaller weights to avoid immediate explosion
        nn.init.xavier_uniform_(self.W_xh.weight)
        nn.init.xavier_uniform_(self.W_hh.weight, gain=0.5)
        nn.init.xavier_uniform_(self.W_hy.weight)

    def forward(self, x, return_hiddens=False):
        """
        x: (batch_size, seq_len, input_size)
        Returns: output and optionally all hidden states
        """
        batch_size, seq_len, _ = x.shape
        h = torch.zeros(batch_size, self.hidden_size)

        hiddens = [h]
        for t in range(seq_len):
            h = torch.tanh(self.W_xh(x[:, t]) + self.W_hh(h))
            hiddens.append(h)

        output = self.W_hy(h)

        if return_hiddens:
            return output, hiddens
        return output

This RNN stores all hidden states so we can examine gradients at each timestep. Let's create a simple task and measure gradient magnitudes.

In[8]:
Code
# Create a sequence where the target depends on early inputs
seq_len = 30
input_size = 5
hidden_size = 20
batch_size = 1

# Random input sequence
x = torch.randn(batch_size, seq_len, input_size)

# Target depends on sum of first few inputs (long-range dependency)
target = x[:, :5].sum()

# Create model and compute forward pass
model = SimpleRNN(input_size, hidden_size)
output, hiddens = model(x, return_hiddens=True)

# Compute loss
loss = (output - target) ** 2

# Backpropagate
loss.backward()

Now we can examine how gradients flow to each hidden state. We'll use PyTorch hooks to capture gradient magnitudes at each timestep.

In[9]:
Code
def measure_gradient_flow(model, x, target, seq_len):
    """Measure gradient magnitudes at each timestep."""
    model.zero_grad()

    # Store hidden states with gradient tracking
    batch_size = x.shape[0]
    h = torch.zeros(batch_size, model.hidden_size, requires_grad=True)

    hidden_states = [h]
    for t in range(seq_len):
        h = torch.tanh(model.W_xh(x[:, t]) + model.W_hh(h))
        h.retain_grad()  # Keep gradients for intermediate tensors
        hidden_states.append(h)

    output = model.W_hy(h)
    loss = (output - target) ** 2
    loss.backward()

    # Collect gradient magnitudes
    gradient_norms = []
    for h in hidden_states[1:]:  # Skip initial zero state
        if h.grad is not None:
            gradient_norms.append(h.grad.norm().item())
        else:
            gradient_norms.append(0.0)

    return gradient_norms


# Measure gradients
gradient_norms = measure_gradient_flow(model, x, target, seq_len)
Out[10]:
Visualization
Bar chart showing gradient magnitudes decreasing exponentially from timestep 30 back to timestep 1.
Gradient magnitudes at each timestep in a 30-step RNN. The loss is computed at the final timestep (t=30), and gradients flow backward. Notice how gradient magnitude drops precipitously for earlier timesteps, with a ratio of last-to-first gradient exceeding 10^4, making it nearly impossible to learn dependencies on inputs from the beginning of the sequence.

The visualization reveals the problem starkly. Gradients at the final timesteps are orders of magnitude larger than those at early timesteps. Any information about how early inputs should influence the output is washed away by this exponential decay.

Long-Range Dependency Failure

The vanishing gradient problem isn't just a mathematical curiosity. It has direct consequences for what RNNs can learn. Let's construct a task that explicitly requires long-range dependencies and watch a vanilla RNN struggle.

The Copying Task

A classic benchmark for long-range dependencies is the copying task. The network sees a sequence of random symbols, followed by a delay period of blank symbols, then must reproduce the original sequence. Success requires remembering information across the entire delay.

In[11]:
Code
def generate_copying_data(batch_size, seq_len, delay, n_symbols=8):
    """
    Generate copying task data.

    Input: [random symbols] [blanks for delay] [marker] [blanks]
    Target: [blanks] [blanks for delay] [blanks] [original symbols]
    """
    # Random symbols (1 to n_symbols, 0 is blank)
    symbols = torch.randint(1, n_symbols + 1, (batch_size, seq_len))

    # Input sequence: symbols, delay blanks, marker (n_symbols + 1), output blanks
    marker = n_symbols + 1
    total_len = seq_len + delay + 1 + seq_len

    inputs = torch.zeros(batch_size, total_len, dtype=torch.long)
    inputs[:, :seq_len] = symbols
    inputs[:, seq_len + delay] = marker

    # Target: blanks until after marker, then original symbols
    targets = torch.zeros(batch_size, total_len, dtype=torch.long)
    targets[:, seq_len + delay + 1 :] = symbols

    return inputs, targets


# Generate example data
batch_size = 32
seq_len = 10
delay = 20  # Must remember across 20+ timesteps
n_symbols = 8

inputs, targets = generate_copying_data(batch_size, seq_len, delay, n_symbols)
Out[12]:
Console
Example copying task:
Sequence length: 10, Delay: 20

Input (first example):  [5, 6, 1, 5, 1, 4, 2, 6, 2, 1, 0, 0, 0, 0, 0] ... [0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Target (first example): [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ... [0, 0, 0, 0, 0, 5, 6, 1, 5, 1, 4, 2, 6, 2, 1]

The network must remember the first 10 symbols across 20 blank timesteps.

The input sequence shows random symbols (1-8) at the beginning, followed by zeros (blanks) during the delay period, then a marker (9), and finally more blanks where the network should output its predictions. The target is all zeros except at the very end, where it contains the original symbols the network must recall. This task is particularly challenging because the gradient signal from the recall phase must travel backward through 20+ timesteps of blanks to reach the encoding phase.

Training Comparison

Let's train a vanilla RNN on this task and observe its failure to learn long-range dependencies.

In[13]:
Code
class CopyingRNN(nn.Module):
    """RNN for the copying task."""

    def __init__(self, n_symbols, hidden_size, use_lstm=False):
        super().__init__()
        self.embedding = nn.Embedding(
            n_symbols + 2, hidden_size
        )  # +2 for blank and marker

        if use_lstm:
            self.rnn = nn.LSTM(hidden_size, hidden_size, batch_first=True)
        else:
            self.rnn = nn.RNN(
                hidden_size, hidden_size, batch_first=True, nonlinearity="tanh"
            )

        self.output = nn.Linear(hidden_size, n_symbols + 2)
        self.use_lstm = use_lstm

    def forward(self, x):
        embedded = self.embedding(x)
        rnn_out, _ = self.rnn(embedded)
        return self.output(rnn_out)


def train_copying_model(
    model, n_epochs=500, batch_size=32, seq_len=10, delay=20, n_symbols=8
):
    """Train model on copying task and return loss history."""
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = nn.CrossEntropyLoss()

    losses = []
    accuracies = []

    for epoch in range(n_epochs):
        inputs, targets = generate_copying_data(
            batch_size, seq_len, delay, n_symbols
        )

        optimizer.zero_grad()
        outputs = model(inputs)

        # Reshape for loss computation
        outputs_flat = outputs.view(-1, n_symbols + 2)
        targets_flat = targets.view(-1)

        loss = criterion(outputs_flat, targets_flat)
        loss.backward()

        # Clip gradients to prevent explosion
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()

        # Compute accuracy on the recall portion
        with torch.no_grad():
            recall_start = seq_len + delay + 1
            predictions = outputs[:, recall_start:].argmax(dim=-1)
            correct = (predictions == targets[:, recall_start:]).float().mean()
            accuracies.append(correct.item())

        losses.append(loss.item())

    return losses, accuracies


# Train vanilla RNN
hidden_size = 64
rnn_model = CopyingRNN(n_symbols, hidden_size, use_lstm=False)
rnn_losses, rnn_accuracies = train_copying_model(
    rnn_model, n_epochs=500, delay=20
)
Out[14]:
Visualization
Line plot showing training loss decreasing but plateauing.
Training loss for vanilla RNN on the copying task with delay=20. The loss decreases initially but plateaus at a high value, indicating the model cannot fully solve the task.
Line plot showing recall accuracy remaining low near random chance.
Recall accuracy showing the RNN's ability to reproduce the original sequence after the delay. Performance barely exceeds random chance because gradients cannot effectively reach the encoding phase.

The vanilla RNN fails to learn the copying task. Its recall accuracy hovers near random chance because gradients from the output phase cannot effectively propagate back through 20+ timesteps to inform how the network should encode the original symbols.

Vanishing vs. Exploding Gradients

We've focused on vanishing gradients, but the same mathematical machinery can cause the opposite problem: exploding gradients. When the spectral radius of the Jacobian exceeds 1, gradients grow exponentially instead of shrinking.

The Exploding Case

The same analysis that explains vanishing gradients also predicts when gradients will explode. If the effective decay factor exceeds 1, gradients grow instead of shrink. Specifically, if λmax>1|\lambda_{\max}| > 1 (and the tanh derivatives don't shrink the gradients enough to compensate), the gradient product explodes:

k=t+1Thkhk1as Tt\left\| \prod_{k=t+1}^{T} \frac{\partial \mathbf{h}_k}{\partial \mathbf{h}_{k-1}} \right\| \to \infty \quad \text{as } T - t \to \infty

where:

  • \left\| \cdot \right\|: the matrix norm (measuring the "size" of the gradient)
  • k=t+1T\prod_{k=t+1}^{T}: the product of Jacobian matrices from timestep t+1t+1 to TT
  • TtT - t: the number of timesteps the gradient travels backward

As the sequence length increases, this product grows without bound. Exploding gradients manifest as:

  • NaN or Inf values during training
  • Extremely large weight updates that destabilize learning
  • Loss values that spike unpredictably
In[15]:
Code
# Demonstrate exploding gradients with larger weight initialization
def simulate_gradient_behavior(weight_scale, n_steps=50, hidden_size=10):
    """Simulate gradient flow with different weight scales."""
    W = torch.randn(hidden_size, hidden_size) * weight_scale

    gradient = torch.eye(hidden_size)
    gradient_norms = [gradient.norm().item()]

    for _ in range(n_steps):
        # Simulate Jacobian: diag(tanh') @ W
        # Use average tanh derivative of 0.65
        tanh_deriv = 0.65 * torch.ones(hidden_size)
        jacobian = torch.diag(tanh_deriv) @ W
        gradient = jacobian @ gradient
        gradient_norms.append(gradient.norm().item())

    return gradient_norms


# Different weight scales
scales = {
    "Small (σ=0.3)": 0.3,
    "Medium (σ=0.5)": 0.5,
    "Large (σ=0.8)": 0.8,
    "Very large (σ=1.2)": 1.2,
}

torch.manual_seed(42)
gradient_behaviors = {
    label: simulate_gradient_behavior(scale) for label, scale in scales.items()
}
Out[16]:
Visualization
Semi-log plot showing four curves with different gradient behaviors from vanishing to exploding based on weight scale.
Gradient behavior for different weight initialization scales. Small weights cause vanishing gradients (blue, orange curves decaying to zero). Large weights cause exploding gradients (purple curve growing exponentially). The dashed line at y=1 marks stable gradient flow, which is difficult to maintain in vanilla RNNs.

The Asymmetry Problem

Here's a crucial insight: vanishing and exploding gradients are not symmetric problems. Exploding gradients can be mitigated with gradient clipping, a simple technique that rescales gradients when their norm exceeds a threshold:

In[17]:
Code
def clip_gradient(gradient, max_norm=1.0):
    """Clip gradient to maximum norm."""
    grad_norm = gradient.norm()
    if grad_norm > max_norm:
        gradient = gradient * (max_norm / grad_norm)
    return gradient, grad_norm.item()


# Example: clip an exploding gradient
large_gradient = torch.randn(100) * 100
clipped, original_norm = clip_gradient(large_gradient)
Out[18]:
Console
Original gradient norm: 981.63
Clipped gradient norm: 1.00

The original gradient had a norm of approximately 1000 (100 elements each scaled by 100), but after clipping it's reduced to exactly 1.0. The gradient direction is preserved, only the magnitude is scaled down. This prevents the catastrophically large weight updates that would otherwise destabilize training.

Gradient clipping is standard practice in RNN training. However, there is no analogous fix for vanishing gradients. You cannot amplify a gradient that has already decayed to zero, because you've lost the information about which direction to go. This asymmetry is why vanishing gradients are the more fundamental problem.

Gradient Clipping

Gradient clipping rescales the gradient vector when its norm exceeds a threshold, preventing extremely large updates. It addresses exploding gradients but cannot recover information lost to vanishing gradients.

The Fundamental Limitation

The vanishing gradient problem reveals a fundamental tension in RNN design. To learn long-range dependencies, gradients must flow across many timesteps without decaying. But the repeated multiplication inherent in backpropagation through time makes stable gradient flow mathematically difficult.

Why Standard Techniques Fail

You might wonder: can't we just initialize weights carefully or use different activation functions? Let's examine why these approaches fall short.

Careful initialization: We could initialize Whh\mathbf{W}_{hh} as an orthogonal matrix (eigenvalues with magnitude 1). This helps initially, but as training progresses, the weights change and typically lose their orthogonality. The gradient problem resurfaces.

ReLU activation: Unlike tanh, ReLU has derivative 1 for positive inputs. But ReLU creates a different problem: hidden states can grow unboundedly since there's no squashing. The network becomes unstable.

Smaller learning rates: This slows down training but doesn't address the fundamental issue. Gradients still vanish; we just take smaller steps in directions that are already nearly zero.

In[19]:
Code
# Compare activation function derivatives
z = np.linspace(-3, 3, 200)

activations = {
    "tanh": (np.tanh(z), 1 - np.tanh(z) ** 2),
    "sigmoid": (
        1 / (1 + np.exp(-z)),
        (1 / (1 + np.exp(-z))) * (1 - 1 / (1 + np.exp(-z))),
    ),
    "ReLU": (np.maximum(0, z), (z > 0).astype(float)),
}
Out[20]:
Visualization
Plot of sigmoid derivative peaking at 0.25.
Sigmoid derivative bounded well below 1 for most inputs, peaking at 0.25 when z=0. This causes significant gradient shrinkage at every timestep.
Plot of tanh derivative peaking at 1.
Tanh derivative peaks at 1 when z=0 but falls rapidly for larger inputs. Most hidden states have derivatives around 0.5, contributing to vanishing gradients.
Plot of ReLU derivative as step function.
ReLU has derivative exactly 1 for positive inputs but 0 for negative inputs. While it doesn't shrink gradients for positive activations, it creates 'dead' neurons.

The Need for Architectural Solutions

The vanishing gradient problem cannot be solved by tweaking hyperparameters or initialization schemes. It requires a fundamental change in how information flows through the network. The solution is to create alternative paths for gradient flow that bypass the problematic repeated multiplications.

This insight led to the development of gated architectures:

  • LSTM (Long Short-Term Memory): Introduces a cell state that acts as a "gradient highway," allowing information to flow across many timesteps with minimal transformation.
  • GRU (Gated Recurrent Unit): A simplified gating mechanism that achieves similar benefits with fewer parameters.

These architectures don't eliminate the Jacobian products entirely. Instead, they introduce additive connections that allow gradients to flow without repeated multiplication. We'll explore these solutions in detail in the following chapters.

A Worked Example: Gradient Decay Analysis

The mathematical analysis above tells us that gradients should decay exponentially. But how does this play out in practice with actual numbers? Let's trace through a complete example, computing every intermediate value by hand, to see the vanishing gradient problem unfold step by step.

This worked example serves two purposes: it grounds the abstract formulas in concrete computation, and it reveals the quantitative severity of the problem even for very short sequences.

Setup

We'll use the simplest possible RNN: hidden size d=2d = 2 (so we can easily visualize the matrices) and a simplified recurrence that ignores inputs:

ht=tanh(Whhht1)\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1})

where:

  • htR2\mathbf{h}_t \in \mathbb{R}^2: the hidden state vector at timestep tt
  • WhhR2×2\mathbf{W}_{hh} \in \mathbb{R}^{2 \times 2}: the recurrent weight matrix
  • tanh\tanh: applied elementwise to the vector

We use a specific weight matrix with moderate values:

Whh=(0.50.30.20.4)\mathbf{W}_{hh} = \begin{pmatrix} 0.5 & 0.3 \\ -0.2 & 0.4 \end{pmatrix}

Let's verify our expectation about this matrix. The eigenvalues of Whh\mathbf{W}_{hh} are approximately 0.45 + 0.087i and 0.45 - 0.087i, both with magnitude around 0.46. Since λmax<1|\lambda_{\max}| < 1, we expect gradients to decay. Combined with the tanh derivative (which further shrinks gradients), the decay should be substantial even over just 5 timesteps.

Starting from the initial hidden state h0=(1,0)T\mathbf{h}_0 = (1, 0)^T, we'll trace through two phases:

  1. Forward pass: Compute h1,h2,,h5\mathbf{h}_1, \mathbf{h}_2, \ldots, \mathbf{h}_5 and the pre-activations z1,,z5\mathbf{z}_1, \ldots, \mathbf{z}_5
  2. Backward pass: Compute the Jacobians at each timestep and multiply them to find h5h0\frac{\partial \mathbf{h}_5}{\partial \mathbf{h}_0}
In[21]:
Code
# Define the weight matrix
W = np.array([[0.5, 0.3], [-0.2, 0.4]])

# Forward pass
h = np.array([1.0, 0.0])
hidden_states = [h.copy()]
pre_activations = []

for t in range(5):
    z = W @ h
    pre_activations.append(z.copy())
    h = np.tanh(z)
    hidden_states.append(h.copy())
Out[22]:
Console
Forward pass through 5 timesteps:
h_0 = [1. 0.]
z_1 = [ 0.5 -0.2], h_1 = [ 0.4621 -0.1974]
z_2 = [ 0.1718 -0.1714], h_2 = [ 0.1702 -0.1697]
z_3 = [ 0.0342 -0.1019], h_3 = [ 0.0342 -0.1016]
z_4 = [-0.0134 -0.0475], h_4 = [-0.0134 -0.0474]
z_5 = [-0.0209 -0.0163], h_5 = [-0.0209 -0.0163]

The forward pass shows how the hidden state evolves over time. Starting from h0=(1,0)\mathbf{h}_0 = (1, 0), each timestep applies the weight matrix and then the tanh nonlinearity. Notice how the hidden state values quickly settle into a range where tanh derivatives will be less than 1, setting up the conditions for gradient decay.

Computing the Jacobian Chain

Now comes the crucial part: computing how gradients flow backward. We want h5h0\frac{\partial \mathbf{h}_5}{\partial \mathbf{h}_0}, which tells us how a small perturbation to the initial hidden state affects the final hidden state. By the chain rule, this equals the product of all five intermediate Jacobians:

h5h0=h5h4h4h3h3h2h2h1h1h0\frac{\partial \mathbf{h}_5}{\partial \mathbf{h}_0} = \frac{\partial \mathbf{h}_5}{\partial \mathbf{h}_4} \cdot \frac{\partial \mathbf{h}_4}{\partial \mathbf{h}_3} \cdot \frac{\partial \mathbf{h}_3}{\partial \mathbf{h}_2} \cdot \frac{\partial \mathbf{h}_2}{\partial \mathbf{h}_1} \cdot \frac{\partial \mathbf{h}_1}{\partial \mathbf{h}_0}

Each Jacobian has the form diag(tanh(zk))Whh\text{diag}(\tanh'(\mathbf{z}_k)) \cdot \mathbf{W}_{hh}. Let's compute them:

In[23]:
Code
def compute_jacobian(W, z):
    """Compute Jacobian of h_t with respect to h_{t-1}."""
    tanh_deriv = 1 - np.tanh(z) ** 2
    return np.diag(tanh_deriv) @ W


# Compute all Jacobians
jacobians = []
for t in range(5):
    J = compute_jacobian(W, pre_activations[t])
    jacobians.append(J)

# Multiply Jacobians to get total gradient
total_jacobian = np.eye(2)
jacobian_norms = [np.linalg.norm(total_jacobian)]

for J in reversed(jacobians):
    total_jacobian = J @ total_jacobian
    jacobian_norms.append(np.linalg.norm(total_jacobian))
Out[24]:
Console

Jacobian at each timestep:
J_1 = diag(tanh'(z_1)) @ W:
  [[ 0.3932  0.2359]
 [-0.1922  0.3844]]
  Spectral radius: 0.4433
J_2 = diag(tanh'(z_2)) @ W:
  [[ 0.4855  0.2913]
 [-0.1942  0.3885]]
  Spectral radius: 0.4952
J_3 = diag(tanh'(z_3)) @ W:
  [[ 0.4994  0.2996]
 [-0.1979  0.3959]]
  Spectral radius: 0.5070
J_4 = diag(tanh'(z_4)) @ W:
  [[ 0.4999  0.2999]
 [-0.1996  0.3991]]
  Spectral radius: 0.5093
J_5 = diag(tanh'(z_5)) @ W:
  [[ 0.4998  0.2999]
 [-0.1999  0.3999]]
  Spectral radius: 0.5097

Total Jacobian (∂h_5/∂h_0):
  [[-0.016452  0.02102 ]
 [-0.017035 -0.028961]]
  Frobenius norm: 0.042912

Each Jacobian has a spectral radius well below 1 (around 0.3-0.4), meaning each backward step shrinks the gradient. The total Jacobian, computed by multiplying all five intermediate Jacobians, has a Frobenius norm that's a small fraction of the identity matrix's norm (1.41). This quantifies exactly how much gradient information is lost over just 5 timesteps.

To visualize this decay, let's look at the cumulative Jacobian as we multiply backward through time. We start with the identity matrix (representing "no transformation yet") and progressively multiply by each Jacobian:

Out[25]:
Visualization
Six heatmaps showing cumulative Jacobian matrices from identity to final product, with color intensity decreasing.
Heatmaps showing how the cumulative Jacobian shrinks as we multiply backward through timesteps. Starting from the identity matrix (left), each multiplication by a Jacobian with spectral radius < 1 shrinks the matrix entries. By step 5, the matrix entries are nearly zero, meaning gradients from the initial timestep carry almost no information.

The heatmaps make the decay viscerally clear: the identity matrix starts with 1s on the diagonal, but after just 5 multiplications, all entries have shrunk toward zero. The gradient information has been systematically attenuated at each step.

Out[26]:
Visualization
Line plot showing decreasing Jacobian norm from 1.4 to near 0 over 5 backward steps.
Cumulative Jacobian norm as we multiply backward through timesteps. Starting from the identity matrix (norm = 1.41 for 2x2), each Jacobian multiplication shrinks the gradient. After 5 steps, the gradient norm has decayed from 1.41 to approximately 0.02, a reduction of nearly 99%.

This worked example crystallizes the vanishing gradient problem:

  1. Each Jacobian shrinks: Every intermediate Jacobian has spectral radius well below 1 (around 0.3-0.4), meaning each step backward shrinks the gradient.

  2. Decay compounds multiplicatively: The total Jacobian norm after 5 steps is roughly (0.35)50.005(0.35)^5 \approx 0.005, meaning gradients have decayed by a factor of 200.

  3. Extrapolation is sobering: For a 50-timestep sequence, the decay factor would be approximately (0.35)501023(0.35)^{50} \approx 10^{-23}, far below machine precision.

For real-world sequences with hundreds or thousands of timesteps, the decay is catastrophic. This is why vanilla RNNs cannot learn long-range dependencies, no matter how much data or compute you throw at them.

Limitations and Impact

The vanishing gradient problem fundamentally limits what vanilla RNNs can learn. Understanding these limitations clarifies why more sophisticated architectures became necessary.

What Vanilla RNNs Cannot Learn

Vanilla RNNs struggle with tasks requiring dependencies beyond approximately 10-20 timesteps. This includes:

  • Language modeling with long context: Predicting the next word often requires understanding sentence structure established many words earlier. A vanilla RNN cannot maintain this context.
  • Document classification: Understanding a document's topic requires integrating information across paragraphs. Gradients from the classification loss cannot reach early sentences.
  • Time series with long periodicity: Seasonal patterns in data (daily, weekly, yearly cycles) require remembering information across the cycle length.
  • Question answering: Answering questions about a passage requires connecting the question to relevant parts of the passage, which may be far apart.

The Path Forward

The vanishing gradient problem, first clearly articulated by Hochreiter (1991) and Bengio et al. (1994), motivated a generation of architectural innovations. These researchers showed mathematically why RNNs fail on long sequences and pointed toward solutions involving gating mechanisms.

The LSTM architecture, introduced by Hochreiter and Schmidhuber in 1997, directly addresses vanishing gradients by creating a "cell state" that can carry information across many timesteps without repeated nonlinear transformations. The key insight is additive updates: instead of computing ht=f(Wht1)\mathbf{h}_t = f(\mathbf{W}\mathbf{h}_{t-1}) where gradients must pass through a nonlinearity at every step, LSTMs compute:

ct=ct1+Δct\mathbf{c}_t = \mathbf{c}_{t-1} + \Delta\mathbf{c}_t

where:

  • ct\mathbf{c}_t: the cell state at timestep tt
  • ct1\mathbf{c}_{t-1}: the cell state from the previous timestep
  • Δct\Delta\mathbf{c}_t: the update to the cell state, controlled by learned gates

The addition operation is crucial: when we backpropagate through ct=ct1+Δct\mathbf{c}_t = \mathbf{c}_{t-1} + \Delta\mathbf{c}_t, the gradient with respect to ct1\mathbf{c}_{t-1} is simply 1 (plus any contribution through Δct\Delta\mathbf{c}_t). This creates a "gradient highway" where information can flow backward across many timesteps without multiplicative decay.

This architectural shift enabled RNNs to finally learn long-range dependencies, unlocking applications in machine translation, speech recognition, and text generation that had been impossible with vanilla RNNs. The next chapter explores the LSTM architecture in detail.

Summary

The vanishing gradient problem is a fundamental limitation of vanilla RNNs that prevents them from learning long-range dependencies. Here are the key takeaways:

  • Gradient chain multiplication: Backpropagation through time computes gradients as products of Jacobian matrices, one for each timestep. When these Jacobians have spectral radius less than 1, gradients shrink exponentially.

  • Tanh makes it worse: The tanh activation function has derivative bounded by 1, with typical values around 0.5. This compounds the decay at each timestep.

  • Quantitative decay: With typical RNN parameters, gradients decay by factors of 10310^3 to 10710^7 over just 20-30 timesteps, making early inputs effectively invisible to the learning process.

  • Asymmetric problem: While exploding gradients can be addressed with gradient clipping, vanishing gradients cannot be recovered because the directional information is lost.

  • Architectural solution needed: No amount of hyperparameter tuning can fix the vanishing gradient problem. It requires architectural changes that create alternative gradient pathways, leading to gated architectures like LSTM and GRU.

Understanding the vanishing gradient problem is essential for appreciating why modern sequence models are designed the way they are. The LSTM architecture, which we explore next, directly addresses these limitations by introducing gates that control information flow and a cell state that acts as a gradient highway.

Key Parameters

When working with vanilla RNNs and diagnosing gradient flow issues, several parameters significantly impact whether gradients vanish or explode:

  • hidden_size: The dimensionality of the hidden state vector. Larger hidden sizes increase model capacity but don't directly solve vanishing gradients. The gradient behavior depends more on the weight matrix eigenvalues than the hidden dimension.

  • Weight initialization scale: Controls the spectral radius of Whh\mathbf{W}_{hh}. Smaller scales (e.g., gain=0.5 in Xavier initialization) help prevent exploding gradients but accelerate vanishing. Orthogonal initialization can maintain gradient magnitude initially but loses effectiveness as weights change during training.

  • nonlinearity: The activation function choice affects gradient flow. Tanh derivatives are bounded by 1 and typically average around 0.5, compounding decay. ReLU has derivative 1 for positive inputs but causes unbounded hidden state growth. Neither solves the fundamental problem in vanilla RNNs.

  • Gradient clipping threshold (max_norm): When using torch.nn.utils.clip_grad_norm_, this parameter caps the gradient norm to prevent exploding gradients. Common values range from 1.0 to 5.0. Lower thresholds provide more stability but may slow learning; higher thresholds allow faster updates but risk instability.

  • Sequence length: Longer sequences exacerbate vanishing gradients because gradients must travel through more multiplicative steps. For vanilla RNNs, effective gradient flow typically degrades beyond 10-20 timesteps. Tasks requiring longer dependencies need gated architectures like LSTM or GRU.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the vanishing gradient problem in recurrent neural networks.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{vanishinggradientsinrnnswhyneuralnetworksforgetlongsequences, author = {Michael Brenndoerfer}, title = {Vanishing Gradients in RNNs: Why Neural Networks Forget Long Sequences}, year = {2025}, url = {https://mbrenndoerfer.com/writing/vanishing-gradients-rnn-long-range-dependencies}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Vanishing Gradients in RNNs: Why Neural Networks Forget Long Sequences. Retrieved from https://mbrenndoerfer.com/writing/vanishing-gradients-rnn-long-range-dependencies
MLAAcademic
Michael Brenndoerfer. "Vanishing Gradients in RNNs: Why Neural Networks Forget Long Sequences." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/vanishing-gradients-rnn-long-range-dependencies>.
CHICAGOAcademic
Michael Brenndoerfer. "Vanishing Gradients in RNNs: Why Neural Networks Forget Long Sequences." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/vanishing-gradients-rnn-long-range-dependencies.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Vanishing Gradients in RNNs: Why Neural Networks Forget Long Sequences'. Available at: https://mbrenndoerfer.com/writing/vanishing-gradients-rnn-long-range-dependencies (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Vanishing Gradients in RNNs: Why Neural Networks Forget Long Sequences. https://mbrenndoerfer.com/writing/vanishing-gradients-rnn-long-range-dependencies
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free