Backpropagation Through Time: Training RNNs with Gradient Flow

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master BPTT for training recurrent neural networks. Learn unrolling, gradient accumulation, truncated BPTT, and understand the vanishing gradient problem.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Backpropagation Through TimeLink Copied

In the previous chapter, we introduced RNN architecture: how recurrent connections create a hidden state that flows across timesteps, enabling the network to process sequences of arbitrary length. We saw the elegant parameter sharing that makes RNNs practical. But we left a critical question unanswered: how do we actually train these networks?

Standard backpropagation, which you learned in the neural network foundations, assumes a feedforward structure where information flows in one direction. RNNs break this assumption. The same weights are used at every timestep, and the hidden state at time $t$ depends on the hidden state at time $t-1$ , which depends on $t-2$ , and so on. This creates a temporal dependency chain that standard backpropagation cannot directly handle.

Backpropagation Through Time (BPTT) solves this problem by conceptually "unrolling" the RNN across time, transforming the recurrent network into a deep feedforward network with shared weights. Once unrolled, we can apply the chain rule just as we did for feedforward networks, but now the gradients flow backward through time as well as through layers.

This chapter derives BPTT from first principles, traces gradient flow through the unrolled computation graph, and implements the algorithm from scratch. You'll see exactly how gradients accumulate across timesteps and why this creates both computational challenges and the infamous vanishing gradient problem that we'll tackle in the next chapter.

From Feedforward to Recurrent: The Training ChallengeLink Copied

Let's first recall how we train feedforward networks, then see why RNNs require a different approach.

In a feedforward network, each layer transforms its input exactly once. The loss depends on the final output, and backpropagation traces gradients backward through each layer. Each weight appears in exactly one place in the computation graph, so its gradient is computed once.

RNNs are fundamentally different. Consider a sequence of length $T$ . At each timestep $t$ , the RNN computes:

\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)

where:

$\mathbf{h}_t \in \mathbb{R}^{d_h}$ : the hidden state at timestep $t$
$\mathbf{h}_{t-1} \in \mathbb{R}^{d_h}$ : the hidden state from the previous timestep
$\mathbf{x}_t \in \mathbb{R}^{d_x}$ : the input at timestep $t$
$\mathbf{W}_{hh} \in \mathbb{R}^{d_h \times d_h}$ : the hidden-to-hidden weight matrix (recurrent weights)
$\mathbf{W}_{xh} \in \mathbb{R}^{d_h \times d_x}$ : the input-to-hidden weight matrix
$\mathbf{b}_h \in \mathbb{R}^{d_h}$ : the bias vector

The key observation is that $\mathbf{W}_{hh}$ , $\mathbf{W}_{xh}$ , and $\mathbf{b}_h$ are the same parameters at every timestep. This parameter sharing is what gives RNNs their power, but it also means each parameter affects the loss through multiple computational paths, one for each timestep.

Parameter Sharing in RNNs

Unlike feedforward networks where each weight appears once in the computation graph, RNN weights are reused at every timestep. When computing gradients, we must account for how each weight influences the loss through all $T$ timesteps, not just one.

Unrolling the RNNLink Copied

The conceptual trick that makes BPTT work is unrolling: we create a copy of the RNN for each timestep, treating the recurrent network as a very deep feedforward network where each "layer" corresponds to a timestep.

Consider a sequence of length 4. The unrolled computation graph looks like this:

Out[3]:

Visualization

Diagram showing four copies of an RNN cell connected horizontally, with inputs x1 through x4 entering from below and hidden states h1 through h4 flowing between cells. — An RNN unrolled across 4 timesteps. Each timestep uses the same weights (W_hh, W_xh), but the computation graph shows them as separate operations. The hidden state h flows from left to right, creating temporal dependencies.

In the unrolled view, we see that:

Each timestep has its own copy of the RNN cell, but they all share the same weights
The hidden state $\mathbf{h}_t$ flows horizontally, carrying information from past to future
Inputs $\mathbf{x}_t$ enter from below at each timestep
The network is now a deep feedforward network with $T$ "layers"

This unrolling is purely conceptual. We don't actually create $T$ copies of the weights in memory. But thinking about the RNN this way allows us to apply the chain rule systematically.

The BPTT DerivationLink Copied

With the unrolled view in mind, we can now derive the BPTT algorithm systematically. The key challenge is this: how do we compute the gradient of the loss with respect to weights that are used repeatedly across time? The answer comes from carefully applying the chain rule while tracking how each weight influences the loss through multiple paths.

We'll build up the derivation in three stages, starting with the simplest case and progressively tackling more complex dependencies:

Output weights (straightforward: no temporal dependencies)
Hidden state gradients (the recursive heart of BPTT)
Recurrent weights (combining recursion with gradient accumulation)

Throughout, we'll consider a sequence-to-sequence task where we compute a loss at each timestep and sum them. The total loss over the entire sequence is:

\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t

where:

$\mathcal{L}$ : the total loss over all timesteps, which we want to minimize
$T$ : the sequence length (total number of timesteps)
$\mathcal{L}_t$ : the loss at timestep $t$ , typically computed from the hidden state $\mathbf{h}_t$ through an output layer (e.g., cross-entropy for classification)

Gradient with Respect to Output WeightsLink Copied

We begin with the simplest case: the output weights $\mathbf{W}_{hy}$ that transform hidden states into predictions. This serves as a warm-up because these weights don't create temporal dependencies. Each use of $\mathbf{W}_{hy}$ at timestep $t$ affects only the loss $\mathcal{L}_t$ at that timestep, not any future losses.

If we have an output layer that produces predictions from the hidden state:

\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y

where:

$\mathbf{y}_t \in \mathbb{R}^{d_y}$ : the output (logits) at timestep $t$
$\mathbf{W}_{hy} \in \mathbb{R}^{d_y \times d_h}$ : the hidden-to-output weight matrix
$\mathbf{h}_t \in \mathbb{R}^{d_h}$ : the hidden state at timestep $t$
$\mathbf{b}_y \in \mathbb{R}^{d_y}$ : the output bias vector

The gradient with respect to $\mathbf{W}_{hy}$ is straightforward because $\mathbf{W}_{hy}$ only affects the loss through the outputs at each timestep. Applying the chain rule:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hy}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{W}_{hy}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t} \mathbf{h}_t^\top

where:

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hy}}$ : the total gradient of the loss with respect to the output weights
$\frac{\partial \mathcal{L}_t}{\partial \mathbf{y}_t}$ : the gradient of the loss at timestep $t$ with respect to the output, which depends on your loss function (e.g., softmax cross-entropy)
$\mathbf{h}_t^\top$ : the transpose of the hidden state, needed for the outer product that produces a matrix gradient

This is just standard backpropagation applied independently at each timestep, then summed. The output weights are straightforward because they don't create temporal dependencies: changing $\mathbf{W}_{hy}$ at timestep 3 doesn't affect the hidden state at timestep 4.

Gradient with Respect to Hidden StateLink Copied

Now we tackle the heart of BPTT: computing gradients for the hidden states. This is where temporal dependencies make things interesting, and where the "through time" in BPTT becomes essential.

Consider the hidden state $\mathbf{h}_t$ at some timestep $t$ . In a feedforward network, each activation affects the loss through exactly one path. But in an RNN, $\mathbf{h}_t$ influences the loss through two distinct channels:

Direct effect: $\mathbf{h}_t$ produces output $\mathbf{y}_t$ , which contributes to loss $\mathcal{L}_t$
Indirect effect: $\mathbf{h}_t$ flows into $\mathbf{h}_{t+1}$ , which affects $\mathcal{L}_{t+1}$ , and then into $\mathbf{h}_{t+2}$ , affecting $\mathcal{L}_{t+2}$ , and so on through all future timesteps

This dual influence is the fundamental difference from feedforward networks. The total gradient must account for both channels, which leads us to a recursive formulation.

For the final timestep $T$ , the gradient is simple because there are no future timesteps to consider:

\frac{\partial \mathcal{L}}{\partial \mathbf{h}_T} = \frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_T}

For earlier timesteps, we must account for both the direct contribution to $\mathcal{L}_t$ and the indirect contribution through future hidden states. The multivariate chain rule gives us:

\frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} = \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} + \frac{\partial \mathcal{L}}{\partial \mathbf{h}_{t+1}} \cdot \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_t}

where:

$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_t}$ : the total gradient of the loss with respect to hidden state $\mathbf{h}_t$ , accounting for all paths to the loss
$\frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t}$ : the direct gradient from the loss at timestep $t$ (through the output layer)
$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_{t+1}}$ : the total gradient at the next timestep (already computed when working backward)
$\frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_t}$ : how much $\mathbf{h}_{t+1}$ changes when we perturb $\mathbf{h}_t$ (the Jacobian of the RNN transition)

This recursive formula is the heart of BPTT. To make the recursion explicit, let's define $\boldsymbol{\delta}_t = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t}$ as the "error signal" at timestep $t$ . This gives us:

\boldsymbol{\delta}_t = \frac{\partial \mathcal{L}_t}{\partial \mathbf{h}_t} + \boldsymbol{\delta}_{t+1} \cdot \frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_t}

The term $\frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_t}$ is the Jacobian of the hidden state transition. To derive it, recall that $\mathbf{h}_{t+1} = \tanh(\mathbf{W}_{hh}\mathbf{h}_t + \mathbf{W}_{xh}\mathbf{x}_{t+1} + \mathbf{b}_h)$ . Applying the chain rule:

\frac{\partial \mathbf{h}_{t+1}}{\partial \mathbf{h}_t} = \text{diag}(1 - \mathbf{h}_{t+1}^2) \cdot \mathbf{W}_{hh}

where:

$\text{diag}(1 - \mathbf{h}_{t+1}^2)$ : a diagonal matrix of size $d_h \times d_h$ , where the $i$ -th diagonal entry is $(1 - h_{t+1,i}^2)$ , the derivative of tanh at the $i$ -th hidden unit
$\mathbf{W}_{hh}$ : the recurrent weight matrix, which determines how the previous hidden state influences the current one

The tanh derivative $(1 - h^2)$ ranges from 0 (when $h = \pm 1$ , the saturation regions) to 1 (when $h = 0$ ). This bounded derivative will become critically important when we discuss vanishing gradients: if the hidden units saturate (approach $\pm 1$ ), the gradient signal gets attenuated at each timestep.

Out[4]:

Visualization

Plot of tanh function showing S-curve bounded between -1 and +1, with saturation regions shaded. — The tanh activation function. Output is bounded between -1 and +1, with saturation regions (shaded red) where the function approaches its limits.

Plot of tanh derivative showing bell-shaped curve with high gradient region near zero and low gradient regions at extremes. — The tanh derivative (1 - tanh²(x)). The derivative is maximized at x=0 and approaches zero in saturation regions, attenuating gradients during backpropagation.

The plot reveals why tanh can cause gradient problems. In the green region (|z| < 1), the derivative is close to 1, allowing gradients to flow relatively unimpeded. But in the red saturation regions (|z| > 1), the derivative drops rapidly toward zero. If hidden units frequently saturate, each timestep multiplies the gradient by a small number, causing exponential decay.

Gradient with Respect to Recurrent WeightsLink Copied

With the hidden state gradients in hand, we can finally tackle the recurrent weights $\mathbf{W}_{hh}$ . This is where gradient accumulation becomes essential.

The challenge is that $\mathbf{W}_{hh}$ appears at every timestep in the unrolled computation graph. When we change $\mathbf{W}_{hh}$ , we change the computation at timestep 1, timestep 2, timestep 3, and so on. Each of these changes affects the loss. The total gradient must sum all these contributions:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \underbrace{\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}}\bigg|_{t=1}}_{\text{contribution from } t=1} + \underbrace{\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}}\bigg|_{t=2}}_{\text{contribution from } t=2} + \cdots + \underbrace{\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}}\bigg|_{t=T}}_{\text{contribution from } t=T}

More formally, since $\mathbf{W}_{hh}$ is used at every timestep, its gradient is the sum of contributions from all timesteps:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} \cdot \frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_{hh}}

where:

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}}$ : the total gradient of the loss with respect to the recurrent weights
$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_t} = \boldsymbol{\delta}_t$ : the error signal at timestep $t$ (computed in the backward pass)
$\frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_{hh}}$ : how the hidden state at timestep $t$ changes with respect to the weights (the local gradient)

At each timestep $t$ , the local gradient comes from differentiating $\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \ldots)$ with respect to $\mathbf{W}_{hh}$ :

\frac{\partial \mathbf{h}_t}{\partial \mathbf{W}_{hh}} = \text{diag}(1 - \mathbf{h}_t^2) \cdot \mathbf{h}_{t-1}^\top

where $\mathbf{h}_{t-1}^\top$ appears because the gradient of a matrix-vector product $\mathbf{W}\mathbf{v}$ with respect to $\mathbf{W}$ involves the vector $\mathbf{v}$ .

Combining the error signal with the local gradient at each timestep:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}} = \sum_{t=1}^{T} \boldsymbol{\delta}_t \odot (1 - \mathbf{h}_t^2) \cdot \mathbf{h}_{t-1}^\top

where:

$\boldsymbol{\delta}_t \odot (1 - \mathbf{h}_t^2)$ : the error signal multiplied element-wise by the tanh derivative, giving the gradient at the pre-activation
$\mathbf{h}_{t-1}^\top$ : the hidden state from the previous timestep, transposed for the outer product
$\odot$ : element-wise (Hadamard) multiplication

The sum over all timesteps reflects the fact that the same weight matrix $\mathbf{W}_{hh}$ is used at every step. Each use contributes to the total gradient.

Gradient Accumulation

The key insight of BPTT is gradient accumulation: each weight's gradient is the sum of its contributions from all timesteps. This is why we must store all hidden states during the forward pass, as we need them when computing the backward pass.

Visualizing Gradient FlowLink Copied

To understand BPTT intuitively, let's visualize how gradients flow backward through the unrolled network.

Out[5]:

Visualization

Diagram showing gradient flow arrows pointing backward through time from loss nodes to hidden states, with accumulated gradient paths highlighted. — Gradient flow in BPTT. The loss gradient flows backward through time (red arrows), accumulating contributions at each timestep. The gradient at h_1 includes contributions from all future losses. The recursive formula is: δ_t = ∂L_t/∂h_t + δ_{t+1} · ∂h_{t+1}/∂h_t

The diagram reveals a crucial insight: the gradient at early timesteps accumulates contributions from all future losses. The gradient $\boldsymbol{\delta}_1$ at the first hidden state includes:

The direct gradient from $\mathcal{L}_1$
The gradient from $\mathcal{L}_2$ flowing back through $\mathbf{h}_2$
The gradient from $\mathcal{L}_3$ flowing back through $\mathbf{h}_3$ and $\mathbf{h}_2$
The gradient from $\mathcal{L}_4$ flowing back through $\mathbf{h}_4$ , $\mathbf{h}_3$ , and $\mathbf{h}_2$

This accumulation is what allows RNNs to learn long-range dependencies, but it's also the source of the vanishing gradient problem we'll explore in the next chapter.

A Worked ExampleLink Copied

The formulas above can feel abstract until you trace through them with concrete numbers. Let's work through BPTT step by step on a tiny example, computing every intermediate value by hand. This will solidify your understanding of how gradients flow backward through time and accumulate across timesteps.

We'll use a sequence of length 3 with scalar hidden states (dimension 1) to keep the arithmetic tractable. In practice, hidden states are vectors with hundreds or thousands of dimensions, but the same principles apply.

Setup:

Sequence length: $T = 3$
Hidden dimension: $d_h = 1$ (scalar for simplicity)
Input dimension: $d_x = 1$
Inputs: $x_1 = 0.5$ , $x_2 = -0.3$ , $x_3 = 0.8$
Initial hidden state: $h_0 = 0$
Weights: $W_{hh} = 0.6$ , $W_{xh} = 0.4$ , $b_h = 0.1$
Target outputs: $y_1^* = 0.3$ , $y_2^* = -0.1$ , $y_3^* = 0.5$
Loss: Mean squared error at each timestep

Forward Pass:

In[6]:

Code

import numpy as np

# Forward pass through a simple RNN
W_hh, W_xh, b_h = 0.6, 0.4, 0.1
x = [0.5, -0.3, 0.8]
y_target = [0.3, -0.1, 0.5]
h = [0.0]  # h[0] is the initial hidden state

# Compute hidden states
for t in range(3):
    z_t = W_hh * h[t] + W_xh * x[t] + b_h
    h_t = np.tanh(z_t)
    h.append(h_t)

# Compute losses (MSE)
losses = [(h[t + 1] - y_target[t]) ** 2 for t in range(3)]
total_loss = sum(losses)

import numpy as np

# Forward pass through a simple RNN
W_hh, W_xh, b_h = 0.6, 0.4, 0.1
x = [0.5, -0.3, 0.8]
y_target = [0.3, -0.1, 0.5]
h = [0.0]  # h[0] is the initial hidden state

# Compute hidden states
for t in range(3):
    z_t = W_hh * h[t] + W_xh * x[t] + b_h
    h_t = np.tanh(z_t)
    h.append(h_t)

# Compute losses (MSE)
losses = [(h[t + 1] - y_target[t]) ** 2 for t in range(3)]
total_loss = sum(losses)

Out[7]:

Console

Forward Pass Results
==================================================
Timestep 1:
  z_1 = 0.6 × 0.0000 + 0.4 × 0.5 + 0.1 = 0.3000
  h_1 = tanh(0.3000) = 0.2913
  L_1 = (h_1 - y*_1)² = (0.2913 - 0.3)² = 0.0001

Timestep 2:
  z_2 = 0.6 × 0.2913 + 0.4 × -0.3 + 0.1 = 0.1548
  h_2 = tanh(0.1548) = 0.1536
  L_2 = (h_2 - y*_2)² = (0.1536 - -0.1)² = 0.0643

Timestep 3:
  z_3 = 0.6 × 0.1536 + 0.4 × 0.8 + 0.1 = 0.5121
  h_3 = tanh(0.5121) = 0.4716
  L_3 = (h_3 - y*_3)² = (0.4716 - 0.5)² = 0.0008

Total Loss: L = L_1 + L_2 + L_3 = 0.0652

Backward Pass:

Now we compute gradients flowing backward from $t=3$ to $t=1$ :

In[8]:

Code

# Backward pass: compute gradients
# delta[t] = dL/dh[t] (total gradient at hidden state t)

delta = [0.0] * 4  # delta[0] unused, delta[1..3] for timesteps 1-3

# Start from the last timestep
# dL/dh_3 = dL_3/dh_3 (no future contribution)
delta[3] = 2 * (h[3] - y_target[2])  # derivative of MSE

# Work backward
for t in range(2, 0, -1):
    # Local loss gradient
    local_grad = 2 * (h[t] - y_target[t - 1])

    # Gradient flowing back from future: delta[t+1] * dh[t+1]/dh[t]
    # dh[t+1]/dh[t] = (1 - h[t+1]^2) * W_hh
    future_grad = delta[t + 1] * (1 - h[t + 1] ** 2) * W_hh

    delta[t] = local_grad + future_grad

# Accumulate gradient for W_hh across all timesteps
dL_dW_hh = 0.0
for t in range(1, 4):
    # dL/dW_hh at timestep t = delta[t] * (1 - h[t]^2) * h[t-1]
    dL_dW_hh += delta[t] * (1 - h[t] ** 2) * h[t - 1]

# Backward pass: compute gradients
# delta[t] = dL/dh[t] (total gradient at hidden state t)

delta = [0.0] * 4  # delta[0] unused, delta[1..3] for timesteps 1-3

# Start from the last timestep
# dL/dh_3 = dL_3/dh_3 (no future contribution)
delta[3] = 2 * (h[3] - y_target[2])  # derivative of MSE

# Work backward
for t in range(2, 0, -1):
    # Local loss gradient
    local_grad = 2 * (h[t] - y_target[t - 1])

    # Gradient flowing back from future: delta[t+1] * dh[t+1]/dh[t]
    # dh[t+1]/dh[t] = (1 - h[t+1]^2) * W_hh
    future_grad = delta[t + 1] * (1 - h[t + 1] ** 2) * W_hh

    delta[t] = local_grad + future_grad

# Accumulate gradient for W_hh across all timesteps
dL_dW_hh = 0.0
for t in range(1, 4):
    # dL/dW_hh at timestep t = delta[t] * (1 - h[t]^2) * h[t-1]
    dL_dW_hh += delta[t] * (1 - h[t] ** 2) * h[t - 1]

Out[9]:

Console

Backward Pass Results
==================================================
Timestep 3 (start of backward pass):
  δ_3 = dL_3/dh_3 = 2 × (h_3 - y*_3) = 2 × (0.4716 - 0.5) = -0.0568

Timestep 2:
  Local gradient: dL_2/dh_2 = 2 × (0.1536 - -0.1) = 0.5071
  tanh derivative: (1 - h_3²) = 1 - 0.4716² = 0.7776
  Future gradient: δ_3 × 0.7776 × W_hh = -0.0568 × 0.7776 × 0.6 = -0.0265
  δ_2 = 0.5071 + -0.0265 = 0.4806

Timestep 1:
  Local gradient: dL_1/dh_1 = 2 × (0.2913 - 0.3) = -0.0174
  tanh derivative: (1 - h_2²) = 1 - 0.1536² = 0.9764
  Future gradient: δ_2 × 0.9764 × W_hh = 0.4806 × 0.9764 × 0.6 = 0.2816
  δ_1 = -0.0174 + 0.2816 = 0.2642

Gradient Accumulation for W_hh:
--------------------------------------------------
  Timestep 1: δ_1 × (1-h_1²) × h_0 = 0.2642 × 0.9151 × 0.0000 = 0.0000
  Timestep 2: δ_2 × (1-h_2²) × h_1 = 0.4806 × 0.9764 × 0.2913 = 0.1367
  Timestep 3: δ_3 × (1-h_3²) × h_2 = -0.0568 × 0.7776 × 0.1536 = -0.0068

Total: dL/dW_hh = 0.1299

Out[10]:

Visualization

Bar chart showing gradient contributions from timesteps 1, 2, and 3, with timestep 1 contributing zero and later timesteps contributing more. — Gradient contributions to W_hh from each timestep. Earlier timesteps contribute less because (1) h_0 = 0 makes the timestep-1 contribution zero, and (2) the error signals δ_t at earlier timesteps are smaller due to gradient decay through the recursive formula.

This worked example illustrates the two key aspects of BPTT:

Recursive gradient computation: Notice how $\delta_1$ at the first timestep includes contributions from all three losses. The gradient from $\mathcal{L}_3$ flows backward through $\mathbf{h}_3$ and $\mathbf{h}_2$ before reaching $\mathbf{h}_1$ , getting multiplied by the tanh derivative and $W_{hh}$ at each step. This is the "through time" in Backpropagation Through Time.
Gradient accumulation for shared weights: The gradient for $W_{hh}$ is the sum of contributions from all timesteps. At timestep 1, the contribution is zero because $h_0 = 0$ , but timesteps 2 and 3 both contribute. In a longer sequence, this sum would have many more terms, all computed in a single backward pass.

The small numbers in this example hint at a deeper issue: the gradient contributions shrink as we go further back in time. The tanh derivative (around 0.78-0.92 in our example) multiplied by $W_{hh} = 0.6$ gives a factor less than 1 at each step. Over many timesteps, this shrinking compounds exponentially, leading to the vanishing gradient problem we'll explore in the next chapter.

Let's visualize how gradient magnitude decays as it flows backward through time:

Out[11]:

Visualization

Line plot showing exponential decay of gradient magnitude over 20 timesteps, with annotations showing the decay factor at each step. — Gradient magnitude decay as gradients flow backward through time. Starting from δ_T = 1.0, each backward step multiplies by (1 - h²) × W_hh. With typical values (tanh derivative ≈ 0.85, W_hh = 0.6), gradients decay exponentially, reaching less than 1% of their original magnitude after just 10 timesteps.

The logarithmic scale reveals the exponential nature of gradient decay. With $W_{hh} = 0.6$ (our worked example), gradients fall below 1% of their original magnitude after just 8 timesteps. Even with $W_{hh} = 0.9$ , gradients still decay significantly over 20 steps. Only when the effective decay factor approaches 1.0 do gradients maintain their magnitude, but this requires careful initialization and is difficult to maintain during training.

Implementing BPTT from ScratchLink Copied

Having derived BPTT mathematically and traced through a worked example, we're ready to implement the algorithm in code. This implementation will handle the general case: vector-valued hidden states, multiple weight matrices, and cross-entropy loss for classification.

We'll build a simple character-level language model, a classic application of RNNs that predicts the next character given the previous characters. This task is simple enough to train quickly but complex enough to demonstrate all aspects of BPTT.

RNN Forward PassLink Copied

The forward pass computes hidden states and outputs while storing everything needed for backpropagation. The key insight is that we must save all intermediate values: the hidden states at every timestep (for computing weight gradients) and the pre-activation values (for computing tanh derivatives).

In[12]:

Code

class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights with small random values
        scale = 0.01
        self.W_xh = np.random.randn(hidden_size, input_size) * scale
        self.W_hh = np.random.randn(hidden_size, hidden_size) * scale
        self.W_hy = np.random.randn(output_size, hidden_size) * scale
        self.b_h = np.zeros((hidden_size, 1))
        self.b_y = np.zeros((output_size, 1))

        self.hidden_size = hidden_size

    def forward(self, inputs, h_prev):
        """
        Forward pass through the RNN.

        Args:
            inputs: List of one-hot encoded input vectors, each shape (input_size, 1)
            h_prev: Initial hidden state, shape (hidden_size, 1)

        Returns:
            outputs: List of output vectors (pre-softmax logits)
            hidden_states: List of hidden states (including h_prev at index 0)
            cache: Values needed for backward pass
        """
        hidden_states = [h_prev]
        outputs = []
        pre_activations = []  # Store z values before tanh

        h = h_prev
        for x in inputs:
            # Compute pre-activation
            z = self.W_xh @ x + self.W_hh @ h + self.b_h
            pre_activations.append(z)

            # Apply tanh activation
            h = np.tanh(z)
            hidden_states.append(h)

            # Compute output
            y = self.W_hy @ h + self.b_y
            outputs.append(y)

        cache = {
            "inputs": inputs,
            "hidden_states": hidden_states,
            "pre_activations": pre_activations,
            "outputs": outputs,
        }

        return outputs, hidden_states, cache

class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        # Initialize weights with small random values
        scale = 0.01
        self.W_xh = np.random.randn(hidden_size, input_size) * scale
        self.W_hh = np.random.randn(hidden_size, hidden_size) * scale
        self.W_hy = np.random.randn(output_size, hidden_size) * scale
        self.b_h = np.zeros((hidden_size, 1))
        self.b_y = np.zeros((output_size, 1))

        self.hidden_size = hidden_size

    def forward(self, inputs, h_prev):
        """
        Forward pass through the RNN.

        Args:
            inputs: List of one-hot encoded input vectors, each shape (input_size, 1)
            h_prev: Initial hidden state, shape (hidden_size, 1)

        Returns:
            outputs: List of output vectors (pre-softmax logits)
            hidden_states: List of hidden states (including h_prev at index 0)
            cache: Values needed for backward pass
        """
        hidden_states = [h_prev]
        outputs = []
        pre_activations = []  # Store z values before tanh

        h = h_prev
        for x in inputs:
            # Compute pre-activation
            z = self.W_xh @ x + self.W_hh @ h + self.b_h
            pre_activations.append(z)

            # Apply tanh activation
            h = np.tanh(z)
            hidden_states.append(h)

            # Compute output
            y = self.W_hy @ h + self.b_y
            outputs.append(y)

        cache = {
            "inputs": inputs,
            "hidden_states": hidden_states,
            "pre_activations": pre_activations,
            "outputs": outputs,
        }

        return outputs, hidden_states, cache

The forward pass stores three critical pieces of information:

inputs: The input vectors at each timestep
hidden_states: All hidden states including $h_0$ , needed for computing weight gradients
pre_activations: The values before tanh, needed for computing tanh derivatives

BPTT Backward PassLink Copied

The backward pass is where BPTT comes to life. We iterate through timesteps in reverse order, computing the error signal $\boldsymbol{\delta}_t$ at each step and accumulating gradients for all weight matrices. The code directly implements the formulas we derived earlier:

In[13]:

Code

def backward(self, targets, cache):
    """
    Backward pass using BPTT.

    Args:
        targets: List of target indices (for cross-entropy loss)
        cache: Values from forward pass

    Returns:
        grads: Dictionary of gradients for each parameter
        loss: Total cross-entropy loss
    """
    inputs = cache["inputs"]
    hidden_states = cache["hidden_states"]
    outputs = cache["outputs"]

    T = len(inputs)

    # Initialize gradients
    dW_xh = np.zeros_like(self.W_xh)
    dW_hh = np.zeros_like(self.W_hh)
    dW_hy = np.zeros_like(self.W_hy)
    db_h = np.zeros_like(self.b_h)
    db_y = np.zeros_like(self.b_y)

    # Compute softmax probabilities and loss
    loss = 0.0
    dy_list = []  # Gradient of loss w.r.t. output

    for t in range(T):
        # Softmax
        exp_y = np.exp(outputs[t] - np.max(outputs[t]))  # Numerical stability
        probs = exp_y / np.sum(exp_y)

        # Cross-entropy loss
        loss -= np.log(probs[targets[t], 0] + 1e-8)

        # Gradient of cross-entropy loss w.r.t. output (softmax derivative)
        dy = probs.copy()
        dy[targets[t]] -= 1
        dy_list.append(dy)

    # Backward pass through time
    dh_next = np.zeros((self.hidden_size, 1))  # Gradient from future

    for t in reversed(range(T)):
        # Gradient w.r.t. output layer
        dy = dy_list[t]
        dW_hy += dy @ hidden_states[t + 1].T
        db_y += dy

        # Gradient w.r.t. hidden state (from output + from future)
        dh = self.W_hy.T @ dy + dh_next

        # Gradient through tanh: dz = dh * (1 - h^2)
        dz = dh * (1 - hidden_states[t + 1] ** 2)

        # Accumulate gradients for weights
        dW_xh += dz @ inputs[t].T
        dW_hh += dz @ hidden_states[t].T
        db_h += dz

        # Gradient to pass to previous timestep
        dh_next = self.W_hh.T @ dz

    grads = {
        "W_xh": dW_xh,
        "W_hh": dW_hh,
        "W_hy": dW_hy,
        "b_h": db_h,
        "b_y": db_y,
    }

    return grads, loss


# Add method to class
SimpleRNN.backward = backward

def backward(self, targets, cache):
    """
    Backward pass using BPTT.

    Args:
        targets: List of target indices (for cross-entropy loss)
        cache: Values from forward pass

    Returns:
        grads: Dictionary of gradients for each parameter
        loss: Total cross-entropy loss
    """
    inputs = cache["inputs"]
    hidden_states = cache["hidden_states"]
    outputs = cache["outputs"]

    T = len(inputs)

    # Initialize gradients
    dW_xh = np.zeros_like(self.W_xh)
    dW_hh = np.zeros_like(self.W_hh)
    dW_hy = np.zeros_like(self.W_hy)
    db_h = np.zeros_like(self.b_h)
    db_y = np.zeros_like(self.b_y)

    # Compute softmax probabilities and loss
    loss = 0.0
    dy_list = []  # Gradient of loss w.r.t. output

    for t in range(T):
        # Softmax
        exp_y = np.exp(outputs[t] - np.max(outputs[t]))  # Numerical stability
        probs = exp_y / np.sum(exp_y)

        # Cross-entropy loss
        loss -= np.log(probs[targets[t], 0] + 1e-8)

        # Gradient of cross-entropy loss w.r.t. output (softmax derivative)
        dy = probs.copy()
        dy[targets[t]] -= 1
        dy_list.append(dy)

    # Backward pass through time
    dh_next = np.zeros((self.hidden_size, 1))  # Gradient from future

    for t in reversed(range(T)):
        # Gradient w.r.t. output layer
        dy = dy_list[t]
        dW_hy += dy @ hidden_states[t + 1].T
        db_y += dy

        # Gradient w.r.t. hidden state (from output + from future)
        dh = self.W_hy.T @ dy + dh_next

        # Gradient through tanh: dz = dh * (1 - h^2)
        dz = dh * (1 - hidden_states[t + 1] ** 2)

        # Accumulate gradients for weights
        dW_xh += dz @ inputs[t].T
        dW_hh += dz @ hidden_states[t].T
        db_h += dz

        # Gradient to pass to previous timestep
        dh_next = self.W_hh.T @ dz

    grads = {
        "W_xh": dW_xh,
        "W_hh": dW_hh,
        "W_hy": dW_hy,
        "b_h": db_h,
        "b_y": db_y,
    }

    return grads, loss


# Add method to class
SimpleRNN.backward = backward

The backward pass implements the BPTT algorithm:

Compute output gradients: For each timestep, compute the gradient of the cross-entropy loss with respect to the output logits.
Backward through time: Starting from the last timestep, propagate gradients backward. At each step:
- Add the gradient from the output layer and the gradient from future timesteps
- Backpropagate through the tanh activation
- Accumulate gradients for all weight matrices
- Pass the gradient to the previous timestep through $\mathbf{W}_{hh}$
Gradient accumulation: Each weight matrix accumulates contributions from all timesteps.

Verifying with Numerical GradientsLink Copied

Let's verify our implementation by comparing analytical gradients to numerical approximations:

In[14]:

Code

def numerical_gradient(rnn, inputs, targets, h_prev, param_name, epsilon=1e-5):
    """Compute numerical gradient for verification."""
    param = getattr(rnn, param_name)
    grad = np.zeros_like(param)

    for i in range(param.shape[0]):
        for j in range(param.shape[1]):
            # Compute f(x + epsilon)
            param[i, j] += epsilon
            _, _, cache = rnn.forward(inputs, h_prev)
            _, loss_plus = rnn.backward(targets, cache)

            # Compute f(x - epsilon)
            param[i, j] -= 2 * epsilon
            _, _, cache = rnn.forward(inputs, h_prev)
            _, loss_minus = rnn.backward(targets, cache)

            # Restore original value
            param[i, j] += epsilon

            # Central difference
            grad[i, j] = (loss_plus - loss_minus) / (2 * epsilon)

    return grad


# Test on a small example
np.random.seed(42)
rnn = SimpleRNN(input_size=5, hidden_size=3, output_size=5)

# Create dummy inputs (one-hot vectors)
inputs = [np.zeros((5, 1)) for _ in range(4)]
for i, inp in enumerate(inputs):
    inp[i % 5] = 1.0

targets = [1, 2, 3, 4]
h_prev = np.zeros((3, 1))

# Compute analytical gradients
_, _, cache = rnn.forward(inputs, h_prev)
grads, _ = rnn.backward(targets, cache)

# Compute numerical gradients for W_hh
numerical_grad = numerical_gradient(rnn, inputs, targets, h_prev, "W_hh")

def numerical_gradient(rnn, inputs, targets, h_prev, param_name, epsilon=1e-5):
    """Compute numerical gradient for verification."""
    param = getattr(rnn, param_name)
    grad = np.zeros_like(param)

    for i in range(param.shape[0]):
        for j in range(param.shape[1]):
            # Compute f(x + epsilon)
            param[i, j] += epsilon
            _, _, cache = rnn.forward(inputs, h_prev)
            _, loss_plus = rnn.backward(targets, cache)

            # Compute f(x - epsilon)
            param[i, j] -= 2 * epsilon
            _, _, cache = rnn.forward(inputs, h_prev)
            _, loss_minus = rnn.backward(targets, cache)

            # Restore original value
            param[i, j] += epsilon

            # Central difference
            grad[i, j] = (loss_plus - loss_minus) / (2 * epsilon)

    return grad


# Test on a small example
np.random.seed(42)
rnn = SimpleRNN(input_size=5, hidden_size=3, output_size=5)

# Create dummy inputs (one-hot vectors)
inputs = [np.zeros((5, 1)) for _ in range(4)]
for i, inp in enumerate(inputs):
    inp[i % 5] = 1.0

targets = [1, 2, 3, 4]
h_prev = np.zeros((3, 1))

# Compute analytical gradients
_, _, cache = rnn.forward(inputs, h_prev)
grads, _ = rnn.backward(targets, cache)

# Compute numerical gradients for W_hh
numerical_grad = numerical_gradient(rnn, inputs, targets, h_prev, "W_hh")

Out[15]:

Console

Gradient Verification for W_hh
==================================================
Analytical gradient:
[[-2.97543334e-05  6.67625730e-05 -5.93139537e-05]
 [ 4.81988759e-05  6.48110405e-05  1.70433261e-04]
 [-1.28115927e-05  1.26916142e-04  3.01300196e-05]]

Numerical gradient:
[[-2.97543323e-05  6.67625955e-05 -5.93138871e-05]
 [ 4.81988671e-05  6.48110454e-05  1.70433223e-04]
 [-1.28115740e-05  1.26916166e-04  3.01300762e-05]]

Max absolute difference: 6.66e-11
Relative error: 2.08e-07

The analytical and numerical gradients match to within $10^{-7}$ relative error, well within floating-point precision. This level of agreement confirms our BPTT implementation is mathematically correct. Gradient checking is an essential debugging technique: if the relative error exceeds $10^{-5}$ , there's likely a bug in the backward pass.

Training on Character SequencesLink Copied

Let's train our RNN on a simple character prediction task:

In[16]:

Code

# Prepare a simple dataset
text = "hello world hello world hello world "
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

vocab_size = len(chars)
hidden_size = 32
seq_length = 10

# Initialize RNN
np.random.seed(42)
rnn = SimpleRNN(vocab_size, hidden_size, vocab_size)

# Training loop
learning_rate = 0.1
losses = []

for iteration in range(500):
    # Sample a random starting position
    start_idx = np.random.randint(0, len(text) - seq_length - 1)

    # Prepare inputs and targets
    inputs = []
    targets = []
    for i in range(seq_length):
        # One-hot encode input
        x = np.zeros((vocab_size, 1))
        x[char_to_idx[text[start_idx + i]]] = 1.0
        inputs.append(x)
        targets.append(char_to_idx[text[start_idx + i + 1]])

    # Forward pass
    h_prev = np.zeros((hidden_size, 1))
    outputs, hidden_states, cache = rnn.forward(inputs, h_prev)

    # Backward pass
    grads, loss = rnn.backward(targets, cache)
    losses.append(loss)

    # Gradient clipping (prevent exploding gradients)
    for param_name in grads:
        np.clip(grads[param_name], -5, 5, out=grads[param_name])

    # Update parameters
    rnn.W_xh -= learning_rate * grads["W_xh"]
    rnn.W_hh -= learning_rate * grads["W_hh"]
    rnn.W_hy -= learning_rate * grads["W_hy"]
    rnn.b_h -= learning_rate * grads["b_h"]
    rnn.b_y -= learning_rate * grads["b_y"]

# Prepare a simple dataset
text = "hello world hello world hello world "
chars = sorted(list(set(text)))
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

vocab_size = len(chars)
hidden_size = 32
seq_length = 10

# Initialize RNN
np.random.seed(42)
rnn = SimpleRNN(vocab_size, hidden_size, vocab_size)

# Training loop
learning_rate = 0.1
losses = []

for iteration in range(500):
    # Sample a random starting position
    start_idx = np.random.randint(0, len(text) - seq_length - 1)

    # Prepare inputs and targets
    inputs = []
    targets = []
    for i in range(seq_length):
        # One-hot encode input
        x = np.zeros((vocab_size, 1))
        x[char_to_idx[text[start_idx + i]]] = 1.0
        inputs.append(x)
        targets.append(char_to_idx[text[start_idx + i + 1]])

    # Forward pass
    h_prev = np.zeros((hidden_size, 1))
    outputs, hidden_states, cache = rnn.forward(inputs, h_prev)

    # Backward pass
    grads, loss = rnn.backward(targets, cache)
    losses.append(loss)

    # Gradient clipping (prevent exploding gradients)
    for param_name in grads:
        np.clip(grads[param_name], -5, 5, out=grads[param_name])

    # Update parameters
    rnn.W_xh -= learning_rate * grads["W_xh"]
    rnn.W_hh -= learning_rate * grads["W_hh"]
    rnn.W_hy -= learning_rate * grads["W_hy"]
    rnn.b_h -= learning_rate * grads["b_h"]
    rnn.b_y -= learning_rate * grads["b_y"]

Out[17]:

Visualization

Line plot showing training loss decreasing from around 25 to near 5 over 500 iterations, with some noise. — Training loss over 500 iterations of BPTT. The loss decreases as the RNN learns to predict the next character in the repeating 'hello world' sequence.

The loss decreases steadily, showing that BPTT successfully trains the RNN. Let's sample from the trained model:

In[18]:

Code

def sample(rnn, seed_char, length=50):
    """Generate text by sampling from the RNN."""
    h = np.zeros((rnn.hidden_size, 1))
    x = np.zeros((vocab_size, 1))
    x[char_to_idx[seed_char]] = 1.0

    result = seed_char

    for _ in range(length):
        # Forward one step
        z = rnn.W_xh @ x + rnn.W_hh @ h + rnn.b_h
        h = np.tanh(z)
        y = rnn.W_hy @ h + rnn.b_y

        # Sample from softmax distribution
        exp_y = np.exp(y - np.max(y))
        probs = exp_y / np.sum(exp_y)
        idx = np.random.choice(range(vocab_size), p=probs.ravel())

        # Prepare next input
        x = np.zeros((vocab_size, 1))
        x[idx] = 1.0
        result += idx_to_char[idx]

    return result


generated = sample(rnn, "h", length=50)

def sample(rnn, seed_char, length=50):
    """Generate text by sampling from the RNN."""
    h = np.zeros((rnn.hidden_size, 1))
    x = np.zeros((vocab_size, 1))
    x[char_to_idx[seed_char]] = 1.0

    result = seed_char

    for _ in range(length):
        # Forward one step
        z = rnn.W_xh @ x + rnn.W_hh @ h + rnn.b_h
        h = np.tanh(z)
        y = rnn.W_hy @ h + rnn.b_y

        # Sample from softmax distribution
        exp_y = np.exp(y - np.max(y))
        probs = exp_y / np.sum(exp_y)
        idx = np.random.choice(range(vocab_size), p=probs.ravel())

        # Prepare next input
        x = np.zeros((vocab_size, 1))
        x[idx] = 1.0
        result += idx_to_char[idx]

    return result


generated = sample(rnn, "h", length=50)

Out[19]:

Console

Generated text from trained RNN:
--------------------------------------------------
heerlddee  loerlwd wdloedrwoellordl o  wllod lo hdl

The generated text shows the RNN has learned the basic structure of the training data. While not perfect (the simple model and limited training produce some errors), the output demonstrates that BPTT successfully trained the network to capture character-level patterns. The model learned common bigrams like "he", "ll", "wo", and "ld" from the repeating "hello world" sequence.

Truncated BPTTLink Copied

Full BPTT requires storing all hidden states for the entire sequence, which becomes prohibitive for long sequences. For a sequence of length $T$ with hidden dimension $d_h$ , we need $O(T \cdot d_h)$ memory just for hidden states.

Truncated BPTT

Truncated Backpropagation Through Time is a practical approximation to full BPTT that limits gradient flow to a fixed number of timesteps $k$ . This reduces memory requirements from $O(T \cdot d_h)$ to $O(k \cdot d_h)$ , at the cost of not learning dependencies longer than $k$ steps.

Truncated BPTT addresses this by limiting how far back gradients flow. Instead of backpropagating through the entire sequence, we:

Divide the sequence into chunks of length $k$
Run forward through the entire sequence, but only backpropagate through the last $k$ timesteps
Carry the hidden state forward between chunks, but not the gradients

Out[20]:

Visualization

Diagram showing an RNN sequence divided into chunks, with forward arrows spanning all timesteps but backward gradient arrows only within each chunk of 3 timesteps. — Truncated BPTT with chunk size k=3. The forward pass runs through the entire sequence (solid arrows), but gradients only flow back within each chunk (dashed arrows). Hidden states carry forward, but gradient information is truncated at chunk boundaries.

The trade-off is clear: truncated BPTT uses less memory but cannot learn dependencies longer than $k$ timesteps. In practice, $k$ is often set to 20-50 timesteps, which works well for many tasks but fundamentally limits what the RNN can learn.

In[21]:

Code

def truncated_bptt_step(rnn, inputs, targets, h_prev, k):
    """
    One step of truncated BPTT.

    Args:
        rnn: The RNN model
        inputs: Full sequence of inputs
        targets: Full sequence of targets
        h_prev: Hidden state from previous chunk
        k: Truncation length (backprop through last k steps only)

    Returns:
        grads: Gradients accumulated over this chunk
        loss: Loss for this chunk
        h_final: Hidden state to carry forward
    """
    T = len(inputs)
    total_grads = {
        name: np.zeros_like(getattr(rnn, name))
        for name in ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]
    }
    total_loss = 0.0

    h = h_prev

    # Process sequence in chunks
    for chunk_start in range(0, T, k):
        chunk_end = min(chunk_start + k, T)
        chunk_inputs = inputs[chunk_start:chunk_end]
        chunk_targets = targets[chunk_start:chunk_end]

        # Forward pass for this chunk
        outputs, hidden_states, cache = rnn.forward(chunk_inputs, h)

        # Backward pass (only through this chunk)
        grads, loss = rnn.backward(chunk_targets, cache)

        # Accumulate
        for name in total_grads:
            total_grads[name] += grads[name]
        total_loss += loss

        # Carry hidden state forward (but not gradients!)
        h = hidden_states[-1]

    return total_grads, total_loss, h

def truncated_bptt_step(rnn, inputs, targets, h_prev, k):
    """
    One step of truncated BPTT.

    Args:
        rnn: The RNN model
        inputs: Full sequence of inputs
        targets: Full sequence of targets
        h_prev: Hidden state from previous chunk
        k: Truncation length (backprop through last k steps only)

    Returns:
        grads: Gradients accumulated over this chunk
        loss: Loss for this chunk
        h_final: Hidden state to carry forward
    """
    T = len(inputs)
    total_grads = {
        name: np.zeros_like(getattr(rnn, name))
        for name in ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]
    }
    total_loss = 0.0

    h = h_prev

    # Process sequence in chunks
    for chunk_start in range(0, T, k):
        chunk_end = min(chunk_start + k, T)
        chunk_inputs = inputs[chunk_start:chunk_end]
        chunk_targets = targets[chunk_start:chunk_end]

        # Forward pass for this chunk
        outputs, hidden_states, cache = rnn.forward(chunk_inputs, h)

        # Backward pass (only through this chunk)
        grads, loss = rnn.backward(chunk_targets, cache)

        # Accumulate
        for name in total_grads:
            total_grads[name] += grads[name]
        total_loss += loss

        # Carry hidden state forward (but not gradients!)
        h = hidden_states[-1]

    return total_grads, total_loss, h

Memory RequirementsLink Copied

Understanding BPTT's memory requirements is crucial for practical deep learning. Let's analyze what we need to store during training.

Forward Pass Storage:

Hidden states: $T \times d_h$ values (one hidden state vector per timestep)
Pre-activation values: $T \times d_h$ values (needed to compute tanh derivatives during backprop)
Inputs: $T \times d_x$ values (already stored as part of the input data)

Backward Pass Computation:

Gradient for each weight matrix: same size as the weight (e.g., $d_h \times d_h$ for $\mathbf{W}_{hh}$ )
Temporary gradient signals: $d_h$ values for $\boldsymbol{\delta}_t$ (reused at each timestep)

The dominant cost is storing hidden states: $O(T \cdot d_h)$ memory. For a sequence of length 1000 with hidden dimension 512, this is about 2 million floating-point numbers, or 8 MB in float32. With batch processing (multiple sequences simultaneously), multiply by the batch size.

In[22]:

Code

def estimate_bptt_memory(seq_length, hidden_size, batch_size=1, dtype_bytes=4):
    """Estimate memory required for BPTT in bytes."""
    # Hidden states for all timesteps
    hidden_memory = seq_length * hidden_size * batch_size * dtype_bytes

    # Pre-activation values (same size as hidden states)
    preact_memory = seq_length * hidden_size * batch_size * dtype_bytes

    # Total for activations
    activation_memory = hidden_memory + preact_memory

    return activation_memory


# Example calculations
configs = [
    (100, 256, 32),  # Small model, short sequence
    (500, 512, 32),  # Medium model
    (1000, 1024, 64),  # Large model, long sequence
    (2000, 2048, 128),  # Very large
]

def estimate_bptt_memory(seq_length, hidden_size, batch_size=1, dtype_bytes=4):
    """Estimate memory required for BPTT in bytes."""
    # Hidden states for all timesteps
    hidden_memory = seq_length * hidden_size * batch_size * dtype_bytes

    # Pre-activation values (same size as hidden states)
    preact_memory = seq_length * hidden_size * batch_size * dtype_bytes

    # Total for activations
    activation_memory = hidden_memory + preact_memory

    return activation_memory


# Example calculations
configs = [
    (100, 256, 32),  # Small model, short sequence
    (500, 512, 32),  # Medium model
    (1000, 1024, 64),  # Large model, long sequence
    (2000, 2048, 128),  # Very large
]

Out[23]:

Console

BPTT Memory Requirements (Activation Storage)
============================================================
Seq Length   Hidden     Batch    Memory         
------------------------------------------------------------
100          256        32       6.2 MB         
500          512        32       62.5 MB        
1000         1024       64       500.0 MB       
2000         2048       128      3.91 GB

Out[24]:

Visualization

Line plot showing linear growth of memory with sequence length for different hidden sizes. — Memory scales linearly with sequence length. Longer sequences require proportionally more storage for hidden states and pre-activations.

Line plot showing linear growth of memory with hidden size for different sequence lengths. — Memory scales linearly with hidden size. Larger hidden dimensions increase storage requirements for each timestep's activations.

The table and plots show how memory requirements scale dramatically with sequence length, hidden size, and batch size. A modest configuration (1000 timesteps, 1024 hidden units, batch size 64) already requires over 500 MB just for activation storage, not including model parameters or optimizer state. For very long sequences, memory becomes the bottleneck. This is one motivation for truncated BPTT and later architectural innovations like transformers that can process sequences in parallel.

Limitations and ImpactLink Copied

BPTT is a foundational algorithm that made training recurrent neural networks practical, but it comes with significant challenges that shaped the evolution of sequence modeling.

The Vanishing Gradient ProblemLink Copied

The most critical limitation of BPTT is the vanishing gradient problem. When gradients flow backward through many timesteps, they pass through repeated multiplications by $\mathbf{W}_{hh}$ and the tanh derivative $(1 - h^2)$ . If these multiplications consistently shrink the gradient (eigenvalues less than 1), the gradient signal decays exponentially.

To see this mathematically, consider how the gradient at $\mathbf{h}_1$ depends on the loss at timestep $T$ . Unrolling the recursive formula, we get a product of Jacobians:

\frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_1} = \frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_T} \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}

where:

$\frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_1}$ : how much the loss at the final timestep changes when we perturb the first hidden state
$\frac{\partial \mathcal{L}_T}{\partial \mathbf{h}_T}$ : the direct gradient at the final timestep
$\prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}$ : the product of $T-1$ Jacobian matrices, each of the form $\text{diag}(1 - \mathbf{h}_t^2) \cdot \mathbf{W}_{hh}$

Each Jacobian has magnitude bounded by the product of the largest tanh derivative (at most 1) and the spectral norm of $\mathbf{W}_{hh}$ . If this product is less than 1, the gradient shrinks at each step. After $T$ steps, the gradient is multiplied by a factor that decays exponentially with sequence length.

For long sequences, this product approaches zero, meaning early timesteps receive essentially no learning signal from late losses. The network cannot learn long-range dependencies because the gradient information is lost before it reaches the relevant parameters.

This isn't just a theoretical concern. In practice, vanilla RNNs trained with BPTT struggle with dependencies beyond 10-20 timesteps. A language model might fail to connect a pronoun with its antecedent if they're separated by many words. A time series model might miss patterns that span long intervals.

Out[25]:

Visualization

Heatmap showing gradient flow from loss timesteps to hidden states, with strong gradients near the diagonal and vanishing gradients far from it. — Simulated gradient flow in a 50-timestep RNN. The heatmap shows how gradient magnitude at each timestep (rows) contributes to learning at earlier timesteps (columns). The dark region in the lower-left shows that late losses provide almost no learning signal to early timesteps, the hallmark of the vanishing gradient problem.

The heatmap makes the vanishing gradient problem visually stark. Each row represents a loss at a specific timestep; each column represents a hidden state that might need to be updated. The diagonal is bright (recent losses strongly affect recent hidden states), but the lower-left is dark (late losses barely affect early hidden states). If the correct prediction at timestep 50 depends on information from timestep 5, the gradient signal is attenuated by a factor of $0.75^{45} \approx 10^{-6}$ , essentially zero for practical purposes.

Exploding GradientsLink Copied

The opposite problem, exploding gradients, occurs when the repeated multiplications amplify rather than shrink the gradient. The gradient grows exponentially, leading to numerical overflow and unstable training. Gradient clipping (limiting gradient magnitudes) provides a practical workaround, but it doesn't solve the underlying issue of gradient instability.

Computational CostLink Copied

BPTT is inherently sequential. We must compute the forward pass through all timesteps before starting the backward pass, and the backward pass must proceed in reverse order. This sequential dependency prevents parallelization across timesteps, making RNN training slow compared to feedforward networks of similar depth.

What BPTT UnlockedLink Copied

Despite these limitations, BPTT was transformative. Before BPTT, training recurrent networks was largely impractical. BPTT made it possible to train networks that could:

Model sequential dependencies in text, speech, and time series
Learn from variable-length inputs without fixed-size windows
Capture patterns that span multiple timesteps

The limitations of vanilla BPTT directly motivated architectural innovations. LSTMs and GRUs, which we'll cover in upcoming chapters, were designed specifically to address the vanishing gradient problem by creating "highways" for gradient flow. More recently, transformers abandoned recurrence entirely, using attention mechanisms that allow direct connections between any two positions in a sequence.

Understanding BPTT deeply, including its failure modes, is essential for appreciating why these newer architectures were developed and how they solve the problems that BPTT exposed.

SummaryLink Copied

This chapter derived and implemented Backpropagation Through Time, the algorithm that makes training recurrent neural networks possible.

The key concepts are:

Unrolling: Conceptually unfold the RNN across time, treating it as a deep feedforward network with shared weights. This allows us to apply the chain rule systematically.
Recursive gradient computation: The gradient at each hidden state combines the local loss gradient with the gradient flowing back from future timesteps: $\delta_t = \frac{\partial L_t}{\partial h_t} + \delta_{t+1} \cdot \frac{\partial h_{t+1}}{\partial h_t}$
Gradient accumulation: Since weights are shared across timesteps, we sum gradient contributions from all timesteps. This is the defining characteristic of BPTT compared to standard backpropagation.
Memory-computation trade-off: Full BPTT requires storing all hidden states, with memory scaling as $O(T \cdot d_h)$ . Truncated BPTT reduces memory by limiting how far back gradients flow, at the cost of not learning very long-range dependencies.
Gradient instability: The repeated multiplication of gradients through time leads to vanishing or exploding gradients, fundamentally limiting what vanilla RNNs can learn.

In the next chapter, we'll dive deeper into the vanishing gradient problem: why it occurs mathematically, how to visualize it, and why it motivated the development of gated architectures like LSTMs and GRUs.

Key ParametersLink Copied

When implementing BPTT, several parameters significantly affect training behavior and memory usage:

hidden_size: The dimension of the hidden state vector. Larger values increase model capacity but also increase memory requirements quadratically (due to $\mathbf{W}_{hh}$ being $d_h \times d_h$ ). Typical values range from 128 to 1024.
learning_rate: Controls the step size during gradient descent. RNNs are often sensitive to learning rate; values between 0.001 and 0.1 are common starting points. Too high causes instability; too low causes slow convergence.
gradient_clip_value: The maximum allowed gradient magnitude. Clipping gradients to values like 1.0 or 5.0 prevents exploding gradients from destabilizing training, though it doesn't address vanishing gradients.
truncation_length (k): For truncated BPTT, the number of timesteps to backpropagate through. Larger values capture longer dependencies but require more memory. Values of 20-50 are common; the optimal choice depends on the temporal structure of your data.
weight_scale: The standard deviation for weight initialization. Small values (0.01-0.1) prevent initial activations from saturating, which would immediately cause vanishing gradients.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Backpropagation Through Time.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{backpropagationthroughtimetrainingrnnswithgradientflow, author = {Michael Brenndoerfer}, title = {Backpropagation Through Time: Training RNNs with Gradient Flow}, year = {2025}, url = {https://mbrenndoerfer.com/writing/backpropagation-through-time-rnn-training-algorithm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Backpropagation Through Time: Training RNNs with Gradient Flow. Retrieved from https://mbrenndoerfer.com/writing/backpropagation-through-time-rnn-training-algorithm

MLAAcademic

Michael Brenndoerfer. "Backpropagation Through Time: Training RNNs with Gradient Flow." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/backpropagation-through-time-rnn-training-algorithm>.

CHICAGOAcademic

Michael Brenndoerfer. "Backpropagation Through Time: Training RNNs with Gradient Flow." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/backpropagation-through-time-rnn-training-algorithm.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Backpropagation Through Time: Training RNNs with Gradient Flow'. Available at: https://mbrenndoerfer.com/writing/backpropagation-through-time-rnn-training-algorithm (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Backpropagation Through Time: Training RNNs with Gradient Flow. https://mbrenndoerfer.com/writing/backpropagation-through-time-rnn-training-algorithm

Direct link:

https://mbrenndoerfer.com/writing/backpropagation-through-time-rnn-training-algorithm

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Backpropagation Through Time: Training RNNs with Gradient Flow

Backpropagation Through TimeLink Copied

From Feedforward to Recurrent: The Training ChallengeLink Copied

Unrolling the RNNLink Copied

The BPTT DerivationLink Copied

Gradient with Respect to Output WeightsLink Copied

Gradient with Respect to Hidden StateLink Copied

Gradient with Respect to Recurrent WeightsLink Copied

Visualizing Gradient FlowLink Copied

A Worked ExampleLink Copied

Implementing BPTT from ScratchLink Copied

RNN Forward PassLink Copied

BPTT Backward PassLink Copied

Verifying with Numerical GradientsLink Copied

Training on Character SequencesLink Copied

Truncated BPTTLink Copied

Memory RequirementsLink Copied

Limitations and ImpactLink Copied

The Vanishing Gradient ProblemLink Copied

Exploding GradientsLink Copied

Computational CostLink Copied

What BPTT UnlockedLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GRU Architecture: Streamlined Gating for Sequence Modeling

Stacked RNNs: Deep Recurrent Networks for Hierarchical Sequence Modeling

LSTM Gradient Flow: The Constant Error Carousel Explained

Stay updated