Backpropagation: The Algorithm That Makes Deep Learning Possible

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Master backpropagation from computational graphs to gradient flow. Learn the chain rule, implement forward/backward passes, and understand automatic differentiation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

BackpropagationLink Copied

How does a neural network learn? In the previous chapters, we built multilayer perceptrons that transform inputs through layers of weights and activations, and we defined loss functions that measure prediction quality. But we haven't answered the fundamental question: how do we update the millions of parameters in a neural network to reduce the loss?

The answer is backpropagation, short for "backward propagation of errors." This algorithm computes the gradient of the loss function with respect to every weight in the network, telling us exactly how to adjust each parameter to improve predictions. Backpropagation is arguably the most important algorithm in deep learning. Without it, training neural networks would be computationally intractable.

This chapter builds backpropagation from the ground up. We start with computational graphs that visualize how operations compose, then review the chain rule that makes gradient computation tractable. You'll trace through forward and backward passes step by step, understand gradient accumulation, and implement backpropagation from scratch. By the end, you'll see exactly how modern deep learning frameworks like PyTorch compute gradients automatically.

The Gradient ProblemLink Copied

Neural networks have millions of parameters. GPT-3 has 175 billion. How do we figure out which direction to nudge each one to reduce the loss?

Calculus gives us the answer: the gradient. For a function $f(\mathbf{w})$ where $\mathbf{w}$ is a vector of parameters, the gradient $\nabla_\mathbf{w} f$ is a vector that points in the direction of steepest increase of $f$ . Each component of the gradient tells us how sensitive the function is to changes in the corresponding parameter. To minimize the loss, we move in the opposite direction, the direction of steepest decrease:

\mathbf{w}_{\text{new}} = \mathbf{w}_{\text{old}} - \eta \nabla_\mathbf{w} \mathcal{L}

where:

$\mathbf{w}_{\text{new}}$ : the updated parameter vector after one step
$\mathbf{w}_{\text{old}}$ : the current parameter vector before the update
$\eta$ : the learning rate, a small positive scalar (typically 0.001 to 0.1) that controls how large each step is
$\nabla_\mathbf{w} \mathcal{L}$ : the gradient of the loss with respect to parameters, a vector with the same shape as $\mathbf{w}$ , where each entry $\frac{\partial \mathcal{L}}{\partial w_i}$ indicates how much the loss would increase if we increased $w_i$ by a tiny amount
$\mathcal{L}$ : the loss function, a scalar value measuring prediction error

The subtraction makes intuitive sense: if $\frac{\partial \mathcal{L}}{\partial w_i} > 0$ , increasing $w_i$ would increase the loss, so we decrease it instead. This is gradient descent, simple in concept, but the challenge is computing $\nabla_\mathbf{w} \mathcal{L}$ when the loss depends on parameters through many nested operations.

Consider a 3-layer network. The loss depends on the output, which depends on the third layer's activations, which depend on the second layer's activations, which depend on the first layer's activations, which depend on the input and the first layer's weights. This chain of dependencies seems hopelessly complex to differentiate.

Backpropagation

Backpropagation is an algorithm for efficiently computing gradients in neural networks by applying the chain rule systematically from output to input. It computes the gradient of the loss with respect to every parameter in a single backward pass through the network.

Backpropagation solves this problem elegantly. By applying the chain rule in reverse order, from output back to input, we can compute all gradients in time proportional to a single forward pass. This efficiency is what makes training deep networks practical.

Computational GraphsLink Copied

Before diving into backpropagation, we need a way to visualize how computations flow through a network. A computational graph represents a mathematical expression as a directed graph where nodes are operations and edges show data flow.

Building Blocks of ComputationLink Copied

Every neural network computation breaks down into elementary operations: addition, multiplication, matrix products, and activation functions. A computational graph makes these explicit. Consider the simple expression:

f(x, y, z) = (x + y) \cdot z

This decomposes into two operations:

$a = x + y$ (addition)
$f = a \cdot z$ (multiplication)

where:

$x, y, z$ : the input variables to the function
$a$ : an intermediate value that stores the result of the first operation
$f$ : the final output of the computation

Out[3]:

Visualization

Directed graph showing x and y flowing into an addition node, whose output flows with z into a multiplication node producing f. — A computational graph for f(x, y, z) = (x + y) · z. Data flows left to right through intermediate nodes. Each node represents an operation, and edges show dependencies.

The graph shows data flowing from inputs (x, y, z) through operations (+, ×) to the output (f). The intermediate value $a = x + y$ is stored at the addition node. This decomposition is the key insight: complex expressions are just chains of simple operations.

Neural Network as a Computational GraphLink Copied

A neural network layer performs the computation $\mathbf{a} = \sigma(\mathbf{W}\mathbf{x} + \mathbf{b})$ , where:

$\mathbf{x}$ : the input vector (or the output from the previous layer)
$\mathbf{W}$ : the weight matrix, with shape (output neurons × input neurons)
$\mathbf{b}$ : the bias vector, with one value per output neuron
$\sigma$ : the activation function (e.g., ReLU, sigmoid)
$\mathbf{a}$ : the output activations

Let's break this into its constituent operations:

$\mathbf{z}_1 = \mathbf{W}\mathbf{x}$ (matrix multiplication: transforms input through learned weights)
$\mathbf{z}_2 = \mathbf{z}_1 + \mathbf{b}$ (bias addition: shifts the output)
$\mathbf{a} = \sigma(\mathbf{z}_2)$ (activation: introduces nonlinearity)

Out[4]:

Visualization

Directed graph showing neural network layer computation with matmul, add, and activation nodes. — Computational graph for a single neural network layer. The input x flows through matrix multiplication with W, bias addition with b, and finally through the activation function σ to produce the output a.

Each node in the graph represents an operation that we know how to differentiate. The power of computational graphs is that they make the structure of the computation explicit, which is exactly what we need for backpropagation.

Why Computational Graphs MatterLink Copied

Computational graphs provide two crucial benefits for gradient computation:

Modularity: Each node has a well-defined local gradient. We don't need to derive the gradient of the entire network at once; we just need the gradient of each primitive operation.
Efficiency: By storing intermediate values during the forward pass, we can reuse them during the backward pass. This avoids redundant computation.

The forward pass computes values flowing left to right. The backward pass computes gradients flowing right to left. This symmetry is the essence of backpropagation.

The Chain Rule: Foundation of BackpropagationLink Copied

We've seen how computational graphs break complex neural network computations into simple, modular operations. Now we face the central question: how do we compute the gradient of the loss with respect to every parameter in this graph?

The answer lies in the chain rule from calculus. To truly understand backpropagation, we need to see the chain rule not as a formula to memorize, but as a way of thinking about how influence propagates through a system of nested computations. Every operation in a neural network transforms its inputs in some way, and the chain rule tells us exactly how to trace the effect of any input change through all those transformations to the final output.

The Core Intuition: Tracing Influence Through CompositionsLink Copied

Before diving into formulas, let's build intuition with a physical analogy. Imagine a factory assembly line where raw materials pass through several processing stations, each transforming the product in some way, until a final item emerges. If you want to know how changing the raw material affects the final product, you need to trace that change through every station in sequence.

The same logic applies to neural networks. An input $x$ passes through layer after layer of transformations until it produces a loss value $\mathcal{L}$ . To understand how $x$ affects $\mathcal{L}$ , we trace the influence through each transformation, and the chain rule gives us the precise mathematical tool to do this.

Single Variable Chain RuleLink Copied

Let's start with the simplest case: two functions arranged in a pipeline. First, $g$ transforms input $x$ into some intermediate value, then $f$ transforms that intermediate value into the final output. We write this composition as $f(g(x))$ .

The question we want to answer is: if we nudge $x$ by a tiny amount, how much does the final output change?

Think of it as a two-stage amplification process:

Stage 1: The nudge to $x$ causes $g(x)$ to change. If $g$ is very sensitive to $x$ (steep slope), a small change in $x$ produces a large change in $g$ . If $g$ is insensitive (flat slope), the change is small.
Stage 2: This change in $g(x)$ then causes $f$ to change. Again, the magnitude depends on how sensitive $f$ is to its input.

The total effect is the product of these two sensitivities, as each stage amplifies (or attenuates) the signal from the previous stage:

\frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

where:

$\frac{df}{dx}$ : the total derivative, how much the final output $f$ changes per unit change in the original input $x$ . This is what we ultimately want to know.
$\frac{df}{dg}$ : the local gradient of the outer function, how sensitive $f$ is to changes in its immediate input $g$ . We can compute this knowing only the function $f$ , without caring about where $g$ came from.
$\frac{dg}{dx}$ : the local gradient of the inner function, how sensitive $g$ is to changes in $x$ . We can compute this knowing only the function $g$ .

Why multiplication? If $g$ doubles any change in $x$ (i.e., $\frac{dg}{dx} = 2$ ), and $f$ triples any change in $g$ (i.e., $\frac{df}{dg} = 3$ ), then a unit change in $x$ ultimately causes a $2 \times 3 = 6$ unit change in $f$ . The effects compound multiplicatively.

This modularity is the key insight: we can break a complex derivative into a product of simple, local derivatives. Each function in the chain only needs to know how to differentiate itself with respect to its immediate inputs. It doesn't need to understand the entire pipeline.

Let's make this concrete with an example. Consider the composite function $f(x) = \sin(x^2)$ , which we can decompose as:

Inner function: $g(x) = x^2$
Outer function: $f(g) = \sin(g)$

To find $\frac{df}{dx}$ , we compute each local gradient and multiply:

$\frac{dg}{dx} = 2x$ (derivative of $x^2$ )
$\frac{df}{dg} = \cos(g)$ (derivative of $\sin(g)$ )
$\frac{df}{dx} = \cos(g) \cdot 2x = \cos(x^2) \cdot 2x = 2x\cos(x^2)$

In[5]:

Code

# Concrete example: f(x) = sin(x^2)
# Let g(x) = x^2, so f(g) = sin(g)


def g(x):
    return x**2


def f(g_val):
    return np.sin(g_val)


def df_dg(g_val):
    """Derivative of sin(g) with respect to g"""
    return np.cos(g_val)


def dg_dx(x):
    """Derivative of x^2 with respect to x"""
    return 2 * x


def df_dx_chain_rule(x):
    """Chain rule: df/dx = df/dg * dg/dx"""
    g_val = g(x)
    return df_dg(g_val) * dg_dx(x)


# Verify against numerical gradient
x_test = 2.0
epsilon = 1e-7
numerical_grad = (f(g(x_test + epsilon)) - f(g(x_test - epsilon))) / (
    2 * epsilon
)
analytical_grad = df_dx_chain_rule(x_test)

# Concrete example: f(x) = sin(x^2)
# Let g(x) = x^2, so f(g) = sin(g)


def g(x):
    return x**2


def f(g_val):
    return np.sin(g_val)


def df_dg(g_val):
    """Derivative of sin(g) with respect to g"""
    return np.cos(g_val)


def dg_dx(x):
    """Derivative of x^2 with respect to x"""
    return 2 * x


def df_dx_chain_rule(x):
    """Chain rule: df/dx = df/dg * dg/dx"""
    g_val = g(x)
    return df_dg(g_val) * dg_dx(x)


# Verify against numerical gradient
x_test = 2.0
epsilon = 1e-7
numerical_grad = (f(g(x_test + epsilon)) - f(g(x_test - epsilon))) / (
    2 * epsilon
)
analytical_grad = df_dx_chain_rule(x_test)

Out[6]:

Console

Chain Rule Verification: f(x) = sin(x²)
---------------------------------------------
Test point: x = 2.0
g(x) = x² = 4.0
f(g) = sin(g) = -0.756802

Derivatives:
  dg/dx = 2x = 4.0
  df/dg = cos(g) = -0.653644
  df/dx = df/dg × dg/dx = -2.614574

Numerical gradient: -2.614574
Difference: 2.65e-09

The analytical gradient computed via the chain rule matches the numerical approximation to within $10^{-9}$ , well within floating-point precision. This verification technique (comparing analytical gradients to numerical approximations) will become crucial when we implement backpropagation and need to verify our code is correct.

Multivariate Chain Rule: When Paths DivergeLink Copied

The single-variable chain rule handles a simple pipeline where one thing leads to another in a straight line. But neural networks are more complex: a single variable can influence the output through multiple intermediate pathways. This is where the multivariate chain rule becomes essential.

The Problem: Multiple Pathways of InfluenceLink Copied

Consider a concrete scenario: in a neural network, a single weight $w$ might affect multiple neurons in the next layer, each of which contributes to the final loss. How do we account for all these influences?

Think of it like a river that splits into multiple tributaries before reaching the ocean. If you add a drop of dye at the source, it flows through all the tributaries. To measure the total amount of dye reaching the ocean, you must sum the contributions from each path.

The same principle applies to gradients: when a variable affects the output through multiple paths, we sum the contributions from each path. If changing $w$ increases the loss through one path and decreases it through another, the net effect is the sum of these opposing influences.

The FormulaLink Copied

Mathematically, if a variable $x$ affects the loss $L$ through several intermediate variables $y_1, y_2, \ldots, y_n$ , the total gradient is:

\frac{\partial L}{\partial x} = \sum_{i=1}^{n} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}

Let's unpack each component:

$L$ : the final loss, a single scalar value that we're trying to minimize
$x$ : the variable we want to understand the influence of (e.g., a weight or bias)
$y_1, y_2, \ldots, y_n$ : all the intermediate values that directly depend on $x$
$\frac{\partial L}{\partial y_i}$ : how sensitive the loss is to changes in $y_i$ (this is computed by backpropagating from the output)
$\frac{\partial y_i}{\partial x}$ : how sensitive $y_i$ is to changes in $x$ (this is computed locally)

Each term $\frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}$ represents one "pathway of influence," the effect of $x$ on $L$ through the specific route via $y_i$ . The sum aggregates all these pathways.

Why This Matters for Neural NetworksLink Copied

This summation principle has profound implications for neural network training:

Gradient accumulation: When the same weight is used multiple times (as in recurrent networks processing sequences, or in architectures with shared weights), its gradient naturally accumulates contributions from each use.
Fan-out patterns: When a layer's output feeds into multiple downstream computations, the gradient flowing back must aggregate all these influences.
Batch processing: When we process multiple examples simultaneously, each example contributes to the gradient, and we sum (or average) these contributions.

Chain Rule on Computational Graphs: The Backpropagation AlgorithmLink Copied

Now we can see how the chain rule maps onto computational graphs, and this mapping is what transforms the abstract mathematics into a practical algorithm.

From Math to AlgorithmLink Copied

Recall that a computational graph represents a neural network's computation as a directed graph: nodes are operations, and edges show how data flows from inputs to outputs. During the forward pass, we compute values flowing from left to right. During the backward pass, we compute gradients flowing from right to left, from the loss back toward the inputs.

The key insight is that each node only needs to know its local gradient. A node doesn't need to understand the entire network; it just needs to know two things:

Forward: How to compute its output from its inputs
Backward: How to compute the gradient of its output with respect to its inputs

This locality is what makes backpropagation modular and scalable.

The Backward Pass FormulaLink Copied

For any node $v$ in the graph, the gradient of the loss with respect to $v$ follows directly from the multivariate chain rule:

\frac{\partial L}{\partial v} = \sum_{u \in \text{children}(v)} \frac{\partial L}{\partial u} \cdot \frac{\partial u}{\partial v}

where:

$L$ : the loss function (a scalar value we're trying to minimize)
$v$ : the current node whose gradient we're computing
$\text{children}(v)$ : the set of all nodes $u$ that directly receive $v$ as input (i.e., nodes downstream from $v$ )
$\frac{\partial L}{\partial u}$ : the gradient of the loss with respect to each child node (already computed during backpropagation)
$\frac{\partial u}{\partial v}$ : the local gradient, how much node $u$ 's output changes when $v$ changes

The Algorithm in Plain LanguageLink Copied

This formula encodes a simple, elegant algorithm. For each node $v$ , working backward from the loss:

Look downstream: Find all nodes $u$ that use $v$ as an input (these are $v$ 's "children" in graph terminology)
Collect incoming gradients: Each child $u$ sends back a gradient signal $\frac{\partial L}{\partial u}$ , which tells us how sensitive the loss is to $u$ 's output
Scale by local sensitivity: Multiply each incoming gradient by $\frac{\partial u}{\partial v}$ , the local gradient that describes how much $u$ 's output changes when $v$ changes
Sum the contributions: Add up all these scaled gradients to get the total gradient at $v$

Why This Works: The Recursive StructureLink Copied

The beauty of this approach is its recursive structure. We start at the loss node, where $\frac{\partial L}{\partial L} = 1$ (the loss is perfectly sensitive to itself). Then we work backward through the graph, layer by layer:

Local information is sufficient: Each node computes its local gradient using only the operation type and values stored during the forward pass
Gradients flow backward: Each node receives gradient signals from its children (which have already been computed) and passes gradient signals to its parents
No global coordination needed: The algorithm is inherently parallel, and nodes at the same depth can compute their gradients simultaneously

This is the essence of backpropagation: a systematic, efficient way to apply the chain rule to computational graphs.

Out[7]:

Visualization

Diagram showing gradient flow through a computational graph with multiple paths, illustrating the chain rule summation. — The chain rule on a computational graph. To compute ∂L/∂x, we multiply along each path from x to L and sum the results. Here, x affects L through both y₁ and y₂, so we sum both contributions.

When a variable affects the loss through multiple paths, we sum the gradients from each path. This is why the chain rule involves a sum in the multivariate case.

Forward Pass: Computing ValuesLink Copied

With the chain rule understood, we're ready to trace through an actual neural network. The forward pass propagates input data through the network, computing and storing intermediate values at each node. These stored values are essential for the backward pass; without them, we couldn't compute the local gradients.

Think of the forward pass as leaving breadcrumbs: as we compute each intermediate value, we save it for later use when gradients flow backward.

Step-by-Step Forward ComputationLink Copied

Let's trace through a concrete 2-layer network with the following architecture:

Input: $\mathbf{x} \in \mathbb{R}^2$ (2 features)
Hidden layer: 3 neurons with ReLU activation
Output layer: 1 neuron with sigmoid activation (for binary classification)

This small network is simple enough to trace by hand, yet complex enough to illustrate all the key concepts.

In[8]:

Code

# Define network architecture
np.random.seed(42)

# Layer 1: 2 inputs -> 3 hidden neurons
W1 = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, -0.1]])
b1 = np.array([[0.1], [0.2], [-0.1]])

# Layer 2: 3 hidden -> 1 output
W2 = np.array([[0.2, -0.3, 0.4]])
b2 = np.array([[0.1]])


# Activation functions
def relu(z):
    return np.maximum(0, z)


def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# Define network architecture
np.random.seed(42)

# Layer 1: 2 inputs -> 3 hidden neurons
W1 = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, -0.1]])
b1 = np.array([[0.1], [0.2], [-0.1]])

# Layer 2: 3 hidden -> 1 output
W2 = np.array([[0.2, -0.3, 0.4]])
b2 = np.array([[0.1]])


# Activation functions
def relu(z):
    return np.maximum(0, z)


def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

Now let's perform the forward pass, storing all intermediate values:

In[9]:

Code

def forward_pass(x, W1, b1, W2, b2):
    """
    Forward pass through a 2-layer network.

    Returns all intermediate values needed for backpropagation.
    """
    # Ensure x is a column vector
    x = x.reshape(-1, 1)

    # Layer 1: Linear transformation
    z1 = W1 @ x + b1

    # Layer 1: ReLU activation
    a1 = relu(z1)

    # Layer 2: Linear transformation
    z2 = W2 @ a1 + b2

    # Layer 2: Sigmoid activation (output)
    a2 = sigmoid(z2)

    # Store all values in a cache for backpropagation
    cache = {
        "x": x,
        "z1": z1,
        "a1": a1,
        "z2": z2,
        "a2": a2,
        "W1": W1,
        "b1": b1,
        "W2": W2,
        "b2": b2,
    }

    return a2, cache


# Run forward pass
x = np.array([1.0, 2.0])
y_true = np.array([[1.0]])  # Target label

output, cache = forward_pass(x, W1, b1, W2, b2)

def forward_pass(x, W1, b1, W2, b2):
    """
    Forward pass through a 2-layer network.

    Returns all intermediate values needed for backpropagation.
    """
    # Ensure x is a column vector
    x = x.reshape(-1, 1)

    # Layer 1: Linear transformation
    z1 = W1 @ x + b1

    # Layer 1: ReLU activation
    a1 = relu(z1)

    # Layer 2: Linear transformation
    z2 = W2 @ a1 + b2

    # Layer 2: Sigmoid activation (output)
    a2 = sigmoid(z2)

    # Store all values in a cache for backpropagation
    cache = {
        "x": x,
        "z1": z1,
        "a1": a1,
        "z2": z2,
        "a2": a2,
        "W1": W1,
        "b1": b1,
        "W2": W2,
        "b2": b2,
    }

    return a2, cache


# Run forward pass
x = np.array([1.0, 2.0])
y_true = np.array([[1.0]])  # Target label

output, cache = forward_pass(x, W1, b1, W2, b2)

Out[10]:

Console

Forward Pass Trace
==================================================

Input x: [1. 2.]

Layer 1 (Hidden):
  z1 = W1 @ x + b1:
    W1 @ x = [0.5 1.1 0.3]
    z1 = [0.6 1.3 0.2]
  a1 = ReLU(z1) = [0.6 1.3 0.2]

Layer 2 (Output):
  z2 = W2 @ a1 + b2:
    W2 @ a1 = [-0.19]
    z2 = [-0.09]
  a2 = sigmoid(z2) = [0.47751518]

Final output: 0.477515
Target: 1.0

Notice how the ReLU activation zeroed out the negative pre-activation value in position 2 (z1[2] = -0.1 became a1[2] = 0). This sparsity is characteristic of ReLU networks, and it will have important implications for the backward pass, as we'll see shortly.

Computing the LossLink Copied

With the forward pass complete, we've arrived at a prediction. But how good is it? We need a way to quantify the error, and this is the role of the loss function.

For binary classification, we use binary cross-entropy (BCE), which measures how well the predicted probability matches the true label:

\mathcal{L} = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

where:

$y$ : the true label, either 0 or 1
$\hat{y}$ : the predicted probability from the sigmoid output, a value between 0 and 1
$\log$ : the natural logarithm
$\mathcal{L}$ : the loss value (always non-negative)

Understanding the FormulaLink Copied

This formula has an elegant interpretation that becomes clear when we consider the two cases:

When $y = 1$ (positive class): The second term vanishes (since $1-y = 0$ ), leaving $\mathcal{L} = -\log(\hat{y})$ .

If $\hat{y} \approx 1$ (correct prediction): $\log(1) = 0$ , so loss is small
If $\hat{y} \approx 0$ (wrong prediction): $\log(0) \rightarrow -\infty$ , so loss is large

When $y = 0$ (negative class): The first term vanishes (since $y = 0$ ), leaving $\mathcal{L} = -\log(1-\hat{y})$ .

If $\hat{y} \approx 0$ (correct prediction): $\log(1) = 0$ , so loss is small
If $\hat{y} \approx 1$ (wrong prediction): $\log(0) \rightarrow -\infty$ , so loss is large

The logarithm creates an asymmetric penalty: confident wrong predictions are punished severely, while confident correct predictions are rewarded with near-zero loss.

In[11]:

Code

def binary_cross_entropy(y_pred, y_true, epsilon=1e-15):
    """Compute binary cross-entropy loss."""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))


loss = binary_cross_entropy(output, y_true)

def binary_cross_entropy(y_pred, y_true, epsilon=1e-15):
    """Compute binary cross-entropy loss."""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))


loss = binary_cross_entropy(output, y_true)

Out[12]:

Console

Binary Cross-Entropy Loss: 0.739159

Predicted probability: 0.4775
True label: 1.0

The loss of approximately 0.59 is moderate, not terrible, but clearly room for improvement. Our prediction of ~0.56 is below the target of 1.0, meaning the network is underconfident about the positive class. The loss quantifies this gap: a perfect prediction of 1.0 would yield a loss near 0, while our current prediction produces a loss that will drive gradient updates to increase the output probability. Now we need to compute gradients to reduce this loss.

Backward Pass: Computing GradientsLink Copied

Now comes the heart of backpropagation. We've completed the forward pass, storing all intermediate values. We've computed the loss. Now we need to answer the fundamental question: how should we adjust each weight to reduce the loss?

The backward pass answers this by computing $\frac{\partial \mathcal{L}}{\partial w}$ for every weight $w$ in the network. Armed with these gradients, we can update weights in the direction that decreases the loss.

The backward pass is like unwinding a ball of yarn. We start at the end (the loss) and work our way back to the beginning (the inputs), carefully tracking how each intermediate value influenced the final result. At each step, we apply the chain rule to decompose the complex derivative into simpler, local derivatives.

Step 1: Gradient of the Loss with Respect to the PredictionLink Copied

Every backward pass begins at the loss function. The loss is our starting point because it's what we're trying to minimize. It's the "error signal" that will propagate backward through the network.

Before we can backpropagate through the network layers, we need to know: how does the loss change when the prediction changes? This is the gradient $\frac{\partial \mathcal{L}}{\partial \hat{y}}$ , the sensitivity of the loss to our prediction.

Let's derive this step by step, starting from the BCE formula:

Step 1: Write out the loss function:

\mathcal{L} = -y \log(\hat{y}) - (1-y) \log(1-\hat{y})

This formula has two terms, but only one is ever "active" for a given example. When $y = 1$ (positive class), the second term vanishes because $(1-y) = 0$ . When $y = 0$ (negative class), the first term vanishes because $y = 0$ .

Step 2: Differentiate each term with respect to $\hat{y}$ :

For the first term, we use the fact that $\frac{d}{dx}\log(x) = \frac{1}{x}$ :

\frac{\partial}{\partial \hat{y}}[-y \log(\hat{y})] = -y \cdot \frac{1}{\hat{y}} = -\frac{y}{\hat{y}}

For the second term, we need the chain rule because we're differentiating $\log(1-\hat{y})$ :

\frac{\partial}{\partial \hat{y}}[-(1-y) \log(1-\hat{y})] = -(1-y) \cdot \frac{1}{1-\hat{y}} \cdot (-1) = \frac{1-y}{1-\hat{y}}

Step 3: Combine the terms:

\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}

Interpreting the GradientLink Copied

This formula reveals something important about how the loss "pushes" the prediction:

How the BCE gradient pushes predictions toward the correct answer.

Scenario	Gradient	Effect
$y = 1$ , $\hat{y}$ small (wrong)	$-\frac{1}{\hat{y}}$ (large negative)	Strong push to increase $\hat{y}$
$y = 1$ , $\hat{y}$ close to 1 (correct)	$-\frac{1}{\hat{y}}$ (small negative)	Gentle push to increase $\hat{y}$
$y = 0$ , $\hat{y}$ large (wrong)	$\frac{1}{1-\hat{y}}$ (large positive)	Strong push to decrease $\hat{y}$
$y = 0$ , $\hat{y}$ close to 0 (correct)	$\frac{1}{1-\hat{y}}$ (small positive)	Gentle push to decrease $\hat{y}$

The loss function naturally provides stronger correction signals for more confident wrong predictions. This is exactly what we want: big mistakes get big corrections.

Step 2: Backpropagating Through SigmoidLink Copied

Now we need to push the gradient one step further back, through the sigmoid activation function. This is our first example of backpropagating through a nonlinearity, and it reveals a beautiful mathematical property that makes the combined BCE + sigmoid gradient remarkably simple.

The sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$ squashes any input into the range (0, 1). But what's its derivative? How sensitive is the sigmoid output to changes in its input?

Deriving the sigmoid derivative:

Step 1: Rewrite sigmoid using exponent notation for easier differentiation:

\sigma(z) = (1 + e^{-z})^{-1}

Step 2: Apply the chain rule. Let $u = 1 + e^{-z}$ , so $\sigma = u^{-1}$ :

\frac{\partial \sigma}{\partial z} = \frac{\partial \sigma}{\partial u} \cdot \frac{\partial u}{\partial z} = (-u^{-2}) \cdot (-e^{-z}) = \frac{e^{-z}}{(1 + e^{-z})^2}

where:

$u = 1 + e^{-z}$ : a substitution variable to simplify the differentiation
$\frac{\partial \sigma}{\partial u} = -u^{-2}$ : the derivative of $u^{-1}$ with respect to $u$
$\frac{\partial u}{\partial z} = -e^{-z}$ : the derivative of $1 + e^{-z}$ with respect to $z$ (the constant 1 vanishes, and $\frac{d}{dz}e^{-z} = -e^{-z}$ )

The two negative signs cancel, giving us a positive result.

Step 3: Here's where it gets elegant. We can rewrite this messy expression entirely in terms of $\sigma(z)$ itself:

Note that $\sigma(z) = \frac{1}{1 + e^{-z}}$
And $1 - \sigma(z) = 1 - \frac{1}{1 + e^{-z}} = \frac{e^{-z}}{1 + e^{-z}}$
Therefore: $\sigma(z) \cdot (1 - \sigma(z)) = \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = \frac{e^{-z}}{(1 + e^{-z})^2}$

This gives us the remarkably simple result:

\frac{\partial \sigma(z)}{\partial z} = \sigma(z)(1 - \sigma(z))

where:

$\sigma(z)$ : the sigmoid output, a value between 0 and 1
$1 - \sigma(z)$ : the complement of the sigmoid output
$\sigma(z)(1 - \sigma(z))$ : the product, which is maximized when $\sigma(z) = 0.5$

The Geometry of the Sigmoid DerivativeLink Copied

This formula has a beautiful geometric interpretation:

Maximum at the midpoint: The derivative peaks at $\sigma(z) = 0.5$ , where $\sigma'(z) = 0.5 \times 0.5 = 0.25$ . This is where the sigmoid is steepest.
Vanishes at the extremes: As $\sigma(z) \rightarrow 0$ or $\sigma(z) \rightarrow 1$ , the derivative approaches zero. The sigmoid flattens out.

This behavior has profound implications for training: when activations saturate near 0 or 1, gradients become tiny, making learning slow or impossible. This is the vanishing gradient problem for sigmoid activations.

Out[13]:

Visualization

S-shaped sigmoid curve from 0 to 1 — The sigmoid function squashes any input to the range (0, 1). Note how it flattens for large positive or negative inputs.

Bell-shaped derivative curve peaking at 0.25 when z=0 — The sigmoid derivative σ(z)(1-σ(z)) peaks at z=0 where σ=0.5, and vanishes as z moves away from zero. This causes vanishing gradients in saturated neurons.

The visualization shows why sigmoid activations are problematic in deep networks. When the pre-activation $z$ is far from zero (either very positive or very negative), the derivative approaches zero, causing gradients to vanish during backpropagation. Only inputs near $z = 0$ produce meaningful gradients.

Step 3: The Beautiful Cancellation (Combining BCE and Sigmoid)Link Copied

Now comes the mathematical payoff that makes backpropagation through classification networks elegant. When we chain the BCE gradient with the sigmoid gradient using the chain rule, something remarkable happens:

\frac{\partial \mathcal{L}}{\partial z_2} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2}

where:

$z_2$ : the pre-activation value (logit) before the sigmoid, i.e., the raw output of the linear layer
$\hat{y} = \sigma(z_2)$ : the sigmoid output (predicted probability)
$\frac{\partial \mathcal{L}}{\partial \hat{y}}$ : the gradient of the loss with respect to the prediction (computed in the previous section)
$\frac{\partial \hat{y}}{\partial z_2}$ : the sigmoid derivative (computed above)

Substituting our expressions for $\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$ and $\frac{\partial \hat{y}}{\partial z_2} = \hat{y}(1-\hat{y})$ :

\frac{\partial \mathcal{L}}{\partial z_2} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y})

Let's expand this carefully:

= -\frac{y}{\hat{y}} \cdot \hat{y}(1-\hat{y}) + \frac{1-y}{1-\hat{y}} \cdot \hat{y}(1-\hat{y})

= -y(1-\hat{y}) + (1-y)\hat{y}

= -y + y\hat{y} + \hat{y} - y\hat{y}

= \hat{y} - y

The complex-looking derivatives collapse into something beautifully simple:

\frac{\partial \mathcal{L}}{\partial z_2} = \hat{y} - y

where:

$\hat{y}$ : the predicted probability (sigmoid output)
$y$ : the true label (0 or 1)
$\hat{y} - y$ : the prediction error, which is positive when we over-predict and negative when we under-predict

This is remarkably simple: the gradient at the pre-sigmoid layer is just the prediction minus the target, the prediction error!

This elegant simplification is not a coincidence. Sigmoid and cross-entropy were designed to work together, and the mathematical complexity cancels out perfectly. The result is a gradient that's both easy to compute and deeply intuitive:

The simplified gradient has a direct interpretation as prediction error.

Condition	Gradient $\hat{y} - y$	Interpretation
$\hat{y} > y$	Positive	Push logit down to reduce prediction
$\hat{y} < y$	Negative	Push logit up to increase prediction
$\hat{y} = y$	Zero	Perfect prediction, no update needed

The gradient magnitude is proportional to how wrong we are. A prediction of 0.9 when the target is 0 produces gradient 0.9, while a prediction of 0.6 produces gradient 0.6. Bigger mistakes get bigger corrections.

Step 4: Implementing the Full Backward PassLink Copied

Now we can implement the complete backward pass. The code below traces gradients from the loss back through both layers, computing gradients for all weights and biases.

In[14]:

Code

def backward_pass(y_true, cache):
    """
    Backward pass through a 2-layer network.

    Computes gradients for all parameters.
    """
    # Retrieve cached values
    x = cache["x"]
    z1, a1 = cache["z1"], cache["a1"]
    z2, a2 = cache["z2"], cache["a2"]
    W1, W2 = cache["W1"], cache["W2"]

    m = 1  # batch size (single example)

    # ===== Output Layer (Layer 2) =====
    # Gradient of loss w.r.t. z2 (sigmoid + BCE simplification)
    dz2 = a2 - y_true  # Shape: (1, 1)

    # Gradient of loss w.r.t. W2
    dW2 = (1 / m) * dz2 @ a1.T  # Shape: (1, 3)

    # Gradient of loss w.r.t. b2
    db2 = (1 / m) * dz2  # Shape: (1, 1)

    # Gradient of loss w.r.t. a1 (to propagate backward)
    da1 = W2.T @ dz2  # Shape: (3, 1)

    # ===== Hidden Layer (Layer 1) =====
    # Gradient through ReLU: pass gradient only where z1 > 0
    dz1 = da1 * (z1 > 0).astype(float)  # Shape: (3, 1)

    # Gradient of loss w.r.t. W1
    dW1 = (1 / m) * dz1 @ x.T  # Shape: (3, 2)

    # Gradient of loss w.r.t. b1
    db1 = (1 / m) * dz1  # Shape: (3, 1)

    gradients = {
        "dW2": dW2,
        "db2": db2,
        "dW1": dW1,
        "db1": db1,
        "dz2": dz2,
        "da1": da1,
        "dz1": dz1,
    }

    return gradients


# Compute gradients
grads = backward_pass(y_true, cache)

def backward_pass(y_true, cache):
    """
    Backward pass through a 2-layer network.

    Computes gradients for all parameters.
    """
    # Retrieve cached values
    x = cache["x"]
    z1, a1 = cache["z1"], cache["a1"]
    z2, a2 = cache["z2"], cache["a2"]
    W1, W2 = cache["W1"], cache["W2"]

    m = 1  # batch size (single example)

    # ===== Output Layer (Layer 2) =====
    # Gradient of loss w.r.t. z2 (sigmoid + BCE simplification)
    dz2 = a2 - y_true  # Shape: (1, 1)

    # Gradient of loss w.r.t. W2
    dW2 = (1 / m) * dz2 @ a1.T  # Shape: (1, 3)

    # Gradient of loss w.r.t. b2
    db2 = (1 / m) * dz2  # Shape: (1, 1)

    # Gradient of loss w.r.t. a1 (to propagate backward)
    da1 = W2.T @ dz2  # Shape: (3, 1)

    # ===== Hidden Layer (Layer 1) =====
    # Gradient through ReLU: pass gradient only where z1 > 0
    dz1 = da1 * (z1 > 0).astype(float)  # Shape: (3, 1)

    # Gradient of loss w.r.t. W1
    dW1 = (1 / m) * dz1 @ x.T  # Shape: (3, 2)

    # Gradient of loss w.r.t. b1
    db1 = (1 / m) * dz1  # Shape: (3, 1)

    gradients = {
        "dW2": dW2,
        "db2": db2,
        "dW1": dW1,
        "db1": db1,
        "dz2": dz2,
        "da1": da1,
        "dz1": dz1,
    }

    return gradients


# Compute gradients
grads = backward_pass(y_true, cache)

Out[15]:

Console

Backward Pass Trace
==================================================

Starting from the loss...

Layer 2 (Output) Gradients:
  dL/dz2 = a2 - y = 0.477515 - 1.0 = -0.522485
  dL/dW2 = dz2 @ a1.T:
    [[-0.31349089 -0.67923027 -0.10449696]]
  dL/db2 = [-0.52248482]
  dL/da1 = W2.T @ dz2 = [-0.10449696  0.15674545 -0.20899393]

Layer 1 (Hidden) Gradients:
  ReLU mask (z1 > 0): [1 1 1]
  dL/dz1 = da1 * mask = [-0.10449696  0.15674545 -0.20899393]
  dL/dW1:
    [[-0.10449696 -0.20899393]
 [ 0.15674545  0.31349089]
 [-0.20899393 -0.41798786]]
  dL/db1 = [-0.10449696  0.15674545 -0.20899393]

Understanding the Backward Pass Step by StepLink Copied

Let's trace through what just happened, connecting each computation to the chain rule:

1. Starting at the loss: We computed $dz_2 = \hat{y} - y$ , the gradient at the output layer. This single number tells us how wrong our prediction was, and it's the "error signal" that will propagate backward.

2. Gradient for output weights ( $dW_2$ ): We asked: "How does each weight in $W_2$ affect $z_2$ ?" Since $z_2 = W_2 \cdot a_1 + b_2$ , each weight $W_{2,ij}$ is multiplied by the corresponding activation $a_{1,j}$ . By the chain rule:

\frac{\partial \mathcal{L}}{\partial W_2} = \frac{\partial \mathcal{L}}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2} = dz_2 \cdot a_1^T

3. Gradient for hidden activations ( $da_1$ ): To continue backpropagating, we need to know how $a_1$ affected the loss. Since $z_2 = W_2 \cdot a_1$ , changing $a_1$ by a small amount changes $z_2$ by $W_2$ times that amount:

\frac{\partial \mathcal{L}}{\partial a_1} = W_2^T \cdot \frac{\partial \mathcal{L}}{\partial z_2} = W_2^T \cdot dz_2

4. Through the ReLU ( $dz_1$ ): The ReLU function $\text{ReLU}(z) = \max(0, z)$ has a simple derivative:

\frac{\partial \text{ReLU}(z)}{\partial z} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}

This acts as a gate: gradients pass through unchanged for positive pre-activations, but are completely blocked for negative ones. The gradient becomes $dz_1 = da_1 \odot \mathbf{1}_{z_1 > 0}$ , where $\odot$ denotes element-wise multiplication.

5. Gradient for hidden weights ( $dW_1$ ): Following the same pattern as the output layer:

\frac{\partial \mathcal{L}}{\partial W_1} = dz_1 \cdot x^T

The ReLU Gate and "Dying Neurons"Link Copied

Notice how the ReLU gradient acts as a gate. Where $z_1 > 0$ , the gradient passes through unchanged. Where $z_1 \leq 0$ , the gradient is blocked (zeroed). In our example, the third neuron had $z_1[2] = -0.1 < 0$ , so its gradient was blocked entirely.

This is the "dying ReLU" phenomenon: neurons that never activate don't receive gradient updates. If a neuron's pre-activation is always negative (perhaps due to poor initialization or a large negative bias), it will never learn. This is why variants like Leaky ReLU (which allow small gradients for negative inputs) are sometimes preferred.

Understanding the Gradient ShapesLink Copied

Each gradient has a specific shape that matches its corresponding parameter:

Gradient shapes match parameter shapes exactly.

Parameter	Shape	Gradient Shape	Explanation
W1	(3, 2)	(3, 2)	3 hidden neurons, 2 inputs
b1	(3, 1)	(3, 1)	One bias per hidden neuron
W2	(1, 3)	(1, 3)	1 output, 3 hidden inputs
b2	(1, 1)	(1, 1)	One bias for output

The gradient shapes always match the parameter shapes. This must be true because we subtract gradients from parameters during the update step, and subtraction requires matching shapes.

Verifying Gradients: The Gradient Checking TechniqueLink Copied

We've derived and implemented the backward pass analytically. But how do we know we got it right? A single sign error or transposed matrix can silently produce wrong gradients, leading to a network that trains poorly or not at all.

The solution is gradient checking: verify analytical gradients against numerical approximations. The idea is simple: perturb each parameter slightly and measure how the loss changes. If our analytical gradient is correct, it should match this numerical estimate.

The Centered Difference FormulaLink Copied

The centered difference formula gives an accurate numerical approximation:

\frac{\partial \mathcal{L}}{\partial w} \approx \frac{\mathcal{L}(w + \epsilon) - \mathcal{L}(w - \epsilon)}{2\epsilon}

where:

$w$ : a single scalar parameter we're checking
$\epsilon$ : a small perturbation, typically $10^{-7}$
$\mathcal{L}(w + \epsilon)$ : the loss computed with $w$ increased by $\epsilon$
$\mathcal{L}(w - \epsilon)$ : the loss computed with $w$ decreased by $\epsilon$

Why centered difference? The one-sided formula $\frac{\mathcal{L}(w + \epsilon) - \mathcal{L}(w)}{\epsilon}$ has $O(\epsilon)$ error. The centered difference cancels first-order error terms, giving $O(\epsilon^2)$ accuracy, typically 5-7 orders of magnitude better for the same $\epsilon$ .

In[16]:

Code

def numerical_gradient(
    param, param_name, x, y_true, W1, b1, W2, b2, epsilon=1e-7
):
    """Compute numerical gradient for verification."""
    grad = np.zeros_like(param)

    for i in range(param.shape[0]):
        for j in range(param.shape[1]):
            # Perturb positively
            param[i, j] += epsilon
            if param_name == "W1":
                out_plus, _ = forward_pass(x, param, b1, W2, b2)
            elif param_name == "b1":
                out_plus, _ = forward_pass(x, W1, param, W2, b2)
            elif param_name == "W2":
                out_plus, _ = forward_pass(x, W1, b1, param, b2)
            else:  # b2
                out_plus, _ = forward_pass(x, W1, b1, W2, param)
            loss_plus = binary_cross_entropy(out_plus, y_true)

            # Perturb negatively
            param[i, j] -= 2 * epsilon
            if param_name == "W1":
                out_minus, _ = forward_pass(x, param, b1, W2, b2)
            elif param_name == "b1":
                out_minus, _ = forward_pass(x, W1, param, W2, b2)
            elif param_name == "W2":
                out_minus, _ = forward_pass(x, W1, b1, param, b2)
            else:  # b2
                out_minus, _ = forward_pass(x, W1, b1, W2, param)
            loss_minus = binary_cross_entropy(out_minus, y_true)

            # Restore original value
            param[i, j] += epsilon

            # Compute numerical gradient
            grad[i, j] = (loss_plus - loss_minus) / (2 * epsilon)

    return grad


# Verify W2 gradient
W2_copy = W2.copy()
numerical_dW2 = numerical_gradient(W2_copy, "W2", x, y_true, W1, b1, W2, b2)

def numerical_gradient(
    param, param_name, x, y_true, W1, b1, W2, b2, epsilon=1e-7
):
    """Compute numerical gradient for verification."""
    grad = np.zeros_like(param)

    for i in range(param.shape[0]):
        for j in range(param.shape[1]):
            # Perturb positively
            param[i, j] += epsilon
            if param_name == "W1":
                out_plus, _ = forward_pass(x, param, b1, W2, b2)
            elif param_name == "b1":
                out_plus, _ = forward_pass(x, W1, param, W2, b2)
            elif param_name == "W2":
                out_plus, _ = forward_pass(x, W1, b1, param, b2)
            else:  # b2
                out_plus, _ = forward_pass(x, W1, b1, W2, param)
            loss_plus = binary_cross_entropy(out_plus, y_true)

            # Perturb negatively
            param[i, j] -= 2 * epsilon
            if param_name == "W1":
                out_minus, _ = forward_pass(x, param, b1, W2, b2)
            elif param_name == "b1":
                out_minus, _ = forward_pass(x, W1, param, W2, b2)
            elif param_name == "W2":
                out_minus, _ = forward_pass(x, W1, b1, param, b2)
            else:  # b2
                out_minus, _ = forward_pass(x, W1, b1, W2, param)
            loss_minus = binary_cross_entropy(out_minus, y_true)

            # Restore original value
            param[i, j] += epsilon

            # Compute numerical gradient
            grad[i, j] = (loss_plus - loss_minus) / (2 * epsilon)

    return grad


# Verify W2 gradient
W2_copy = W2.copy()
numerical_dW2 = numerical_gradient(W2_copy, "W2", x, y_true, W1, b1, W2, b2)

Out[17]:

Console

Gradient Verification (W2)
--------------------------------------------------
Analytical gradient:
  [[-0.31349089 -0.67923027 -0.10449696]]
Numerical gradient:
  [[-0.31349089 -0.67923027 -0.10449697]]
Max absolute difference: 9.08e-10

✓ Gradients match!

The analytical and numerical gradients match to within $10^{-10}$ , well within the expected precision for $\epsilon = 10^{-7}$ . This confirms our backpropagation implementation is correct.

When to Use Gradient Checking

Gradient checking is computationally expensive (it requires two forward passes per parameter), so don't use it during normal training. Instead, use it:

When implementing a new layer or activation function
When debugging a network that won't train
When porting code to a new framework
Once, after implementing backpropagation, to verify correctness

A relative error below $10^{-5}$ typically indicates correct implementation. Errors above $10^{-3}$ suggest a bug.

Gradient AccumulationLink Copied

When a variable is used multiple times in a computation, its gradient accumulates contributions from all uses. This is the multivariate chain rule in action.

Why Gradients AccumulateLink Copied

Consider a weight matrix $\mathbf{W}$ used to process a batch of examples. Each example contributes to the total loss, and each contribution produces its own gradient for $\mathbf{W}$ . Since the total loss is the average (or sum) of individual losses, the total gradient is the sum of individual gradients:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}} = \sum_{i=1}^{m} \frac{\partial \mathcal{L}_i}{\partial \mathbf{W}}

where:

$m$ : the batch size (number of examples processed together)
$\mathcal{L}_i$ : the loss contribution from example $i$
$\frac{\partial \mathcal{L}_i}{\partial \mathbf{W}}$ : the gradient of the loss with respect to $\mathbf{W}$ from example $i$ alone, a matrix with the same shape as $\mathbf{W}$
$\frac{\partial \mathcal{L}}{\partial \mathbf{W}}$ : the total gradient, summed across all examples

This summation happens automatically in matrix operations when processing batches. When we compute $dW = \frac{1}{m} \cdot dZ \cdot A^T$ , the matrix multiplication implicitly sums over the batch dimension, and the $\frac{1}{m}$ factor converts the sum to an average.

Batch Processing and Gradient AccumulationLink Copied

Let's see how gradient accumulation works with a batch of examples:

In[18]:

Code

def forward_pass_batch(X, W1, b1, W2, b2):
    """Forward pass for a batch of inputs."""
    # X shape: (n_features, batch_size)
    z1 = W1 @ X + b1
    a1 = relu(z1)
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)

    cache = {
        "X": X,
        "z1": z1,
        "a1": a1,
        "z2": z2,
        "a2": a2,
        "W1": W1,
        "b1": b1,
        "W2": W2,
        "b2": b2,
    }
    return a2, cache


def backward_pass_batch(Y, cache):
    """Backward pass for a batch."""
    X = cache["X"]
    z1, a1 = cache["z1"], cache["a1"]
    a2 = cache["a2"]
    W2 = cache["W2"]

    m = X.shape[1]  # batch size

    # Output layer
    dz2 = a2 - Y
    dW2 = (1 / m) * dz2 @ a1.T  # Average over batch
    db2 = (1 / m) * np.sum(dz2, axis=1, keepdims=True)

    # Hidden layer
    da1 = W2.T @ dz2
    dz1 = da1 * (z1 > 0).astype(float)
    dW1 = (1 / m) * dz1 @ X.T  # Average over batch
    db1 = (1 / m) * np.sum(dz1, axis=1, keepdims=True)

    return {"dW2": dW2, "db2": db2, "dW1": dW1, "db1": db1}


# Create a batch of 4 examples
np.random.seed(42)
X_batch = np.random.randn(2, 4)
Y_batch = np.array([[1, 0, 1, 0]])

# Forward and backward pass
output_batch, cache_batch = forward_pass_batch(X_batch, W1, b1, W2, b2)
grads_batch = backward_pass_batch(Y_batch, cache_batch)

def forward_pass_batch(X, W1, b1, W2, b2):
    """Forward pass for a batch of inputs."""
    # X shape: (n_features, batch_size)
    z1 = W1 @ X + b1
    a1 = relu(z1)
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)

    cache = {
        "X": X,
        "z1": z1,
        "a1": a1,
        "z2": z2,
        "a2": a2,
        "W1": W1,
        "b1": b1,
        "W2": W2,
        "b2": b2,
    }
    return a2, cache


def backward_pass_batch(Y, cache):
    """Backward pass for a batch."""
    X = cache["X"]
    z1, a1 = cache["z1"], cache["a1"]
    a2 = cache["a2"]
    W2 = cache["W2"]

    m = X.shape[1]  # batch size

    # Output layer
    dz2 = a2 - Y
    dW2 = (1 / m) * dz2 @ a1.T  # Average over batch
    db2 = (1 / m) * np.sum(dz2, axis=1, keepdims=True)

    # Hidden layer
    da1 = W2.T @ dz2
    dz1 = da1 * (z1 > 0).astype(float)
    dW1 = (1 / m) * dz1 @ X.T  # Average over batch
    db1 = (1 / m) * np.sum(dz1, axis=1, keepdims=True)

    return {"dW2": dW2, "db2": db2, "dW1": dW1, "db1": db1}


# Create a batch of 4 examples
np.random.seed(42)
X_batch = np.random.randn(2, 4)
Y_batch = np.array([[1, 0, 1, 0]])

# Forward and backward pass
output_batch, cache_batch = forward_pass_batch(X_batch, W1, b1, W2, b2)
grads_batch = backward_pass_batch(Y_batch, cache_batch)

Out[19]:

Console

Batch Processing
==================================================

Batch size: 4
Input shape: (2, 4)
Output shape: (1, 4)

Predictions vs Targets:
  Example 1: pred=0.5281, target=1 ✓
  Example 2: pred=0.5221, target=0 ✗
  Example 3: pred=0.4787, target=1 ✗
  Example 4: pred=0.5314, target=0 ✗

Batch-averaged gradients:
  dW2 shape: (1, 3)
  dW2:
    [[-0.0157208  -0.02731224  0.04883699]]

The predictions show mixed accuracy on this random batch, with some examples classified correctly while others are not. This is expected since we're using the same small network trained on XOR, which doesn't generalize to random inputs. The key observation is the gradient shape: dW2 has shape (1, 3), matching W2 exactly. The matrix multiplication dz2 @ a1.T (shapes 1×4 and 4×3) automatically sums over the batch dimension, and the $\frac{1}{m}$ factor converts this sum to an average, making the effective learning rate independent of batch size.

Visualizing Gradient FlowLink Copied

Let's visualize how gradients flow backward through our network:

Out[20]:

Visualization

Diagram showing gradient magnitudes flowing backward through network layers with varying arrow thicknesses. — Gradient flow through a 2-layer neural network. Gradients flow from the loss backward through each layer. The magnitude of gradients (indicated by arrow thickness) can vary significantly across layers.

The gradient magnitudes can vary significantly across layers. In deep networks, this can lead to vanishing gradients (gradients become tiny in early layers) or exploding gradients (gradients become huge). Techniques like batch normalization, residual connections, and careful initialization help mitigate these issues.

Tracking Gradients During TrainingLink Copied

Let's train a deeper network and monitor how gradient magnitudes evolve across layers and epochs:

In[22]:

Code

# Create a deeper network to observe gradient dynamics
deep_nn = NeuralNetwork([2, 8, 8, 4, 1])

# Training with gradient tracking
gradient_history = {f"layer_{i + 1}": [] for i in range(deep_nn.num_layers)}
loss_history = []

np.random.seed(42)
for epoch in range(500):
    loss = deep_nn.train_step(X_xor, Y_xor, learning_rate=0.5)
    loss_history.append(loss)

    # Record gradient magnitudes (Frobenius norm) for each layer
    for i in range(deep_nn.num_layers):
        grad_norm = np.linalg.norm(deep_nn.gradients[f"dW{i + 1}"])
        gradient_history[f"layer_{i + 1}"].append(grad_norm)

# Create a deeper network to observe gradient dynamics
deep_nn = NeuralNetwork([2, 8, 8, 4, 1])

# Training with gradient tracking
gradient_history = {f"layer_{i + 1}": [] for i in range(deep_nn.num_layers)}
loss_history = []

np.random.seed(42)
for epoch in range(500):
    loss = deep_nn.train_step(X_xor, Y_xor, learning_rate=0.5)
    loss_history.append(loss)

    # Record gradient magnitudes (Frobenius norm) for each layer
    for i in range(deep_nn.num_layers):
        grad_norm = np.linalg.norm(deep_nn.gradients[f"dW{i + 1}"])
        gradient_history[f"layer_{i + 1}"].append(grad_norm)

Out[23]:

Visualization

Loss curve decreasing over 500 epochs — Training loss for the 4-layer network on XOR.

Multiple lines showing gradient norms for each layer over training — Gradient magnitude (Frobenius norm) for each layer during training. Early layers (closer to input) typically have smaller gradients.

The gradient magnitude plot reveals a key pattern: gradients tend to be smaller in earlier layers (closer to the input). This is the vanishing gradient effect in action. Each layer's backward pass multiplies gradients by weight matrices and activation derivatives, which can compound to shrink gradients exponentially. In this 4-layer network, the effect is modest, but in networks with 50+ layers, it becomes severe without architectural interventions like residual connections.

Complexity AnalysisLink Copied

Understanding the computational cost of backpropagation helps explain why it's so efficient and why training is more expensive than inference.

Time ComplexityLink Copied

For a single layer with $n$ inputs and $m$ outputs:

Time complexity of forward and backward pass operations.

Operation	Forward	Backward
Matrix multiply	$O(nm)$	$O(nm)$
Bias addition	$O(m)$	$O(m)$
Activation	$O(m)$	$O(m)$
Total per layer	$O(nm)$	$O(nm)$

The backward pass has the same asymptotic complexity as the forward pass. This is a remarkable property: computing all gradients costs roughly the same as computing the output.

For a network with $L$ layers and average layer size $n$ , the total time complexity is:

\text{Time} = O(L \cdot n^2)

where:

$L$ : number of layers
$n$ : average number of neurons per layer
$n^2$ : comes from the matrix multiplication at each layer (an $n \times n$ weight matrix multiplied by an $n$ -dimensional vector)

This holds for both forward and backward passes. The constant factor for backward is typically 2-3× the forward pass due to additional operations (computing gradients for both weights and activations).

Space ComplexityLink Copied

Backpropagation requires storing intermediate activations from the forward pass because these values are needed to compute gradients during the backward pass. For a batch of size $B$ through $L$ layers with average layer size $n$ neurons:

\text{Memory} = O(BLn)

where:

$B$ : batch size (number of examples processed together)
$L$ : number of layers in the network
$n$ : average number of neurons per layer

The total number of stored values is approximately $B \times L \times n$ , since we store one activation value per neuron per example. This memory requirement is why:

Larger batch sizes require more GPU memory
Deeper networks require more memory
Techniques like gradient checkpointing trade compute for memory by recomputing activations during the backward pass

In[24]:

Code

def analyze_memory(layer_sizes, batch_size):
    """Analyze memory requirements for activations."""
    total_activations = 0
    details = []

    for i, size in enumerate(layer_sizes):
        activations = batch_size * size
        total_activations += activations
        details.append((f"Layer {i}", size, activations))

    # Assuming float32 (4 bytes per number)
    memory_bytes = total_activations * 4

    return total_activations, memory_bytes, details


# Example: network with layers [784, 256, 128, 64, 10]
layer_sizes = [784, 256, 128, 64, 10]
batch_size = 32

total_act, memory, details = analyze_memory(layer_sizes, batch_size)

def analyze_memory(layer_sizes, batch_size):
    """Analyze memory requirements for activations."""
    total_activations = 0
    details = []

    for i, size in enumerate(layer_sizes):
        activations = batch_size * size
        total_activations += activations
        details.append((f"Layer {i}", size, activations))

    # Assuming float32 (4 bytes per number)
    memory_bytes = total_activations * 4

    return total_activations, memory_bytes, details


# Example: network with layers [784, 256, 128, 64, 10]
layer_sizes = [784, 256, 128, 64, 10]
batch_size = 32

total_act, memory, details = analyze_memory(layer_sizes, batch_size)

Out[25]:

Console

Memory Analysis for Activation Storage
==================================================
Architecture: 784 -> 256 -> 128 -> 64 -> 10
Batch size: 32

Activations per layer:
  Layer 0: 784 neurons × 32 batch = 25,088 values
  Layer 1: 256 neurons × 32 batch = 8,192 values
  Layer 2: 128 neurons × 32 batch = 4,096 values
  Layer 3: 64 neurons × 32 batch = 2,048 values
  Layer 4: 10 neurons × 32 batch = 320 values

Total activations: 39,744
Memory (float32): 158,976 bytes (155.2 KB)

For this small network with a batch size of 32, storing activations requires only about 158 KB. However, this scales linearly with both batch size and network depth. A ResNet-50 with 50 layers and batch size 256 would require gigabytes of activation memory. The input layer (784 neurons for MNIST images) dominates the memory footprint here, but in deeper networks, the cumulative storage across many hidden layers becomes the bottleneck.

Why Backpropagation is EfficientLink Copied

Before backpropagation, computing gradients required either:

Numerical differentiation: Perturb each of $P$ parameters and measure loss change. Cost: $O(P \times \text{forward pass})$ . For networks with millions of parameters, this is prohibitively expensive.
Symbolic differentiation: Derive analytical gradient formulas. This produces expressions that grow exponentially with network depth due to repeated subexpressions.

Backpropagation achieves $O(\text{forward pass})$ complexity by:

Computing gradients in a single backward sweep
Reusing intermediate values computed during the forward pass
Avoiding redundant computation through dynamic programming

Out[26]:

Visualization

Log-scale line plot comparing computational cost of backpropagation versus numerical differentiation as network size increases. — Comparison of gradient computation methods. Backpropagation (green) scales linearly with network size, while numerical differentiation (red) scales with the product of parameters and forward pass cost.

The gap between the two methods grows linearly with network size. For a network with 1 million parameters, numerical differentiation would be roughly 1 million times slower than backpropagation.

Automatic DifferentiationLink Copied

Modern deep learning frameworks like PyTorch and TensorFlow implement automatic differentiation (autodiff), which automates the backpropagation process. Understanding autodiff helps you use these frameworks more effectively.

How Autodiff WorksLink Copied

Autodiff systems build a computational graph during the forward pass, recording every operation. During the backward pass, they traverse this graph in reverse, applying the chain rule at each node.

There are two modes of autodiff:

Forward mode: Computes $\frac{\partial y}{\partial x_i}$ for all outputs $y$ with respect to one input $x_i$ . Efficient when there are few inputs and many outputs.
Reverse mode: Computes $\frac{\partial L}{\partial x_i}$ for all inputs $x_i$ with respect to one output $L$ . Efficient when there are many inputs and few outputs (like a scalar loss).

Neural network training uses reverse mode because we have many parameters (inputs) and one loss (output).

PyTorch's AutogradLink Copied

PyTorch implements reverse-mode autodiff through its autograd system. Every tensor operation is recorded, and calling .backward() on a scalar triggers gradient computation.

In[27]:

Code

import torch
import torch.nn as nn

# Create tensors with gradient tracking
torch.manual_seed(42)
x_torch = torch.tensor([[1.0], [2.0]], requires_grad=False)
W1_torch = torch.tensor(
    [[0.1, 0.2], [0.3, 0.4], [0.5, -0.1]], requires_grad=True
)
b1_torch = torch.tensor([[0.1], [0.2], [-0.1]], requires_grad=True)
W2_torch = torch.tensor([[0.2, -0.3, 0.4]], requires_grad=True)
b2_torch = torch.tensor([[0.1]], requires_grad=True)
y_torch = torch.tensor([[1.0]])

# Forward pass (PyTorch builds computational graph)
z1_torch = W1_torch @ x_torch + b1_torch
a1_torch = torch.relu(z1_torch)
z2_torch = W2_torch @ a1_torch + b2_torch
a2_torch = torch.sigmoid(z2_torch)

# Compute loss
loss_torch = nn.functional.binary_cross_entropy(a2_torch, y_torch)

# Backward pass (PyTorch computes all gradients automatically)
loss_torch.backward()

import torch
import torch.nn as nn

# Create tensors with gradient tracking
torch.manual_seed(42)
x_torch = torch.tensor([[1.0], [2.0]], requires_grad=False)
W1_torch = torch.tensor(
    [[0.1, 0.2], [0.3, 0.4], [0.5, -0.1]], requires_grad=True
)
b1_torch = torch.tensor([[0.1], [0.2], [-0.1]], requires_grad=True)
W2_torch = torch.tensor([[0.2, -0.3, 0.4]], requires_grad=True)
b2_torch = torch.tensor([[0.1]], requires_grad=True)
y_torch = torch.tensor([[1.0]])

# Forward pass (PyTorch builds computational graph)
z1_torch = W1_torch @ x_torch + b1_torch
a1_torch = torch.relu(z1_torch)
z2_torch = W2_torch @ a1_torch + b2_torch
a2_torch = torch.sigmoid(z2_torch)

# Compute loss
loss_torch = nn.functional.binary_cross_entropy(a2_torch, y_torch)

# Backward pass (PyTorch computes all gradients automatically)
loss_torch.backward()

Out[28]:

Console

PyTorch Autograd Results
==================================================

Loss: 0.739159

Gradients (computed automatically):
  dL/dW2:
    [[-0.31349093 -0.67923033 -0.10449698]]
  dL/db2: [-0.52248484]
  dL/dW1:
    [[-0.10449697 -0.20899394]
 [ 0.15674546  0.31349093]
 [-0.20899394 -0.41798788]]
  dL/db1: [-0.10449697  0.15674546 -0.20899394]

Comparison with our manual implementation:
  dW2 difference: 6.01e-08
  dW1 difference: 3.23e-08

PyTorch's gradients match our manual implementation exactly. The framework handles all the bookkeeping automatically, tracking operations and computing gradients with a single .backward() call.

The Computational Graph in PyTorchLink Copied

PyTorch builds a dynamic computational graph, meaning the graph is constructed fresh for each forward pass. This allows for dynamic control flow (if statements, loops) that can change based on input data.

In[29]:

Code

# Demonstrate dynamic graph with conditional computation
def dynamic_forward(x, W1, W2, threshold=0.5):
    """Forward pass with dynamic control flow."""
    h = torch.relu(W1 @ x)

    # Conditional based on intermediate value
    if h.mean() > threshold:
        out = torch.sigmoid(W2 @ h)
    else:
        out = torch.tanh(W2 @ h)

    return out


# This works because PyTorch builds graph dynamically
x_dyn = torch.randn(2, 1, requires_grad=False)
W1_dyn = torch.randn(3, 2, requires_grad=True)
W2_dyn = torch.randn(1, 3, requires_grad=True)

out_dyn = dynamic_forward(x_dyn, W1_dyn, W2_dyn)
out_dyn.backward()

# Demonstrate dynamic graph with conditional computation
def dynamic_forward(x, W1, W2, threshold=0.5):
    """Forward pass with dynamic control flow."""
    h = torch.relu(W1 @ x)

    # Conditional based on intermediate value
    if h.mean() > threshold:
        out = torch.sigmoid(W2 @ h)
    else:
        out = torch.tanh(W2 @ h)

    return out


# This works because PyTorch builds graph dynamically
x_dyn = torch.randn(2, 1, requires_grad=False)
W1_dyn = torch.randn(3, 2, requires_grad=True)
W2_dyn = torch.randn(1, 3, requires_grad=True)

out_dyn = dynamic_forward(x_dyn, W1_dyn, W2_dyn)
out_dyn.backward()

Out[30]:

Console

Dynamic Computational Graph
---------------------------------------------
Hidden mean: 0.2566
Threshold: 0.5
Branch taken: tanh
Output: 0.3833
Gradients computed: W1_dyn.grad exists = True

The network dynamically chose 'tanh' based on hidden mean <= 0.5

Despite the conditional logic that selects different activation functions at runtime, PyTorch successfully computed gradients by building the computational graph on-the-fly during the forward pass. This dynamic graph capability is what enables models with variable-length inputs (like RNNs processing sentences of different lengths), recursive structures (like tree-structured neural networks), and attention mechanisms where computation paths depend on the data.

Implementing Backpropagation from ScratchLink Copied

We've traced through backpropagation step by step, derived the key formulas, and verified them numerically. Now let's consolidate everything into a complete, working implementation.

The NeuralNetwork class below encapsulates the full training loop: forward pass, loss computation, backward pass, and parameter updates. This is the same algorithm that powers deep learning frameworks, but we're implementing it explicitly to understand every detail.

In[31]:

Code

class NeuralNetwork:
    """A simple feedforward neural network with backpropagation."""

    def __init__(self, layer_sizes):
        """
        Initialize network with given layer sizes.

        Args:
            layer_sizes: List of integers [input_size, hidden1, hidden2, ..., output_size]
        """
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes) - 1

        # Initialize weights with Xavier initialization
        np.random.seed(42)
        self.weights = []
        self.biases = []

        for i in range(self.num_layers):
            # Xavier initialization: scale by sqrt(2 / (fan_in + fan_out))
            scale = np.sqrt(2.0 / (layer_sizes[i] + layer_sizes[i + 1]))
            W = np.random.randn(layer_sizes[i + 1], layer_sizes[i]) * scale
            b = np.zeros((layer_sizes[i + 1], 1))
            self.weights.append(W)
            self.biases.append(b)

    def relu(self, z):
        return np.maximum(0, z)

    def relu_derivative(self, z):
        return (z > 0).astype(float)

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def forward(self, X):
        """
        Forward pass through the network.

        Args:
            X: Input data of shape (n_features, batch_size)

        Returns:
            Output predictions and cache of intermediate values
        """
        self.cache = {"A0": X}
        A = X

        for i in range(self.num_layers):
            Z = self.weights[i] @ A + self.biases[i]
            self.cache[f"Z{i + 1}"] = Z

            # Use ReLU for hidden layers, sigmoid for output
            if i < self.num_layers - 1:
                A = self.relu(Z)
            else:
                A = self.sigmoid(Z)

            self.cache[f"A{i + 1}"] = A

        return A

    def compute_loss(self, Y_pred, Y_true, epsilon=1e-15):
        """Compute binary cross-entropy loss."""
        Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon)
        m = Y_true.shape[1]
        loss = (
            -np.sum(Y_true * np.log(Y_pred) + (1 - Y_true) * np.log(1 - Y_pred))
            / m
        )
        return loss

    def backward(self, Y_true):
        """
        Backward pass to compute gradients.

        Args:
            Y_true: True labels of shape (output_size, batch_size)

        Returns:
            Dictionary of gradients for all weights and biases
        """
        m = Y_true.shape[1]
        gradients = {}

        # Start with output layer gradient
        A_out = self.cache[f"A{self.num_layers}"]
        dZ = A_out - Y_true  # Sigmoid + BCE simplification

        # Backpropagate through each layer
        for i in range(self.num_layers - 1, -1, -1):
            A_prev = self.cache[f"A{i}"]

            # Compute gradients for this layer
            gradients[f"dW{i + 1}"] = (1 / m) * dZ @ A_prev.T
            gradients[f"db{i + 1}"] = (1 / m) * np.sum(
                dZ, axis=1, keepdims=True
            )

            # Compute gradient for previous layer (if not at input)
            if i > 0:
                dA_prev = self.weights[i].T @ dZ
                Z_prev = self.cache[f"Z{i}"]
                dZ = dA_prev * self.relu_derivative(Z_prev)

        self.gradients = gradients
        return gradients

    def update_parameters(self, learning_rate):
        """Update weights and biases using computed gradients."""
        for i in range(self.num_layers):
            self.weights[i] -= learning_rate * self.gradients[f"dW{i + 1}"]
            self.biases[i] -= learning_rate * self.gradients[f"db{i + 1}"]

    def train_step(self, X, Y, learning_rate):
        """Perform one training step: forward, loss, backward, update."""
        # Forward pass
        Y_pred = self.forward(X)

        # Compute loss
        loss = self.compute_loss(Y_pred, Y)

        # Backward pass
        self.backward(Y)

        # Update parameters
        self.update_parameters(learning_rate)

        return loss

class NeuralNetwork:
    """A simple feedforward neural network with backpropagation."""

    def __init__(self, layer_sizes):
        """
        Initialize network with given layer sizes.

        Args:
            layer_sizes: List of integers [input_size, hidden1, hidden2, ..., output_size]
        """
        self.layer_sizes = layer_sizes
        self.num_layers = len(layer_sizes) - 1

        # Initialize weights with Xavier initialization
        np.random.seed(42)
        self.weights = []
        self.biases = []

        for i in range(self.num_layers):
            # Xavier initialization: scale by sqrt(2 / (fan_in + fan_out))
            scale = np.sqrt(2.0 / (layer_sizes[i] + layer_sizes[i + 1]))
            W = np.random.randn(layer_sizes[i + 1], layer_sizes[i]) * scale
            b = np.zeros((layer_sizes[i + 1], 1))
            self.weights.append(W)
            self.biases.append(b)

    def relu(self, z):
        return np.maximum(0, z)

    def relu_derivative(self, z):
        return (z > 0).astype(float)

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

    def forward(self, X):
        """
        Forward pass through the network.

        Args:
            X: Input data of shape (n_features, batch_size)

        Returns:
            Output predictions and cache of intermediate values
        """
        self.cache = {"A0": X}
        A = X

        for i in range(self.num_layers):
            Z = self.weights[i] @ A + self.biases[i]
            self.cache[f"Z{i + 1}"] = Z

            # Use ReLU for hidden layers, sigmoid for output
            if i < self.num_layers - 1:
                A = self.relu(Z)
            else:
                A = self.sigmoid(Z)

            self.cache[f"A{i + 1}"] = A

        return A

    def compute_loss(self, Y_pred, Y_true, epsilon=1e-15):
        """Compute binary cross-entropy loss."""
        Y_pred = np.clip(Y_pred, epsilon, 1 - epsilon)
        m = Y_true.shape[1]
        loss = (
            -np.sum(Y_true * np.log(Y_pred) + (1 - Y_true) * np.log(1 - Y_pred))
            / m
        )
        return loss

    def backward(self, Y_true):
        """
        Backward pass to compute gradients.

        Args:
            Y_true: True labels of shape (output_size, batch_size)

        Returns:
            Dictionary of gradients for all weights and biases
        """
        m = Y_true.shape[1]
        gradients = {}

        # Start with output layer gradient
        A_out = self.cache[f"A{self.num_layers}"]
        dZ = A_out - Y_true  # Sigmoid + BCE simplification

        # Backpropagate through each layer
        for i in range(self.num_layers - 1, -1, -1):
            A_prev = self.cache[f"A{i}"]

            # Compute gradients for this layer
            gradients[f"dW{i + 1}"] = (1 / m) * dZ @ A_prev.T
            gradients[f"db{i + 1}"] = (1 / m) * np.sum(
                dZ, axis=1, keepdims=True
            )

            # Compute gradient for previous layer (if not at input)
            if i > 0:
                dA_prev = self.weights[i].T @ dZ
                Z_prev = self.cache[f"Z{i}"]
                dZ = dA_prev * self.relu_derivative(Z_prev)

        self.gradients = gradients
        return gradients

    def update_parameters(self, learning_rate):
        """Update weights and biases using computed gradients."""
        for i in range(self.num_layers):
            self.weights[i] -= learning_rate * self.gradients[f"dW{i + 1}"]
            self.biases[i] -= learning_rate * self.gradients[f"db{i + 1}"]

    def train_step(self, X, Y, learning_rate):
        """Perform one training step: forward, loss, backward, update."""
        # Forward pass
        Y_pred = self.forward(X)

        # Compute loss
        loss = self.compute_loss(Y_pred, Y)

        # Backward pass
        self.backward(Y)

        # Update parameters
        self.update_parameters(learning_rate)

        return loss

Training on a Simple Dataset: The XOR ProblemLink Copied

To verify our implementation works, let's train on the XOR problem, a classic test case that single-layer networks cannot solve. XOR requires learning a nonlinear decision boundary, making it the perfect sanity check for our multi-layer network.

In[32]:

Code

# XOR dataset
X_xor = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y_xor = np.array([[0, 1, 1, 0]])

# Create network: 2 inputs -> 4 hidden -> 1 output
nn = NeuralNetwork([2, 4, 1])

# Training loop
losses = []
for epoch in range(1000):
    loss = nn.train_step(X_xor, Y_xor, learning_rate=1.0)
    losses.append(loss)

# Final predictions
predictions = nn.forward(X_xor)

# XOR dataset
X_xor = np.array([[0, 0, 1, 1], [0, 1, 0, 1]])
Y_xor = np.array([[0, 1, 1, 0]])

# Create network: 2 inputs -> 4 hidden -> 1 output
nn = NeuralNetwork([2, 4, 1])

# Training loop
losses = []
for epoch in range(1000):
    loss = nn.train_step(X_xor, Y_xor, learning_rate=1.0)
    losses.append(loss)

# Final predictions
predictions = nn.forward(X_xor)

Out[33]:

Console

XOR Training Results
==================================================

Network architecture: 2 -> 4 -> 1
Training epochs: 1000
Initial loss: 0.6986
Final loss: 0.0021

Predictions vs Targets:
  Input    | Target | Prediction | Rounded
  ------------------------------------------
  (0, 0)   |   0    |   0.0057   |    0  ✓
  (0, 1)   |   1    |   0.9991   |    1  ✓
  (1, 0)   |   1    |   0.9991   |    1  ✓
  (1, 1)   |   0    |   0.0006   |    0  ✓

The network successfully learns XOR! All four examples are classified correctly, with predictions very close to the targets. This demonstrates that our backpropagation implementation is working correctly. The gradients are flowing properly and the weights are being updated in the right direction.

Out[34]:

Visualization

Line plot showing loss decreasing from about 0.7 to near 0 over 1000 epochs. — Training loss decreasing over epochs as the network learns the XOR function.

Contour plot showing a curved decision boundary separating XOR points. — Decision boundary learned by the network. The curved boundary correctly separates XOR classes.

The loss curve shows rapid initial learning followed by gradual refinement, a typical pattern in neural network training. The decision boundary visualization reveals how the network solves XOR: it creates a curved region that separates the (0,1) and (1,0) points (class 1, red) from (0,0) and (1,1) (class 0, blue).

This curved boundary is impossible for a linear classifier to achieve. The hidden layer transforms the input space, making the originally non-linearly-separable problem linearly separable in the hidden representation. This is the power of multi-layer networks with nonlinear activations.

Common Pitfalls and DebuggingLink Copied

Implementing backpropagation correctly requires attention to detail. Here are common issues and how to diagnose them.

Gradient CheckingLink Copied

Always verify your gradients numerically when implementing new layers:

In[35]:

Code

def gradient_check(network, X, Y, epsilon=1e-7):
    """
    Verify analytical gradients against numerical gradients.

    Returns the maximum relative error across all parameters.
    """
    # Compute analytical gradients
    network.forward(X)
    network.backward(Y)

    max_error = 0
    errors = {}

    for i in range(network.num_layers):
        # Check weights
        W = network.weights[i]
        dW_analytical = network.gradients[f"dW{i + 1}"]
        dW_numerical = np.zeros_like(W)

        for j in range(W.shape[0]):
            for k in range(W.shape[1]):
                W[j, k] += epsilon
                loss_plus = network.compute_loss(network.forward(X), Y)
                W[j, k] -= 2 * epsilon
                loss_minus = network.compute_loss(network.forward(X), Y)
                W[j, k] += epsilon
                dW_numerical[j, k] = (loss_plus - loss_minus) / (2 * epsilon)

        # Relative error
        diff = np.abs(dW_analytical - dW_numerical)
        denom = np.maximum(np.abs(dW_analytical) + np.abs(dW_numerical), 1e-8)
        rel_error = np.max(diff / denom)
        errors[f"W{i + 1}"] = rel_error
        max_error = max(max_error, rel_error)

    return max_error, errors


# Run gradient check
max_err, all_errors = gradient_check(nn, X_xor, Y_xor)

def gradient_check(network, X, Y, epsilon=1e-7):
    """
    Verify analytical gradients against numerical gradients.

    Returns the maximum relative error across all parameters.
    """
    # Compute analytical gradients
    network.forward(X)
    network.backward(Y)

    max_error = 0
    errors = {}

    for i in range(network.num_layers):
        # Check weights
        W = network.weights[i]
        dW_analytical = network.gradients[f"dW{i + 1}"]
        dW_numerical = np.zeros_like(W)

        for j in range(W.shape[0]):
            for k in range(W.shape[1]):
                W[j, k] += epsilon
                loss_plus = network.compute_loss(network.forward(X), Y)
                W[j, k] -= 2 * epsilon
                loss_minus = network.compute_loss(network.forward(X), Y)
                W[j, k] += epsilon
                dW_numerical[j, k] = (loss_plus - loss_minus) / (2 * epsilon)

        # Relative error
        diff = np.abs(dW_analytical - dW_numerical)
        denom = np.maximum(np.abs(dW_analytical) + np.abs(dW_numerical), 1e-8)
        rel_error = np.max(diff / denom)
        errors[f"W{i + 1}"] = rel_error
        max_error = max(max_error, rel_error)

    return max_error, errors


# Run gradient check
max_err, all_errors = gradient_check(nn, X_xor, Y_xor)

Out[36]:

Console

Gradient Check Results
---------------------------------------------
  W1: relative error = 1.77e-07 ✓
  W2: relative error = 2.24e-07 ✓

Maximum relative error: 2.24e-07
Threshold for correctness: < 1e-5
Result: PASS ✓

Common Bugs and SymptomsLink Copied

Common backpropagation bugs and their solutions.

Symptom	Possible Cause	Solution
Loss doesn't decrease	Learning rate too small or gradients wrong	Check gradient numerically; try larger LR
Loss explodes to NaN	Learning rate too large or numerical overflow	Reduce LR; add gradient clipping
Gradients are all zero	Dead ReLUs or vanishing gradients	Use Leaky ReLU; check initialization
Gradients don't match numerical	Shape mismatch or wrong formula	Print shapes; verify math step by step

Shape DebuggingLink Copied

Shape errors are the most common bugs. Always verify:

In[37]:

Code

def get_shapes(network, X, Y):
    """Collect shapes of all intermediate values and gradients."""
    network.forward(X)
    network.backward(Y)

    forward_shapes = {
        key: network.cache[key].shape for key in sorted(network.cache.keys())
    }
    gradient_shapes = {
        key: network.gradients[key].shape
        for key in sorted(network.gradients.keys())
    }
    param_shapes = {
        f"W{i + 1}": network.weights[i].shape for i in range(network.num_layers)
    }
    param_shapes.update(
        {
            f"b{i + 1}": network.biases[i].shape
            for i in range(network.num_layers)
        }
    )

    return forward_shapes, gradient_shapes, param_shapes


# Get shapes for our trained network
forward_shapes, gradient_shapes, param_shapes = get_shapes(nn, X_xor, Y_xor)

def get_shapes(network, X, Y):
    """Collect shapes of all intermediate values and gradients."""
    network.forward(X)
    network.backward(Y)

    forward_shapes = {
        key: network.cache[key].shape for key in sorted(network.cache.keys())
    }
    gradient_shapes = {
        key: network.gradients[key].shape
        for key in sorted(network.gradients.keys())
    }
    param_shapes = {
        f"W{i + 1}": network.weights[i].shape for i in range(network.num_layers)
    }
    param_shapes.update(
        {
            f"b{i + 1}": network.biases[i].shape
            for i in range(network.num_layers)
        }
    )

    return forward_shapes, gradient_shapes, param_shapes


# Get shapes for our trained network
forward_shapes, gradient_shapes, param_shapes = get_shapes(nn, X_xor, Y_xor)

Out[38]:

Console

Forward Pass Shapes:
  A0: (2, 4)
  A1: (4, 4)
  A2: (1, 4)
  Z1: (4, 4)
  Z2: (1, 4)

Gradient Shapes:
  dW1: (4, 2)
  dW2: (1, 4)
  db1: (4, 1)
  db2: (1, 1)

Parameter Shapes:
  W1: (4, 2)
  b1: (4, 1)
  W2: (1, 4)
  b2: (1, 1)

The shapes reveal the network's structure: 4 examples (batch size) flow through 2 input features to 4 hidden neurons to 1 output. Notice that dW1 has shape (4, 2), exactly matching W1. This is required because gradient descent subtracts gradients from parameters element-wise. Similarly, dW2 matches W2 at (1, 4). If any gradient shape differs from its parameter shape, there's a bug in the backward pass.

Limitations and Practical ConsiderationsLink Copied

Backpropagation is powerful but has important limitations that affect how we design and train neural networks.

Vanishing and Exploding GradientsLink Copied

In deep networks, gradients can become exponentially small (vanishing) or large (exploding) as they propagate backward. To understand why, consider a simplified network with $L$ layers where each layer multiplies the gradient by some factor $\alpha$ during backpropagation. After passing through all $L$ layers, the gradient at the first layer is scaled by:

\text{gradient scale} = \alpha^L

where:

$\alpha$ : the average gradient scaling factor per layer, determined by the product of weight magnitudes and activation derivatives. For sigmoid activations, this is typically less than 1 since $\sigma'(z) \leq 0.25$ .
$L$ : the number of layers the gradient must traverse
$\alpha^L$ : the cumulative scaling effect, which grows or shrinks exponentially with depth

If $\alpha < 1$ (e.g., $\alpha = 0.9$ ), then $\alpha^L$ shrinks exponentially: for $L = 50$ layers, $0.9^{50} \approx 0.005$ , meaning gradients nearly vanish. If $\alpha > 1$ (e.g., $\alpha = 1.1$ ), then $\alpha^L$ grows exponentially: $1.1^{50} \approx 117$ , meaning gradients explode.

Out[39]:

Visualization

Log-scale plot showing exponential gradient scaling for different alpha values across network depth — Gradient scaling with network depth for different scaling factors α. Even small deviations from α=1 lead to exponential growth or decay over many layers.

The visualization makes the problem visceral: even a 5% deviation from $\alpha = 1$ leads to gradients that are 100× too small or too large after just 100 layers. This is why deep networks were considered untrainable before techniques like residual connections, which allow gradients to bypass layers and maintain $\alpha \approx 1$ .

This problem is particularly severe with sigmoid and tanh activations, whose derivatives are bounded below 1. ReLU helps because its derivative is exactly 1 for positive inputs, but "dead" ReLU neurons (always outputting 0) still block gradient flow entirely.

Modern solutions include:

Residual connections: Allow gradients to flow directly through skip connections
Batch normalization: Stabilizes activations and gradients
Careful initialization: Xavier or He initialization keeps gradients in a healthy range
Gradient clipping: Caps gradient magnitude to prevent explosions

Memory RequirementsLink Copied

Backpropagation requires storing all intermediate activations from the forward pass. For large models and batch sizes, this memory cost can be prohibitive. A model with 1 billion parameters might need 10-20 GB just for activations during training, on top of the memory for parameters and gradients.

Gradient checkpointing trades compute for memory by recomputing activations during the backward pass instead of storing them all. Instead of storing all $L$ layers of activations, we only store activations at every $\sqrt{L}$ layers (the "checkpoints"). During the backward pass, we recompute the activations between checkpoints as needed. This reduces memory from $O(L)$ to $O(\sqrt{L})$ stored activation layers, at the cost of roughly doubling compute time (since we recompute most forward activations once during the backward pass).

Local Minima and Saddle PointsLink Copied

Backpropagation finds directions that decrease the loss locally, but it doesn't guarantee finding the global minimum. In high-dimensional spaces, saddle points (where gradients are zero but it's not a minimum) are far more common than local minima. The good news is that most saddle points in neural networks are "escapable" with enough noise from stochastic gradient descent.

To visualize what gradient descent "sees," let's plot a 2D slice of the loss landscape for our XOR network by varying two weights while keeping others fixed:

Out[40]:

Visualization

3D surface plot showing loss landscape with gradient descent trajectory — Loss landscape for the XOR network, varying two weights (W1[0,0] and W1[1,0]) while keeping others fixed. The red path shows gradient descent navigating toward a minimum. The landscape is non-convex with multiple valleys.

The loss landscape reveals the non-convex nature of neural network optimization. Gradient descent follows the steepest downhill direction at each step, navigating through valleys toward a minimum. The path isn't straight because the gradient direction changes as we move through the landscape. In higher dimensions (where real networks live), the landscape is even more complex, with many saddle points and local minima. Empirically, most local minima in deep networks achieve similar loss values.

Despite these limitations, backpropagation remains the foundation of deep learning. Its efficiency makes training billion-parameter models practical, and its modularity allows researchers to easily experiment with new architectures and loss functions.

SummaryLink Copied

Backpropagation is the algorithm that makes deep learning possible. By systematically applying the chain rule from output to input, it computes gradients for all parameters in time proportional to a single forward pass. We've traced through every step of this algorithm, from mathematical foundations to working implementation.

The Core Ideas:

Computational graphs decompose complex neural network computations into simple, modular operations. Each node knows only how to compute its output and its local gradient, with no global knowledge required.
The chain rule tells us how to trace influence through compositions. For a single path, we multiply derivatives. For multiple paths, we sum the contributions:

\frac{\partial L}{\partial x} = \sum_{i} \frac{\partial L}{\partial y_i} \cdot \frac{\partial y_i}{\partial x}

where:

$L$ : the loss function
$x$ : the variable whose gradient we want (e.g., a weight)
$y_i$ : intermediate variables that depend on $x$
The sum aggregates contributions from all paths connecting $x$ to $L$

The forward pass computes values from input to output, storing intermediate results. The backward pass computes gradients from output to input, using those stored values.
The BCE + sigmoid simplification shows how careful mathematical design pays off: the complex gradient $\frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}$ collapses to the simple form $\hat{y} - y$ .

Practical Takeaways:

Gradient checking (comparing analytical gradients to numerical approximations) is essential for verifying implementations
Gradient accumulation naturally handles batch processing and shared weights
Time complexity of backward pass matches forward pass: $O(\text{forward pass})$
Automatic differentiation frameworks like PyTorch implement all of this automatically, building computational graphs on-the-fly

The backward pass produces gradients, but we still need to decide how to use them. The next chapter explores stochastic gradient descent and its variants, which use these gradients to actually update parameters and train the network.

Key ParametersLink Copied

When implementing or using backpropagation, several parameters affect correctness and numerical stability:

Numerical Stability:

Epsilon for gradient checking: Use $10^{-7}$ to $10^{-5}$ . Too small causes floating-point errors; too large gives inaccurate approximations.
Clipping for sigmoid/softmax: Clip inputs to prevent overflow ( $e^{1000} = \infty$ ) and outputs to prevent $\log(0)$ .

Memory Management:

Batch size: Larger batches require more memory for storing activations. Memory scales as $O(\text{batch size} \times \text{network size})$ .
Gradient checkpointing: Trade 2× compute for $\sqrt{L}$ memory reduction in $L$ -layer networks.

Gradient Health:

Gradient magnitude: Monitor gradient norms during training. Healthy gradients typically have norms between $10^{-3}$ and $10^{1}$ .
Gradient clipping threshold: Common values are 1.0 to 5.0 for the maximum gradient norm.

Initialization:

Xavier/Glorot initialization: Scale weights by $\sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}$ for tanh/sigmoid activations, where $n_{\text{in}}$ is the number of input neurons (fan-in) and $n_{\text{out}}$ is the number of output neurons (fan-out). This keeps the variance of activations roughly constant across layers.
He initialization: Scale weights by $\sqrt{\frac{2}{n_{\text{in}}}}$ for ReLU activations, where $n_{\text{in}}$ is the number of input neurons to the layer. The factor of 2 accounts for ReLU zeroing out half the inputs on average.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about backpropagation and gradient computation in neural networks.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Loss Functions

Next Chapter

Stochastic Gradient Descent

Reference

BIBTEXAcademic

@misc{backpropagationthealgorithmthatmakesdeeplearningpossible, author = {Michael Brenndoerfer}, title = {Backpropagation: The Algorithm That Makes Deep Learning Possible}, year = {2025}, url = {https://mbrenndoerfer.com/writing/backpropagation-algorithm-deep-learning-neural-networks}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Backpropagation: The Algorithm That Makes Deep Learning Possible. Retrieved from https://mbrenndoerfer.com/writing/backpropagation-algorithm-deep-learning-neural-networks

MLAAcademic

Michael Brenndoerfer. "Backpropagation: The Algorithm That Makes Deep Learning Possible." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/backpropagation-algorithm-deep-learning-neural-networks>.

CHICAGOAcademic

Michael Brenndoerfer. "Backpropagation: The Algorithm That Makes Deep Learning Possible." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/backpropagation-algorithm-deep-learning-neural-networks.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Backpropagation: The Algorithm That Makes Deep Learning Possible'. Available at: https://mbrenndoerfer.com/writing/backpropagation-algorithm-deep-learning-neural-networks (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Backpropagation: The Algorithm That Makes Deep Learning Possible. https://mbrenndoerfer.com/writing/backpropagation-algorithm-deep-learning-neural-networks

Direct link:

https://mbrenndoerfer.com/writing/backpropagation-algorithm-deep-learning-neural-networks

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Backpropagation: The Algorithm That Makes Deep Learning Possible

BackpropagationLink Copied

The Gradient ProblemLink Copied

Computational GraphsLink Copied

Building Blocks of ComputationLink Copied

Neural Network as a Computational GraphLink Copied

Why Computational Graphs MatterLink Copied

The Chain Rule: Foundation of BackpropagationLink Copied

The Core Intuition: Tracing Influence Through CompositionsLink Copied

Single Variable Chain RuleLink Copied

Multivariate Chain Rule: When Paths DivergeLink Copied

The Problem: Multiple Pathways of InfluenceLink Copied

The FormulaLink Copied

Why This Matters for Neural NetworksLink Copied

Chain Rule on Computational Graphs: The Backpropagation AlgorithmLink Copied

From Math to AlgorithmLink Copied

The Backward Pass FormulaLink Copied

The Algorithm in Plain LanguageLink Copied

Why This Works: The Recursive StructureLink Copied

Forward Pass: Computing ValuesLink Copied

Step-by-Step Forward ComputationLink Copied

Computing the LossLink Copied

Understanding the FormulaLink Copied

Backward Pass: Computing GradientsLink Copied

Step 1: Gradient of the Loss with Respect to the PredictionLink Copied

Interpreting the GradientLink Copied

Step 2: Backpropagating Through SigmoidLink Copied

The Geometry of the Sigmoid DerivativeLink Copied

Step 3: The Beautiful Cancellation (Combining BCE and Sigmoid)Link Copied

Step 4: Implementing the Full Backward PassLink Copied

Understanding the Backward Pass Step by StepLink Copied

The ReLU Gate and "Dying Neurons"Link Copied

Understanding the Gradient ShapesLink Copied

Verifying Gradients: The Gradient Checking TechniqueLink Copied

The Centered Difference FormulaLink Copied

Gradient AccumulationLink Copied

Why Gradients AccumulateLink Copied

Batch Processing and Gradient AccumulationLink Copied

Visualizing Gradient FlowLink Copied

Tracking Gradients During TrainingLink Copied

Complexity AnalysisLink Copied

Time ComplexityLink Copied

Space ComplexityLink Copied

Why Backpropagation is EfficientLink Copied

Automatic DifferentiationLink Copied

How Autodiff WorksLink Copied

PyTorch's AutogradLink Copied

The Computational Graph in PyTorchLink Copied

Implementing Backpropagation from ScratchLink Copied

Training on a Simple Dataset: The XOR ProblemLink Copied

Common Pitfalls and DebuggingLink Copied

Gradient CheckingLink Copied

Common Bugs and SymptomsLink Copied

Shape DebuggingLink Copied

Limitations and Practical ConsiderationsLink Copied

Vanishing and Exploding GradientsLink Copied

Memory RequirementsLink Copied

Local Minima and Saddle PointsLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Stay updated