Gradient Clipping: Preventing Exploding Gradients in Deep Learning

Michael Brenndoerfer

Learn how gradient clipping prevents training instability by capping gradient magnitudes. Master clip by value vs clip by norm strategies with PyTorch implementation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Gradient ClippingLink Copied

Deep neural networks are powerful function approximators, but training them can be surprisingly fragile. One moment your loss is decreasing steadily, the next it explodes to infinity. The culprit? Exploding gradients. When gradients grow too large during backpropagation, weight updates become catastrophically destabilizing, and training collapses.

Gradient clipping provides a simple but effective solution. By capping gradients at a maximum threshold before applying updates, we prevent the runaway feedback loops that cause explosions. This technique is essential for training recurrent neural networks, transformers, and many other deep architectures. Without it, models with long computational paths would be nearly impossible to train.

This chapter covers how to detect exploding gradients, the two main clipping strategies (by value and by global norm), and when to apply each approach. We'll implement gradient clipping from scratch and explore how to choose appropriate thresholds through gradient norm monitoring.

The Exploding Gradient ProblemLink Copied

During backpropagation, gradients flow backward through the network, accumulating contributions from each layer. In deep networks or recurrent architectures, this can create a feedback loop where gradients compound multiplicatively.

Exploding Gradients

Exploding gradients occur when gradient magnitudes grow exponentially during backpropagation. This happens when the chain rule produces products of values greater than 1 that compound across many layers, resulting in weight updates so large they destabilize training.

Consider a simple recurrent network processing a sequence of $T$ timesteps. At each step, the gradient is multiplied by the weight matrix $\mathbf{W}$ . If the largest eigenvalue of that matrix exceeds 1, gradients grow exponentially. After $T$ multiplications, the gradient magnitude scales as:

\|\mathbf{g}_T\| \approx \|\mathbf{g}_0\| \cdot \lambda_{\max}^T

where:

$\|\mathbf{g}_0\|$ : the initial gradient magnitude at the output layer
$\|\mathbf{g}_T\|$ : the gradient magnitude after backpropagating through $T$ timesteps
$\lambda_{\max}$ : the largest eigenvalue of the weight matrix (or its spectral norm)
$T$ : the number of timesteps (or layers) the gradient passes through

With $T = 100$ timesteps and $\lambda_{\max} = 1.1$ , the gradient grows by a factor of $1.1^{100} \approx 13,781$ . Even this modest 10% amplification per step compounds into a catastrophic explosion.

In[2]:

Code

# Simulating gradient growth in a recurrent network
timesteps = 100
growth_factor = 1.1  # Gradient multiplier per step

# Track gradient magnitude over time
gradient_magnitudes = [1.0]  # Start with unit gradient
for t in range(timesteps):
    gradient_magnitudes.append(gradient_magnitudes[-1] * growth_factor)

# Simulating gradient growth in a recurrent network
timesteps = 100
growth_factor = 1.1  # Gradient multiplier per step

# Track gradient magnitude over time
gradient_magnitudes = [1.0]  # Start with unit gradient
for t in range(timesteps):
    gradient_magnitudes.append(gradient_magnitudes[-1] * growth_factor)

Out[3]:

Console

Gradient Growth Over Timesteps
--------------------------------------------------
Initial gradient magnitude: 1.00
After 10 steps: 2.59
After 50 steps: 117.39
After 100 steps: 13780.61
Growth factor: 13781x

Out[4]:

Visualization

Line plot showing exponential growth curve of gradient magnitude from 1 to nearly 14000 over 100 timesteps on linear scale. — Gradient magnitude on linear scale showing the characteristic hockey-stick curve of exponential growth.

Line plot showing exponential growth on log scale as a straight line with annotations at key points. — Gradient magnitude on logarithmic scale confirms exponential growth with annotations showing 117× amplification at step 50 and 13,781× at step 100.

The exponential growth is striking. With just a 10% amplification per step, the gradient grows nearly 14,000 times larger by the end of the sequence. This illustrates why even seemingly small spectral norms above 1.0 become catastrophic in deep or recurrent networks. In practice, these explosive gradients lead to NaN losses, parameter values shooting to infinity, and complete training failure.

Detecting Gradient ExplosionsLink Copied

Before clipping, you need to know when gradients are problematic. The clearest signal is the gradient norm, which measures the overall magnitude of gradients across all parameters. For a model with parameters $\theta_1, \theta_2, \ldots, \theta_n$ , each with gradient $\nabla_{\theta_i} L$ , the global L2 norm is:

\|\nabla L\|_2 = \sqrt{\sum_{i=1}^{n} \|\nabla_{\theta_i} L\|_2^2}

where:

$\nabla_{\theta_i} L$ : the gradient of the loss $L$ with respect to parameter $\theta_i$ (a tensor)
$\|\nabla_{\theta_i} L\|_2$ : the L2 norm of that parameter's gradient
$\|\nabla L\|_2$ : the global gradient norm across all parameters

This single scalar summarizes the entire gradient's magnitude, making it easy to monitor and compare across training steps.

In[5]:

Code

import torch
import torch.nn as nn

# Create a simple network to demonstrate gradient monitoring
model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 1)
)

# Compute gradients for a random batch
x = torch.randn(32, 10)
y = torch.randn(32, 1)

loss = nn.MSELoss()(model(x), y)
loss.backward()


def compute_gradient_norm(model):
    """Compute the global L2 norm of all gradients."""
    total_norm = 0.0
    for param in model.parameters():
        if param.grad is not None:
            total_norm += param.grad.data.norm(2).item() ** 2
    return total_norm**0.5

import torch
import torch.nn as nn

# Create a simple network to demonstrate gradient monitoring
model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(), nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 1)
)

# Compute gradients for a random batch
x = torch.randn(32, 10)
y = torch.randn(32, 1)

loss = nn.MSELoss()(model(x), y)
loss.backward()


def compute_gradient_norm(model):
    """Compute the global L2 norm of all gradients."""
    total_norm = 0.0
    for param in model.parameters():
        if param.grad is not None:
            total_norm += param.grad.data.norm(2).item() ** 2
    return total_norm**0.5

Out[6]:

Console

Total gradient norm: 0.6806

Per-layer gradient norms:
  0.weight: 0.1575 (5.4% of total)
  0.bias: 0.0474 (0.5% of total)
  2.weight: 0.4153 (37.2% of total)
  2.bias: 0.1219 (3.2% of total)
  4.weight: 0.3790 (31.0% of total)
  4.bias: 0.3243 (22.7% of total)

The global gradient norm for this network is relatively modest, typical for a well-initialized shallow network on random data. The per-layer breakdown shows how each parameter contributes to the total. In practice, you'd monitor this norm over training batches to detect sudden spikes, which signal potential gradient explosions.

What Causes Gradient Explosions?Link Copied

Several architectural and training choices make gradient explosions more likely:

Deep architectures: More layers mean more gradient multiplications. Each layer's Jacobian contributes to the product, and if many have spectral norms greater than 1, gradients explode.
Recurrent networks: RNNs and LSTMs process sequences by repeatedly applying the same weights. Long sequences create deep computational graphs with shared parameters.
Large learning rates: Even moderate gradients become destructive updates with high learning rates. The explosion may be in the update magnitude rather than the gradient itself.
Improper initialization: Weight initialization that starts parameters too large can push networks into explosive regimes from the first forward pass.

Gradient Clipping StrategiesLink Copied

Now that we understand why exploding gradients are dangerous, let's examine how to prevent them. The core idea is simple: if gradients become too large, we shrink them before applying the weight update. But "too large" can mean different things, and how we shrink gradients matters for optimization. This leads to two distinct strategies, each with its own geometric interpretation and trade-offs.

The Central Question: What Should "Too Large" Mean?Link Copied

When we say a gradient is "too large," we could mean two different things:

Individual elements are extreme: Some specific gradient values are huge, even if others are small
The overall magnitude is extreme: The gradient vector as a whole points too far, even if no single element is unusual

These two interpretations lead to fundamentally different clipping strategies. The first gives us "clip by value," which treats each gradient element independently. The second gives us "clip by global norm," which treats the gradient as a unified vector and scales it uniformly. Understanding this distinction is crucial, because the choice affects not just the magnitude of updates but potentially their direction.

Clip by Value: Element-wise ConstraintLink Copied

The simplest approach treats each gradient element as an independent quantity to be bounded. If any element exceeds a threshold $\tau$ , we clamp it to that threshold. Elements within bounds remain unchanged.

Think of this geometrically: we're constraining the gradient to lie within a hypercube centered at the origin. In 2D, this is a square; in 3D, a cube; in higher dimensions, a hypercube with sides of length $2\tau$ .

Clip by Value

Given a gradient vector $\mathbf{g} = [g_1, g_2, \ldots, g_n]$ and a clipping threshold $\tau > 0$ , clip by value applies the following transformation to each element independently:

g_i^{\text{clipped}} = \max(-\tau, \min(\tau, g_i))

where:

$g_i$ : the $i$ -th element of the original gradient vector
$\tau$ : the clipping threshold, defining the maximum allowed absolute value
$g_i^{\text{clipped}}$ : the resulting element, guaranteed to satisfy $|g_i^{\text{clipped}}| \leq \tau$

The $\min(\tau, g_i)$ operation caps positive values at $\tau$ , while $\max(-\tau, \cdot)$ floors negative values at $-\tau$ . Together, they project each component onto the interval $[-\tau, \tau]$ .

Let's see this in action with a concrete example. We'll create a gradient vector with some moderate values and some extreme outliers, then apply clip by value.

In[7]:

Code

def clip_gradient_by_value(gradients, clip_value):
    """Clip each gradient element to [-clip_value, clip_value]."""
    clipped = []
    for grad in gradients:
        clipped.append(torch.clamp(grad, -clip_value, clip_value))
    return clipped


# Example: gradients with some extreme values
example_grads = torch.tensor([-5.0, -1.0, 0.5, 2.0, 10.0, -15.0])
clip_threshold = 2.0
clipped_grads = torch.clamp(example_grads, -clip_threshold, clip_threshold)

def clip_gradient_by_value(gradients, clip_value):
    """Clip each gradient element to [-clip_value, clip_value]."""
    clipped = []
    for grad in gradients:
        clipped.append(torch.clamp(grad, -clip_value, clip_value))
    return clipped


# Example: gradients with some extreme values
example_grads = torch.tensor([-5.0, -1.0, 0.5, 2.0, 10.0, -15.0])
clip_threshold = 2.0
clipped_grads = torch.clamp(example_grads, -clip_threshold, clip_threshold)

Out[8]:

Console

Clip by Value Example
--------------------------------------------------
Original gradients: [-5.0, -1.0, 0.5, 2.0, 10.0, -15.0]
Clip threshold: ±2.0
Clipped gradients: [-2.0, -1.0, 0.5, 2.0, 2.0, -2.0]

Elements clipped: 3/6
  Index 0: -5.0 → -2.0
  Index 4: 10.0 → 2.0
  Index 5: -15.0 → -2.0

Three of the six elements exceeded the threshold and were clamped. The extreme values at indices 0, 4, and 5 are now bounded to ±2.0, while the moderate values at indices 1, 2, and 3 pass through unchanged.

This asymmetric treatment reveals the core limitation of clip by value: it changes the gradient direction. Before clipping, the gradient might have pointed predominantly toward adjusting parameter 4 (with gradient 10.0). After clipping, that dimension is capped at 2.0, the same as dimension 3. The relative importance of different parameters has been distorted. In optimization terms, we're no longer moving in the direction the loss landscape suggested, but in a direction bent by our clipping constraints.

Clip by Global Norm: Preserving DirectionLink Copied

The direction distortion problem motivates a different approach. Instead of asking "is each element too large?", we ask "is the gradient vector as a whole too long?" If yes, we scale the entire vector down uniformly, preserving all relative proportions.

Geometrically, this constrains the gradient to lie within a hypersphere of radius $\tau$ centered at the origin. In 2D, this is a circle; in 3D, a sphere. Any gradient landing outside this sphere gets projected back onto its surface by scaling, not by truncating individual components.

Why does this preserve direction? Consider a gradient vector $\mathbf{g}$ with norm $\|\mathbf{g}\|_2 = 10$ when our threshold is $\tau = 5$ . To bring the norm down to 5, we multiply every component by $\frac{5}{10} = 0.5$ . Each component shrinks by the same factor, so the ratios between components remain unchanged. The vector points in exactly the same direction, just with half the length.

Clip by Global Norm

Given a gradient vector $\mathbf{g}$ and a maximum norm threshold $\tau > 0$ , clip by global norm applies the following transformation:

\mathbf{g}^{\text{clipped}} = \begin{cases} \mathbf{g} \cdot \frac{\tau}{\|\mathbf{g}\|_2} & \text{if } \|\mathbf{g}\|_2 > \tau \\ \mathbf{g} & \text{otherwise} \end{cases}

where:

$\mathbf{g}$ : the full gradient vector, treating all parameters as a single concatenated vector
$\|\mathbf{g}\|_2 = \sqrt{\sum_i g_i^2}$ : the L2 (Euclidean) norm, measuring the vector's total magnitude
$\tau$ : the maximum allowed norm (clipping threshold)
$\frac{\tau}{\|\mathbf{g}\|_2}$ : the scaling factor, always less than 1 when clipping occurs

After clipping, the resulting gradient satisfies $\|\mathbf{g}^{\text{clipped}}\|_2 \leq \tau$ while maintaining $\frac{g_i^{\text{clipped}}}{g_j^{\text{clipped}}} = \frac{g_i}{g_j}$ for all pairs of components.

The key insight is that uniform scaling preserves the update direction. If the original gradient suggested parameter A should change twice as much as parameter B, that relationship holds after clipping. We simply take a smaller step in the same direction.

Let's implement this and verify that direction is preserved. We'll use two gradient tensors representing different parameters and show that their relative proportions remain unchanged after clipping.

In[9]:

Code

def clip_gradient_by_norm(gradients, max_norm):
    """
    Clip gradients by global L2 norm.
    If the total norm exceeds max_norm, scale all gradients proportionally.
    """
    # Step 1: Compute global norm across all gradients
    total_norm = 0.0
    for grad in gradients:
        total_norm += (grad**2).sum()
    total_norm = total_norm**0.5

    # Step 2: Compute scaling factor
    clip_coef = max_norm / (total_norm + 1e-6)  # Add epsilon for stability

    # Step 3: Apply uniform scaling if needed
    if clip_coef < 1.0:
        return (
            [grad * clip_coef for grad in gradients],
            total_norm.item(),
            clip_coef.item(),
        )
    else:
        return gradients, total_norm.item(), 1.0


# Example: two parameter groups with different magnitudes
grad1 = torch.tensor([3.0, 4.0])  # Norm = sqrt(9+16) = 5
grad2 = torch.tensor([6.0, 8.0])  # Norm = sqrt(36+64) = 10
# Global norm = sqrt(3^2 + 4^2 + 6^2 + 8^2) = sqrt(9+16+36+64) = sqrt(125) ≈ 11.18

max_norm = 5.0
clipped, original_norm, scale = clip_gradient_by_norm([grad1, grad2], max_norm)

def clip_gradient_by_norm(gradients, max_norm):
    """
    Clip gradients by global L2 norm.
    If the total norm exceeds max_norm, scale all gradients proportionally.
    """
    # Step 1: Compute global norm across all gradients
    total_norm = 0.0
    for grad in gradients:
        total_norm += (grad**2).sum()
    total_norm = total_norm**0.5

    # Step 2: Compute scaling factor
    clip_coef = max_norm / (total_norm + 1e-6)  # Add epsilon for stability

    # Step 3: Apply uniform scaling if needed
    if clip_coef < 1.0:
        return (
            [grad * clip_coef for grad in gradients],
            total_norm.item(),
            clip_coef.item(),
        )
    else:
        return gradients, total_norm.item(), 1.0


# Example: two parameter groups with different magnitudes
grad1 = torch.tensor([3.0, 4.0])  # Norm = sqrt(9+16) = 5
grad2 = torch.tensor([6.0, 8.0])  # Norm = sqrt(36+64) = 10
# Global norm = sqrt(3^2 + 4^2 + 6^2 + 8^2) = sqrt(9+16+36+64) = sqrt(125) ≈ 11.18

max_norm = 5.0
clipped, original_norm, scale = clip_gradient_by_norm([grad1, grad2], max_norm)

The implementation follows three steps:

Compute the global norm by summing squared elements across all gradient tensors, then taking the square root
Calculate the scaling factor as $\frac{\tau}{\|\mathbf{g}\|_2}$ , which will be less than 1 if clipping is needed
Apply uniform scaling to every gradient tensor if the norm exceeds the threshold

Out[10]:

Console

Clip by Global Norm Example
--------------------------------------------------
Gradient 1: [3.0, 4.0] (norm = 5.00)
Gradient 2: [6.0, 8.0] (norm = 10.00)
Global norm: 11.18
Max norm threshold: 5.0

Scale factor: 0.4472
Clipped gradient 1: [1.3416407108306885, 1.7888543605804443]
Clipped gradient 2: [2.683281421661377, 3.5777087211608887]
New global norm: 5.00

Verifying direction is preserved:
  Ratio grad1[0]/grad2[0] before: 0.5000
  Ratio grad1[0]/grad2[0] after: 0.5000

The results confirm the theory. The original global norm of 11.18 exceeded our threshold of 5.0, triggering clipping. Both gradient tensors were scaled by the same factor (approximately 0.45), bringing the global norm to exactly 5.0. Most importantly, the ratio between corresponding elements remains identical: 0.5 before and 0.5 after. The gradient direction is perfectly preserved.

This is why norm clipping is generally preferred for optimization. When we clip by norm, we're still moving in the direction the loss landscape suggested, just taking a more cautious step. The relative importance of each parameter's update remains unchanged.

Comparing the Two Approaches: A Geometric ViewLink Copied

The difference between clip by value and clip by norm becomes crystal clear when we visualize them geometrically. Consider a 2D gradient vector, which we can plot as an arrow in a plane. The two clipping methods impose different constraints on where this arrow can point and how long it can be.

Clip by value constrains the gradient to a square (or hypercube in higher dimensions). The boundary is defined by $|g_1| \leq \tau$ and $|g_2| \leq \tau$ independently. A gradient in the corner of this square, like $(3, 3)$ with threshold $\tau = 2$ , gets clipped to $(2, 2)$ , which changes its direction subtly. But a gradient like $(4, 1)$ with the same threshold becomes $(2, 1)$ , dramatically shifting its angle.

Clip by norm constrains the gradient to a circle (or hypersphere). The boundary is defined by $\sqrt{g_1^2 + g_2^2} \leq \tau$ . Any gradient outside this circle gets scaled toward the origin until it touches the circle's edge, preserving its angle perfectly.

Let's see both methods applied to the same gradient vector:

Out[11]:

Visualization

Vector plot showing original gradient and clip-by-value result pointing in different directions. — Clip by value creates a square constraint. The first component is clipped to 3.0 while the second remains at 3.0, distorting the gradient direction.

Vector plot showing original gradient and clip-by-norm result pointing in the same direction with different magnitudes. — Clip by norm creates a circular constraint. Both components scale uniformly, preserving the original gradient direction.

The visualization crystallizes the key insight. On the left, clip by value creates a box (square) constraint. The original gradient $(4, 3)$ lies outside this box, so its first component is clipped to 3. But the second component was already within bounds and remains unchanged. The result? The clipped vector points at a different angle than the original, pulling the update direction away from where the loss landscape suggested we should go.

On the right, clip by norm creates a circular constraint. The same gradient $(4, 3)$ with norm 5.0 exceeds our threshold of 3.0, so we scale the entire vector by $\frac{3}{5} = 0.6$ . Both components shrink proportionally to $(2.4, 1.8)$ , and the angle is perfectly preserved. We're still heading in the right direction, just taking a smaller step.

This geometric intuition extends to any number of dimensions. In the parameter spaces of neural networks with millions of dimensions, clip by value constrains gradients to a hypercube while clip by norm constrains them to a hypersphere. The direction preservation property of norm clipping becomes even more valuable in high dimensions, where the corners of a hypercube can point in vastly different directions than the original gradient.

Implementing Gradient Clipping in PyTorchLink Copied

PyTorch provides built-in functions for both clipping strategies. Here's how to integrate them into a training loop.

In[12]:

Code

import torch
import torch.nn as nn
import torch.optim as optim

# Create a model for demonstration
model = nn.Sequential(
    nn.Linear(10, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 1),
)

optimizer = optim.Adam(model.parameters(), lr=0.001)


# Training step with gradient clipping
def train_step_with_clipping(model, x, y, optimizer, max_norm=1.0):
    """Perform one training step with gradient clipping."""
    optimizer.zero_grad()

    # Forward pass
    predictions = model(x)
    loss = nn.MSELoss()(predictions, y)

    # Backward pass
    loss.backward()

    # Compute gradient norm BEFORE clipping
    grad_norm_before = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm=float("inf")
    )

    # Actually clip gradients
    grad_norm_after = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm=max_norm
    )

    # Update weights
    optimizer.step()

    return loss.item(), grad_norm_before.item()


# Simulate training
x = torch.randn(64, 10)
y = torch.randn(64, 1)

loss, grad_norm = train_step_with_clipping(model, x, y, optimizer, max_norm=1.0)

import torch
import torch.nn as nn
import torch.optim as optim

# Create a model for demonstration
model = nn.Sequential(
    nn.Linear(10, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 1),
)

optimizer = optim.Adam(model.parameters(), lr=0.001)


# Training step with gradient clipping
def train_step_with_clipping(model, x, y, optimizer, max_norm=1.0):
    """Perform one training step with gradient clipping."""
    optimizer.zero_grad()

    # Forward pass
    predictions = model(x)
    loss = nn.MSELoss()(predictions, y)

    # Backward pass
    loss.backward()

    # Compute gradient norm BEFORE clipping
    grad_norm_before = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm=float("inf")
    )

    # Actually clip gradients
    grad_norm_after = torch.nn.utils.clip_grad_norm_(
        model.parameters(), max_norm=max_norm
    )

    # Update weights
    optimizer.step()

    return loss.item(), grad_norm_before.item()


# Simulate training
x = torch.randn(64, 10)
y = torch.randn(64, 1)

loss, grad_norm = train_step_with_clipping(model, x, y, optimizer, max_norm=1.0)

Out[13]:

Console

Training Step with Gradient Clipping
--------------------------------------------------
Loss: 0.801149
Gradient norm (before clipping): 0.4012
Max norm threshold: 1.0
No clipping needed (norm within threshold)

The gradient norm determines whether clipping activates. In this case, the norm exceeds 1.0, so all gradients are scaled down proportionally. PyTorch's clip_grad_norm_ function handles this computation automatically, returning the original norm for logging purposes. The underscore suffix indicates it modifies gradients in-place.

Tracking Clipping FrequencyLink Copied

Monitoring how often gradients get clipped provides insight into training stability. If clipping happens on every batch, your threshold might be too low. If it never happens, you might not need clipping at all.

In[14]:

Code

def train_with_clipping_stats(
    model, dataloader, optimizer, max_norm=1.0, epochs=1
):
    """Train while tracking gradient clipping statistics."""
    clip_counts = []
    grad_norms = []

    for epoch in range(epochs):
        epoch_clips = 0
        epoch_norms = []

        for x, y in dataloader:
            optimizer.zero_grad()
            loss = nn.MSELoss()(model(x), y)
            loss.backward()

            # Get gradient norm before clipping
            total_norm = 0.0
            for p in model.parameters():
                if p.grad is not None:
                    total_norm += p.grad.data.norm(2).item() ** 2
            total_norm = total_norm**0.5
            epoch_norms.append(total_norm)

            # Check if clipping will occur
            if total_norm > max_norm:
                epoch_clips += 1

            # Apply clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
            optimizer.step()

        clip_counts.append(epoch_clips)
        grad_norms.append(epoch_norms)

    return clip_counts, grad_norms


# Create synthetic dataloader
from torch.utils.data import TensorDataset, DataLoader

X = torch.randn(1000, 10)
Y = torch.randn(1000, 1)
dataset = TensorDataset(X, Y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Fresh model
model = nn.Sequential(
    nn.Linear(10, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 1),
)
optimizer = optim.Adam(model.parameters(), lr=0.001)

clip_counts, grad_norms = train_with_clipping_stats(
    model, dataloader, optimizer, max_norm=1.0, epochs=3
)

def train_with_clipping_stats(
    model, dataloader, optimizer, max_norm=1.0, epochs=1
):
    """Train while tracking gradient clipping statistics."""
    clip_counts = []
    grad_norms = []

    for epoch in range(epochs):
        epoch_clips = 0
        epoch_norms = []

        for x, y in dataloader:
            optimizer.zero_grad()
            loss = nn.MSELoss()(model(x), y)
            loss.backward()

            # Get gradient norm before clipping
            total_norm = 0.0
            for p in model.parameters():
                if p.grad is not None:
                    total_norm += p.grad.data.norm(2).item() ** 2
            total_norm = total_norm**0.5
            epoch_norms.append(total_norm)

            # Check if clipping will occur
            if total_norm > max_norm:
                epoch_clips += 1

            # Apply clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
            optimizer.step()

        clip_counts.append(epoch_clips)
        grad_norms.append(epoch_norms)

    return clip_counts, grad_norms


# Create synthetic dataloader
from torch.utils.data import TensorDataset, DataLoader

X = torch.randn(1000, 10)
Y = torch.randn(1000, 1)
dataset = TensorDataset(X, Y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# Fresh model
model = nn.Sequential(
    nn.Linear(10, 128),
    nn.ReLU(),
    nn.Linear(128, 128),
    nn.ReLU(),
    nn.Linear(128, 1),
)
optimizer = optim.Adam(model.parameters(), lr=0.001)

clip_counts, grad_norms = train_with_clipping_stats(
    model, dataloader, optimizer, max_norm=1.0, epochs=3
)

Out[15]:

Console

Gradient Clipping Statistics
--------------------------------------------------
Epoch 1:
  Batches clipped: 20/32 (62.5%)
  Gradient norm - min: 0.6366, avg: 1.2710, max: 2.4376
Epoch 2:
  Batches clipped: 21/32 (65.6%)
  Gradient norm - min: 0.7528, avg: 1.3141, max: 2.2093
Epoch 3:
  Batches clipped: 13/32 (40.6%)
  Gradient norm - min: 0.6061, avg: 1.0633, max: 2.2303

The clipping statistics reveal training dynamics. The high initial clipping rate indicates that our threshold of 1.0 is relatively aggressive for this model, catching most gradient updates. As training progresses, gradient norms typically decrease and stabilize, leading to fewer clipping events. A healthy training run typically sees clipping on 5-20% of batches, preventing occasional spikes without constantly dampening gradients. If clipping occurs on nearly every batch, consider raising the threshold.

Visualizing Gradient Norms During TrainingLink Copied

Plotting gradient norms over training reveals patterns that inform threshold selection. Let's train a model and visualize the gradient dynamics.

Out[16]:

Visualization

Line plot showing gradient norm values across training steps with most values between 0-2 and occasional spikes reaching 4-5, with a red threshold line at 2.0. — Gradient norms during training without clipping. The occasional spikes represent potential gradient explosions that would destabilize training without intervention. The red dashed line shows a reasonable clipping threshold that would prevent most spikes while allowing normal gradient flow.

The gradient norm plot reveals the training dynamics. Early in training, gradients tend to be larger and more variable. As the model converges, gradients stabilize and shrink. The occasional spikes are the dangerous outliers that gradient clipping targets.

Choosing the Right ThresholdLink Copied

Selecting an appropriate clipping threshold requires balancing two concerns: clip too aggressively and you slow down learning; clip too loosely and explosions still occur.

Empirical GuidelinesLink Copied

Start with a threshold based on observed gradient norms during initial training runs:

Conservative: Set threshold at the 90th percentile of observed gradient norms. This clips only the most extreme values.
Moderate: Use the 75th percentile. Clips more frequently but still allows most gradients through unchanged.
Aggressive: Set threshold at the median. Halves most gradient magnitudes, useful for very unstable training.

In[17]:

Code

def recommend_threshold(gradient_norms, percentile=90):
    """Recommend a clipping threshold based on observed gradient norms."""
    threshold = np.percentile(gradient_norms, percentile)
    return threshold


# Using our simulated gradient norms
thresholds = {
    "Conservative (90th)": recommend_threshold(gradient_norms, 90),
    "Moderate (75th)": recommend_threshold(gradient_norms, 75),
    "Aggressive (50th)": recommend_threshold(gradient_norms, 50),
}

def recommend_threshold(gradient_norms, percentile=90):
    """Recommend a clipping threshold based on observed gradient norms."""
    threshold = np.percentile(gradient_norms, percentile)
    return threshold


# Using our simulated gradient norms
thresholds = {
    "Conservative (90th)": recommend_threshold(gradient_norms, 90),
    "Moderate (75th)": recommend_threshold(gradient_norms, 75),
    "Aggressive (50th)": recommend_threshold(gradient_norms, 50),
}

Out[18]:

Console

Recommended Clipping Thresholds
--------------------------------------------------
Gradient norm statistics:
  Min: 0.0135
  Max: 4.2679
  Mean: 0.7562
  Std: 0.5029

Threshold recommendations:
  Conservative (90th): 1.0466 (would clip 10.0% of batches)
  Moderate (75th): 0.8660 (would clip 25.0% of batches)
  Aggressive (50th): 0.6914 (would clip 50.0% of batches)

The conservative threshold at the 90th percentile would clip only the most extreme 10% of gradients, allowing normal training dynamics while preventing outlier spikes. The moderate threshold clips about 25% of batches, providing more aggressive stabilization. The aggressive option clips half of all gradients, useful only for severely unstable training where stability matters more than convergence speed.

Out[19]:

Visualization

Histogram of gradient norms with vertical lines showing 50th, 75th, and 90th percentile thresholds, demonstrating the right-skewed distribution with occasional spikes. — Distribution of gradient norms during training with percentile-based threshold recommendations. Most gradient norms cluster around 0.5-0.8, with a long right tail of occasional spikes. Vertical lines indicate different clipping thresholds based on percentiles.

Common Threshold ValuesLink Copied

Certain threshold values have become standard through empirical practice:

1.0: A common default for transformer models. Works well with Adam optimizer and typical learning rates.
5.0: More permissive, used when gradients are naturally larger or training is more stable.
0.5: Aggressive clipping for very deep or recurrent networks with known stability issues.

The right value depends on your architecture, optimizer, and learning rate. Higher learning rates generally require lower clipping thresholds to prevent large updates.

When to Use Gradient ClippingLink Copied

Gradient clipping isn't always necessary. Some architectures and training setups are naturally stable, while others require careful gradient management.

Use Cases Where Clipping HelpsLink Copied

The following scenarios benefit most from gradient clipping:

Recurrent Neural Networks: LSTMs and GRUs process sequences by repeatedly applying the same weights. Long sequences create deep computational graphs prone to gradient explosion. Clipping is nearly mandatory for RNN training.
Transformers: Self-attention mechanisms can create gradient paths that amplify through many layers. Most transformer implementations use gradient clipping by default.
Reinforcement Learning: Policy gradients in RL can be extremely noisy and variable. Clipping stabilizes training when rewards vary dramatically.
Large Batch Training: Gradient averaging across large batches can occasionally produce extreme values when batches contain outliers.
Fine-tuning with High Learning Rates: When adapting pre-trained models, initial gradients can be large relative to the already-trained weights.

When Clipping May Not HelpLink Copied

In some situations, gradient clipping addresses symptoms rather than causes:

Vanishing gradients: Clipping only helps with exploding gradients. If your gradients are too small, clipping does nothing. Techniques like residual connections and layer normalization address vanishing gradients.
Fundamentally broken architectures: If gradient explosions happen constantly and early, the architecture may need revision. Clipping is a band-aid, not a fix for poor design.
Well-conditioned shallow networks: Simple feedforward networks with proper initialization often train stably without clipping.

Gradient Clipping with Different OptimizersLink Copied

The interaction between gradient clipping and optimizers matters. Adaptive optimizers like Adam already normalize gradients per-parameter, which can reduce the need for clipping.

In[20]:

Code

# Comparing gradient clipping needs across optimizers
def analyze_optimizer_gradients(optimizer_class, lr, n_steps=100):
    """Analyze gradient behavior with different optimizers."""
    model = nn.Sequential(
        nn.Linear(10, 64),
        nn.ReLU(),
        nn.Linear(64, 64),
        nn.ReLU(),
        nn.Linear(64, 1),
    )

    optimizer = optimizer_class(model.parameters(), lr=lr)

    gradient_norms = []
    update_norms = []

    for step in range(n_steps):
        x = torch.randn(32, 10)
        y = torch.randn(32, 1)

        optimizer.zero_grad()
        loss = nn.MSELoss()(model(x), y)
        loss.backward()

        # Record gradient norm
        grad_norm = (
            sum(
                p.grad.norm().item() ** 2
                for p in model.parameters()
                if p.grad is not None
            )
            ** 0.5
        )
        gradient_norms.append(grad_norm)

        optimizer.step()

    return gradient_norms


sgd_norms = analyze_optimizer_gradients(optim.SGD, lr=0.01)
adam_norms = analyze_optimizer_gradients(optim.Adam, lr=0.001)

# Comparing gradient clipping needs across optimizers
def analyze_optimizer_gradients(optimizer_class, lr, n_steps=100):
    """Analyze gradient behavior with different optimizers."""
    model = nn.Sequential(
        nn.Linear(10, 64),
        nn.ReLU(),
        nn.Linear(64, 64),
        nn.ReLU(),
        nn.Linear(64, 1),
    )

    optimizer = optimizer_class(model.parameters(), lr=lr)

    gradient_norms = []
    update_norms = []

    for step in range(n_steps):
        x = torch.randn(32, 10)
        y = torch.randn(32, 1)

        optimizer.zero_grad()
        loss = nn.MSELoss()(model(x), y)
        loss.backward()

        # Record gradient norm
        grad_norm = (
            sum(
                p.grad.norm().item() ** 2
                for p in model.parameters()
                if p.grad is not None
            )
            ** 0.5
        )
        gradient_norms.append(grad_norm)

        optimizer.step()

    return gradient_norms


sgd_norms = analyze_optimizer_gradients(optim.SGD, lr=0.01)
adam_norms = analyze_optimizer_gradients(optim.Adam, lr=0.001)

Out[21]:

Console

Gradient Norm Comparison Across Optimizers
--------------------------------------------------
SGD (lr=0.01):
  Mean gradient norm: 0.6990
  Max gradient norm: 1.3108
  Std gradient norm: 0.2006
  Coefficient of variation: 0.29

Adam (lr=0.001):
  Mean gradient norm: 0.8031
  Max gradient norm: 2.2882
  Std gradient norm: 0.3151
  Coefficient of variation: 0.39

Both optimizers show similar gradient norm distributions since we measure gradients before the optimizer transforms them. The coefficient of variation (std/mean) indicates gradient stability. While Adam's adaptive per-parameter scaling helps with the actual updates, it doesn't change the raw gradient magnitudes that clipping operates on. For very unstable training, combining Adam with gradient clipping provides complementary stability mechanisms.

Out[22]:

Visualization

Line plot showing gradient norms over training steps for SGD and Adam optimizers. — Gradient norms over training steps for SGD vs Adam. Both optimizers produce similar trajectories since clipping operates on raw gradients.

Overlapping histograms comparing gradient norm distributions for SGD and Adam. — Distribution of gradient norms for SGD vs Adam. The distributions largely overlap, confirming gradient clipping decisions should be based on architecture.

Limitations and Practical ConsiderationsLink Copied

Gradient clipping is a stabilization technique, not an optimization improvement. Understanding its limitations helps you use it appropriately.

The Direction Distortion Trade-offLink Copied

When using clip by value, we sacrifice gradient direction for bounded magnitude. This trade-off becomes more severe when gradients are highly unbalanced across dimensions. If one parameter has gradients of magnitude 100 while another has gradients of magnitude 0.1, clipping to 1.0 completely dominates the update direction toward the smaller gradient's direction. In such cases, clip by norm is strongly preferred.

Even with norm clipping, there's a subtle issue: clipping changes the effective learning rate. When gradients are clipped by a factor of 10, it's equivalent to temporarily using 1/10th of your learning rate. This can slow convergence during periods of high gradient activity, though it's usually a worthwhile trade-off for stability.

Interaction with Learning Rate SchedulesLink Copied

If you use learning rate warmup or decay, consider how clipping interacts with these schedules. Early in training with low learning rates, gradients might be naturally bounded. As learning rates increase during warmup, gradient clipping becomes more relevant. Some practitioners adjust clipping thresholds alongside learning rate schedules, though this adds complexity.

Monitoring in ProductionLink Copied

For production training runs, log gradient norms and clipping events. A sudden increase in clipping frequency can indicate data distribution shift, model instability, or other issues that warrant investigation. Conversely, if clipping never triggers, you might remove it to simplify your training pipeline.

Key ParametersLink Copied

When implementing gradient clipping in PyTorch, understanding the key parameters helps you configure clipping effectively for your specific training scenario:

max_norm (for clip_grad_norm_): The maximum allowed L2 norm for the entire gradient vector. Gradients exceeding this threshold are scaled down proportionally. Common values range from 0.5 to 5.0, with 1.0 being a typical default for transformer training.
clip_value (for clip_grad_value_): The maximum absolute value for any single gradient element. Elements outside $[-\text{clip\_value}, \text{clip\_value}]$ are clamped. Typically set higher than max_norm since it applies element-wise rather than globally.
norm_type: The type of norm to use for clipping (default: 2 for L2 norm). Can be set to float('inf') for infinity norm or other positive values for different $p$ -norms.
error_if_nonfinite (PyTorch 1.9+): Whether to raise an error if gradients contain NaN or infinity values. Setting to True helps catch gradient explosions early rather than silently propagating corrupted values.

When selecting thresholds, start with an exploratory training run without clipping to observe your gradient norm distribution. Set the threshold at the 90th percentile for conservative clipping, or lower if training remains unstable.

SummaryLink Copied

Gradient clipping prevents training instability caused by exploding gradients. The technique caps gradient magnitudes before weight updates, keeping optimization on track when backpropagation produces unreasonably large values.

The key takeaways from this chapter:

Exploding gradients occur when gradient magnitudes grow exponentially during backpropagation, especially in deep or recurrent architectures. They manifest as NaN losses and training collapse.
Clip by value constrains each gradient element independently to $[-\tau, \tau]$ , where $\tau$ is the clipping threshold. It's simple but can distort gradient direction.
Clip by global norm rescales the entire gradient vector to have at most norm $\tau$ while preserving relative proportions. This preserves the update direction and is generally preferred.
PyTorch's clip_grad_norm_ implements global norm clipping and is the standard choice for most applications.
Threshold selection should be based on observed gradient norm distributions. Common values range from 0.5 to 5.0, with 1.0 being a typical default.
Monitor clipping frequency to calibrate thresholds. Clipping on 5-20% of batches usually indicates a well-tuned threshold.

Gradient clipping is essential for training RNNs, transformers, and other deep architectures. While it doesn't improve optimization directly, it prevents the catastrophic failures that would otherwise make training impossible.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about gradient clipping and preventing exploding gradients.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{gradientclippingpreventingexplodinggradientsindeeplearning, author = {Michael Brenndoerfer}, title = {Gradient Clipping: Preventing Exploding Gradients in Deep Learning}, year = {2025}, url = {https://mbrenndoerfer.com/writing/gradient-clipping-deep-learning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). Gradient Clipping: Preventing Exploding Gradients in Deep Learning. Retrieved from https://mbrenndoerfer.com/writing/gradient-clipping-deep-learning

MLAAcademic

Michael Brenndoerfer. "Gradient Clipping: Preventing Exploding Gradients in Deep Learning." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/gradient-clipping-deep-learning>.

CHICAGOAcademic

Michael Brenndoerfer. "Gradient Clipping: Preventing Exploding Gradients in Deep Learning." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/gradient-clipping-deep-learning.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Gradient Clipping: Preventing Exploding Gradients in Deep Learning'. Available at: https://mbrenndoerfer.com/writing/gradient-clipping-deep-learning (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). Gradient Clipping: Preventing Exploding Gradients in Deep Learning. https://mbrenndoerfer.com/writing/gradient-clipping-deep-learning

Direct link:

https://mbrenndoerfer.com/writing/gradient-clipping-deep-learning

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Gradient Clipping: Preventing Exploding Gradients in Deep Learning

Gradient ClippingLink Copied

The Exploding Gradient ProblemLink Copied

Detecting Gradient ExplosionsLink Copied

What Causes Gradient Explosions?Link Copied

Gradient Clipping StrategiesLink Copied

The Central Question: What Should "Too Large" Mean?Link Copied

Clip by Value: Element-wise ConstraintLink Copied

Clip by Global Norm: Preserving DirectionLink Copied

Comparing the Two Approaches: A Geometric ViewLink Copied

Implementing Gradient Clipping in PyTorchLink Copied

Tracking Clipping FrequencyLink Copied

Visualizing Gradient Norms During TrainingLink Copied

Choosing the Right ThresholdLink Copied

Empirical GuidelinesLink Copied

Common Threshold ValuesLink Copied

When to Use Gradient ClippingLink Copied

Use Cases Where Clipping HelpsLink Copied

When Clipping May Not HelpLink Copied

Gradient Clipping with Different OptimizersLink Copied

Limitations and Practical ConsiderationsLink Copied

The Direction Distortion Trade-offLink Copied

Interaction with Learning Rate SchedulesLink Copied

Monitoring in ProductionLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Linear Classifiers: The Foundation of Neural Networks

Stay updated