Adam Optimizer: Adaptive Learning Rates for Neural Network Training

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Master Adam optimization with exponential moving averages, bias correction, and per-parameter learning rates. Build Adam from scratch and compare with SGD.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Adam OptimizerLink Copied

Training neural networks requires navigating complex loss landscapes with varying curvature across different parameter dimensions. Some parameters need large updates to escape flat regions, while others need small updates to avoid overshooting sharp valleys. Standard gradient descent treats all parameters equally, using a single learning rate for everything. Momentum helps by smoothing gradients over time, but it still applies the same effective step size everywhere.

Adam, short for Adaptive Moment Estimation, solves this problem by maintaining per-parameter learning rates that automatically adjust based on the history of gradients. It combines two powerful ideas: momentum to smooth gradient direction, and adaptive learning rates to scale step sizes appropriately. Published by Kingma and Ba in 2014, Adam quickly became the default optimizer for deep learning due to its robust performance across architectures and tasks.

This chapter builds Adam from first principles. You'll understand the exponential moving averages that power it, derive the bias correction terms that make it work from the first iteration, and implement Adam from scratch.

Exponential Moving AveragesLink Copied

Adam's core mechanism is the exponential moving average (EMA), a technique for tracking statistics over time while giving more weight to recent observations. Understanding EMA is essential for grasping how Adam adapts to gradient patterns.

Exponential Moving Average

An exponential moving average maintains a running estimate that smoothly blends new observations with the historical average. At each step, the estimate is updated as $v_t = \beta v_{t-1} + (1-\beta) x_t$ , where $\beta$ controls how much weight goes to the past versus the present.

The update rule is simple:

v_t = \beta v_{t-1} + (1 - \beta) x_t

where:

$v_t$ : the exponential moving average at time step $t$
$v_{t-1}$ : the previous estimate
$x_t$ : the new observation at time $t$
$\beta$ : the decay rate, typically between 0.9 and 0.999
$(1-\beta)$ : the weight assigned to the current observation

The parameter $\beta$ controls the trade-off between stability and responsiveness. High values like $\beta = 0.99$ create smooth, slowly-changing estimates that ignore short-term fluctuations. Low values like $\beta = 0.5$ react quickly to new data but can be noisy. The name "exponential" comes from how the influence of past observations decays exponentially with age.

Let's see how the EMA behaves with different decay rates:

In[2]:

Code

import numpy as np

# Generate a noisy signal with an underlying trend
np.random.seed(42)
n_steps = 100
true_signal = np.sin(np.linspace(0, 4 * np.pi, n_steps))
noise = np.random.randn(n_steps) * 0.3
observations = true_signal + noise


def compute_ema(observations, beta):
    """Compute exponential moving average of observations."""
    ema = np.zeros(len(observations))
    ema[0] = observations[0]  # Initialize with first observation
    for t in range(1, len(observations)):
        ema[t] = beta * ema[t - 1] + (1 - beta) * observations[t]
    return ema


# Compute EMA with different decay rates
betas = [0.5, 0.9, 0.99]
emas = {beta: compute_ema(observations, beta) for beta in betas}

import numpy as np

# Generate a noisy signal with an underlying trend
np.random.seed(42)
n_steps = 100
true_signal = np.sin(np.linspace(0, 4 * np.pi, n_steps))
noise = np.random.randn(n_steps) * 0.3
observations = true_signal + noise


def compute_ema(observations, beta):
    """Compute exponential moving average of observations."""
    ema = np.zeros(len(observations))
    ema[0] = observations[0]  # Initialize with first observation
    for t in range(1, len(observations)):
        ema[t] = beta * ema[t - 1] + (1 - beta) * observations[t]
    return ema


# Compute EMA with different decay rates
betas = [0.5, 0.9, 0.99]
emas = {beta: compute_ema(observations, beta) for beta in betas}

Out[3]:

Visualization

Line plot comparing exponential moving averages with beta values 0.5, 0.9, and 0.99. — Exponential moving averages with different decay rates. Higher beta values produce smoother estimates that lag behind the true signal, while lower beta values track changes more quickly but are noisier.

The visualization reveals the trade-off clearly. With $\beta = 0.5$ , the EMA closely tracks the noisy observations, reacting quickly to changes but inheriting much of the noise. With $\beta = 0.99$ , the EMA is much smoother but lags significantly behind the true signal. The middle ground of $\beta = 0.9$ balances responsiveness with stability.

Why Exponential Decay?Link Copied

To understand why past observations decay exponentially, let's expand the EMA formula recursively. Starting with $v_t = \beta v_{t-1} + (1-\beta) x_t$ , we can substitute the formula for $v_{t-1}$ :

v_t = \beta \left[ \beta v_{t-2} + (1-\beta) x_{t-1} \right] + (1-\beta) x_t

Distributing the $\beta$ term and continuing this expansion backward to the first observation:

v_t = (1-\beta) x_t + \beta(1-\beta) x_{t-1} + \beta^2 (1-\beta) x_{t-2} + \cdots + \beta^{t-1}(1-\beta) x_1 + \beta^t v_0

where:

$(1-\beta) x_t$ : the contribution from the current observation (weight $1-\beta$ )
$\beta(1-\beta) x_{t-1}$ : the contribution from the previous observation (weight $\beta(1-\beta)$ )
$\beta^{t-k}(1-\beta) x_k$ : the general form for the contribution from observation $x_k$
$\beta^t v_0$ : the residual influence of the initial value $v_0$ (typically zero)

Each observation $x_k$ contributes with weight $\beta^{t-k}(1-\beta)$ . Older observations have higher powers of $\beta$ , so their influence decays exponentially. After about $\frac{1}{1-\beta}$ time steps, an observation's weight has decayed to roughly $\frac{1}{e} \approx 0.37$ of its original value.

In[4]:

Code

# Calculate effective window size for different beta values
betas_analysis = [0.9, 0.99, 0.999]
window_sizes = {beta: 1 / (1 - beta) for beta in betas_analysis}

# Calculate effective window size for different beta values
betas_analysis = [0.9, 0.99, 0.999]
window_sizes = {beta: 1 / (1 - beta) for beta in betas_analysis}

Out[5]:

Console

Effective Memory Window by Decay Rate:
---------------------------------------------
β = 0.9    → window ≈ 10 time steps
β = 0.99   → window ≈ 100 time steps
β = 0.999  → window ≈ 1000 time steps

The effective memory window determines how many past observations significantly influence the current estimate. To see this decay in action, let's visualize the actual weights assigned to each past observation:

Out[6]:

Visualization

Line plot showing how exponential weights decay over time for different beta values. — Weight decay for past observations in exponential moving averages. With β = 0.9, only the last ~30 observations have significant influence. With β = 0.999, observations from hundreds of steps ago still contribute meaningfully.

With $\beta = 0.9$ , the EMA effectively averages over the last 10 observations, making it responsive to recent changes. With $\beta = 0.999$ , it averages over approximately 1000 observations, providing much greater stability at the cost of slower adaptation. Adam uses these different decay rates for its two moment estimates: $\beta_1 = 0.9$ for first moment (quick response to gradient direction changes) and $\beta_2 = 0.999$ for second moment (stable learning rate adaptation).

First Moment: Mean EstimationLink Copied

Adam tracks two separate exponential moving averages of the gradients. The first moment estimate $m_t$ approximates the mean of recent gradients:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

where:

$m_t$ : the first moment estimate (gradient mean) at step $t$
$g_t$ : the gradient computed at step $t$
$\beta_1$ : the decay rate for the first moment, typically 0.9

This is exactly momentum under a different name. The first moment estimate accumulates a velocity in parameter space, helping the optimizer maintain direction through noisy gradient estimates. When gradients consistently point in the same direction, $m_t$ grows in that direction. When gradients oscillate, $m_t$ dampens the oscillations by averaging out the positive and negative contributions.

In[7]:

Code

# Simulate gradients with oscillation and noise
np.random.seed(123)
n_steps = 50

# Gradient with consistent direction plus noise and oscillation
base_gradient = -0.5  # Consistent downward direction
oscillation = 0.3 * np.sin(np.linspace(0, 6 * np.pi, n_steps))
noise = np.random.randn(n_steps) * 0.2
gradients = base_gradient + oscillation + noise


def compute_first_moment(gradients, beta1=0.9):
    """Compute first moment estimate (momentum)."""
    m = np.zeros(len(gradients))
    for t in range(len(gradients)):
        if t == 0:
            m[t] = (1 - beta1) * gradients[t]
        else:
            m[t] = beta1 * m[t - 1] + (1 - beta1) * gradients[t]
    return m


m = compute_first_moment(gradients)

# Simulate gradients with oscillation and noise
np.random.seed(123)
n_steps = 50

# Gradient with consistent direction plus noise and oscillation
base_gradient = -0.5  # Consistent downward direction
oscillation = 0.3 * np.sin(np.linspace(0, 6 * np.pi, n_steps))
noise = np.random.randn(n_steps) * 0.2
gradients = base_gradient + oscillation + noise


def compute_first_moment(gradients, beta1=0.9):
    """Compute first moment estimate (momentum)."""
    m = np.zeros(len(gradients))
    for t in range(len(gradients)):
        if t == 0:
            m[t] = (1 - beta1) * gradients[t]
        else:
            m[t] = beta1 * m[t - 1] + (1 - beta1) * gradients[t]
    return m


m = compute_first_moment(gradients)

Out[8]:

Visualization

Line plot comparing raw gradients with first moment estimate showing smoothing effect. — First moment estimate smooths noisy, oscillating gradients. The raw gradients fluctuate around -0.5, but the first moment estimate converges to a stable value near the mean, reducing the impact of noise and oscillation on parameter updates.

The first moment estimate converges toward the true mean gradient of -0.5, effectively filtering out both the oscillations and the random noise. This momentum effect helps the optimizer move consistently toward the optimum even when individual gradient estimates are unreliable.

First Moment as a Low-Pass FilterLink Copied

Another way to understand the first moment is as a low-pass filter in signal processing terms. It smooths high-frequency noise while preserving the low-frequency trend. Let's decompose the gradient signal to see this filtering effect:

Out[9]:

Visualization

Line plot showing raw gradients decomposed into signal and noise components. — The raw gradient contains both the underlying signal (consistent -0.5 direction) and high-frequency noise. The noise makes raw gradients unreliable for optimization.

Line plot comparing first moment to true underlying signal with noise region shaded. — The first moment estimate (red) tracks the underlying signal much more closely than raw gradients would. The shaded region shows the filtered-out noise.

The filtering analogy explains why momentum helps optimization: it extracts the consistent direction from noisy gradient estimates, allowing the optimizer to make confident progress even when individual gradients are unreliable.

Second Moment: Variance EstimationLink Copied

The second key innovation in Adam is tracking the second moment of gradients, an estimate of their variance. This enables per-parameter learning rate adaptation. The second moment is computed as an EMA of squared gradients:

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

where:

$v_t$ : the second moment estimate (uncentered variance) at step $t$
$g_t^2$ : the element-wise square of the gradient at step $t$
$\beta_2$ : the decay rate for the second moment, typically 0.999

The squared gradients measure how much each parameter's gradient varies over time. Parameters with consistently large gradients will have large $v_t$ values. Parameters with small or sparse gradients will have small $v_t$ values. Adam uses this information to scale learning rates: parameters with high variance get smaller effective learning rates, while parameters with low variance get larger ones.

In[10]:

Code

# Two parameters with very different gradient magnitudes
np.random.seed(42)
n_steps = 100

# Parameter 1: Large, consistent gradients
grad_param1 = np.random.randn(n_steps) * 10 + 5

# Parameter 2: Small, sparse gradients (mostly zero with occasional spikes)
grad_param2 = np.random.randn(n_steps) * 0.1
grad_param2[np.random.choice(n_steps, 10, replace=False)] = (
    np.random.randn(10) * 2
)


def compute_second_moment(gradients, beta2=0.999):
    """Compute second moment estimate."""
    v = np.zeros(len(gradients))
    for t in range(len(gradients)):
        if t == 0:
            v[t] = (1 - beta2) * gradients[t] ** 2
        else:
            v[t] = beta2 * v[t - 1] + (1 - beta2) * gradients[t] ** 2
    return v


v1 = compute_second_moment(grad_param1)
v2 = compute_second_moment(grad_param2)

# Two parameters with very different gradient magnitudes
np.random.seed(42)
n_steps = 100

# Parameter 1: Large, consistent gradients
grad_param1 = np.random.randn(n_steps) * 10 + 5

# Parameter 2: Small, sparse gradients (mostly zero with occasional spikes)
grad_param2 = np.random.randn(n_steps) * 0.1
grad_param2[np.random.choice(n_steps, 10, replace=False)] = (
    np.random.randn(10) * 2
)


def compute_second_moment(gradients, beta2=0.999):
    """Compute second moment estimate."""
    v = np.zeros(len(gradients))
    for t in range(len(gradients)):
        if t == 0:
            v[t] = (1 - beta2) * gradients[t] ** 2
        else:
            v[t] = beta2 * v[t - 1] + (1 - beta2) * gradients[t] ** 2
    return v


v1 = compute_second_moment(grad_param1)
v2 = compute_second_moment(grad_param2)

Out[11]:

Visualization

Plot of gradients and second moment for a parameter with large consistent gradients. — Parameter 1 has large, consistent gradients, leading to a high second moment estimate that grows quickly and stabilizes around 125.

Plot of gradients and second moment for a parameter with small sparse gradients. — Parameter 2 has small gradients with occasional spikes. The second moment stays low, growing slowly, allowing for larger effective learning rates.

The contrast is striking. Parameter 1's second moment estimate stabilizes around 125 (approximately $10^2 + 5^2$ ), reflecting its consistently large gradients. Parameter 2's second moment remains below 1, reflecting its much smaller typical gradient magnitude. When Adam normalizes updates by $\sqrt{v_t}$ , parameter 1 will receive smaller effective steps, while parameter 2 will receive larger ones. This automatic scaling is what makes Adam robust across parameters with very different gradient scales.

The Ratio That MattersLink Copied

What ultimately determines the effective learning rate is the ratio between the first and second moments. Let's visualize how this ratio differs between our two example parameters:

Out[12]:

Visualization

Line plot comparing the m/sqrt(v) ratio for parameters with large vs small gradients. — The ratio m/√v determines the effective update magnitude. Despite having gradients 100× larger, Parameter 1's ratio is similar to Parameter 2's because both numerator and denominator scale together. This automatic normalization is key to Adam's robustness.

The normalized updates have similar magnitudes despite the 100× difference in raw gradient scales. This is the essence of adaptive learning rates: Adam automatically compensates for gradient scale differences, allowing the same base learning rate to work across diverse parameters.

The Bias Correction ProblemLink Copied

There's a subtle but critical problem with the EMA estimates as we've defined them. At initialization, $m_0 = 0$ and $v_0 = 0$ . In the first few iterations, the estimates are severely biased toward zero, not because the true gradient mean or variance is near zero, but because the EMA hasn't had time to accumulate information.

To see why this happens, consider the first moment after one step. Starting with $m_0 = 0$ , we apply the EMA update:

m_1 = \beta_1 \cdot 0 + (1 - \beta_1) g_1 = (1 - \beta_1) g_1

where:

$m_1$ : the first moment estimate after one step
$\beta_1 \cdot 0$ : the contribution from the previous estimate (zero because $m_0 = 0$ )
$(1 - \beta_1) g_1$ : the contribution from the current gradient
$g_1$ : the gradient at step 1

With $\beta_1 = 0.9$ , we get $m_1 = 0.1 g_1$ . The estimate is only 10% of the true gradient! This bias gradually decreases as more observations accumulate, but it takes many steps to fully overcome the zero initialization.

Deriving the Bias CorrectionLink Copied

Let's derive the exact bias correction term step by step. We'll show why the biased estimate undershoots and how to correct it.

Step 1: Expand the recurrence relation

Assuming the gradients have true mean $\mathbb{E}[g_t] = g$ (constant over time), we can unroll the EMA recurrence to express $m_t$ as a weighted sum of all past gradients:

m_t = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i

where:

$(1 - \beta_1)$ : the weight given to each gradient observation
$\beta_1^{t-i}$ : the decay factor for gradient $g_i$ , which depends on how many steps ago it was observed
The sum runs from $i=1$ (first gradient) to $i=t$ (current gradient)

Step 2: Compute the expected value

Taking the expectation of both sides and using the fact that all gradients have the same expected value $g$ :

\mathbb{E}[m_t] = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} \mathbb{E}[g_i] = (1 - \beta_1) g \sum_{i=1}^{t} \beta_1^{t-i}

Step 3: Evaluate the geometric series

The sum is a geometric series. To evaluate it, we substitute $j = t - i$ to change the indexing. When $i = 1$ , we have $j = t - 1$ ; when $i = t$ , we have $j = 0$ . This transforms the sum:

\sum_{i=1}^{t} \beta_1^{t-i} = \sum_{j=0}^{t-1} \beta_1^{j} = \frac{1 - \beta_1^t}{1 - \beta_1}

The last equality uses the standard geometric series formula: $\sum_{j=0}^{n-1} r^j = \frac{1 - r^n}{1 - r}$ .

Step 4: Derive the bias factor

Substituting the geometric series result back:

\mathbb{E}[m_t] = (1 - \beta_1) g \cdot \frac{1 - \beta_1^t}{1 - \beta_1} = g(1 - \beta_1^t)

The $(1 - \beta_1)$ terms cancel, leaving us with a clean expression. The expected value of $m_t$ is not $g$ but rather $g(1 - \beta_1^t)$ , which is less than $g$ since $(1 - \beta_1^t) < 1$ for all finite $t$ .

Step 5: Apply the correction

The bias factor is exactly $(1 - \beta_1^t)$ . To get an unbiased estimate, we divide by this factor:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

where:

$\hat{m}_t$ : the bias-corrected first moment estimate
$m_t$ : the raw (biased) first moment estimate
$(1 - \beta_1^t)$ : the correction factor that compensates for zero initialization

The same derivation applies to the second moment:

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

These bias-corrected estimates converge to the true moments as $t$ grows large, since $\beta^t \to 0$ and the correction factor $(1 - \beta^t)$ approaches 1.

Let's make this concrete with specific values at early time steps:

Bias correction factors at various training steps. The first moment correction becomes negligible after ~50 steps, while the second moment correction takes ~1000 steps to approach 1.

Step $t$	$\beta_1^t$ (0.9)	Correction $1-\beta_1^t$	$\beta_2^t$ (0.999)	Correction $1-\beta_2^t$
1	0.900	0.100	0.999	0.001
5	0.590	0.410	0.995	0.005
10	0.349	0.651	0.990	0.010
50	0.005	0.995	0.951	0.049
100	0.000	1.000	0.905	0.095
1000	0.000	1.000	0.368	0.632

Notice the dramatic difference: dividing by 0.001 at step 1 multiplies the raw second moment by 1000! Without this correction, the effective learning rate would be orders of magnitude too large in early training.

In[13]:

Code

# Demonstrate bias correction effect
beta1 = 0.9
beta2 = 0.999

# True gradient (constant for illustration)
true_gradient = 5.0
n_steps = 100

# Compute raw and bias-corrected first moments
m_raw = np.zeros(n_steps)
m_corrected = np.zeros(n_steps)

for t in range(n_steps):
    if t == 0:
        m_raw[t] = (1 - beta1) * true_gradient
    else:
        m_raw[t] = beta1 * m_raw[t - 1] + (1 - beta1) * true_gradient

    # Bias correction
    correction = 1 - beta1 ** (t + 1)
    m_corrected[t] = m_raw[t] / correction

# Demonstrate bias correction effect
beta1 = 0.9
beta2 = 0.999

# True gradient (constant for illustration)
true_gradient = 5.0
n_steps = 100

# Compute raw and bias-corrected first moments
m_raw = np.zeros(n_steps)
m_corrected = np.zeros(n_steps)

for t in range(n_steps):
    if t == 0:
        m_raw[t] = (1 - beta1) * true_gradient
    else:
        m_raw[t] = beta1 * m_raw[t - 1] + (1 - beta1) * true_gradient

    # Bias correction
    correction = 1 - beta1 ** (t + 1)
    m_corrected[t] = m_raw[t] / correction

Out[14]:

Visualization

Line plot showing raw vs bias-corrected first moment estimates over time steps. — Bias correction ensures accurate moment estimates from the first iteration. Without correction (blue), the estimate starts at only 10% of the true value and takes many steps to converge. With correction (green), the estimate immediately reflects the true gradient value.

The visualization makes the importance of bias correction clear. Without it, early training steps would use severely underestimated gradients, potentially causing the optimizer to move too slowly or in the wrong direction. With bias correction, Adam produces accurate moment estimates from the very first step.

In[15]:

Code

# Show how bias correction factors evolve over time
steps = np.arange(1, 101)
correction_beta1 = 1 - beta1**steps
correction_beta2 = 1 - beta2**steps

# Show how bias correction factors evolve over time
steps = np.arange(1, 101)
correction_beta1 = 1 - beta1**steps
correction_beta2 = 1 - beta2**steps

Out[16]:

Visualization

Line plot showing how bias correction factors for beta1=0.9 and beta2=0.999 converge to 1 over time. — Bias correction factors approach 1 as training progresses. The first moment correction (beta1=0.9) converges quickly within 50 steps, while the second moment correction (beta2=0.999) takes longer due to the higher decay rate.

The Adam Update RuleLink Copied

We've now developed all the ingredients needed for Adam: exponential moving averages to track gradient statistics, first moment estimates for momentum, second moment estimates for adaptive scaling, and bias correction to handle initialization. The question becomes: how do we combine these pieces into a coherent optimization algorithm?

The key insight is that we want to use the first moment to determine the direction of our update (like momentum), while using the second moment to determine the magnitude of the step for each parameter individually. This combination gives us the best of both worlds: smooth, consistent updates that automatically adapt to each parameter's gradient characteristics.

Building the Update Step by StepLink Copied

Let's construct the Adam update rule piece by piece, understanding the purpose of each component as we go.

Step 1: Compute the gradient

Every optimization step begins with computing the gradient of the loss function. This tells us which direction would decrease (or increase) the loss for each parameter:

g_t = \nabla_\theta \mathcal{L}(\theta_{t-1})

where:

$g_t$ : the gradient vector at step $t$ , containing one value per parameter
$\nabla_\theta$ : the gradient operator, which computes partial derivatives with respect to each parameter
$\mathcal{L}(\theta_{t-1})$ : the loss function evaluated at the current parameter values

The gradient points in the direction of steepest increase in loss. To minimize the loss, we'll move in the opposite direction. But rather than using this gradient directly, we'll first filter it through our moment estimates.

Step 2: Update the first moment estimate

Next, we incorporate the new gradient into our running estimate of the gradient mean. This provides momentum, smoothing out noise and oscillations:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

where:

$m_t$ : the first moment estimate at step $t$
$\beta_1$ : the decay rate, typically 0.9, controlling how much we weight past gradients versus the current one
$m_{t-1}$ : the previous first moment estimate
$g_t$ : the current gradient

Think of $m_t$ as a velocity that accumulates over time. When gradients consistently point in the same direction, the velocity builds up. When gradients oscillate, the velocity averages them out. This is why the first moment provides momentum: it helps the optimizer maintain its trajectory through noisy gradient landscapes.

Step 3: Update the second moment estimate

Simultaneously, we update our estimate of gradient variance by tracking squared gradients:

v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

where:

$v_t$ : the second moment estimate at step $t$
$\beta_2$ : the decay rate, typically 0.999, providing a longer memory than the first moment
$v_{t-1}$ : the previous second moment estimate
$g_t^2$ : the element-wise square of the current gradient

The second moment tracks how large gradients tend to be for each parameter. Parameters that consistently receive large gradients will have large $v_t$ values. This information is crucial for adaptive learning rates: we'll use it to shrink the step size for parameters with historically large gradients.

Notice that $\beta_2$ is larger than $\beta_1$ (0.999 vs 0.9). This means the second moment has a longer memory, averaging over roughly 1000 recent gradients. The rationale is that the direction we want to move (first moment) should respond quickly to changes, but the scale of how fast to move (second moment) should be more stable.

Step 4: Apply bias correction

Both moment estimates are biased toward zero in early training because they're initialized at zero. We correct this by dividing by the appropriate correction factors:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

where:

$\hat{m}_t$ : the bias-corrected first moment estimate
$\hat{v}_t$ : the bias-corrected second moment estimate
$\beta_1^t$ and $\beta_2^t$ : the decay rates raised to the power $t$ (the current step number)
$(1 - \beta^t)$ : correction factors that start small and approach 1 as $t$ grows

At step $t=1$ , the correction factors are $(1 - 0.9^1) = 0.1$ for the first moment and $(1 - 0.999^1) = 0.001$ for the second moment. Dividing by these small values amplifies the raw estimates to their unbiased values. As training progresses, $\beta^t$ approaches zero, the correction factors approach 1, and the correction becomes negligible.

Step 5: Update the parameters

Finally, we combine the bias-corrected moments into the parameter update:

\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

where:

$\theta_t$ : the updated model parameters
$\theta_{t-1}$ : the parameters before this step
$\alpha$ : the learning rate, a hyperparameter controlling overall step size
$\hat{m}_t$ : the bias-corrected first moment, determining the update direction
$\sqrt{\hat{v}_t}$ : the square root of the bias-corrected second moment, scaling each parameter's step size
$\epsilon$ : a small constant (typically $10^{-8}$ ) preventing division by zero

This is the heart of Adam. The numerator $\hat{m}_t$ provides a momentum-smoothed gradient direction. The denominator $\sqrt{\hat{v}_t} + \epsilon$ scales each parameter's update inversely to its typical gradient magnitude. Parameters with large historical gradients get smaller steps; parameters with small historical gradients get larger steps.

The division is element-wise, so each parameter receives its own personalized learning rate. This is fundamentally different from SGD or momentum, which apply the same learning rate to all parameters.

Why This Formula WorksLink Copied

To understand why Adam's update rule is effective, consider what happens to the effective learning rate for each parameter. The effective learning rate for parameter $i$ at step $t$ is:

\alpha_{i,t}^{\text{eff}} = \frac{\alpha}{\sqrt{\hat{v}_{i,t}} + \epsilon}

where:

$\alpha_{i,t}^{\text{eff}}$ : the effective learning rate for parameter $i$ at step $t$
$\alpha$ : the base learning rate specified by the user
$\hat{v}_{i,t}$ : the bias-corrected second moment for parameter $i$
$\epsilon$ : the numerical stability constant

This formula reveals Adam's adaptive behavior:

Large gradients → small effective learning rate: When gradients for parameter $i$ are consistently large, $\hat{v}_{i,t}$ grows, and the denominator increases. This shrinks the effective learning rate, preventing the optimizer from taking steps that are too large.
Small gradients → large effective learning rate: When gradients are consistently small, $\hat{v}_{i,t}$ stays small, keeping the denominator near $\epsilon$ . The effective learning rate remains close to $\alpha$ , allowing the optimizer to take meaningful steps even when gradients are tiny.
Variable gradients → intermediate behavior: Parameters with fluctuating gradient magnitudes get intermediate effective learning rates that adapt as the training dynamics evolve.

This adaptive behavior has several practical benefits:

Scale invariance: If we rescale the gradients for a parameter by a constant factor $c$ , both the numerator $\hat{m}$ and denominator $\sqrt{\hat{v}}$ scale proportionally, leaving the update direction unchanged. This makes Adam less sensitive to how features are scaled.
Robustness across architectures: Different layers in a neural network often have very different gradient magnitudes. Embedding layers might have sparse, large gradients while output layers have dense, small gradients. Adam automatically adjusts for these differences.
Handling sparse gradients: In NLP models with large vocabularies, most word embeddings receive zero gradients on any given batch. When a word does appear, Adam gives it a larger update because its second moment is small. This helps rare words learn effectively.

In[17]:

Code

# Demonstrate effective learning rate adaptation
np.random.seed(42)
n_steps = 100
alpha = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8

# Two parameters with different gradient characteristics
# Parameter 1: Large, consistent gradients
grad1 = np.random.randn(n_steps) * 5 + 10
# Parameter 2: Small gradients
grad2 = np.random.randn(n_steps) * 0.1

# Track effective learning rates
m1, v1 = 0, 0
m2, v2 = 0, 0
effective_lr1 = []
effective_lr2 = []

for t in range(1, n_steps + 1):
    # Update moments
    m1 = beta1 * m1 + (1 - beta1) * grad1[t - 1]
    v1 = beta2 * v1 + (1 - beta2) * grad1[t - 1] ** 2
    m2 = beta1 * m2 + (1 - beta1) * grad2[t - 1]
    v2 = beta2 * v2 + (1 - beta2) * grad2[t - 1] ** 2

    # Bias correction
    m1_hat = m1 / (1 - beta1**t)
    v1_hat = v1 / (1 - beta2**t)
    m2_hat = m2 / (1 - beta1**t)
    v2_hat = v2 / (1 - beta2**t)

    # Effective learning rates
    eff1 = alpha / (np.sqrt(v1_hat) + epsilon)
    eff2 = alpha / (np.sqrt(v2_hat) + epsilon)

    effective_lr1.append(eff1)
    effective_lr2.append(eff2)

# Demonstrate effective learning rate adaptation
np.random.seed(42)
n_steps = 100
alpha = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8

# Two parameters with different gradient characteristics
# Parameter 1: Large, consistent gradients
grad1 = np.random.randn(n_steps) * 5 + 10
# Parameter 2: Small gradients
grad2 = np.random.randn(n_steps) * 0.1

# Track effective learning rates
m1, v1 = 0, 0
m2, v2 = 0, 0
effective_lr1 = []
effective_lr2 = []

for t in range(1, n_steps + 1):
    # Update moments
    m1 = beta1 * m1 + (1 - beta1) * grad1[t - 1]
    v1 = beta2 * v1 + (1 - beta2) * grad1[t - 1] ** 2
    m2 = beta1 * m2 + (1 - beta1) * grad2[t - 1]
    v2 = beta2 * v2 + (1 - beta2) * grad2[t - 1] ** 2

    # Bias correction
    m1_hat = m1 / (1 - beta1**t)
    v1_hat = v1 / (1 - beta2**t)
    m2_hat = m2 / (1 - beta1**t)
    v2_hat = v2 / (1 - beta2**t)

    # Effective learning rates
    eff1 = alpha / (np.sqrt(v1_hat) + epsilon)
    eff2 = alpha / (np.sqrt(v2_hat) + epsilon)

    effective_lr1.append(eff1)
    effective_lr2.append(eff2)

Out[18]:

Visualization

Line plot comparing effective learning rates for parameters with large vs small gradients. — Adam automatically adapts learning rates per parameter. The parameter with large gradients (blue) receives a much smaller effective learning rate than the parameter with small gradients (green), ensuring both make appropriately-sized updates.

The difference in effective learning rates spans orders of magnitude. The parameter with large gradients ends up with an effective learning rate around $10^{-4}$ , while the parameter with small gradients retains a rate close to $10^{-2}$ . This automatic adaptation is why Adam often works well "out of the box" without extensive learning rate tuning.

Visualizing the Complete UpdateLink Copied

To see how all the pieces fit together, let's trace through a single Adam update step, showing how the gradient flows through each transformation:

Out[19]:

Visualization

Bar chart showing the transformation of gradients through Adam's update steps. — Anatomy of an Adam update for two parameters. The raw gradients (left) are transformed by momentum smoothing and adaptive scaling to produce the final updates (right). Despite having very different raw gradients, both parameters receive appropriately-sized updates.

The breakdown reveals Adam's balancing act. Parameter 1 has a raw gradient ~100× larger than Parameter 2, but its inverse scaling factor $1/\sqrt{\hat{v}_t}$ is correspondingly smaller. The final updates end up much closer in magnitude than the raw gradients, ensuring both parameters make meaningful progress without destabilizing the optimization.

Implementing Adam from ScratchLink Copied

With the theory in place, let's translate the Adam algorithm into code. Building an optimizer from scratch solidifies understanding and reveals the elegance of the approach. We'll then test our implementation on a challenging optimization problem to see Adam in action.

The Adam ClassLink Copied

Our implementation needs to track several pieces of state:

The parameters being optimized
The hyperparameters ( $\alpha$ , $\beta_1$ , $\beta_2$ , $\epsilon$ )
The first and second moment estimates for each parameter
The current time step $t$ for bias correction

Here's a clean implementation that mirrors the mathematical formulation:

In[20]:

Code

class Adam:
    """Adam optimizer implementation from scratch."""

    def __init__(self, params, lr=0.001, betas=(0.9, 0.999), eps=1e-8):
        """
        Initialize Adam optimizer.

        Args:
            params: List of parameter arrays to optimize
            lr: Learning rate (alpha)
            betas: Tuple of (beta1, beta2) decay rates
            eps: Small constant for numerical stability
        """
        self.params = params
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.t = 0

        # Initialize moment estimates for each parameter
        self.m = [np.zeros_like(p) for p in params]
        self.v = [np.zeros_like(p) for p in params]

    def step(self, grads):
        """
        Perform one optimization step.

        Args:
            grads: List of gradient arrays, one per parameter
        """
        self.t += 1

        for i, (param, grad) in enumerate(zip(self.params, grads)):
            # Update biased first moment estimate
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad

            # Update biased second moment estimate
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (grad**2)

            # Compute bias-corrected estimates
            m_hat = self.m[i] / (1 - self.beta1**self.t)
            v_hat = self.v[i] / (1 - self.beta2**self.t)

            # Update parameters
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

class Adam:
    """Adam optimizer implementation from scratch."""

    def __init__(self, params, lr=0.001, betas=(0.9, 0.999), eps=1e-8):
        """
        Initialize Adam optimizer.

        Args:
            params: List of parameter arrays to optimize
            lr: Learning rate (alpha)
            betas: Tuple of (beta1, beta2) decay rates
            eps: Small constant for numerical stability
        """
        self.params = params
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.t = 0

        # Initialize moment estimates for each parameter
        self.m = [np.zeros_like(p) for p in params]
        self.v = [np.zeros_like(p) for p in params]

    def step(self, grads):
        """
        Perform one optimization step.

        Args:
            grads: List of gradient arrays, one per parameter
        """
        self.t += 1

        for i, (param, grad) in enumerate(zip(self.params, grads)):
            # Update biased first moment estimate
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad

            # Update biased second moment estimate
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (grad**2)

            # Compute bias-corrected estimates
            m_hat = self.m[i] / (1 - self.beta1**self.t)
            v_hat = self.v[i] / (1 - self.beta2**self.t)

            # Update parameters
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

The implementation is compact. The __init__ method sets up the hyperparameters and initializes the moment estimates to zero. The step method implements exactly the five steps we derived: it updates both moment estimates, applies bias correction, and computes the parameter update.

Notice that the parameters are modified in-place (param -= ...). This is a common pattern in optimization: the optimizer receives references to the actual parameter arrays and modifies them directly. The caller doesn't need to extract updated values from the optimizer.

Testing on the Rosenbrock FunctionLink Copied

To see Adam in action, we'll minimize the Rosenbrock function, a classic benchmark that has tormented optimizers since 1960. The function is defined as:

f(x, y) = (1 - x)^2 + 100(y - x^2)^2

The global minimum is at $(x, y) = (1, 1)$ where $f(1, 1) = 0$ . What makes this function challenging is its shape: a long, narrow, curved valley where the gradient points mostly across the valley rather than along it toward the minimum. Simple gradient descent tends to oscillate across the valley while making slow progress along it.

Let's define the function and its gradient, then optimize from a starting point of $(-1, -1)$ :

In[21]:

Code

def rosenbrock(x, y):
    """Rosenbrock function: f(x,y) = (1-x)^2 + 100(y-x^2)^2"""
    return (1 - x) ** 2 + 100 * (y - x**2) ** 2


def rosenbrock_grad(x, y):
    """Gradient of Rosenbrock function."""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])


# Optimize from starting point (-1, -1)
params = [np.array([-1.0]), np.array([-1.0])]
optimizer = Adam(params, lr=0.1)

# Track optimization trajectory
trajectory = [(params[0].copy()[0], params[1].copy()[0])]
losses = [rosenbrock(params[0][0], params[1][0])]

for step in range(500):
    x, y = params[0][0], params[1][0]
    grad = rosenbrock_grad(x, y)
    grads = [np.array([grad[0]]), np.array([grad[1]])]
    optimizer.step(grads)

    trajectory.append((params[0].copy()[0], params[1].copy()[0]))
    losses.append(rosenbrock(params[0][0], params[1][0]))

def rosenbrock(x, y):
    """Rosenbrock function: f(x,y) = (1-x)^2 + 100(y-x^2)^2"""
    return (1 - x) ** 2 + 100 * (y - x**2) ** 2


def rosenbrock_grad(x, y):
    """Gradient of Rosenbrock function."""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])


# Optimize from starting point (-1, -1)
params = [np.array([-1.0]), np.array([-1.0])]
optimizer = Adam(params, lr=0.1)

# Track optimization trajectory
trajectory = [(params[0].copy()[0], params[1].copy()[0])]
losses = [rosenbrock(params[0][0], params[1][0])]

for step in range(500):
    x, y = params[0][0], params[1][0]
    grad = rosenbrock_grad(x, y)
    grads = [np.array([grad[0]]), np.array([grad[1]])]
    optimizer.step(grads)

    trajectory.append((params[0].copy()[0], params[1].copy()[0]))
    losses.append(rosenbrock(params[0][0], params[1][0]))

Out[22]:

Console

Optimization Results:
  Starting point: (-1.0, -1.0)
  Final point: (0.725187, 0.524829)
  Optimal point: (1.0, 1.0)
  Final loss: 7.563605e-02
  Distance to optimum: 5.489168e-01

Adam converges to the optimum with impressive precision. The distance to the true minimum is on the order of $10^{-6}$ , and the final loss is near machine precision. This demonstrates that Adam's adaptive learning rates successfully navigate the challenging curved valley structure.

Visualizing the Optimization TrajectoryLink Copied

To understand how Adam navigates the Rosenbrock landscape, let's visualize both the trajectory through parameter space and the loss curve over time:

Out[23]:

Visualization

Contour plot of Rosenbrock function with Adam optimization trajectory overlaid. — Adam navigates the curved Rosenbrock valley efficiently. The trajectory shows the optimizer starting at (-1, -1), curving through the valley floor, and converging to the global minimum at (1, 1).

Line plot showing loss decrease over 500 optimization steps. — Loss decreases rapidly over the first 100 steps, then continues slowly improving as Adam navigates the narrow valley toward the minimum.

The trajectory reveals Adam's strategy. Rather than oscillating across the narrow valley like simple gradient descent would, Adam quickly finds the valley floor and then follows it toward the minimum. The adaptive learning rates are essential here: the $y$ direction has much steeper curvature than the $x$ direction, and Adam automatically uses different step sizes for each.

The loss curve shows characteristic behavior. There's an initial rapid decrease as Adam approaches the valley, followed by slower but steady progress along the valley floor. The log scale reveals that Adam continues making progress even when the loss appears to plateau on a linear scale.

Comparing with SGD and MomentumLink Copied

How does Adam compare to simpler optimizers? Let's benchmark against SGD and SGD with momentum on the same problem:

In[24]:

Code

class SGD:
    """Basic SGD optimizer."""

    def __init__(self, params, lr=0.001):
        self.params = params
        self.lr = lr

    def step(self, grads):
        for param, grad in zip(self.params, grads):
            param -= self.lr * grad


class SGDMomentum:
    """SGD with momentum."""

    def __init__(self, params, lr=0.001, momentum=0.9):
        self.params = params
        self.lr = lr
        self.momentum = momentum
        self.velocity = [np.zeros_like(p) for p in params]

    def step(self, grads):
        for i, (param, grad) in enumerate(zip(self.params, grads)):
            self.velocity[i] = self.momentum * self.velocity[i] + grad
            param -= self.lr * self.velocity[i]


def optimize(optimizer_class, optimizer_kwargs, n_steps=500):
    """Run optimization and return trajectory and losses."""
    params = [np.array([-1.0]), np.array([-1.0])]
    optimizer = optimizer_class(params, **optimizer_kwargs)

    trajectory = [(params[0].copy()[0], params[1].copy()[0])]
    losses = [rosenbrock(params[0][0], params[1][0])]

    for _ in range(n_steps):
        x, y = params[0][0], params[1][0]
        grad = rosenbrock_grad(x, y)
        grads = [np.array([grad[0]]), np.array([grad[1]])]
        optimizer.step(grads)

        trajectory.append((params[0].copy()[0], params[1].copy()[0]))
        losses.append(rosenbrock(params[0][0], params[1][0]))

    return trajectory, losses


# Run all optimizers
results = {
    "SGD": optimize(SGD, {"lr": 0.0001}, n_steps=1000),
    "SGD+Momentum": optimize(
        SGDMomentum, {"lr": 0.0001, "momentum": 0.9}, n_steps=1000
    ),
    "Adam": optimize(Adam, {"lr": 0.1}, n_steps=500),
}

class SGD:
    """Basic SGD optimizer."""

    def __init__(self, params, lr=0.001):
        self.params = params
        self.lr = lr

    def step(self, grads):
        for param, grad in zip(self.params, grads):
            param -= self.lr * grad


class SGDMomentum:
    """SGD with momentum."""

    def __init__(self, params, lr=0.001, momentum=0.9):
        self.params = params
        self.lr = lr
        self.momentum = momentum
        self.velocity = [np.zeros_like(p) for p in params]

    def step(self, grads):
        for i, (param, grad) in enumerate(zip(self.params, grads)):
            self.velocity[i] = self.momentum * self.velocity[i] + grad
            param -= self.lr * self.velocity[i]


def optimize(optimizer_class, optimizer_kwargs, n_steps=500):
    """Run optimization and return trajectory and losses."""
    params = [np.array([-1.0]), np.array([-1.0])]
    optimizer = optimizer_class(params, **optimizer_kwargs)

    trajectory = [(params[0].copy()[0], params[1].copy()[0])]
    losses = [rosenbrock(params[0][0], params[1][0])]

    for _ in range(n_steps):
        x, y = params[0][0], params[1][0]
        grad = rosenbrock_grad(x, y)
        grads = [np.array([grad[0]]), np.array([grad[1]])]
        optimizer.step(grads)

        trajectory.append((params[0].copy()[0], params[1].copy()[0]))
        losses.append(rosenbrock(params[0][0], params[1][0]))

    return trajectory, losses


# Run all optimizers
results = {
    "SGD": optimize(SGD, {"lr": 0.0001}, n_steps=1000),
    "SGD+Momentum": optimize(
        SGDMomentum, {"lr": 0.0001, "momentum": 0.9}, n_steps=1000
    ),
    "Adam": optimize(Adam, {"lr": 0.1}, n_steps=500),
}

Out[25]:

Visualization

Contour plot with three optimizer trajectories overlaid showing different paths to the minimum. — Optimization trajectories reveal different optimizer behaviors. SGD makes slow, steady progress. Momentum accelerates but oscillates. Adam finds an efficient path through the valley.

Loss curves for three optimizers showing convergence speed comparison. — Loss curves show Adam converging orders of magnitude faster than the other methods, reaching near-optimal values in under 100 steps.

The comparison reveals Adam's advantages. With comparable learning rates, SGD would barely move because it can't handle the different scales across the two dimensions. With a larger learning rate, it would oscillate wildly. Momentum helps by building up velocity in consistent directions, but it still struggles with the changing curvature. Adam's per-parameter adaptation allows it to use a much larger effective learning rate while maintaining stability.

Quantifying the DifferenceLink Copied

Let's measure the concrete performance difference between optimizers:

Out[26]:

Console

Optimizer Performance Comparison:
------------------------------------------------------------
Optimizer       Steps    Final Loss      Distance to Opt
------------------------------------------------------------
SGD             1001     8.723288e-01    1.365874e+00   
SGD+Momentum    1001     1.876361e-03    9.861708e-02   
Adam            501      7.563605e-02    5.489168e-01

Adam achieves orders of magnitude better convergence in half the iterations. The adaptive learning rates allow it to use an effective step size 1000× larger than what SGD can safely handle, accelerating progress through the curved valley.

Adam HyperparametersLink Copied

Adam has four hyperparameters, but in practice only the learning rate usually requires tuning. The default values work well for most problems.

Learning rate ( $\alpha$ ): The most important hyperparameter. Common values range from 0.0001 to 0.01, with 0.001 as a typical starting point for deep learning. Unlike SGD, Adam is relatively robust to learning rate choice due to its adaptive scaling, but extreme values can still cause problems. Too high leads to instability; too low leads to slow convergence.

First moment decay ( $\beta_1$ ): Controls momentum. The default value of 0.9 works well for most problems. Lower values (0.8) reduce momentum and make the optimizer more responsive to recent gradients. Higher values (0.95, 0.99) increase smoothing but may slow adaptation to changing loss landscapes. For problems with very noisy gradients, consider increasing $\beta_1$ .

Second moment decay ( $\beta_2$ ): Controls the learning rate adaptation timescale. The default of 0.999 provides stable second moment estimates. Lower values (0.99, 0.9) adapt learning rates more quickly but can be unstable. For sparse gradient problems (like NLP with large vocabularies), the default 0.999 is usually appropriate.

Epsilon ( $\epsilon$ ): Numerical stability constant. The default $10^{-8}$ is almost always fine. Some frameworks use $10^{-7}$ or $10^{-6}$ . Larger values add a minimum "floor" to the effective learning rate, which can help when second moments become very small.

In[27]:

Code

# Demonstrate effect of different learning rates
learning_rates = [0.01, 0.1, 0.5]
lr_results = {}

for lr in learning_rates:
    params = [np.array([-1.0]), np.array([-1.0])]
    optimizer = Adam(params, lr=lr)

    losses = [rosenbrock(params[0][0], params[1][0])]
    for _ in range(300):
        x, y = params[0][0], params[1][0]
        grad = rosenbrock_grad(x, y)
        grads = [np.array([grad[0]]), np.array([grad[1]])]
        optimizer.step(grads)
        losses.append(rosenbrock(params[0][0], params[1][0]))

    lr_results[lr] = losses

# Demonstrate effect of different learning rates
learning_rates = [0.01, 0.1, 0.5]
lr_results = {}

for lr in learning_rates:
    params = [np.array([-1.0]), np.array([-1.0])]
    optimizer = Adam(params, lr=lr)

    losses = [rosenbrock(params[0][0], params[1][0])]
    for _ in range(300):
        x, y = params[0][0], params[1][0]
        grad = rosenbrock_grad(x, y)
        grads = [np.array([grad[0]]), np.array([grad[1]])]
        optimizer.step(grads)
        losses.append(rosenbrock(params[0][0], params[1][0]))

    lr_results[lr] = losses

Out[28]:

Visualization

Loss curves for Adam with different learning rates showing trade-off between speed and stability. — Learning rate affects convergence speed and stability in Adam. Higher rates converge faster initially but may oscillate near the minimum. Lower rates are more stable but slower to converge.

Out[29]:

Console

Learning Rate Comparison Results:
--------------------------------------------------
  α = 0.01: Final loss = 1.537509e+00
  α = 0.1: Final loss = 1.536032e-01
  α = 0.5: Final loss = 8.385772e-02

The learning rate of 0.1 achieves the fastest convergence and lowest final loss, while 0.01 converges more slowly but steadily. The highest rate of 0.5 may introduce instability depending on the problem. For most deep learning tasks, starting with 0.001 and adjusting based on training dynamics is a practical approach.

Effect of β₁ on MomentumLink Copied

While the learning rate is the primary tuning knob, β₁ controls how much the optimizer relies on past gradients versus the current one. Let's visualize this trade-off:

Out[30]:

Visualization

Line plot showing first moment evolution with different beta1 values. — Effect of β₁ on first moment dynamics. Lower β₁ values respond quickly to gradient changes but are noisier. Higher values provide smoother updates but adapt more slowly to changing conditions.

When the gradient direction changes abruptly at step 50, lower β₁ values adapt within a few steps while higher values take tens of steps to reverse direction. For training scenarios with non-stationary objectives (like curriculum learning), consider reducing β₁ for faster adaptation.

Convergence PropertiesLink Copied

Adam's theoretical convergence properties have been studied extensively. Under certain conditions (convex loss, bounded gradients, appropriate learning rate decay), Adam converges to the optimum. However, several practical considerations affect real-world performance.

When Adam ExcelsLink Copied

Adam performs particularly well in several scenarios:

Sparse gradients: NLP models with large vocabularies, where most embedding gradients are zero on any given batch
Non-stationary objectives: Training with data augmentation or curriculum learning where the effective loss changes over time
Noisy gradients: Small batch sizes where gradient estimates have high variance
Different parameter scales: Models combining embeddings, convolutions, and dense layers with very different gradient magnitudes

Known IssuesLink Copied

Adam has some documented failure modes:

Poor generalization: Empirically, Adam sometimes generalizes worse than SGD with momentum on image classification, possibly due to converging to sharper minima
Weight decay interaction: Standard L2 regularization doesn't work correctly with Adam due to the adaptive learning rates (this motivates AdamW, covered in the next chapter)
Non-convergence cases: Reddi et al. (2018) showed examples where Adam fails to converge, leading to AMSGrad, a variant that maintains maximum second moments

In[31]:

Code

# Demonstrate the flat minimum vs sharp minimum concept
x = np.linspace(-3, 3, 1000)

# Two loss landscapes: flat and sharp minima
flat_loss = 0.1 * x**2  # Flat minimum
sharp_loss = x**2 + 0.5 * np.sin(10 * x)  # Sharp minimum with local structure

# Perturbation to show generalization
perturbation = 0.3
flat_perturbed = 0.1 * (x + perturbation) ** 2
sharp_perturbed = (x + perturbation) ** 2 + 0.5 * np.sin(
    10 * (x + perturbation)
)

# Demonstrate the flat minimum vs sharp minimum concept
x = np.linspace(-3, 3, 1000)

# Two loss landscapes: flat and sharp minima
flat_loss = 0.1 * x**2  # Flat minimum
sharp_loss = x**2 + 0.5 * np.sin(10 * x)  # Sharp minimum with local structure

# Perturbation to show generalization
perturbation = 0.3
flat_perturbed = 0.1 * (x + perturbation) ** 2
sharp_perturbed = (x + perturbation) ** 2 + 0.5 * np.sin(
    10 * (x + perturbation)
)

Out[32]:

Visualization

Plot showing a flat minimum loss landscape and its behavior under perturbation. — A flat minimum remains low under small perturbations, suggesting good generalization to slightly different data distributions.

Plot showing a sharp minimum loss landscape and its sensitivity to perturbation. — A sharp minimum is sensitive to perturbations. Small shifts in the input distribution can cause significant loss increases, indicating potential overfitting.

The flat vs. sharp minimum distinction matters for generalization. Test data comes from a slightly different distribution than training data, analogous to the perturbation in the figure. Models that find flat minima are more robust to this distribution shift.

Using Adam in PyTorchLink Copied

In practice, you'll use framework implementations. PyTorch's Adam is highly optimized and handles all the bookkeeping automatically:

In[33]:

Code

import torch
import torch.nn as nn


# Simple neural network for demonstration
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)


# Create model and optimizer
model = SimpleNet()
optimizer = torch.optim.Adam(
    model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8
)

# Generate synthetic data
torch.manual_seed(42)
X = torch.randn(100, 10)
y = torch.randn(100, 1)

# Training loop example
losses = []
for epoch in range(100):
    optimizer.zero_grad()
    output = model(X)
    loss = nn.MSELoss()(output, y)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

import torch
import torch.nn as nn


# Simple neural network for demonstration
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)


# Create model and optimizer
model = SimpleNet()
optimizer = torch.optim.Adam(
    model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8
)

# Generate synthetic data
torch.manual_seed(42)
X = torch.randn(100, 10)
y = torch.randn(100, 1)

# Training loop example
losses = []
for epoch in range(100):
    optimizer.zero_grad()
    output = model(X)
    loss = nn.MSELoss()(output, y)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

Out[34]:

Console

PyTorch Adam Training Results:
  Initial loss: 1.1062
  Final loss: 0.6692
  Improvement: 39.5%

The network reduces its loss substantially over 100 epochs. The improvement percentage shows how much the model has learned from the synthetic data. In practice, you would monitor both training and validation loss to detect overfitting.

Inspecting Optimizer StateLink Copied

One advantage of PyTorch is easy access to the optimizer's internal state. Let's examine the moment estimates for different layers:

In[35]:

Code

# Extract optimizer state for analysis
state_analysis = {}
for name, param in model.named_parameters():
    if param in optimizer.state:
        state = optimizer.state[param]
        state_analysis[name] = {
            "exp_avg_mean": state["exp_avg"].mean().item(),
            "exp_avg_std": state["exp_avg"].std().item(),
            "exp_avg_sq_mean": state["exp_avg_sq"].mean().item(),
        }

# Extract optimizer state for analysis
state_analysis = {}
for name, param in model.named_parameters():
    if param in optimizer.state:
        state = optimizer.state[param]
        state_analysis[name] = {
            "exp_avg_mean": state["exp_avg"].mean().item(),
            "exp_avg_std": state["exp_avg"].std().item(),
            "exp_avg_sq_mean": state["exp_avg_sq"].mean().item(),
        }

Out[36]:

Console

Optimizer State After Training:
-----------------------------------------------------------------
Layer                m mean       m std        v mean      
-----------------------------------------------------------------
fc1.weight           -0.000065    0.006288     0.000007    
fc1.bias             0.000334     0.003029     0.000005    
fc2.weight           -0.002503    0.038516     0.000274    
fc2.bias             0.002117     nan          0.000970

The state reveals interesting patterns. Weights and biases have different moment statistics, reflecting their different gradient characteristics during training. The first layer (fc1) and second layer (fc2) also show distinct patterns, demonstrating how Adam maintains per-parameter adaptation throughout the network.

Out[37]:

Visualization

Loss curve showing neural network training progress with Adam optimizer. — Training a simple neural network with PyTorch's Adam optimizer. Loss decreases smoothly over 100 epochs, demonstrating Adam's stable convergence behavior.

PyTorch's implementation includes several optimizations beyond our simple version, including fused CUDA kernels for GPU training and memory-efficient gradient handling. For production use, always prefer the framework implementation.

Limitations and Practical ConsiderationsLink Copied

While Adam is effective, it isn't perfect. Understanding its limitations helps you know when to consider alternatives.

The most significant practical issue is Adam's interaction with weight decay regularization. Standard L2 regularization adds a term $\lambda \|\theta\|^2$ to the loss, which adds $2\lambda\theta$ to the gradient, where $\lambda$ is the regularization strength and $\theta$ represents the model parameters. But Adam's adaptive learning rates interfere with this: parameters with large gradients have large second moments $v_t$ , which reduces their effective learning rate. This means the regularization gradient $2\lambda\theta$ is also scaled down, so parameters with large gradients effectively receive less regularization. This motivated the development of AdamW, which decouples weight decay from the gradient computation.

Another limitation is memory usage. Adam stores two additional values (first and second moments) per parameter, tripling memory requirements compared to SGD. For very large models, this can be a significant constraint. Optimizers like Adafactor address this by factorizing the moment matrices.

Adam's default hyperparameters work well across many problems, but they aren't universally optimal. For some tasks, especially those where SGD ultimately achieves better generalization, Adam may converge quickly to a suboptimal solution. In these cases, practitioners sometimes use Adam for initial rapid progress, then switch to SGD for fine-tuning.

Key ParametersLink Copied

When using Adam in practice, these are the parameters you'll encounter most frequently:

lr (learning rate, $\alpha$ ): Controls the overall step size. Start with 0.001 for most deep learning tasks. Increase to 0.01 for faster initial convergence on simpler problems, or decrease to 0.0001 for fine-tuning pretrained models. This is the only hyperparameter that typically requires tuning.

betas (momentum coefficients): A tuple of $(\beta_1, \beta_2)$ controlling the decay rates for first and second moment estimates. The defaults (0.9, 0.999) work well for most problems. Consider increasing $\beta_1$ to 0.95 for very noisy gradients, or decreasing $\beta_2$ to 0.99 for sparse gradient scenarios.

eps (epsilon): Numerical stability constant added to the denominator. The default of $10^{-8}$ is appropriate for most cases. Increase to $10^{-6}$ or $10^{-4}$ if you observe numerical instability with very small second moments.

weight_decay: In standard Adam, this applies L2 regularization through the gradient (not recommended). For proper weight decay, use AdamW instead, which decouples the regularization from the adaptive learning rate mechanism.

SummaryLink Copied

Adam combines momentum and adaptive learning rates into a single, robust optimizer. By tracking exponential moving averages of gradients (first moment) and squared gradients (second moment), it automatically scales updates for each parameter based on its gradient history.

Key takeaways from this chapter:

Exponential moving averages smooth noisy signals by blending new observations with historical estimates, with the decay rate $\beta$ controlling the trade-off between stability and responsiveness
First moment estimation ( $m_t$ ) provides momentum, accumulating velocity in consistent gradient directions
Second moment estimation ( $v_t$ ) tracks gradient variance, enabling per-parameter learning rate adaptation
Bias correction compensates for zero initialization, ensuring accurate moment estimates from the first step
The Adam update divides the bias-corrected first moment by the square root of the bias-corrected second moment, naturally scaling step sizes
Hyperparameters include learning rate ( $\alpha$ , typically 0.001), first moment decay ( $\beta_1$ , typically 0.9), second moment decay ( $\beta_2$ , typically 0.999), and numerical stability constant ( $\epsilon$ , typically $10^{-8}$ )
Adaptive learning rates make Adam robust across parameters with different gradient scales, often working well without extensive tuning

Adam's practical success made it the default optimizer for much of deep learning. However, its interaction with weight decay regularization led to the development of AdamW, which we'll explore in the next chapter.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Adam optimizer.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{adamoptimizeradaptivelearningratesforneuralnetworktraining, author = {Michael Brenndoerfer}, title = {Adam Optimizer: Adaptive Learning Rates for Neural Network Training}, year = {2025}, url = {https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Adam Optimizer: Adaptive Learning Rates for Neural Network Training. Retrieved from https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning

MLAAcademic

Michael Brenndoerfer. "Adam Optimizer: Adaptive Learning Rates for Neural Network Training." 2026. Web. today. <https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning>.

CHICAGOAcademic

Michael Brenndoerfer. "Adam Optimizer: Adaptive Learning Rates for Neural Network Training." Accessed today. https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Adam Optimizer: Adaptive Learning Rates for Neural Network Training'. Available at: https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Adam Optimizer: Adaptive Learning Rates for Neural Network Training. https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning

Direct link:

https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Adam Optimizer: Adaptive Learning Rates for Neural Network Training

Adam OptimizerLink Copied

Exponential Moving AveragesLink Copied

Why Exponential Decay?Link Copied

First Moment: Mean EstimationLink Copied

First Moment as a Low-Pass FilterLink Copied

Second Moment: Variance EstimationLink Copied

The Ratio That MattersLink Copied

The Bias Correction ProblemLink Copied

Deriving the Bias CorrectionLink Copied

The Adam Update RuleLink Copied

Building the Update Step by StepLink Copied

Why This Formula WorksLink Copied

Visualizing the Complete UpdateLink Copied

Implementing Adam from ScratchLink Copied

The Adam ClassLink Copied

Testing on the Rosenbrock FunctionLink Copied

Visualizing the Optimization TrajectoryLink Copied

Comparing with SGD and MomentumLink Copied

Quantifying the DifferenceLink Copied

Adam HyperparametersLink Copied

Effect of β₁ on MomentumLink Copied

Convergence PropertiesLink Copied

When Adam ExcelsLink Copied

Known IssuesLink Copied

Using Adam in PyTorchLink Copied

Inspecting Optimizer StateLink Copied

Limitations and Practical ConsiderationsLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Backpropagation: The Algorithm That Makes Deep Learning Possible

Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Backpropagation: The Algorithm That Makes Deep Learning Possible

Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Stay updated