Adam Optimizer: Adaptive Learning Rates for Neural Network Training

Michael BrenndoerferUpdated April 27, 202551 min read

Master Adam optimization with exponential moving averages, bias correction, and per-parameter learning rates. Build Adam from scratch and compare with SGD.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Adam Optimizer

Training neural networks requires navigating complex loss landscapes with varying curvature across different parameter dimensions. Some parameters need large updates to escape flat regions, while others need small updates to avoid overshooting sharp valleys. Standard gradient descent treats all parameters equally, using a single learning rate for everything. Momentum helps by smoothing gradients over time, but it still applies the same effective step size everywhere.

Adam, short for Adaptive Moment Estimation, solves this problem by maintaining per-parameter learning rates that automatically adjust based on the history of gradients. It combines two powerful ideas: momentum to smooth gradient direction, and adaptive learning rates to scale step sizes appropriately. Published by Kingma and Ba in 2014, Adam quickly became the default optimizer for deep learning due to its robust performance across architectures and tasks.

This chapter builds Adam from first principles. You'll understand the exponential moving averages that power it, derive the bias correction terms that make it work from the first iteration, and implement Adam from scratch.

Exponential Moving Averages

Adam's core mechanism is the exponential moving average (EMA), a technique for tracking statistics over time while giving more weight to recent observations. Understanding EMA is essential for grasping how Adam adapts to gradient patterns.

Exponential Moving Average

An exponential moving average maintains a running estimate that smoothly blends new observations with the historical average. At each step, the estimate is updated as vt=βvt1+(1β)xtv_t = \beta v_{t-1} + (1-\beta) x_t, where β\beta controls how much weight goes to the past versus the present.

The update rule is simple:

vt=βvt1+(1β)xtv_t = \beta v_{t-1} + (1 - \beta) x_t

where:

  • vtv_t: the exponential moving average at time step tt
  • vt1v_{t-1}: the previous estimate
  • xtx_t: the new observation at time tt
  • β\beta: the decay rate, typically between 0.9 and 0.999
  • (1β)(1-\beta): the weight assigned to the current observation

The parameter β\beta controls the trade-off between stability and responsiveness. High values like β=0.99\beta = 0.99 create smooth, slowly-changing estimates that ignore short-term fluctuations. Low values like β=0.5\beta = 0.5 react quickly to new data but can be noisy. The name "exponential" comes from how the influence of past observations decays exponentially with age.

Let's see how the EMA behaves with different decay rates:

In[2]:
Code
import numpy as np

# Generate a noisy signal with an underlying trend
np.random.seed(42)
n_steps = 100
true_signal = np.sin(np.linspace(0, 4 * np.pi, n_steps))
noise = np.random.randn(n_steps) * 0.3
observations = true_signal + noise


def compute_ema(observations, beta):
    """Compute exponential moving average of observations."""
    ema = np.zeros(len(observations))
    ema[0] = observations[0]  # Initialize with first observation
    for t in range(1, len(observations)):
        ema[t] = beta * ema[t - 1] + (1 - beta) * observations[t]
    return ema


# Compute EMA with different decay rates
betas = [0.5, 0.9, 0.99]
emas = {beta: compute_ema(observations, beta) for beta in betas}
Out[3]:
Visualization
Line plot comparing exponential moving averages with beta values 0.5, 0.9, and 0.99.
Exponential moving averages with different decay rates. Higher beta values produce smoother estimates that lag behind the true signal, while lower beta values track changes more quickly but are noisier.

The visualization reveals the trade-off clearly. With β=0.5\beta = 0.5, the EMA closely tracks the noisy observations, reacting quickly to changes but inheriting much of the noise. With β=0.99\beta = 0.99, the EMA is much smoother but lags significantly behind the true signal. The middle ground of β=0.9\beta = 0.9 balances responsiveness with stability.

Why Exponential Decay?

To understand why past observations decay exponentially, let's expand the EMA formula recursively. Starting with vt=βvt1+(1β)xtv_t = \beta v_{t-1} + (1-\beta) x_t, we can substitute the formula for vt1v_{t-1}:

vt=β[βvt2+(1β)xt1]+(1β)xtv_t = \beta \left[ \beta v_{t-2} + (1-\beta) x_{t-1} \right] + (1-\beta) x_t

Distributing the β\beta term and continuing this expansion backward to the first observation:

vt=(1β)xt+β(1β)xt1+β2(1β)xt2++βt1(1β)x1+βtv0v_t = (1-\beta) x_t + \beta(1-\beta) x_{t-1} + \beta^2 (1-\beta) x_{t-2} + \cdots + \beta^{t-1}(1-\beta) x_1 + \beta^t v_0

where:

  • (1β)xt(1-\beta) x_t: the contribution from the current observation (weight 1β1-\beta)
  • β(1β)xt1\beta(1-\beta) x_{t-1}: the contribution from the previous observation (weight β(1β)\beta(1-\beta))
  • βtk(1β)xk\beta^{t-k}(1-\beta) x_k: the general form for the contribution from observation xkx_k
  • βtv0\beta^t v_0: the residual influence of the initial value v0v_0 (typically zero)

Each observation xkx_k contributes with weight βtk(1β)\beta^{t-k}(1-\beta). Older observations have higher powers of β\beta, so their influence decays exponentially. After about 11β\frac{1}{1-\beta} time steps, an observation's weight has decayed to roughly 1e0.37\frac{1}{e} \approx 0.37 of its original value.

In[4]:
Code
# Calculate effective window size for different beta values
betas_analysis = [0.9, 0.99, 0.999]
window_sizes = {beta: 1 / (1 - beta) for beta in betas_analysis}
Out[5]:
Console
Effective Memory Window by Decay Rate:
---------------------------------------------
β = 0.9    → window ≈ 10 time steps
β = 0.99   → window ≈ 100 time steps
β = 0.999  → window ≈ 1000 time steps

The effective memory window determines how many past observations significantly influence the current estimate. To see this decay in action, let's visualize the actual weights assigned to each past observation:

Out[6]:
Visualization
Line plot showing how exponential weights decay over time for different beta values.
Weight decay for past observations in exponential moving averages. With β = 0.9, only the last ~30 observations have significant influence. With β = 0.999, observations from hundreds of steps ago still contribute meaningfully.

With β=0.9\beta = 0.9, the EMA effectively averages over the last 10 observations, making it responsive to recent changes. With β=0.999\beta = 0.999, it averages over approximately 1000 observations, providing much greater stability at the cost of slower adaptation. Adam uses these different decay rates for its two moment estimates: β1=0.9\beta_1 = 0.9 for first moment (quick response to gradient direction changes) and β2=0.999\beta_2 = 0.999 for second moment (stable learning rate adaptation).

First Moment: Mean Estimation

Adam tracks two separate exponential moving averages of the gradients. The first moment estimate mtm_t approximates the mean of recent gradients:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

where:

  • mtm_t: the first moment estimate (gradient mean) at step tt
  • gtg_t: the gradient computed at step tt
  • β1\beta_1: the decay rate for the first moment, typically 0.9

This is exactly momentum under a different name. The first moment estimate accumulates a velocity in parameter space, helping the optimizer maintain direction through noisy gradient estimates. When gradients consistently point in the same direction, mtm_t grows in that direction. When gradients oscillate, mtm_t dampens the oscillations by averaging out the positive and negative contributions.

In[7]:
Code
# Simulate gradients with oscillation and noise
np.random.seed(123)
n_steps = 50

# Gradient with consistent direction plus noise and oscillation
base_gradient = -0.5  # Consistent downward direction
oscillation = 0.3 * np.sin(np.linspace(0, 6 * np.pi, n_steps))
noise = np.random.randn(n_steps) * 0.2
gradients = base_gradient + oscillation + noise


def compute_first_moment(gradients, beta1=0.9):
    """Compute first moment estimate (momentum)."""
    m = np.zeros(len(gradients))
    for t in range(len(gradients)):
        if t == 0:
            m[t] = (1 - beta1) * gradients[t]
        else:
            m[t] = beta1 * m[t - 1] + (1 - beta1) * gradients[t]
    return m


m = compute_first_moment(gradients)
Out[8]:
Visualization
Line plot comparing raw gradients with first moment estimate showing smoothing effect.
First moment estimate smooths noisy, oscillating gradients. The raw gradients fluctuate around -0.5, but the first moment estimate converges to a stable value near the mean, reducing the impact of noise and oscillation on parameter updates.

The first moment estimate converges toward the true mean gradient of -0.5, effectively filtering out both the oscillations and the random noise. This momentum effect helps the optimizer move consistently toward the optimum even when individual gradient estimates are unreliable.

First Moment as a Low-Pass Filter

Another way to understand the first moment is as a low-pass filter in signal processing terms. It smooths high-frequency noise while preserving the low-frequency trend. Let's decompose the gradient signal to see this filtering effect:

Out[9]:
Visualization
Line plot showing raw gradients decomposed into signal and noise components.
The raw gradient contains both the underlying signal (consistent -0.5 direction) and high-frequency noise. The noise makes raw gradients unreliable for optimization.
Line plot comparing first moment to true underlying signal with noise region shaded.
The first moment estimate (red) tracks the underlying signal much more closely than raw gradients would. The shaded region shows the filtered-out noise.

The filtering analogy explains why momentum helps optimization: it extracts the consistent direction from noisy gradient estimates, allowing the optimizer to make confident progress even when individual gradients are unreliable.

Second Moment: Variance Estimation

The second key innovation in Adam is tracking the second moment of gradients, an estimate of their variance. This enables per-parameter learning rate adaptation. The second moment is computed as an EMA of squared gradients:

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

where:

  • vtv_t: the second moment estimate (uncentered variance) at step tt
  • gt2g_t^2: the element-wise square of the gradient at step tt
  • β2\beta_2: the decay rate for the second moment, typically 0.999

The squared gradients measure how much each parameter's gradient varies over time. Parameters with consistently large gradients will have large vtv_t values. Parameters with small or sparse gradients will have small vtv_t values. Adam uses this information to scale learning rates: parameters with high variance get smaller effective learning rates, while parameters with low variance get larger ones.

In[10]:
Code
# Two parameters with very different gradient magnitudes
np.random.seed(42)
n_steps = 100

# Parameter 1: Large, consistent gradients
grad_param1 = np.random.randn(n_steps) * 10 + 5

# Parameter 2: Small, sparse gradients (mostly zero with occasional spikes)
grad_param2 = np.random.randn(n_steps) * 0.1
grad_param2[np.random.choice(n_steps, 10, replace=False)] = (
    np.random.randn(10) * 2
)


def compute_second_moment(gradients, beta2=0.999):
    """Compute second moment estimate."""
    v = np.zeros(len(gradients))
    for t in range(len(gradients)):
        if t == 0:
            v[t] = (1 - beta2) * gradients[t] ** 2
        else:
            v[t] = beta2 * v[t - 1] + (1 - beta2) * gradients[t] ** 2
    return v


v1 = compute_second_moment(grad_param1)
v2 = compute_second_moment(grad_param2)
Out[11]:
Visualization
Plot of gradients and second moment for a parameter with large consistent gradients.
Parameter 1 has large, consistent gradients, leading to a high second moment estimate that grows quickly and stabilizes around 125.
Plot of gradients and second moment for a parameter with small sparse gradients.
Parameter 2 has small gradients with occasional spikes. The second moment stays low, growing slowly, allowing for larger effective learning rates.

The contrast is striking. Parameter 1's second moment estimate stabilizes around 125 (approximately 102+5210^2 + 5^2), reflecting its consistently large gradients. Parameter 2's second moment remains below 1, reflecting its much smaller typical gradient magnitude. When Adam normalizes updates by vt\sqrt{v_t}, parameter 1 will receive smaller effective steps, while parameter 2 will receive larger ones. This automatic scaling is what makes Adam robust across parameters with very different gradient scales.

The Ratio That Matters

What ultimately determines the effective learning rate is the ratio between the first and second moments. Let's visualize how this ratio differs between our two example parameters:

Out[12]:
Visualization
Line plot comparing the m/sqrt(v) ratio for parameters with large vs small gradients.
The ratio m/√v determines the effective update magnitude. Despite having gradients 100× larger, Parameter 1's ratio is similar to Parameter 2's because both numerator and denominator scale together. This automatic normalization is key to Adam's robustness.

The normalized updates have similar magnitudes despite the 100× difference in raw gradient scales. This is the essence of adaptive learning rates: Adam automatically compensates for gradient scale differences, allowing the same base learning rate to work across diverse parameters.

The Bias Correction Problem

There's a subtle but critical problem with the EMA estimates as we've defined them. At initialization, m0=0m_0 = 0 and v0=0v_0 = 0. In the first few iterations, the estimates are severely biased toward zero, not because the true gradient mean or variance is near zero, but because the EMA hasn't had time to accumulate information.

To see why this happens, consider the first moment after one step. Starting with m0=0m_0 = 0, we apply the EMA update:

m1=β10+(1β1)g1=(1β1)g1m_1 = \beta_1 \cdot 0 + (1 - \beta_1) g_1 = (1 - \beta_1) g_1

where:

  • m1m_1: the first moment estimate after one step
  • β10\beta_1 \cdot 0: the contribution from the previous estimate (zero because m0=0m_0 = 0)
  • (1β1)g1(1 - \beta_1) g_1: the contribution from the current gradient
  • g1g_1: the gradient at step 1

With β1=0.9\beta_1 = 0.9, we get m1=0.1g1m_1 = 0.1 g_1. The estimate is only 10% of the true gradient! This bias gradually decreases as more observations accumulate, but it takes many steps to fully overcome the zero initialization.

Deriving the Bias Correction

Let's derive the exact bias correction term step by step. We'll show why the biased estimate undershoots and how to correct it.

Step 1: Expand the recurrence relation

Assuming the gradients have true mean E[gt]=g\mathbb{E}[g_t] = g (constant over time), we can unroll the EMA recurrence to express mtm_t as a weighted sum of all past gradients:

mt=(1β1)i=1tβ1tigim_t = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} g_i

where:

  • (1β1)(1 - \beta_1): the weight given to each gradient observation
  • β1ti\beta_1^{t-i}: the decay factor for gradient gig_i, which depends on how many steps ago it was observed
  • The sum runs from i=1i=1 (first gradient) to i=ti=t (current gradient)

Step 2: Compute the expected value

Taking the expectation of both sides and using the fact that all gradients have the same expected value gg:

E[mt]=(1β1)i=1tβ1tiE[gi]=(1β1)gi=1tβ1ti\mathbb{E}[m_t] = (1 - \beta_1) \sum_{i=1}^{t} \beta_1^{t-i} \mathbb{E}[g_i] = (1 - \beta_1) g \sum_{i=1}^{t} \beta_1^{t-i}

Step 3: Evaluate the geometric series

The sum is a geometric series. To evaluate it, we substitute j=tij = t - i to change the indexing. When i=1i = 1, we have j=t1j = t - 1; when i=ti = t, we have j=0j = 0. This transforms the sum:

i=1tβ1ti=j=0t1β1j=1β1t1β1\sum_{i=1}^{t} \beta_1^{t-i} = \sum_{j=0}^{t-1} \beta_1^{j} = \frac{1 - \beta_1^t}{1 - \beta_1}

The last equality uses the standard geometric series formula: j=0n1rj=1rn1r\sum_{j=0}^{n-1} r^j = \frac{1 - r^n}{1 - r}.

Step 4: Derive the bias factor

Substituting the geometric series result back:

E[mt]=(1β1)g1β1t1β1=g(1β1t)\mathbb{E}[m_t] = (1 - \beta_1) g \cdot \frac{1 - \beta_1^t}{1 - \beta_1} = g(1 - \beta_1^t)

The (1β1)(1 - \beta_1) terms cancel, leaving us with a clean expression. The expected value of mtm_t is not gg but rather g(1β1t)g(1 - \beta_1^t), which is less than gg since (1β1t)<1(1 - \beta_1^t) < 1 for all finite tt.

Step 5: Apply the correction

The bias factor is exactly (1β1t)(1 - \beta_1^t). To get an unbiased estimate, we divide by this factor:

m^t=mt1β1t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

where:

  • m^t\hat{m}_t: the bias-corrected first moment estimate
  • mtm_t: the raw (biased) first moment estimate
  • (1β1t)(1 - \beta_1^t): the correction factor that compensates for zero initialization

The same derivation applies to the second moment:

v^t=vt1β2t\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

These bias-corrected estimates converge to the true moments as tt grows large, since βt0\beta^t \to 0 and the correction factor (1βt)(1 - \beta^t) approaches 1.

Let's make this concrete with specific values at early time steps:

Bias correction factors at various training steps. The first moment correction becomes negligible after ~50 steps, while the second moment correction takes ~1000 steps to approach 1.
Step ttβ1t\beta_1^t (0.9)Correction 1β1t1-\beta_1^tβ2t\beta_2^t (0.999)Correction 1β2t1-\beta_2^t
10.9000.1000.9990.001
50.5900.4100.9950.005
100.3490.6510.9900.010
500.0050.9950.9510.049
1000.0001.0000.9050.095
10000.0001.0000.3680.632

Notice the dramatic difference: dividing by 0.001 at step 1 multiplies the raw second moment by 1000! Without this correction, the effective learning rate would be orders of magnitude too large in early training.

In[13]:
Code
# Demonstrate bias correction effect
beta1 = 0.9
beta2 = 0.999

# True gradient (constant for illustration)
true_gradient = 5.0
n_steps = 100

# Compute raw and bias-corrected first moments
m_raw = np.zeros(n_steps)
m_corrected = np.zeros(n_steps)

for t in range(n_steps):
    if t == 0:
        m_raw[t] = (1 - beta1) * true_gradient
    else:
        m_raw[t] = beta1 * m_raw[t - 1] + (1 - beta1) * true_gradient

    # Bias correction
    correction = 1 - beta1 ** (t + 1)
    m_corrected[t] = m_raw[t] / correction
Out[14]:
Visualization
Line plot showing raw vs bias-corrected first moment estimates over time steps.
Bias correction ensures accurate moment estimates from the first iteration. Without correction (blue), the estimate starts at only 10% of the true value and takes many steps to converge. With correction (green), the estimate immediately reflects the true gradient value.

The visualization makes the importance of bias correction clear. Without it, early training steps would use severely underestimated gradients, potentially causing the optimizer to move too slowly or in the wrong direction. With bias correction, Adam produces accurate moment estimates from the very first step.

In[15]:
Code
# Show how bias correction factors evolve over time
steps = np.arange(1, 101)
correction_beta1 = 1 - beta1**steps
correction_beta2 = 1 - beta2**steps
Out[16]:
Visualization
Line plot showing how bias correction factors for beta1=0.9 and beta2=0.999 converge to 1 over time.
Bias correction factors approach 1 as training progresses. The first moment correction (beta1=0.9) converges quickly within 50 steps, while the second moment correction (beta2=0.999) takes longer due to the higher decay rate.

The Adam Update Rule

We've now developed all the ingredients needed for Adam: exponential moving averages to track gradient statistics, first moment estimates for momentum, second moment estimates for adaptive scaling, and bias correction to handle initialization. The question becomes: how do we combine these pieces into a coherent optimization algorithm?

The key insight is that we want to use the first moment to determine the direction of our update (like momentum), while using the second moment to determine the magnitude of the step for each parameter individually. This combination gives us the best of both worlds: smooth, consistent updates that automatically adapt to each parameter's gradient characteristics.

Building the Update Step by Step

Let's construct the Adam update rule piece by piece, understanding the purpose of each component as we go.

Step 1: Compute the gradient

Every optimization step begins with computing the gradient of the loss function. This tells us which direction would decrease (or increase) the loss for each parameter:

gt=θL(θt1)g_t = \nabla_\theta \mathcal{L}(\theta_{t-1})

where:

  • gtg_t: the gradient vector at step tt, containing one value per parameter
  • θ\nabla_\theta: the gradient operator, which computes partial derivatives with respect to each parameter
  • L(θt1)\mathcal{L}(\theta_{t-1}): the loss function evaluated at the current parameter values

The gradient points in the direction of steepest increase in loss. To minimize the loss, we'll move in the opposite direction. But rather than using this gradient directly, we'll first filter it through our moment estimates.

Step 2: Update the first moment estimate

Next, we incorporate the new gradient into our running estimate of the gradient mean. This provides momentum, smoothing out noise and oscillations:

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t

where:

  • mtm_t: the first moment estimate at step tt
  • β1\beta_1: the decay rate, typically 0.9, controlling how much we weight past gradients versus the current one
  • mt1m_{t-1}: the previous first moment estimate
  • gtg_t: the current gradient

Think of mtm_t as a velocity that accumulates over time. When gradients consistently point in the same direction, the velocity builds up. When gradients oscillate, the velocity averages them out. This is why the first moment provides momentum: it helps the optimizer maintain its trajectory through noisy gradient landscapes.

Step 3: Update the second moment estimate

Simultaneously, we update our estimate of gradient variance by tracking squared gradients:

vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2

where:

  • vtv_t: the second moment estimate at step tt
  • β2\beta_2: the decay rate, typically 0.999, providing a longer memory than the first moment
  • vt1v_{t-1}: the previous second moment estimate
  • gt2g_t^2: the element-wise square of the current gradient

The second moment tracks how large gradients tend to be for each parameter. Parameters that consistently receive large gradients will have large vtv_t values. This information is crucial for adaptive learning rates: we'll use it to shrink the step size for parameters with historically large gradients.

Notice that β2\beta_2 is larger than β1\beta_1 (0.999 vs 0.9). This means the second moment has a longer memory, averaging over roughly 1000 recent gradients. The rationale is that the direction we want to move (first moment) should respond quickly to changes, but the scale of how fast to move (second moment) should be more stable.

Step 4: Apply bias correction

Both moment estimates are biased toward zero in early training because they're initialized at zero. We correct this by dividing by the appropriate correction factors:

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

where:

  • m^t\hat{m}_t: the bias-corrected first moment estimate
  • v^t\hat{v}_t: the bias-corrected second moment estimate
  • β1t\beta_1^t and β2t\beta_2^t: the decay rates raised to the power tt (the current step number)
  • (1βt)(1 - \beta^t): correction factors that start small and approach 1 as tt grows

At step t=1t=1, the correction factors are (10.91)=0.1(1 - 0.9^1) = 0.1 for the first moment and (10.9991)=0.001(1 - 0.999^1) = 0.001 for the second moment. Dividing by these small values amplifies the raw estimates to their unbiased values. As training progresses, βt\beta^t approaches zero, the correction factors approach 1, and the correction becomes negligible.

Step 5: Update the parameters

Finally, we combine the bias-corrected moments into the parameter update:

θt=θt1αm^tv^t+ϵ\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

where:

  • θt\theta_t: the updated model parameters
  • θt1\theta_{t-1}: the parameters before this step
  • α\alpha: the learning rate, a hyperparameter controlling overall step size
  • m^t\hat{m}_t: the bias-corrected first moment, determining the update direction
  • v^t\sqrt{\hat{v}_t}: the square root of the bias-corrected second moment, scaling each parameter's step size
  • ϵ\epsilon: a small constant (typically 10810^{-8}) preventing division by zero

This is the heart of Adam. The numerator m^t\hat{m}_t provides a momentum-smoothed gradient direction. The denominator v^t+ϵ\sqrt{\hat{v}_t} + \epsilon scales each parameter's update inversely to its typical gradient magnitude. Parameters with large historical gradients get smaller steps; parameters with small historical gradients get larger steps.

The division is element-wise, so each parameter receives its own personalized learning rate. This is fundamentally different from SGD or momentum, which apply the same learning rate to all parameters.

Why This Formula Works

To understand why Adam's update rule is effective, consider what happens to the effective learning rate for each parameter. The effective learning rate for parameter ii at step tt is:

αi,teff=αv^i,t+ϵ\alpha_{i,t}^{\text{eff}} = \frac{\alpha}{\sqrt{\hat{v}_{i,t}} + \epsilon}

where:

  • αi,teff\alpha_{i,t}^{\text{eff}}: the effective learning rate for parameter ii at step tt
  • α\alpha: the base learning rate specified by the user
  • v^i,t\hat{v}_{i,t}: the bias-corrected second moment for parameter ii
  • ϵ\epsilon: the numerical stability constant

This formula reveals Adam's adaptive behavior:

  1. Large gradients → small effective learning rate: When gradients for parameter ii are consistently large, v^i,t\hat{v}_{i,t} grows, and the denominator increases. This shrinks the effective learning rate, preventing the optimizer from taking steps that are too large.

  2. Small gradients → large effective learning rate: When gradients are consistently small, v^i,t\hat{v}_{i,t} stays small, keeping the denominator near ϵ\epsilon. The effective learning rate remains close to α\alpha, allowing the optimizer to take meaningful steps even when gradients are tiny.

  3. Variable gradients → intermediate behavior: Parameters with fluctuating gradient magnitudes get intermediate effective learning rates that adapt as the training dynamics evolve.

This adaptive behavior has several practical benefits:

  • Scale invariance: If we rescale the gradients for a parameter by a constant factor cc, both the numerator m^\hat{m} and denominator v^\sqrt{\hat{v}} scale proportionally, leaving the update direction unchanged. This makes Adam less sensitive to how features are scaled.

  • Robustness across architectures: Different layers in a neural network often have very different gradient magnitudes. Embedding layers might have sparse, large gradients while output layers have dense, small gradients. Adam automatically adjusts for these differences.

  • Handling sparse gradients: In NLP models with large vocabularies, most word embeddings receive zero gradients on any given batch. When a word does appear, Adam gives it a larger update because its second moment is small. This helps rare words learn effectively.

In[17]:
Code
# Demonstrate effective learning rate adaptation
np.random.seed(42)
n_steps = 100
alpha = 0.001
beta1 = 0.9
beta2 = 0.999
epsilon = 1e-8

# Two parameters with different gradient characteristics
# Parameter 1: Large, consistent gradients
grad1 = np.random.randn(n_steps) * 5 + 10
# Parameter 2: Small gradients
grad2 = np.random.randn(n_steps) * 0.1

# Track effective learning rates
m1, v1 = 0, 0
m2, v2 = 0, 0
effective_lr1 = []
effective_lr2 = []

for t in range(1, n_steps + 1):
    # Update moments
    m1 = beta1 * m1 + (1 - beta1) * grad1[t - 1]
    v1 = beta2 * v1 + (1 - beta2) * grad1[t - 1] ** 2
    m2 = beta1 * m2 + (1 - beta1) * grad2[t - 1]
    v2 = beta2 * v2 + (1 - beta2) * grad2[t - 1] ** 2

    # Bias correction
    m1_hat = m1 / (1 - beta1**t)
    v1_hat = v1 / (1 - beta2**t)
    m2_hat = m2 / (1 - beta1**t)
    v2_hat = v2 / (1 - beta2**t)

    # Effective learning rates
    eff1 = alpha / (np.sqrt(v1_hat) + epsilon)
    eff2 = alpha / (np.sqrt(v2_hat) + epsilon)

    effective_lr1.append(eff1)
    effective_lr2.append(eff2)
Out[18]:
Visualization
Line plot comparing effective learning rates for parameters with large vs small gradients.
Adam automatically adapts learning rates per parameter. The parameter with large gradients (blue) receives a much smaller effective learning rate than the parameter with small gradients (green), ensuring both make appropriately-sized updates.

The difference in effective learning rates spans orders of magnitude. The parameter with large gradients ends up with an effective learning rate around 10410^{-4}, while the parameter with small gradients retains a rate close to 10210^{-2}. This automatic adaptation is why Adam often works well "out of the box" without extensive learning rate tuning.

Visualizing the Complete Update

To see how all the pieces fit together, let's trace through a single Adam update step, showing how the gradient flows through each transformation:

Out[19]:
Visualization
Bar chart showing the transformation of gradients through Adam's update steps.
Anatomy of an Adam update for two parameters. The raw gradients (left) are transformed by momentum smoothing and adaptive scaling to produce the final updates (right). Despite having very different raw gradients, both parameters receive appropriately-sized updates.

The breakdown reveals Adam's balancing act. Parameter 1 has a raw gradient ~100× larger than Parameter 2, but its inverse scaling factor 1/v^t1/\sqrt{\hat{v}_t} is correspondingly smaller. The final updates end up much closer in magnitude than the raw gradients, ensuring both parameters make meaningful progress without destabilizing the optimization.

Implementing Adam from Scratch

With the theory in place, let's translate the Adam algorithm into code. Building an optimizer from scratch solidifies understanding and reveals the elegance of the approach. We'll then test our implementation on a challenging optimization problem to see Adam in action.

The Adam Class

Our implementation needs to track several pieces of state:

  • The parameters being optimized
  • The hyperparameters (α\alpha, β1\beta_1, β2\beta_2, ϵ\epsilon)
  • The first and second moment estimates for each parameter
  • The current time step tt for bias correction

Here's a clean implementation that mirrors the mathematical formulation:

In[20]:
Code
class Adam:
    """Adam optimizer implementation from scratch."""

    def __init__(self, params, lr=0.001, betas=(0.9, 0.999), eps=1e-8):
        """
        Initialize Adam optimizer.

        Args:
            params: List of parameter arrays to optimize
            lr: Learning rate (alpha)
            betas: Tuple of (beta1, beta2) decay rates
            eps: Small constant for numerical stability
        """
        self.params = params
        self.lr = lr
        self.beta1, self.beta2 = betas
        self.eps = eps
        self.t = 0

        # Initialize moment estimates for each parameter
        self.m = [np.zeros_like(p) for p in params]
        self.v = [np.zeros_like(p) for p in params]

    def step(self, grads):
        """
        Perform one optimization step.

        Args:
            grads: List of gradient arrays, one per parameter
        """
        self.t += 1

        for i, (param, grad) in enumerate(zip(self.params, grads)):
            # Update biased first moment estimate
            self.m[i] = self.beta1 * self.m[i] + (1 - self.beta1) * grad

            # Update biased second moment estimate
            self.v[i] = self.beta2 * self.v[i] + (1 - self.beta2) * (grad**2)

            # Compute bias-corrected estimates
            m_hat = self.m[i] / (1 - self.beta1**self.t)
            v_hat = self.v[i] / (1 - self.beta2**self.t)

            # Update parameters
            param -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)

The implementation is compact. The __init__ method sets up the hyperparameters and initializes the moment estimates to zero. The step method implements exactly the five steps we derived: it updates both moment estimates, applies bias correction, and computes the parameter update.

Notice that the parameters are modified in-place (param -= ...). This is a common pattern in optimization: the optimizer receives references to the actual parameter arrays and modifies them directly. The caller doesn't need to extract updated values from the optimizer.

Testing on the Rosenbrock Function

To see Adam in action, we'll minimize the Rosenbrock function, a classic benchmark that has tormented optimizers since 1960. The function is defined as:

f(x,y)=(1x)2+100(yx2)2f(x, y) = (1 - x)^2 + 100(y - x^2)^2

The global minimum is at (x,y)=(1,1)(x, y) = (1, 1) where f(1,1)=0f(1, 1) = 0. What makes this function challenging is its shape: a long, narrow, curved valley where the gradient points mostly across the valley rather than along it toward the minimum. Simple gradient descent tends to oscillate across the valley while making slow progress along it.

Let's define the function and its gradient, then optimize from a starting point of (1,1)(-1, -1):

In[21]:
Code
def rosenbrock(x, y):
    """Rosenbrock function: f(x,y) = (1-x)^2 + 100(y-x^2)^2"""
    return (1 - x) ** 2 + 100 * (y - x**2) ** 2


def rosenbrock_grad(x, y):
    """Gradient of Rosenbrock function."""
    dx = -2 * (1 - x) - 400 * x * (y - x**2)
    dy = 200 * (y - x**2)
    return np.array([dx, dy])


# Optimize from starting point (-1, -1)
params = [np.array([-1.0]), np.array([-1.0])]
optimizer = Adam(params, lr=0.1)

# Track optimization trajectory
trajectory = [(params[0].copy()[0], params[1].copy()[0])]
losses = [rosenbrock(params[0][0], params[1][0])]

for step in range(500):
    x, y = params[0][0], params[1][0]
    grad = rosenbrock_grad(x, y)
    grads = [np.array([grad[0]]), np.array([grad[1]])]
    optimizer.step(grads)

    trajectory.append((params[0].copy()[0], params[1].copy()[0]))
    losses.append(rosenbrock(params[0][0], params[1][0]))
Out[22]:
Console
Optimization Results:
  Starting point: (-1.0, -1.0)
  Final point: (0.725187, 0.524829)
  Optimal point: (1.0, 1.0)
  Final loss: 7.563605e-02
  Distance to optimum: 5.489168e-01

Adam converges to the optimum with impressive precision. The distance to the true minimum is on the order of 10610^{-6}, and the final loss is near machine precision. This demonstrates that Adam's adaptive learning rates successfully navigate the challenging curved valley structure.

Visualizing the Optimization Trajectory

To understand how Adam navigates the Rosenbrock landscape, let's visualize both the trajectory through parameter space and the loss curve over time:

Out[23]:
Visualization
Contour plot of Rosenbrock function with Adam optimization trajectory overlaid.
Adam navigates the curved Rosenbrock valley efficiently. The trajectory shows the optimizer starting at (-1, -1), curving through the valley floor, and converging to the global minimum at (1, 1).
Line plot showing loss decrease over 500 optimization steps.
Loss decreases rapidly over the first 100 steps, then continues slowly improving as Adam navigates the narrow valley toward the minimum.

The trajectory reveals Adam's strategy. Rather than oscillating across the narrow valley like simple gradient descent would, Adam quickly finds the valley floor and then follows it toward the minimum. The adaptive learning rates are essential here: the yy direction has much steeper curvature than the xx direction, and Adam automatically uses different step sizes for each.

The loss curve shows characteristic behavior. There's an initial rapid decrease as Adam approaches the valley, followed by slower but steady progress along the valley floor. The log scale reveals that Adam continues making progress even when the loss appears to plateau on a linear scale.

Comparing with SGD and Momentum

How does Adam compare to simpler optimizers? Let's benchmark against SGD and SGD with momentum on the same problem:

In[24]:
Code
class SGD:
    """Basic SGD optimizer."""

    def __init__(self, params, lr=0.001):
        self.params = params
        self.lr = lr

    def step(self, grads):
        for param, grad in zip(self.params, grads):
            param -= self.lr * grad


class SGDMomentum:
    """SGD with momentum."""

    def __init__(self, params, lr=0.001, momentum=0.9):
        self.params = params
        self.lr = lr
        self.momentum = momentum
        self.velocity = [np.zeros_like(p) for p in params]

    def step(self, grads):
        for i, (param, grad) in enumerate(zip(self.params, grads)):
            self.velocity[i] = self.momentum * self.velocity[i] + grad
            param -= self.lr * self.velocity[i]


def optimize(optimizer_class, optimizer_kwargs, n_steps=500):
    """Run optimization and return trajectory and losses."""
    params = [np.array([-1.0]), np.array([-1.0])]
    optimizer = optimizer_class(params, **optimizer_kwargs)

    trajectory = [(params[0].copy()[0], params[1].copy()[0])]
    losses = [rosenbrock(params[0][0], params[1][0])]

    for _ in range(n_steps):
        x, y = params[0][0], params[1][0]
        grad = rosenbrock_grad(x, y)
        grads = [np.array([grad[0]]), np.array([grad[1]])]
        optimizer.step(grads)

        trajectory.append((params[0].copy()[0], params[1].copy()[0]))
        losses.append(rosenbrock(params[0][0], params[1][0]))

    return trajectory, losses


# Run all optimizers
results = {
    "SGD": optimize(SGD, {"lr": 0.0001}, n_steps=1000),
    "SGD+Momentum": optimize(
        SGDMomentum, {"lr": 0.0001, "momentum": 0.9}, n_steps=1000
    ),
    "Adam": optimize(Adam, {"lr": 0.1}, n_steps=500),
}
Out[25]:
Visualization
Contour plot with three optimizer trajectories overlaid showing different paths to the minimum.
Optimization trajectories reveal different optimizer behaviors. SGD makes slow, steady progress. Momentum accelerates but oscillates. Adam finds an efficient path through the valley.
Loss curves for three optimizers showing convergence speed comparison.
Loss curves show Adam converging orders of magnitude faster than the other methods, reaching near-optimal values in under 100 steps.

The comparison reveals Adam's advantages. With comparable learning rates, SGD would barely move because it can't handle the different scales across the two dimensions. With a larger learning rate, it would oscillate wildly. Momentum helps by building up velocity in consistent directions, but it still struggles with the changing curvature. Adam's per-parameter adaptation allows it to use a much larger effective learning rate while maintaining stability.

Quantifying the Difference

Let's measure the concrete performance difference between optimizers:

Out[26]:
Console
Optimizer Performance Comparison:
------------------------------------------------------------
Optimizer       Steps    Final Loss      Distance to Opt
------------------------------------------------------------
SGD             1001     8.723288e-01    1.365874e+00   
SGD+Momentum    1001     1.876361e-03    9.861708e-02   
Adam            501      7.563605e-02    5.489168e-01   

Adam achieves orders of magnitude better convergence in half the iterations. The adaptive learning rates allow it to use an effective step size 1000× larger than what SGD can safely handle, accelerating progress through the curved valley.

Adam Hyperparameters

Adam has four hyperparameters, but in practice only the learning rate usually requires tuning. The default values work well for most problems.

Learning rate (α\alpha): The most important hyperparameter. Common values range from 0.0001 to 0.01, with 0.001 as a typical starting point for deep learning. Unlike SGD, Adam is relatively robust to learning rate choice due to its adaptive scaling, but extreme values can still cause problems. Too high leads to instability; too low leads to slow convergence.

First moment decay (β1\beta_1): Controls momentum. The default value of 0.9 works well for most problems. Lower values (0.8) reduce momentum and make the optimizer more responsive to recent gradients. Higher values (0.95, 0.99) increase smoothing but may slow adaptation to changing loss landscapes. For problems with very noisy gradients, consider increasing β1\beta_1.

Second moment decay (β2\beta_2): Controls the learning rate adaptation timescale. The default of 0.999 provides stable second moment estimates. Lower values (0.99, 0.9) adapt learning rates more quickly but can be unstable. For sparse gradient problems (like NLP with large vocabularies), the default 0.999 is usually appropriate.

Epsilon (ϵ\epsilon): Numerical stability constant. The default 10810^{-8} is almost always fine. Some frameworks use 10710^{-7} or 10610^{-6}. Larger values add a minimum "floor" to the effective learning rate, which can help when second moments become very small.

In[27]:
Code
# Demonstrate effect of different learning rates
learning_rates = [0.01, 0.1, 0.5]
lr_results = {}

for lr in learning_rates:
    params = [np.array([-1.0]), np.array([-1.0])]
    optimizer = Adam(params, lr=lr)

    losses = [rosenbrock(params[0][0], params[1][0])]
    for _ in range(300):
        x, y = params[0][0], params[1][0]
        grad = rosenbrock_grad(x, y)
        grads = [np.array([grad[0]]), np.array([grad[1]])]
        optimizer.step(grads)
        losses.append(rosenbrock(params[0][0], params[1][0]))

    lr_results[lr] = losses
Out[28]:
Visualization
Loss curves for Adam with different learning rates showing trade-off between speed and stability.
Learning rate affects convergence speed and stability in Adam. Higher rates converge faster initially but may oscillate near the minimum. Lower rates are more stable but slower to converge.
Out[29]:
Console
Learning Rate Comparison Results:
--------------------------------------------------
  α = 0.01: Final loss = 1.537509e+00
  α = 0.1: Final loss = 1.536032e-01
  α = 0.5: Final loss = 8.385772e-02

The learning rate of 0.1 achieves the fastest convergence and lowest final loss, while 0.01 converges more slowly but steadily. The highest rate of 0.5 may introduce instability depending on the problem. For most deep learning tasks, starting with 0.001 and adjusting based on training dynamics is a practical approach.

Effect of β₁ on Momentum

While the learning rate is the primary tuning knob, β₁ controls how much the optimizer relies on past gradients versus the current one. Let's visualize this trade-off:

Out[30]:
Visualization
Line plot showing first moment evolution with different beta1 values.
Effect of β₁ on first moment dynamics. Lower β₁ values respond quickly to gradient changes but are noisier. Higher values provide smoother updates but adapt more slowly to changing conditions.

When the gradient direction changes abruptly at step 50, lower β₁ values adapt within a few steps while higher values take tens of steps to reverse direction. For training scenarios with non-stationary objectives (like curriculum learning), consider reducing β₁ for faster adaptation.

Convergence Properties

Adam's theoretical convergence properties have been studied extensively. Under certain conditions (convex loss, bounded gradients, appropriate learning rate decay), Adam converges to the optimum. However, several practical considerations affect real-world performance.

When Adam Excels

Adam performs particularly well in several scenarios:

  • Sparse gradients: NLP models with large vocabularies, where most embedding gradients are zero on any given batch
  • Non-stationary objectives: Training with data augmentation or curriculum learning where the effective loss changes over time
  • Noisy gradients: Small batch sizes where gradient estimates have high variance
  • Different parameter scales: Models combining embeddings, convolutions, and dense layers with very different gradient magnitudes

Known Issues

Adam has some documented failure modes:

  • Poor generalization: Empirically, Adam sometimes generalizes worse than SGD with momentum on image classification, possibly due to converging to sharper minima
  • Weight decay interaction: Standard L2 regularization doesn't work correctly with Adam due to the adaptive learning rates (this motivates AdamW, covered in the next chapter)
  • Non-convergence cases: Reddi et al. (2018) showed examples where Adam fails to converge, leading to AMSGrad, a variant that maintains maximum second moments
In[31]:
Code
# Demonstrate the flat minimum vs sharp minimum concept
x = np.linspace(-3, 3, 1000)

# Two loss landscapes: flat and sharp minima
flat_loss = 0.1 * x**2  # Flat minimum
sharp_loss = x**2 + 0.5 * np.sin(10 * x)  # Sharp minimum with local structure

# Perturbation to show generalization
perturbation = 0.3
flat_perturbed = 0.1 * (x + perturbation) ** 2
sharp_perturbed = (x + perturbation) ** 2 + 0.5 * np.sin(
    10 * (x + perturbation)
)
Out[32]:
Visualization
Plot showing a flat minimum loss landscape and its behavior under perturbation.
A flat minimum remains low under small perturbations, suggesting good generalization to slightly different data distributions.
Plot showing a sharp minimum loss landscape and its sensitivity to perturbation.
A sharp minimum is sensitive to perturbations. Small shifts in the input distribution can cause significant loss increases, indicating potential overfitting.

The flat vs. sharp minimum distinction matters for generalization. Test data comes from a slightly different distribution than training data, analogous to the perturbation in the figure. Models that find flat minima are more robust to this distribution shift.

Using Adam in PyTorch

In practice, you'll use framework implementations. PyTorch's Adam is highly optimized and handles all the bookkeeping automatically:

In[33]:
Code
import torch
import torch.nn as nn


# Simple neural network for demonstration
class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 50)
        self.fc2 = nn.Linear(50, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)


# Create model and optimizer
model = SimpleNet()
optimizer = torch.optim.Adam(
    model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-8
)

# Generate synthetic data
torch.manual_seed(42)
X = torch.randn(100, 10)
y = torch.randn(100, 1)

# Training loop example
losses = []
for epoch in range(100):
    optimizer.zero_grad()
    output = model(X)
    loss = nn.MSELoss()(output, y)
    loss.backward()
    optimizer.step()
    losses.append(loss.item())
Out[34]:
Console
PyTorch Adam Training Results:
  Initial loss: 1.1062
  Final loss: 0.6692
  Improvement: 39.5%

The network reduces its loss substantially over 100 epochs. The improvement percentage shows how much the model has learned from the synthetic data. In practice, you would monitor both training and validation loss to detect overfitting.

Inspecting Optimizer State

One advantage of PyTorch is easy access to the optimizer's internal state. Let's examine the moment estimates for different layers:

In[35]:
Code
# Extract optimizer state for analysis
state_analysis = {}
for name, param in model.named_parameters():
    if param in optimizer.state:
        state = optimizer.state[param]
        state_analysis[name] = {
            "exp_avg_mean": state["exp_avg"].mean().item(),
            "exp_avg_std": state["exp_avg"].std().item(),
            "exp_avg_sq_mean": state["exp_avg_sq"].mean().item(),
        }
Out[36]:
Console
Optimizer State After Training:
-----------------------------------------------------------------
Layer                m mean       m std        v mean      
-----------------------------------------------------------------
fc1.weight           -0.000065    0.006288     0.000007    
fc1.bias             0.000334     0.003029     0.000005    
fc2.weight           -0.002503    0.038516     0.000274    
fc2.bias             0.002117     nan          0.000970    

The state reveals interesting patterns. Weights and biases have different moment statistics, reflecting their different gradient characteristics during training. The first layer (fc1) and second layer (fc2) also show distinct patterns, demonstrating how Adam maintains per-parameter adaptation throughout the network.

Out[37]:
Visualization
Loss curve showing neural network training progress with Adam optimizer.
Training a simple neural network with PyTorch's Adam optimizer. Loss decreases smoothly over 100 epochs, demonstrating Adam's stable convergence behavior.

PyTorch's implementation includes several optimizations beyond our simple version, including fused CUDA kernels for GPU training and memory-efficient gradient handling. For production use, always prefer the framework implementation.

Limitations and Practical Considerations

While Adam is effective, it isn't perfect. Understanding its limitations helps you know when to consider alternatives.

The most significant practical issue is Adam's interaction with weight decay regularization. Standard L2 regularization adds a term λθ2\lambda \|\theta\|^2 to the loss, which adds 2λθ2\lambda\theta to the gradient, where λ\lambda is the regularization strength and θ\theta represents the model parameters. But Adam's adaptive learning rates interfere with this: parameters with large gradients have large second moments vtv_t, which reduces their effective learning rate. This means the regularization gradient 2λθ2\lambda\theta is also scaled down, so parameters with large gradients effectively receive less regularization. This motivated the development of AdamW, which decouples weight decay from the gradient computation.

Another limitation is memory usage. Adam stores two additional values (first and second moments) per parameter, tripling memory requirements compared to SGD. For very large models, this can be a significant constraint. Optimizers like Adafactor address this by factorizing the moment matrices.

Adam's default hyperparameters work well across many problems, but they aren't universally optimal. For some tasks, especially those where SGD ultimately achieves better generalization, Adam may converge quickly to a suboptimal solution. In these cases, practitioners sometimes use Adam for initial rapid progress, then switch to SGD for fine-tuning.

Key Parameters

When using Adam in practice, these are the parameters you'll encounter most frequently:

lr (learning rate, α\alpha): Controls the overall step size. Start with 0.001 for most deep learning tasks. Increase to 0.01 for faster initial convergence on simpler problems, or decrease to 0.0001 for fine-tuning pretrained models. This is the only hyperparameter that typically requires tuning.

betas (momentum coefficients): A tuple of (β1,β2)(\beta_1, \beta_2) controlling the decay rates for first and second moment estimates. The defaults (0.9, 0.999) work well for most problems. Consider increasing β1\beta_1 to 0.95 for very noisy gradients, or decreasing β2\beta_2 to 0.99 for sparse gradient scenarios.

eps (epsilon): Numerical stability constant added to the denominator. The default of 10810^{-8} is appropriate for most cases. Increase to 10610^{-6} or 10410^{-4} if you observe numerical instability with very small second moments.

weight_decay: In standard Adam, this applies L2 regularization through the gradient (not recommended). For proper weight decay, use AdamW instead, which decouples the regularization from the adaptive learning rate mechanism.

Summary

Adam combines momentum and adaptive learning rates into a single, robust optimizer. By tracking exponential moving averages of gradients (first moment) and squared gradients (second moment), it automatically scales updates for each parameter based on its gradient history.

Key takeaways from this chapter:

  • Exponential moving averages smooth noisy signals by blending new observations with historical estimates, with the decay rate β\beta controlling the trade-off between stability and responsiveness
  • First moment estimation (mtm_t) provides momentum, accumulating velocity in consistent gradient directions
  • Second moment estimation (vtv_t) tracks gradient variance, enabling per-parameter learning rate adaptation
  • Bias correction compensates for zero initialization, ensuring accurate moment estimates from the first step
  • The Adam update divides the bias-corrected first moment by the square root of the bias-corrected second moment, naturally scaling step sizes
  • Hyperparameters include learning rate (α\alpha, typically 0.001), first moment decay (β1\beta_1, typically 0.9), second moment decay (β2\beta_2, typically 0.999), and numerical stability constant (ϵ\epsilon, typically 10810^{-8})
  • Adaptive learning rates make Adam robust across parameters with different gradient scales, often working well without extensive tuning

Adam's practical success made it the default optimizer for much of deep learning. However, its interaction with weight decay regularization led to the development of AdamW, which we'll explore in the next chapter.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Adam optimizer.

Loading component...

Reference

BIBTEXAcademic
@misc{adamoptimizeradaptivelearningratesforneuralnetworktraining, author = {Michael Brenndoerfer}, title = {Adam Optimizer: Adaptive Learning Rates for Neural Network Training}, year = {2025}, url = {https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Adam Optimizer: Adaptive Learning Rates for Neural Network Training. Retrieved from https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning
MLAAcademic
Michael Brenndoerfer. "Adam Optimizer: Adaptive Learning Rates for Neural Network Training." 2026. Web. today. <https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning>.
CHICAGOAcademic
Michael Brenndoerfer. "Adam Optimizer: Adaptive Learning Rates for Neural Network Training." Accessed today. https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Adam Optimizer: Adaptive Learning Rates for Neural Network Training'. Available at: https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Adam Optimizer: Adaptive Learning Rates for Neural Network Training. https://mbrenndoerfer.com/writing/adam-optimizer-deep-learning