Stochastic Gradient Descent: From Batch to Minibatch Optimization

Michael Brenndoerfer

Master SGD optimization for neural networks, including minibatch training, learning rate schedules, and how gradient noise acts as implicit regularization.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Stochastic Gradient DescentLink Copied

Training a neural network means finding weights that minimize a loss function. We've seen how backpropagation computes gradients, telling us which direction to move each weight to reduce the loss. But computing these gradients over an entire dataset is expensive. With millions of training examples, even a single gradient computation becomes prohibitively slow.

Stochastic Gradient Descent (SGD) solves this problem with a simple but powerful insight: we don't need perfect gradients. We just need gradients that point roughly in the right direction. By using small random samples instead of the full dataset, we trade precision for speed, often getting orders of magnitude faster training with minimal impact on the final result.

From Batch to Stochastic: The Core IdeaLink Copied

To understand why we need stochastic gradient descent, let's first examine what we're replacing and why the replacement works at all.

The Ideal: Computing the True GradientLink Copied

Imagine you want to know the average height of adults in a country. The most accurate approach is to measure everyone. Similarly, standard gradient descent, often called batch gradient descent, computes the loss gradient by averaging over every training example:

\nabla L = \frac{1}{N} \sum_{i=1}^{N} \nabla L_i

where:

$\nabla L$ : the gradient of the total loss with respect to all model parameters
$N$ : the total number of training examples in the dataset
$\nabla L_i$ : the gradient computed from the $i$ -th training example alone
$\frac{1}{N}$ : averaging factor that normalizes the sum

This formula computes the exact direction of steepest descent. Each training example contributes its own "opinion" about which way to adjust the weights, and averaging produces a consensus direction. The result is precise, but precision comes at a cost: you must process the entire dataset before taking a single optimization step.

For a dataset of one million images, that means one million forward passes, one million backward passes, and one million gradient computations, all before you can update a single weight. At this scale, batch gradient descent becomes impractical.

Batch Gradient Descent

Batch gradient descent computes the gradient using the entire training set. This provides an accurate estimate of the true gradient direction but becomes computationally infeasible for large datasets.

The Insight: A Sample Can Substitute for the PopulationLink Copied

Here's the key insight that makes stochastic gradient descent possible: you don't need to measure everyone to estimate the average.

Polling organizations don't survey every citizen; they sample a representative subset. The sample average is close to the true average, and more samples get you closer. The same logic applies to gradients. Instead of computing $\nabla L_i$ for all $N$ examples and averaging, what if we just... picked one example at random?

\nabla L \approx \nabla L_i \quad \text{where } i \text{ is chosen uniformly from } \{1, 2, \ldots, N\}

where:

$\nabla L_i$ : the gradient computed from a single randomly selected training example
$i$ : a random index, each value equally likely

This single-example gradient is an unbiased estimator of the true gradient. Mathematically, $\mathbb{E}[\nabla L_i] = \nabla L$ : if you average over all possible random choices, you get the exact batch gradient back. Any individual estimate might be wrong, perhaps dramatically so, but on average, it points in the right direction.

The tradeoff is variance for speed. A single-example gradient might point 45° away from the true direction, or even in the opposite direction for pathological examples. But it costs $1/N$ as much to compute. With a million examples, that's a million-fold speedup per gradient computation.

Visualizing the Tradeoff: Batch vs. Stochastic GradientsLink Copied

Let's make this concrete. We'll create a simple linear regression problem and compare the batch gradient (computed from all 1000 examples) with stochastic gradients (each computed from a single example). The difference reveals exactly what we gain and what we sacrifice.

In[2]:

Code

import numpy as np

np.random.seed(42)

# Generate simple 2D data for linear regression
N = 1000
X = np.random.randn(N, 2)
true_weights = np.array([2.0, -1.5])
y = X @ true_weights + np.random.randn(N) * 0.5


# Loss function: Mean Squared Error
def mse_loss(w, X, y):
    predictions = X @ w
    return np.mean((predictions - y) ** 2)


# Gradient of MSE loss
def mse_gradient(w, X, y):
    predictions = X @ w
    return 2 * X.T @ (predictions - y) / len(y)

import numpy as np

np.random.seed(42)

# Generate simple 2D data for linear regression
N = 1000
X = np.random.randn(N, 2)
true_weights = np.array([2.0, -1.5])
y = X @ true_weights + np.random.randn(N) * 0.5


# Loss function: Mean Squared Error
def mse_loss(w, X, y):
    predictions = X @ w
    return np.mean((predictions - y) ** 2)


# Gradient of MSE loss
def mse_gradient(w, X, y):
    predictions = X @ w
    return 2 * X.T @ (predictions - y) / len(y)

Now let's compute both types of gradients at the same point in parameter space and see how they compare:

In[3]:

Code

# Starting point for gradient comparison
w = np.array([0.5, 0.5])

# Batch gradient (using all data)
batch_gradient = mse_gradient(w, X, y)

# Stochastic gradients (using single examples)
n_samples = 50
stochastic_gradients = []
for _ in range(n_samples):
    idx = np.random.randint(N)
    sg = mse_gradient(w, X[idx : idx + 1], y[idx : idx + 1])
    stochastic_gradients.append(sg)

stochastic_gradients = np.array(stochastic_gradients)

# Starting point for gradient comparison
w = np.array([0.5, 0.5])

# Batch gradient (using all data)
batch_gradient = mse_gradient(w, X, y)

# Stochastic gradients (using single examples)
n_samples = 50
stochastic_gradients = []
for _ in range(n_samples):
    idx = np.random.randint(N)
    sg = mse_gradient(w, X[idx : idx + 1], y[idx : idx + 1])
    stochastic_gradients.append(sg)

stochastic_gradients = np.array(stochastic_gradients)

Out[4]:

Visualization

Vector plot showing one large red arrow for batch gradient and many scattered gray arrows for stochastic gradients, with their average shown as a blue dashed arrow. — Comparison of batch gradient (red arrow) and stochastic gradients (gray arrows). Each gray arrow is the gradient computed from a single training example. While individual stochastic gradients vary wildly, their average (blue dashed arrow) closely approximates the true batch gradient.

The visualization reveals the fundamental tradeoff at the heart of SGD. The batch gradient (red arrow) points precisely toward the optimum, getting the direction exactly right. The individual stochastic gradients (gray arrows) scatter wildly, some nearly orthogonal to the true direction, a few even pointing away from the optimum entirely. Any single gray arrow would be a poor substitute for the red one.

But look at the blue dashed arrow: the average of those scattered stochastic gradients. It aligns almost perfectly with the true batch gradient. This is not coincidence. It's the central limit theorem at work. Each stochastic gradient is an unbiased sample, so their average converges to the true gradient. The more samples you average, the closer you get.

This is the mathematical foundation of SGD: each step is imprecise, but the journey is correct on average.

The SGD Update RuleLink Copied

Now that we understand why stochastic gradients work, let's formalize how we use them. At each iteration, SGD takes a step in the direction opposite to the gradient:

w_{t+1} = w_t - \eta \nabla L_i

where:

$w_t$ : the weight vector at iteration $t$
$w_{t+1}$ : the updated weight vector after iteration $t$
$\eta$ : the learning rate (step size), a positive scalar controlling how far we move
$\nabla L_i$ : the gradient computed from a single randomly selected example $i$

Understanding Each ComponentLink Copied

Why subtract? Gradients point "uphill," toward increasing loss. Since we want to decrease the loss, we move in the opposite direction. Subtracting the gradient is moving downhill.

Why scale by $\eta$ ? The gradient tells us the direction, but not how far to go. The learning rate controls step size. Too small, and we inch toward the minimum over thousands of steps. Too large, and we overshoot, potentially bouncing around forever or diverging entirely. Finding the right $\eta$ is one of the most important (and frustrating) aspects of training neural networks.

Why use $\nabla L_i$ instead of $\nabla L$ ? This is what makes it stochastic. Using the full gradient $\nabla L$ gives batch gradient descent; using a single-example gradient $\nabla L_i$ gives pure SGD. The update formula is identical, only the gradient source changes.

The Training Loop: Epochs and ShufflingLink Copied

In practice, we don't pick random examples with replacement. Instead, we shuffle the training set once, then iterate through every example in order. Once we've used every example once, that's one epoch. Then we shuffle again and repeat:

In[5]:

Code

def sgd_train(X, y, learning_rate=0.01, n_epochs=10):
    """Train linear regression using SGD."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)

        for idx in indices:
            # Compute gradient for single example
            xi = X[idx : idx + 1]
            yi = y[idx : idx + 1]
            gradient = mse_gradient(w, xi, yi)

            # Update weights
            w = w - learning_rate * gradient

        # Record progress after each epoch
        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history

def sgd_train(X, y, learning_rate=0.01, n_epochs=10):
    """Train linear regression using SGD."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)

        for idx in indices:
            # Compute gradient for single example
            xi = X[idx : idx + 1]
            yi = y[idx : idx + 1]
            gradient = mse_gradient(w, xi, yi)

            # Update weights
            w = w - learning_rate * gradient

        # Record progress after each epoch
        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history

In[6]:

Code

# Train the model
w_sgd, history_sgd = sgd_train(X, y, learning_rate=0.01, n_epochs=20)

# Train the model
w_sgd, history_sgd = sgd_train(X, y, learning_rate=0.01, n_epochs=20)

Out[7]:

Console

SGD Training Results:
---------------------------------------------
True weights:  [+2.0000, -1.5000]
Final weights: [+1.9887, -1.4916]
Initial loss:  6.2620
Final loss:    0.2419

Loss reduction: 96.1%
Weight error:   0.0141

The results speak for themselves. SGD recovers weights extremely close to the ground truth, with the final loss dropping dramatically from its initial value. But here's the key insight: we made 1000 weight updates per epoch (one per training example), whereas batch gradient descent would make only one. Even though each SGD update is noisier than a batch update, we get 1000 chances to improve per epoch instead of just one.

This is why SGD often converges faster in wall-clock time despite being "noisier" per step. The total distance traveled toward the optimum per unit of computation is greater.

Visualizing the Optimization PathLink Copied

Let's compare the actual trajectory through weight space for batch gradient descent vs. SGD. We'll run both starting from the same initial point and watch how they approach the optimum:

In[8]:

Code

def batch_gd_train(X, y, learning_rate=0.1, n_epochs=20):
    """Train linear regression using batch gradient descent."""
    n_features = X.shape[1]
    w = np.array([0.5, 0.5])  # Same starting point

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        gradient = mse_gradient(w, X, y)
        w = w - learning_rate * gradient
        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history


def sgd_train_with_path(X, y, learning_rate=0.01, n_epochs=5):
    """SGD tracking every update for path visualization."""
    w = np.array([0.5, 0.5])
    history = {"weights": [w.copy()]}

    for epoch in range(n_epochs):
        indices = np.random.permutation(len(y))
        for idx in indices[:50]:  # Sample for visibility
            gradient = mse_gradient(w, X[idx : idx + 1], y[idx : idx + 1])
            w = w - learning_rate * gradient
            history["weights"].append(w.copy())

    return w, history


# Run both optimizers
np.random.seed(42)
_, batch_hist = batch_gd_train(X, y, learning_rate=0.1, n_epochs=20)
np.random.seed(42)
_, sgd_path_hist = sgd_train_with_path(X, y, learning_rate=0.01, n_epochs=3)

batch_weights = np.array(batch_hist["weights"])
sgd_weights = np.array(sgd_path_hist["weights"])

def batch_gd_train(X, y, learning_rate=0.1, n_epochs=20):
    """Train linear regression using batch gradient descent."""
    n_features = X.shape[1]
    w = np.array([0.5, 0.5])  # Same starting point

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        gradient = mse_gradient(w, X, y)
        w = w - learning_rate * gradient
        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history


def sgd_train_with_path(X, y, learning_rate=0.01, n_epochs=5):
    """SGD tracking every update for path visualization."""
    w = np.array([0.5, 0.5])
    history = {"weights": [w.copy()]}

    for epoch in range(n_epochs):
        indices = np.random.permutation(len(y))
        for idx in indices[:50]:  # Sample for visibility
            gradient = mse_gradient(w, X[idx : idx + 1], y[idx : idx + 1])
            w = w - learning_rate * gradient
            history["weights"].append(w.copy())

    return w, history


# Run both optimizers
np.random.seed(42)
_, batch_hist = batch_gd_train(X, y, learning_rate=0.1, n_epochs=20)
np.random.seed(42)
_, sgd_path_hist = sgd_train_with_path(X, y, learning_rate=0.01, n_epochs=3)

batch_weights = np.array(batch_hist["weights"])
sgd_weights = np.array(sgd_path_hist["weights"])

Out[9]:

Visualization

Contour plot of loss surface with two optimization trajectories: a smooth blue path for batch GD and a jagged red path for SGD, both converging toward the minimum. — Optimization paths through weight space. Batch gradient descent (blue) takes smooth, direct steps toward the optimum. SGD (red) follows a noisy, wandering path but makes many more updates. Despite the zigzagging, SGD reaches the vicinity of the optimum quickly.

The visualization reveals the fundamental difference in how these methods explore the loss landscape. Batch gradient descent (blue) takes measured, deliberate steps directly toward the minimum, with each step using perfect information from the entire dataset. SGD (red) wanders drunkenly, zigzagging across the landscape as different training examples pull it in different directions.

Yet both reach the same destination. And despite its erratic path, SGD made many more updates in the same computational budget. This is the core tradeoff: precision vs. speed.

Minibatch Gradient Descent: The Best of Both WorldsLink Copied

We've now seen two extremes:

Batch gradient descent: Use all $N$ examples. Precise gradients, but one update per full dataset pass.
Pure SGD: Use 1 example. Fast updates, but extremely noisy gradients.

In practice, neither extreme is ideal. Batch is too slow; pure SGD is too noisy and can't exploit modern hardware. The solution? Meet in the middle.

The Minibatch CompromiseLink Copied

Instead of all examples or just one, we use a small random subset, a minibatch, of $B$ examples:

\nabla L \approx \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla L_i

where:

$\mathcal{B}$ : a minibatch, a random subset of $B$ training examples
$B$ : the batch size (typically 32, 64, 128, or 256)
$\nabla L_i$ : the gradient from the $i$ -th example in the minibatch
$\frac{1}{B}$ : averaging factor that normalizes the sum

This is the same averaging formula as batch gradient descent, just applied to a smaller sample. If $B = N$ , we get batch gradient descent. If $B = 1$ , we get pure SGD. For $B$ somewhere in between, we get the advantages of both:

Reduced variance: Averaging over $B$ examples smooths out the noise. The variance of the gradient estimate decreases as $1/B$ .
Efficient computation: GPUs are designed for parallel matrix operations. A minibatch of 64 examples can be processed almost as fast as a single example because the matrix multiplications happen in parallel.
Practical balance: We still make $N/B$ updates per epoch (for 1000 examples and $B=32$ , that's about 31 updates), so we iterate quickly without extreme noise.

Minibatch Gradient Descent

Minibatch gradient descent computes gradients using a small random subset of training examples (typically 32-256). This provides more stable gradients than pure SGD while maintaining computational efficiency.

Implementing Minibatch SGDLink Copied

The implementation is nearly identical to pure SGD, but we process examples in groups rather than one at a time:

In[10]:

Code

def minibatch_sgd_train(X, y, batch_size=32, learning_rate=0.01, n_epochs=10):
    """Train linear regression using minibatch SGD."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        # Shuffle data
        indices = np.random.permutation(n_samples)

        # Process in batches
        for start in range(0, n_samples, batch_size):
            batch_idx = indices[start : start + batch_size]
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]

            # Compute gradient for batch
            gradient = mse_gradient(w, X_batch, y_batch)

            # Update weights
            w = w - learning_rate * gradient

        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history

def minibatch_sgd_train(X, y, batch_size=32, learning_rate=0.01, n_epochs=10):
    """Train linear regression using minibatch SGD."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        # Shuffle data
        indices = np.random.permutation(n_samples)

        # Process in batches
        for start in range(0, n_samples, batch_size):
            batch_idx = indices[start : start + batch_size]
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]

            # Compute gradient for batch
            gradient = mse_gradient(w, X_batch, y_batch)

            # Update weights
            w = w - learning_rate * gradient

        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history

The key difference from pure SGD is in the loop structure: instead of iterating over individual indices, we iterate in steps of batch_size and slice out groups of examples. The gradient computation itself is unchanged; we just pass a matrix X_batch instead of a single row.

Let's compare how different batch sizes affect convergence:

In[11]:

Code

# Compare different batch sizes
batch_sizes = [1, 16, 64, 256]
histories = {}

for bs in batch_sizes:
    _, hist = minibatch_sgd_train(
        X, y, batch_size=bs, learning_rate=0.01, n_epochs=20
    )
    histories[bs] = hist

# Compare different batch sizes
batch_sizes = [1, 16, 64, 256]
histories = {}

for bs in batch_sizes:
    _, hist = minibatch_sgd_train(
        X, y, batch_size=bs, learning_rate=0.01, n_epochs=20
    )
    histories[bs] = hist

Out[12]:

Visualization

Line plot showing loss decreasing over epochs for four different batch sizes, with smaller batches showing more oscillation. — Training loss curves for different batch sizes. Smaller batches (B=1, B=16) show noisier convergence but can escape shallow local minima. Larger batches (B=64, B=256) provide smoother convergence. All eventually reach similar final losses.

The curves reveal an illuminating pattern:

Small batches (B=1, B=16) converge faster initially because they take more update steps per epoch, covering more ground early. But the path is jagged, reflecting high gradient variance.
Large batches (B=64, B=256) are smoother, with less oscillation. But they make fewer updates per epoch, so early progress is slower.
All reach similar final losses, given enough epochs. The journey differs, but the destination is the same.

This is the batch size tradeoff in action: more noise vs. more updates. Neither extreme is optimal, which is why batch sizes in the 32-256 range are the practical sweet spot.

Why These Specific Batch Sizes?Link Copied

You'll notice that batch sizes are almost always powers of 2: 32, 64, 128, 256. This isn't superstition; it's hardware optimization. GPUs process data in parallel using memory layouts that align to powers of 2. A batch of 64 examples may process in the same time as a batch of 50, simply because 64 fits the hardware's natural granularity.

Beyond hardware, batch size affects learning dynamics:

Too small (B < 16): Gradient variance is high, often requiring smaller learning rates to compensate. Training becomes erratic.
Sweet spot (B = 32-256): Variance is reduced enough for stable training, while still making many updates per epoch. Memory usage fits comfortably on most GPUs.
Too large (B > 1024): Diminishing returns on gradient quality (variance doesn't decrease much beyond a point), and research suggests very large batches may find sharper minima that generalize worse.

Updates per Epoch: The Hidden TradeoffLink Copied

Beyond noise, batch size affects how many times we update weights per epoch. Smaller batches mean more updates per pass through the data:

Out[13]:

Visualization

Bar chart showing updates per epoch decreasing as batch size increases, from 1000 updates at B=1 to about 4 updates at B=256. — Number of weight updates per epoch as a function of batch size. Smaller batches provide more frequent updates: pure SGD (B=1) makes N updates per epoch, while large batches make few. This tradeoff interacts with gradient noise, since more updates means faster exploration but noisier individual steps.

With batch size 1 (pure SGD), we make 1000 updates per epoch, one per example. With batch size 256, we make only about 4 updates. This is why small batches often converge faster in terms of epochs: each epoch does more optimization work. But each individual update is noisier, so the tradeoff isn't straightforward.

Quantifying the Variance ReductionLink Copied

The central limit theorem predicts that gradient variance should decrease as $1/B$ : double the batch size, halve the variance. Let's verify this empirically:

In[14]:

Code

def compute_gradient_variance(X, y, w, batch_size, n_samples=100):
    """Estimate variance of minibatch gradients.

    Computes 100 different minibatch gradients and measures their variance.
    """
    gradients = []
    n = len(y)

    for _ in range(n_samples):
        idx = np.random.choice(n, size=batch_size, replace=False)
        g = mse_gradient(w, X[idx], y[idx])
        gradients.append(g)

    gradients = np.array(gradients)
    return np.mean(np.var(gradients, axis=0))


# Measure gradient variance for different batch sizes
test_batch_sizes = [1, 4, 16, 64, 256, 512]
variances = []

w_test = np.array([0.5, 0.5])
for bs in test_batch_sizes:
    var = compute_gradient_variance(X, y, w_test, bs)
    variances.append(var)

def compute_gradient_variance(X, y, w, batch_size, n_samples=100):
    """Estimate variance of minibatch gradients.

    Computes 100 different minibatch gradients and measures their variance.
    """
    gradients = []
    n = len(y)

    for _ in range(n_samples):
        idx = np.random.choice(n, size=batch_size, replace=False)
        g = mse_gradient(w, X[idx], y[idx])
        gradients.append(g)

    gradients = np.array(gradients)
    return np.mean(np.var(gradients, axis=0))


# Measure gradient variance for different batch sizes
test_batch_sizes = [1, 4, 16, 64, 256, 512]
variances = []

w_test = np.array([0.5, 0.5])
for bs in test_batch_sizes:
    var = compute_gradient_variance(X, y, w_test, bs)
    variances.append(var)

Out[15]:

Visualization

Log-log plot showing gradient variance decreasing linearly with batch size. — Gradient variance decreases as batch size increases, following an approximately 1/B relationship. This explains why larger batches allow larger learning rates. The dashed line shows the theoretical 1/B decay.

The log-log plot confirms the $1/B$ relationship: a straight line with slope -1. Doubling the batch size halves the variance, exactly as the central limit theorem predicts.

Let's visualize this variance reduction more intuitively by plotting the distribution of gradient estimates for different batch sizes:

Out[16]:

Visualization

Violin plot showing gradient distributions narrowing dramatically as batch size increases from 1 to 256. — Distribution of minibatch gradient estimates for different batch sizes. With B=1 (pure SGD), gradients scatter widely around the true value. As batch size increases, the distribution tightens, concentrating near the true gradient. This is variance reduction in action.

The violin plots make the variance reduction visceral. With B=1, gradient estimates scatter wildly, some far from the true value (green dashed line). As batch size increases, the distribution narrows dramatically. By B=256, estimates cluster tightly around the truth. This is why larger batches allow larger learning rates: the gradient you compute is much closer to the true direction.

This relationship has a profound practical consequence: if you double the batch size, you can often double the learning rate while maintaining training stability. The variance reduction from larger batches compensates for the larger steps. This "linear scaling rule" is widely used when training on multiple GPUs, where larger effective batch sizes are natural.

Learning Rate: The Critical HyperparameterLink Copied

We've discussed batch size as a dial we can tune. But the learning rate $\eta$ is the critical hyperparameter, the one that makes or breaks training. Get it wrong, and your model either learns nothing or explodes.

The Goldilocks ProblemLink Copied

Unlike batch size, where 32-256 usually works fine, learning rate is problem-specific. A value that works perfectly for one model might cause another to diverge. Here's what happens at different settings:

Too small ( $\eta = 0.0001$ ): Each step is a timid shuffle toward the minimum. Training is stable and won't diverge, but it's glacially slow. You might need thousands of epochs to converge, and you may get stuck in shallow local minima along the way.
Just right ( $\eta = 0.01$ ): Each step makes meaningful progress without overshooting. The loss decreases steadily, and you reach a good solution in a reasonable number of epochs.
Too large ( $\eta = 0.5$ ): Each step overshoots the minimum, landing on the other side of the valley. The next step overshoots again. The loss oscillates wildly, or worse, increases exponentially until numerical overflow kills training.

In[17]:

Code

# Compare different learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1, 0.5]
lr_histories = {}

for lr in learning_rates:
    _, hist = minibatch_sgd_train(
        X, y, batch_size=32, learning_rate=lr, n_epochs=30
    )
    lr_histories[lr] = hist

# Compare different learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1, 0.5]
lr_histories = {}

for lr in learning_rates:
    _, hist = minibatch_sgd_train(
        X, y, batch_size=32, learning_rate=lr, n_epochs=30
    )
    lr_histories[lr] = hist

Out[18]:

Visualization

Line plot showing loss curves diverging, oscillating, or converging smoothly depending on learning rate. — Training loss curves for different learning rates. Too small (0.0001) barely moves. Moderate values (0.001-0.01) converge well. Too large (0.1-0.5) either oscillates or diverges. Finding the right learning rate is crucial for efficient training.

Finding the right learning rate manually is tedious. The learning rate finder technique automates this by sweeping across learning rates and tracking when the loss starts increasing:

In[19]:

Code

def learning_rate_finder(
    X, y, batch_size=32, lr_min=1e-6, lr_max=1, n_steps=100
):
    """Find optimal learning rate range using exponential increase.

    Start with a tiny learning rate and exponentially increase it.
    Track how the loss changes. The optimal range is where loss
    decreases fastest, just before it starts increasing.
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    # Exponentially spaced learning rates
    lrs = np.exp(np.linspace(np.log(lr_min), np.log(lr_max), n_steps))
    losses = []

    idx = 0
    for lr in lrs:
        # Get a batch
        batch_start = (idx * batch_size) % n_samples
        batch_end = batch_start + batch_size
        if batch_end > n_samples:
            idx = 0
            batch_start = 0
            batch_end = batch_size

        X_batch = X[batch_start:batch_end]
        y_batch = y[batch_start:batch_end]

        # Compute loss
        loss = mse_loss(w, X_batch, y_batch)
        losses.append(loss)

        # Update with current learning rate
        gradient = mse_gradient(w, X_batch, y_batch)
        w = w - lr * gradient

        idx += 1

        # Stop if loss explodes
        if loss > 4 * losses[0]:
            break

    return lrs[: len(losses)], losses


lrs_test, losses_test = learning_rate_finder(X, y)

def learning_rate_finder(
    X, y, batch_size=32, lr_min=1e-6, lr_max=1, n_steps=100
):
    """Find optimal learning rate range using exponential increase.

    Start with a tiny learning rate and exponentially increase it.
    Track how the loss changes. The optimal range is where loss
    decreases fastest, just before it starts increasing.
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    # Exponentially spaced learning rates
    lrs = np.exp(np.linspace(np.log(lr_min), np.log(lr_max), n_steps))
    losses = []

    idx = 0
    for lr in lrs:
        # Get a batch
        batch_start = (idx * batch_size) % n_samples
        batch_end = batch_start + batch_size
        if batch_end > n_samples:
            idx = 0
            batch_start = 0
            batch_end = batch_size

        X_batch = X[batch_start:batch_end]
        y_batch = y[batch_start:batch_end]

        # Compute loss
        loss = mse_loss(w, X_batch, y_batch)
        losses.append(loss)

        # Update with current learning rate
        gradient = mse_gradient(w, X_batch, y_batch)
        w = w - lr * gradient

        idx += 1

        # Stop if loss explodes
        if loss > 4 * losses[0]:
            break

    return lrs[: len(losses)], losses


lrs_test, losses_test = learning_rate_finder(X, y)

Out[20]:

Visualization

Line plot showing loss decreasing then sharply increasing as learning rate grows, with optimal range highlighted. — Learning rate finder showing loss vs. learning rate on a log scale. The optimal learning rate is typically 1-10× smaller than where the loss starts increasing (around 0.01 here). This technique helps quickly identify a good starting point.

The plot shows the classic learning rate finder shape: loss decreases as we increase the learning rate (we're making meaningful progress), then suddenly shoots up when the rate gets too large (we're overshooting). The optimal learning rate is typically 1-10× smaller than where the loss starts increasing, around 0.01-0.05 in this case. This technique, popularized by Leslie Smith, can save hours of manual tuning.

SGD Convergence PropertiesLink Copied

We've seen that SGD trades exact gradients for speed. But does this tradeoff come at a cost to where we end up? Understanding SGD's convergence behavior reveals both its power and its subtleties.

The Noisy Path to ConvergenceLink Copied

Batch gradient descent, with exact gradients, descends smoothly to the minimum. Each step moves directly downhill, and if the learning rate is small enough, the algorithm converges to a fixed point.

SGD is different. With a constant learning rate, SGD never converges to a point. It oscillates around the minimum, bouncing in a region whose diameter is proportional to $\eta$ . The gradient noise prevents it from settling.

This might seem like a fundamental flaw, but it's actually a consequence of a tradeoff. To guarantee exact convergence, we need a decaying learning rate that satisfies two mathematical conditions:

\sum_{t=1}^{\infty} \eta_t = \infty \quad \text{and} \quad \sum_{t=1}^{\infty} \eta_t^2 < \infty

where:

$\eta_t$ : the learning rate at iteration $t$
$\sum_{t=1}^{\infty} \eta_t$ : the cumulative sum of all learning rates

These conditions may seem abstract, but they encode a precise balance:

First condition ( $\sum \eta_t = \infty$ ): We must be able to travel unbounded total distance. If learning rates decay too fast, like $\eta_t = 1/t^2$ , the sum converges to a finite value, and we might stop moving before reaching the minimum. The algorithm gets "stuck" because it runs out of step budget.
Second condition ( $\sum \eta_t^2 < \infty$ ): The accumulated variance must be bounded. Each SGD step has variance proportional to $\eta_t^2$ (from the noisy gradient). If this sum diverges, we bounce around forever, never settling down.

A schedule like $\eta_t = \frac{\eta_0}{1 + t}$ threads this needle: it decays slowly enough that the sum diverges (we can always reach the goal), but fast enough that the sum of squares converges (we eventually stop bouncing).

From Theory to Practice: Learning Rate SchedulesLink Copied

In practice, few practitioners use theoretically-motivated schedules like $1/t$ . Instead, empirically-tuned schedules dominate. The key insight is that we want large steps early (when we're far from the optimum and any progress is good) and small steps late (when we're near the optimum and need precision).

Let's implement and compare several common schedules:

In[21]:

Code

def sgd_with_schedule(X, y, lr_schedule, batch_size=32, n_epochs=50):
    """SGD with custom learning rate schedule."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [], "loss": [], "lr": []}
    step = 0

    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)

        for start in range(0, n_samples, batch_size):
            lr = lr_schedule(step)
            batch_idx = indices[start : start + batch_size]

            gradient = mse_gradient(w, X[batch_idx], y[batch_idx])
            w = w - lr * gradient

            if step % 100 == 0:
                history["weights"].append(w.copy())
                history["loss"].append(mse_loss(w, X, y))
                history["lr"].append(lr)

            step += 1

    return w, history

def sgd_with_schedule(X, y, lr_schedule, batch_size=32, n_epochs=50):
    """SGD with custom learning rate schedule."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [], "loss": [], "lr": []}
    step = 0

    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)

        for start in range(0, n_samples, batch_size):
            lr = lr_schedule(step)
            batch_idx = indices[start : start + batch_size]

            gradient = mse_gradient(w, X[batch_idx], y[batch_idx])
            w = w - lr * gradient

            if step % 100 == 0:
                history["weights"].append(w.copy())
                history["loss"].append(mse_loss(w, X, y))
                history["lr"].append(lr)

            step += 1

    return w, history

We define four schedules representing different decay philosophies:

In[22]:

Code

# Constant: no decay at all
def constant_lr(step, lr0=0.01):
    return lr0


# Step decay: drop by factor of 10 every 500 steps
def step_decay(step, lr0=0.1, decay_rate=0.1, decay_steps=500):
    return lr0 * (decay_rate ** (step // decay_steps))


# Exponential: continuous smooth decay
def exponential_decay(step, lr0=0.1, decay_rate=0.995):
    return lr0 * (decay_rate**step)


# Inverse time: theoretically-motivated 1/t style
def inverse_time_decay(step, lr0=0.1, decay_rate=0.01):
    return lr0 / (1 + decay_rate * step)


# Train with different schedules
schedules = {
    "Constant": lambda s: constant_lr(s, 0.01),
    "Step decay": lambda s: step_decay(s),
    "Exponential decay": lambda s: exponential_decay(s),
    "Inverse time": lambda s: inverse_time_decay(s),
}

schedule_histories = {}
for name, schedule in schedules.items():
    _, hist = sgd_with_schedule(X, y, schedule, batch_size=32, n_epochs=30)
    schedule_histories[name] = hist

# Constant: no decay at all
def constant_lr(step, lr0=0.01):
    return lr0


# Step decay: drop by factor of 10 every 500 steps
def step_decay(step, lr0=0.1, decay_rate=0.1, decay_steps=500):
    return lr0 * (decay_rate ** (step // decay_steps))


# Exponential: continuous smooth decay
def exponential_decay(step, lr0=0.1, decay_rate=0.995):
    return lr0 * (decay_rate**step)


# Inverse time: theoretically-motivated 1/t style
def inverse_time_decay(step, lr0=0.1, decay_rate=0.01):
    return lr0 / (1 + decay_rate * step)


# Train with different schedules
schedules = {
    "Constant": lambda s: constant_lr(s, 0.01),
    "Step decay": lambda s: step_decay(s),
    "Exponential decay": lambda s: exponential_decay(s),
    "Inverse time": lambda s: inverse_time_decay(s),
}

schedule_histories = {}
for name, schedule in schedules.items():
    _, hist = sgd_with_schedule(X, y, schedule, batch_size=32, n_epochs=30)
    schedule_histories[name] = hist

Out[23]:

Visualization

Line plot showing four learning rate schedules decreasing at different rates over training steps. — Learning rate over time for different scheduling strategies. Constant stays flat while decay strategies reduce the rate progressively.

Line plot showing loss curves converging to different final values depending on schedule. — Training loss with different schedules. Decaying schedules converge to lower final losses by taking smaller steps near the optimum.

The left plot shows how each schedule reduces the learning rate over training steps. The constant schedule stays flat; step decay has sharp drops; exponential and inverse-time decay smoothly.

The right plot reveals the consequence: decaying schedules achieve lower final losses. The constant schedule converges quickly but then oscillates because it can't settle precisely into the minimum with such large steps. The decaying schedules take big steps early (making fast initial progress) then small steps late (settling precisely into the minimum).

Common Learning Rate Schedules in PracticeLink Copied

Several schedules have become standard, each with particular strengths:

Common learning rate schedules and their applications.

Schedule	Formula	Use Case
Step decay	$\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}$	Most common; drop by factor $\gamma$ every $s$ steps
Exponential	$\eta_t = \eta_0 \cdot \gamma^t$	Smooth decay; can be too aggressive
Cosine annealing	$\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\frac{t\pi}{T}))$	Popular for vision models; smooth with warm restarts
Linear warmup	Increase linearly for first $W$ steps, then decay	Essential for transformers; prevents early instability

In these formulas:

$\eta_t$ : learning rate at step $t$
$\eta_0$ : initial learning rate
$\gamma$ : decay factor (typically 0.1 for step decay, 0.99-0.999 for exponential)
$s$ : step interval for step decay
$T$ : total number of training steps for cosine annealing
$\eta_{\min}$ : minimum learning rate (floor for cosine schedule)
$W$ : number of warmup steps

Let's visualize all these schedules together, including cosine annealing and warmup:

In[24]:

Code

def cosine_annealing(step, lr0=0.1, T=1000, eta_min=0.001):
    """Cosine annealing schedule."""
    return eta_min + 0.5 * (lr0 - eta_min) * (1 + np.cos(np.pi * step / T))


def warmup_then_decay(step, lr0=0.1, warmup_steps=200, decay_rate=0.995):
    """Linear warmup followed by exponential decay."""
    if step < warmup_steps:
        return lr0 * (step / warmup_steps)
    else:
        return lr0 * (decay_rate ** (step - warmup_steps))


# Generate schedule curves for visualization
steps = np.arange(1000)

schedule_curves = {
    "Constant": [0.01] * len(steps),
    "Step decay": [step_decay(s, lr0=0.1) for s in steps],
    "Exponential": [
        exponential_decay(s, lr0=0.1, decay_rate=0.997) for s in steps
    ],
    "Cosine annealing": [cosine_annealing(s, lr0=0.1, T=1000) for s in steps],
    "Warmup + decay": [warmup_then_decay(s, lr0=0.1) for s in steps],
}

def cosine_annealing(step, lr0=0.1, T=1000, eta_min=0.001):
    """Cosine annealing schedule."""
    return eta_min + 0.5 * (lr0 - eta_min) * (1 + np.cos(np.pi * step / T))


def warmup_then_decay(step, lr0=0.1, warmup_steps=200, decay_rate=0.995):
    """Linear warmup followed by exponential decay."""
    if step < warmup_steps:
        return lr0 * (step / warmup_steps)
    else:
        return lr0 * (decay_rate ** (step - warmup_steps))


# Generate schedule curves for visualization
steps = np.arange(1000)

schedule_curves = {
    "Constant": [0.01] * len(steps),
    "Step decay": [step_decay(s, lr0=0.1) for s in steps],
    "Exponential": [
        exponential_decay(s, lr0=0.1, decay_rate=0.997) for s in steps
    ],
    "Cosine annealing": [cosine_annealing(s, lr0=0.1, T=1000) for s in steps],
    "Warmup + decay": [warmup_then_decay(s, lr0=0.1) for s in steps],
}

Out[25]:

Visualization

Line plot showing five different learning rate schedules over 1000 training steps, each with distinct decay patterns. — Comparison of common learning rate schedules. Step decay has sharp drops; exponential decays smoothly; cosine annealing follows a smooth curve that decreases slowly at first and end, faster in the middle; warmup starts low and increases before decaying. Each has strengths for different training scenarios.

The choice of schedule often matters less than having a schedule. The key is reducing the learning rate as training progresses; the specific shape is secondary. That said, each schedule has its niche:

Cosine annealing is particularly popular for training vision models, as the smooth curve avoids sudden drops that can destabilize training
Warmup is essential for transformers and models with layer normalization, preventing early training instability
Step decay remains the workhorse for many applications, offering predictable drops at known epochs

SGD Noise as Implicit RegularizationLink Copied

Everything we've discussed so far treats SGD's noise as a necessary evil, the cost of speed. But here's a surprising fact: the noise actually helps generalization. What looks like a bug is actually a feature.

The Generalization PuzzleLink Copied

Consider two training runs that achieve the same final loss on the training set. Both fit the data equally well. Yet one might generalize better to new data than the other. Why?

The answer lies in where they ended up in the loss landscape, not just how low they got.

Sharp vs. Flat MinimaLink Copied

Neural network loss landscapes have many local minima, perhaps exponentially many. Some are sharp: the loss increases steeply as you move away from the minimum in any direction. Others are flat: the loss changes gradually, creating a broad basin.

Research suggests that flat minima generalize better. The intuition is straightforward: test data comes from a slightly different distribution than training data. This shifts the loss landscape slightly. If you're in a sharp minimum, even a small shift might push you up a steep wall, dramatically increasing loss. If you're in a flat minimum, the same shift barely matters because you're still near the bottom of a broad basin.

In[26]:

Code

# Illustrate sharp vs flat minima with a 1D loss function
x = np.linspace(-2, 2, 1000)

# Sharp minimum at x=0
sharp_loss = 10 * x**2

# Flat minimum at x=0
flat_loss = x**4

# Second sharp minimum for contrast
sharp_loss_2 = 2 * (x - 0.5) ** 2 + 0.5

# Illustrate sharp vs flat minima with a 1D loss function
x = np.linspace(-2, 2, 1000)

# Sharp minimum at x=0
sharp_loss = 10 * x**2

# Flat minimum at x=0
flat_loss = x**4

# Second sharp minimum for contrast
sharp_loss_2 = 2 * (x - 0.5) ** 2 + 0.5

Out[27]:

Visualization

Line plot showing a narrow deep minimum (sharp) and a wide shallow minimum (flat) on a loss landscape. — Comparison of sharp and flat minima. The sharp minimum (red) has low loss but high curvature; small perturbations cause large loss increases. The flat minimum (green) is more robust to perturbations. SGD's noise naturally avoids sharp minima because escaping them is easy.

How SGD Noise Finds Flat MinimaLink Copied

Here's where SGD's noise becomes a feature rather than a bug. The gradient noise has a specific structure: it depends on the local curvature of the loss.

Near sharp minima with high curvature, gradient variance is high. The loss changes rapidly in all directions, so different training examples disagree sharply about which way to go. This disagreement translates to noisy, high-variance gradients.

Near flat minima with low curvature, gradient variance is low. The loss is nearly constant in the neighborhood, so all examples agree: "we're in a good spot." Gradients are small and consistent.

The consequence? SGD naturally bounces out of sharp minima while settling into flat ones. The noise is proportional to curvature, so sharp minima are inherently unstable under SGD dynamics.

This relationship can be quantified. The effective noise scale in SGD, measuring how much the optimization path fluctuates, is approximately:

\text{Noise scale} \propto \frac{\eta}{B} \cdot \sigma_g^2

where:

$\eta$ : the learning rate (step size)
$B$ : the batch size (number of examples per gradient computation)
$\sigma_g^2$ : the variance of per-example gradients
$\frac{\eta}{B}$ : the noise-to-signal ratio of each update

Larger steps ( $\eta$ ) amplify the noise, while larger batches ( $B$ ) reduce it through averaging. The ratio $\eta/B$ controls the net noise level. This explains several well-known empirical observations:

Smaller batches → better generalization: More noise helps bounce out of sharp minima into flatter ones
Larger learning rates → better generalization (up to a point): Same mechanism, more exploration
The ratio $\eta/B$ matters more than either alone: Doubling batch size while doubling learning rate maintains similar dynamics

In[28]:

Code

# Demonstrate how SGD escapes sharp minima
def synthetic_loss_1d(w, sharp=True):
    """1D loss with sharp or flat minimum."""
    if sharp:
        return 10 * w**2 + np.random.randn() * 0.5  # Sharp with noise
    else:
        return w**4 + np.random.randn() * 0.1  # Flat with less noise


def sgd_1d(sharp=True, n_steps=500, lr=0.1):
    """Run SGD on 1D problem."""
    w = 0.0
    history = [w]

    for _ in range(n_steps):
        # Gradient with noise
        if sharp:
            grad = 20 * w + np.random.randn() * 2
        else:
            grad = 4 * w**3 + np.random.randn() * 0.5

        w = w - lr * grad
        history.append(w)

    return np.array(history)


# Run multiple trials
n_trials = 20
sharp_trajectories = [sgd_1d(sharp=True, n_steps=200) for _ in range(n_trials)]
flat_trajectories = [sgd_1d(sharp=False, n_steps=200) for _ in range(n_trials)]

# Demonstrate how SGD escapes sharp minima
def synthetic_loss_1d(w, sharp=True):
    """1D loss with sharp or flat minimum."""
    if sharp:
        return 10 * w**2 + np.random.randn() * 0.5  # Sharp with noise
    else:
        return w**4 + np.random.randn() * 0.1  # Flat with less noise


def sgd_1d(sharp=True, n_steps=500, lr=0.1):
    """Run SGD on 1D problem."""
    w = 0.0
    history = [w]

    for _ in range(n_steps):
        # Gradient with noise
        if sharp:
            grad = 20 * w + np.random.randn() * 2
        else:
            grad = 4 * w**3 + np.random.randn() * 0.5

        w = w - lr * grad
        history.append(w)

    return np.array(history)


# Run multiple trials
n_trials = 20
sharp_trajectories = [sgd_1d(sharp=True, n_steps=200) for _ in range(n_trials)]
flat_trajectories = [sgd_1d(sharp=False, n_steps=200) for _ in range(n_trials)]

Out[29]:

Visualization

Multiple trajectory lines oscillating widely around zero in a sharp minimum scenario. — SGD trajectories in sharp minimum. High gradient variance causes large oscillations; the optimizer struggles to settle.

Multiple trajectory lines converging smoothly toward zero in a flat minimum scenario. — SGD trajectories in flat minimum. Low gradient variance allows stable convergence to the minimum.

The contrast is striking. In the sharp minimum (left), SGD oscillates wildly because the high curvature creates high gradient variance, and the optimizer bounces around forever. In the flat minimum (right), SGD converges stably because the low curvature means low variance, and the optimizer settles down.

This isn't a flaw to be fixed; it's a feature to be exploited. The noise in SGD naturally selects for solutions that generalize well.

A Worked Example: Training a ClassifierLink Copied

We've explored SGD's mechanics through linear regression, a convex problem where the theory is clean. Now let's bring everything together with a more realistic example: training a neural network classifier from scratch using SGD. This will show how batch size, learning rate, and schedules interact when the loss landscape is non-convex.

The Dataset and ModelLink Copied

We'll use the "moons" dataset: two interleaved crescent shapes that require a nonlinear decision boundary. A simple 2-layer neural network with ReLU activations will learn to separate them.

In[30]:

Code

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# Generate dataset
X_clf, y_clf = make_moons(n_samples=2000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)


# Simple 2-layer neural network
class SimpleNN:
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Xavier initialization
        self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(
            2.0 / input_dim
        )
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, output_dim) * np.sqrt(
            2.0 / hidden_dim
        )
        self.b2 = np.zeros(output_dim)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_derivative(self, x):
        return (x > 0).astype(float)

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2.flatten()

    def backward(self, X, y):
        m = len(y)

        # Output layer gradient
        dz2 = (self.a2.flatten() - y).reshape(-1, 1)
        dW2 = self.a1.T @ dz2 / m
        db2 = np.mean(dz2, axis=0)

        # Hidden layer gradient
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.relu_derivative(self.z1)
        dW1 = X.T @ dz1 / m
        db1 = np.mean(dz1, axis=0)

        return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}

    def update(self, gradients, lr):
        self.W1 -= lr * gradients["W1"]
        self.b1 -= lr * gradients["b1"]
        self.W2 -= lr * gradients["W2"]
        self.b2 -= lr * gradients["b2"]

    def loss(self, X, y):
        pred = self.forward(X)
        # Binary cross-entropy
        eps = 1e-15
        pred = np.clip(pred, eps, 1 - eps)
        return -np.mean(y * np.log(pred) + (1 - y) * np.log(1 - pred))

    def accuracy(self, X, y):
        pred = self.forward(X) > 0.5
        return np.mean(pred == y)

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# Generate dataset
X_clf, y_clf = make_moons(n_samples=2000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)


# Simple 2-layer neural network
class SimpleNN:
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Xavier initialization
        self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(
            2.0 / input_dim
        )
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, output_dim) * np.sqrt(
            2.0 / hidden_dim
        )
        self.b2 = np.zeros(output_dim)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_derivative(self, x):
        return (x > 0).astype(float)

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2.flatten()

    def backward(self, X, y):
        m = len(y)

        # Output layer gradient
        dz2 = (self.a2.flatten() - y).reshape(-1, 1)
        dW2 = self.a1.T @ dz2 / m
        db2 = np.mean(dz2, axis=0)

        # Hidden layer gradient
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.relu_derivative(self.z1)
        dW1 = X.T @ dz1 / m
        db1 = np.mean(dz1, axis=0)

        return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}

    def update(self, gradients, lr):
        self.W1 -= lr * gradients["W1"]
        self.b1 -= lr * gradients["b1"]
        self.W2 -= lr * gradients["W2"]
        self.b2 -= lr * gradients["b2"]

    def loss(self, X, y):
        pred = self.forward(X)
        # Binary cross-entropy
        eps = 1e-15
        pred = np.clip(pred, eps, 1 - eps)
        return -np.mean(y * np.log(pred) + (1 - y) * np.log(1 - pred))

    def accuracy(self, X, y):
        pred = self.forward(X) > 0.5
        return np.mean(pred == y)

The neural network implements the forward pass (input → hidden → output), the backward pass (computing gradients for each weight), and weight updates. The loss method computes binary cross-entropy, and accuracy measures classification performance.

Training with SGDLink Copied

Now let's train this network with SGD, comparing constant vs. decaying learning rates:

In[31]:

Code

def train_nn_sgd(
    X_train,
    y_train,
    X_test,
    y_test,
    hidden_dim=32,
    batch_size=32,
    lr=0.1,
    n_epochs=50,
    lr_decay=None,
):
    """Train neural network with SGD."""
    np.random.seed(42)
    model = SimpleNN(2, hidden_dim, 1)

    n_samples = len(y_train)
    history = {
        "train_loss": [],
        "test_loss": [],
        "train_acc": [],
        "test_acc": [],
        "lr": [],
    }

    step = 0
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)

        for start in range(0, n_samples, batch_size):
            batch_idx = indices[start : start + batch_size]
            X_batch = X_train[batch_idx]
            y_batch = y_train[batch_idx]

            # Forward pass
            model.forward(X_batch)

            # Backward pass
            gradients = model.backward(X_batch, y_batch)

            # Learning rate schedule
            current_lr = lr if lr_decay is None else lr_decay(step, lr)

            # Update
            model.update(gradients, current_lr)
            step += 1

        # Record metrics
        history["train_loss"].append(model.loss(X_train, y_train))
        history["test_loss"].append(model.loss(X_test, y_test))
        history["train_acc"].append(model.accuracy(X_train, y_train))
        history["test_acc"].append(model.accuracy(X_test, y_test))
        history["lr"].append(current_lr)

    return model, history


# Train with step decay schedule
def step_decay_schedule(step, lr0, decay_rate=0.5, decay_every=400):
    return lr0 * (decay_rate ** (step // decay_every))


model_default, hist_default = train_nn_sgd(
    X_train, y_train, X_test, y_test, batch_size=32, lr=0.5, n_epochs=100
)

model_decay, hist_decay = train_nn_sgd(
    X_train,
    y_train,
    X_test,
    y_test,
    batch_size=32,
    lr=0.5,
    n_epochs=100,
    lr_decay=step_decay_schedule,
)

def train_nn_sgd(
    X_train,
    y_train,
    X_test,
    y_test,
    hidden_dim=32,
    batch_size=32,
    lr=0.1,
    n_epochs=50,
    lr_decay=None,
):
    """Train neural network with SGD."""
    np.random.seed(42)
    model = SimpleNN(2, hidden_dim, 1)

    n_samples = len(y_train)
    history = {
        "train_loss": [],
        "test_loss": [],
        "train_acc": [],
        "test_acc": [],
        "lr": [],
    }

    step = 0
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)

        for start in range(0, n_samples, batch_size):
            batch_idx = indices[start : start + batch_size]
            X_batch = X_train[batch_idx]
            y_batch = y_train[batch_idx]

            # Forward pass
            model.forward(X_batch)

            # Backward pass
            gradients = model.backward(X_batch, y_batch)

            # Learning rate schedule
            current_lr = lr if lr_decay is None else lr_decay(step, lr)

            # Update
            model.update(gradients, current_lr)
            step += 1

        # Record metrics
        history["train_loss"].append(model.loss(X_train, y_train))
        history["test_loss"].append(model.loss(X_test, y_test))
        history["train_acc"].append(model.accuracy(X_train, y_train))
        history["test_acc"].append(model.accuracy(X_test, y_test))
        history["lr"].append(current_lr)

    return model, history


# Train with step decay schedule
def step_decay_schedule(step, lr0, decay_rate=0.5, decay_every=400):
    return lr0 * (decay_rate ** (step // decay_every))


model_default, hist_default = train_nn_sgd(
    X_train, y_train, X_test, y_test, batch_size=32, lr=0.5, n_epochs=100
)

model_decay, hist_decay = train_nn_sgd(
    X_train,
    y_train,
    X_test,
    y_test,
    batch_size=32,
    lr=0.5,
    n_epochs=100,
    lr_decay=step_decay_schedule,
)

Out[32]:

Console

Training Results:
=======================================================
Configuration             Train Acc    Test Acc    
-------------------------------------------------------
Constant LR (η=0.5)       0.9725       0.9675
Step Decay LR             0.9700       0.9700

Generalization gap (train - test accuracy):
  Constant LR: 0.0050
  Step Decay:  0.0000

Both configurations achieve strong test accuracy, and our simple 2-layer network successfully separates the crescents. The generalization gap (train minus test accuracy) is small, indicating the models are not overfitting. Learning rate decay achieves slightly better final performance by taking smaller steps as training progresses, allowing finer adjustments near the optimum.

Visualizing the Learning ProcessLink Copied

Let's examine the training dynamics more closely:

Out[33]:

Visualization

Line plot showing training and test loss decreasing over epochs for two learning rate strategies. — Training and test loss over epochs. Both methods converge, but learning rate decay achieves lower final loss by taking smaller steps near the optimum.

Line plot showing training and test accuracy increasing over epochs for two learning rate strategies. — Accuracy on training and test sets. The gap between train and test accuracy indicates model generalization. Both achieve similar test accuracy.

The loss curves show a familiar pattern: rapid initial decrease followed by slower refinement. The learning rate decay schedule (red) achieves lower final loss because it can take more precise steps late in training. The accuracy curves show corresponding behavior, with both reaching high accuracy quickly, though the decaying schedule achieves slightly better final values.

The Learned Decision BoundaryLink Copied

Finally, let's visualize what the network actually learned: the decision boundary separating the two classes.

Out[34]:

Visualization

Scatter plot showing two crescent moon shapes separated by a curved decision boundary. — Decision boundary learned by the neural network on the moons dataset. The nonlinear boundary correctly separates the two crescent-shaped classes. Training points are shown as circles (correct) or X markers (misclassified).

The network learned a smooth, curved boundary that follows the natural shape of the moons. Correctly classified points appear as circles; the rare misclassified points (near the boundary where the classes overlap) appear as X markers. SGD with learning rate decay found weights that generalize well. The boundary isn't overfit to individual training points but captures the underlying pattern.

Complete SGD ImplementationLink Copied

Having explored SGD's theory and applications, let's consolidate everything into a reusable implementation. This SGDOptimizer class encapsulates the key components: learning rate scheduling and minibatch generation.

In[35]:

Code

class SGDOptimizer:
    """
    Stochastic Gradient Descent optimizer with configurable options.

    Parameters
    ----------
    learning_rate : float
        Initial learning rate
    batch_size : int
        Number of samples per minibatch
    lr_schedule : callable, optional
        Function(step, lr0) -> learning rate at step
    """

    def __init__(self, learning_rate=0.01, batch_size=32, lr_schedule=None):
        self.lr0 = learning_rate
        self.batch_size = batch_size
        self.lr_schedule = lr_schedule
        self.step = 0

    def get_lr(self):
        """Get current learning rate."""
        if self.lr_schedule is None:
            return self.lr0
        return self.lr_schedule(self.step, self.lr0)

    def step_update(self, params, gradients):
        """
        Update parameters using SGD.

        Parameters
        ----------
        params : dict
            Dictionary of parameter arrays
        gradients : dict
            Dictionary of gradient arrays (same keys as params)

        Returns
        -------
        params : dict
            Updated parameters
        """
        lr = self.get_lr()

        for key in params:
            params[key] = params[key] - lr * gradients[key]

        self.step += 1
        return params

    def create_batches(self, X, y, shuffle=True):
        """
        Generate minibatches from data.

        Yields
        ------
        X_batch, y_batch : tuple
            Minibatch of features and labels
        """
        n_samples = len(y)
        indices = np.arange(n_samples)

        if shuffle:
            np.random.shuffle(indices)

        for start in range(0, n_samples, self.batch_size):
            batch_idx = indices[start : start + self.batch_size]
            yield X[batch_idx], y[batch_idx]

class SGDOptimizer:
    """
    Stochastic Gradient Descent optimizer with configurable options.

    Parameters
    ----------
    learning_rate : float
        Initial learning rate
    batch_size : int
        Number of samples per minibatch
    lr_schedule : callable, optional
        Function(step, lr0) -> learning rate at step
    """

    def __init__(self, learning_rate=0.01, batch_size=32, lr_schedule=None):
        self.lr0 = learning_rate
        self.batch_size = batch_size
        self.lr_schedule = lr_schedule
        self.step = 0

    def get_lr(self):
        """Get current learning rate."""
        if self.lr_schedule is None:
            return self.lr0
        return self.lr_schedule(self.step, self.lr0)

    def step_update(self, params, gradients):
        """
        Update parameters using SGD.

        Parameters
        ----------
        params : dict
            Dictionary of parameter arrays
        gradients : dict
            Dictionary of gradient arrays (same keys as params)

        Returns
        -------
        params : dict
            Updated parameters
        """
        lr = self.get_lr()

        for key in params:
            params[key] = params[key] - lr * gradients[key]

        self.step += 1
        return params

    def create_batches(self, X, y, shuffle=True):
        """
        Generate minibatches from data.

        Yields
        ------
        X_batch, y_batch : tuple
            Minibatch of features and labels
        """
        n_samples = len(y)
        indices = np.arange(n_samples)

        if shuffle:
            np.random.shuffle(indices)

        for start in range(0, n_samples, self.batch_size):
            batch_idx = indices[start : start + self.batch_size]
            yield X[batch_idx], y[batch_idx]

The class has three main responsibilities:

get_lr(): Returns the current learning rate, applying the schedule if one is provided
step_update(params, gradients): Applies the SGD update rule: $w \leftarrow w - \eta \nabla L$
create_batches(X, y): Generates shuffled minibatches for one epoch

Here's how you would use the optimizer in a training loop:

In[58]:

Code

optimizer = SGDOptimizer(
    learning_rate=0.01,
    batch_size=32,
    lr_schedule=lambda step, lr0: lr0 * 0.99**step,
)

for epoch in range(n_epochs):
    for X_batch, y_batch in optimizer.create_batches(X, y):
        # Forward pass
        predictions = model(X_batch)

        # Compute gradients (via backpropagation)
        gradients = compute_gradients(predictions, y_batch)

        # Update parameters: w = w - lr * grad
        model.params = optimizer.step_update(model.params, gradients)

optimizer = SGDOptimizer(
    learning_rate=0.01,
    batch_size=32,
    lr_schedule=lambda step, lr0: lr0 * 0.99**step,
)

for epoch in range(n_epochs):
    for X_batch, y_batch in optimizer.create_batches(X, y):
        # Forward pass
        predictions = model(X_batch)

        # Compute gradients (via backpropagation)
        gradients = compute_gradients(predictions, y_batch)

        # Update parameters: w = w - lr * grad
        model.params = optimizer.step_update(model.params, gradients)

The optimizer handles learning rate scheduling internally, so you don't need to manually track the step count. The create_batches method yields shuffled minibatches, ensuring each epoch sees the data in a different order (important for avoiding cyclic patterns that can hurt convergence).

Limitations and ChallengesLink Copied

SGD is remarkably effective. It's trained virtually every neural network you've ever used. But it has well-known limitations that motivate the more sophisticated optimizers we'll study next.

One Learning Rate for All DimensionsLink Copied

The "optimal" learning rate varies across different dimensions of the parameter space. Dimensions with large gradients (steep loss surface) want small learning rates to avoid overshooting. Dimensions with small gradients (gentle slope) want large learning rates to make meaningful progress.

Vanilla SGD uses the same $\eta$ for all dimensions. This is a fundamental mismatch: the learning rate that works for steep dimensions is too small for gentle ones, and vice versa.

Difficulty with Saddle PointsLink Copied

In high-dimensional spaces, saddle points (where gradients are zero but it's not a minimum) are far more common than local minima. Near a saddle point, gradients become tiny, and SGD slows to a crawl. The optimizer might spend thousands of iterations in the "saddle" region before random noise eventually pushes it over the edge.

Oscillation in Ill-Conditioned ProblemsLink Copied

When the loss surface is elongated, with high curvature in some directions and low curvature in others, SGD exhibits a characteristic pathology: it oscillates rapidly across the narrow valley while making slow progress along the long axis.

Out[36]:

Visualization

Contour plot showing an elongated elliptical loss surface with an oscillating optimization path. — SGD struggling with an ill-conditioned loss surface. The optimization path (red) oscillates across the narrow valley while slowly progressing toward the minimum. This inefficiency motivates momentum-based methods covered in the next chapter.

The oscillation pattern is unmistakable. In the steep direction ( $w_1$ , horizontal axis), SGD overshoots repeatedly, bouncing back and forth across the valley. In the shallow direction ( $w_2$ , vertical axis), it makes painfully slow progress because the gradients are small.

This is an ill-conditioned problem, where the ratio of largest to smallest curvature is high. For vanilla SGD, ill-conditioning is poison. The learning rate that's appropriate for the steep direction is far too small for the shallow one.

These limitations motivate momentum-based methods, which we'll explore in the next chapter. By accumulating a "velocity" that builds up along consistent gradient directions, momentum methods dampen the oscillations while accelerating progress along the valley floor.

SummaryLink Copied

Stochastic Gradient Descent trades exact gradients for speed, making neural network training practical at scale. The key ideas:

Core mechanics:

Batch gradient descent uses all data for each update; computationally expensive but precise
Pure SGD uses single examples; fast but noisy
Minibatch SGD (the practical choice) uses small batches (32-256), balancing speed and stability

Learning rate:

Too small: training is slow but stable
Too large: training is fast but may diverge
Learning rate finder helps identify a good starting point
Decaying schedules (step, exponential, cosine) improve final convergence

Noise as regularization:

SGD noise helps escape sharp minima that generalize poorly
Smaller batches and larger learning rates increase effective noise
The ratio $\eta/B$ controls the noise scale

Practical considerations:

Shuffle data each epoch to avoid cycles
Monitor both training and validation loss
Use learning rate schedules for best convergence
Gradient variance decreases as $1/B$ with batch size $B$

SGD's limitations, sensitivity to learning rate, slow convergence on ill-conditioned problems, and difficulty with saddle points, motivate the momentum-based optimizers we'll explore in the next chapter.

Key ParametersLink Copied

Key SGD hyperparameters and their effects on training.

Parameter	Typical Values	Effect
Learning rate ( $\eta$ )	0.001-0.1	Higher values mean faster but less stable training
Batch size ( $B$ )	32-256	Larger batches reduce gradient variance but may hurt generalization
Epochs	50-500	More epochs allow more updates; use early stopping to prevent overfitting
Learning rate decay	0.1-0.5 every N steps	Enables convergence to precise minima

When starting with a new problem, a reasonable baseline is:

Batch size: 32 or 64
Learning rate: Use a learning rate finder, or start with 0.01
Schedule: Step decay by 0.1 every 30% of total training
Shuffle: Always shuffle data each epoch

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about stochastic gradient descent and optimization.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{stochasticgradientdescentfrombatchtominibatchoptimization, author = {Michael Brenndoerfer}, title = {Stochastic Gradient Descent: From Batch to Minibatch Optimization}, year = {2025}, url = {https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). Stochastic Gradient Descent: From Batch to Minibatch Optimization. Retrieved from https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization

MLAAcademic

Michael Brenndoerfer. "Stochastic Gradient Descent: From Batch to Minibatch Optimization." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization>.

CHICAGOAcademic

Michael Brenndoerfer. "Stochastic Gradient Descent: From Batch to Minibatch Optimization." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Stochastic Gradient Descent: From Batch to Minibatch Optimization'. Available at: https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). Stochastic Gradient Descent: From Batch to Minibatch Optimization. https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization

Direct link:

https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Stochastic Gradient DescentLink Copied

From Batch to Stochastic: The Core IdeaLink Copied

The Ideal: Computing the True GradientLink Copied

The Insight: A Sample Can Substitute for the PopulationLink Copied

Visualizing the Tradeoff: Batch vs. Stochastic GradientsLink Copied

The SGD Update RuleLink Copied

Understanding Each ComponentLink Copied

The Training Loop: Epochs and ShufflingLink Copied

Visualizing the Optimization PathLink Copied

Minibatch Gradient Descent: The Best of Both WorldsLink Copied

The Minibatch CompromiseLink Copied

Implementing Minibatch SGDLink Copied

Why These Specific Batch Sizes?Link Copied

Updates per Epoch: The Hidden TradeoffLink Copied

Quantifying the Variance ReductionLink Copied

Learning Rate: The Critical HyperparameterLink Copied

The Goldilocks ProblemLink Copied

SGD Convergence PropertiesLink Copied

The Noisy Path to ConvergenceLink Copied

From Theory to Practice: Learning Rate SchedulesLink Copied

Common Learning Rate Schedules in PracticeLink Copied

SGD Noise as Implicit RegularizationLink Copied

The Generalization PuzzleLink Copied

Sharp vs. Flat MinimaLink Copied

How SGD Noise Finds Flat MinimaLink Copied

A Worked Example: Training a ClassifierLink Copied

The Dataset and ModelLink Copied

Training with SGDLink Copied

Visualizing the Learning ProcessLink Copied

The Learned Decision BoundaryLink Copied

Complete SGD ImplementationLink Copied

Limitations and ChallengesLink Copied

One Learning Rate for All DimensionsLink Copied

Difficulty with Saddle PointsLink Copied

Oscillation in Ill-Conditioned ProblemsLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Linear Classifiers: The Foundation of Neural Networks

Dropout: Neural Network Regularization Through Random Neuron Masking

Stay updated