Search

Search articles

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Michael BrenndoerferDecember 15, 202541 min read

Master SGD optimization for neural networks, including minibatch training, learning rate schedules, and how gradient noise acts as implicit regularization.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Stochastic Gradient Descent

Training a neural network means finding weights that minimize a loss function. We've seen how backpropagation computes gradients, telling us which direction to move each weight to reduce the loss. But computing these gradients over an entire dataset is expensive. With millions of training examples, even a single gradient computation becomes prohibitively slow.

Stochastic Gradient Descent (SGD) solves this problem with a simple but powerful insight: we don't need perfect gradients. We just need gradients that point roughly in the right direction. By using small random samples instead of the full dataset, we trade precision for speed, often getting orders of magnitude faster training with minimal impact on the final result.

From Batch to Stochastic: The Core Idea

To understand why we need stochastic gradient descent, let's first examine what we're replacing and why the replacement works at all.

The Ideal: Computing the True Gradient

Imagine you want to know the average height of adults in a country. The most accurate approach is to measure everyone. Similarly, standard gradient descent, often called batch gradient descent, computes the loss gradient by averaging over every training example:

L=1Ni=1NLi\nabla L = \frac{1}{N} \sum_{i=1}^{N} \nabla L_i

where:

  • L\nabla L: the gradient of the total loss with respect to all model parameters
  • NN: the total number of training examples in the dataset
  • Li\nabla L_i: the gradient computed from the ii-th training example alone
  • 1N\frac{1}{N}: averaging factor that normalizes the sum

This formula computes the exact direction of steepest descent. Each training example contributes its own "opinion" about which way to adjust the weights, and averaging produces a consensus direction. The result is precise, but precision comes at a cost: you must process the entire dataset before taking a single optimization step.

For a dataset of one million images, that means one million forward passes, one million backward passes, and one million gradient computations, all before you can update a single weight. At this scale, batch gradient descent becomes impractical.

Batch Gradient Descent

Batch gradient descent computes the gradient using the entire training set. This provides an accurate estimate of the true gradient direction but becomes computationally infeasible for large datasets.

The Insight: A Sample Can Substitute for the Population

Here's the key insight that makes stochastic gradient descent possible: you don't need to measure everyone to estimate the average.

Polling organizations don't survey every citizen; they sample a representative subset. The sample average is close to the true average, and more samples get you closer. The same logic applies to gradients. Instead of computing Li\nabla L_i for all NN examples and averaging, what if we just... picked one example at random?

LLiwhere i is chosen uniformly from {1,2,,N}\nabla L \approx \nabla L_i \quad \text{where } i \text{ is chosen uniformly from } \{1, 2, \ldots, N\}

where:

  • Li\nabla L_i: the gradient computed from a single randomly selected training example
  • ii: a random index, each value equally likely

This single-example gradient is an unbiased estimator of the true gradient. Mathematically, E[Li]=L\mathbb{E}[\nabla L_i] = \nabla L: if you average over all possible random choices, you get the exact batch gradient back. Any individual estimate might be wrong, perhaps dramatically so, but on average, it points in the right direction.

The tradeoff is variance for speed. A single-example gradient might point 45° away from the true direction, or even in the opposite direction for pathological examples. But it costs 1/N1/N as much to compute. With a million examples, that's a million-fold speedup per gradient computation.

Visualizing the Tradeoff: Batch vs. Stochastic Gradients

Let's make this concrete. We'll create a simple linear regression problem and compare the batch gradient (computed from all 1000 examples) with stochastic gradients (each computed from a single example). The difference reveals exactly what we gain and what we sacrifice.

In[2]:
Code
import numpy as np

np.random.seed(42)

# Generate simple 2D data for linear regression
N = 1000
X = np.random.randn(N, 2)
true_weights = np.array([2.0, -1.5])
y = X @ true_weights + np.random.randn(N) * 0.5


# Loss function: Mean Squared Error
def mse_loss(w, X, y):
    predictions = X @ w
    return np.mean((predictions - y) ** 2)


# Gradient of MSE loss
def mse_gradient(w, X, y):
    predictions = X @ w
    return 2 * X.T @ (predictions - y) / len(y)

Now let's compute both types of gradients at the same point in parameter space and see how they compare:

In[3]:
Code
# Starting point for gradient comparison
w = np.array([0.5, 0.5])

# Batch gradient (using all data)
batch_gradient = mse_gradient(w, X, y)

# Stochastic gradients (using single examples)
n_samples = 50
stochastic_gradients = []
for _ in range(n_samples):
    idx = np.random.randint(N)
    sg = mse_gradient(w, X[idx : idx + 1], y[idx : idx + 1])
    stochastic_gradients.append(sg)

stochastic_gradients = np.array(stochastic_gradients)
Out[4]:
Visualization
Vector plot showing one large red arrow for batch gradient and many scattered gray arrows for stochastic gradients, with their average shown as a blue dashed arrow.
Comparison of batch gradient (red arrow) and stochastic gradients (gray arrows). Each gray arrow is the gradient computed from a single training example. While individual stochastic gradients vary wildly, their average (blue dashed arrow) closely approximates the true batch gradient.

The visualization reveals the fundamental tradeoff at the heart of SGD. The batch gradient (red arrow) points precisely toward the optimum, getting the direction exactly right. The individual stochastic gradients (gray arrows) scatter wildly, some nearly orthogonal to the true direction, a few even pointing away from the optimum entirely. Any single gray arrow would be a poor substitute for the red one.

But look at the blue dashed arrow: the average of those scattered stochastic gradients. It aligns almost perfectly with the true batch gradient. This is not coincidence. It's the central limit theorem at work. Each stochastic gradient is an unbiased sample, so their average converges to the true gradient. The more samples you average, the closer you get.

This is the mathematical foundation of SGD: each step is imprecise, but the journey is correct on average.

The SGD Update Rule

Now that we understand why stochastic gradients work, let's formalize how we use them. At each iteration, SGD takes a step in the direction opposite to the gradient:

wt+1=wtηLiw_{t+1} = w_t - \eta \nabla L_i

where:

  • wtw_t: the weight vector at iteration tt
  • wt+1w_{t+1}: the updated weight vector after iteration tt
  • η\eta: the learning rate (step size), a positive scalar controlling how far we move
  • Li\nabla L_i: the gradient computed from a single randomly selected example ii

Understanding Each Component

Why subtract? Gradients point "uphill," toward increasing loss. Since we want to decrease the loss, we move in the opposite direction. Subtracting the gradient is moving downhill.

Why scale by η\eta? The gradient tells us the direction, but not how far to go. The learning rate controls step size. Too small, and we inch toward the minimum over thousands of steps. Too large, and we overshoot, potentially bouncing around forever or diverging entirely. Finding the right η\eta is one of the most important (and frustrating) aspects of training neural networks.

Why use Li\nabla L_i instead of L\nabla L? This is what makes it stochastic. Using the full gradient L\nabla L gives batch gradient descent; using a single-example gradient Li\nabla L_i gives pure SGD. The update formula is identical, only the gradient source changes.

The Training Loop: Epochs and Shuffling

In practice, we don't pick random examples with replacement. Instead, we shuffle the training set once, then iterate through every example in order. Once we've used every example once, that's one epoch. Then we shuffle again and repeat:

In[5]:
Code
def sgd_train(X, y, learning_rate=0.01, n_epochs=10):
    """Train linear regression using SGD."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        # Shuffle data each epoch
        indices = np.random.permutation(n_samples)

        for idx in indices:
            # Compute gradient for single example
            xi = X[idx : idx + 1]
            yi = y[idx : idx + 1]
            gradient = mse_gradient(w, xi, yi)

            # Update weights
            w = w - learning_rate * gradient

        # Record progress after each epoch
        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history
In[6]:
Code
# Train the model
w_sgd, history_sgd = sgd_train(X, y, learning_rate=0.01, n_epochs=20)
Out[7]:
Console
SGD Training Results:
---------------------------------------------
True weights:  [+2.0000, -1.5000]
Final weights: [+1.9887, -1.4916]
Initial loss:  6.2620
Final loss:    0.2419

Loss reduction: 96.1%
Weight error:   0.0141

The results speak for themselves. SGD recovers weights extremely close to the ground truth, with the final loss dropping dramatically from its initial value. But here's the key insight: we made 1000 weight updates per epoch (one per training example), whereas batch gradient descent would make only one. Even though each SGD update is noisier than a batch update, we get 1000 chances to improve per epoch instead of just one.

This is why SGD often converges faster in wall-clock time despite being "noisier" per step. The total distance traveled toward the optimum per unit of computation is greater.

Visualizing the Optimization Path

Let's compare the actual trajectory through weight space for batch gradient descent vs. SGD. We'll run both starting from the same initial point and watch how they approach the optimum:

In[8]:
Code
def batch_gd_train(X, y, learning_rate=0.1, n_epochs=20):
    """Train linear regression using batch gradient descent."""
    n_features = X.shape[1]
    w = np.array([0.5, 0.5])  # Same starting point

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        gradient = mse_gradient(w, X, y)
        w = w - learning_rate * gradient
        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history


def sgd_train_with_path(X, y, learning_rate=0.01, n_epochs=5):
    """SGD tracking every update for path visualization."""
    w = np.array([0.5, 0.5])
    history = {"weights": [w.copy()]}

    for epoch in range(n_epochs):
        indices = np.random.permutation(len(y))
        for idx in indices[:50]:  # Sample for visibility
            gradient = mse_gradient(w, X[idx : idx + 1], y[idx : idx + 1])
            w = w - learning_rate * gradient
            history["weights"].append(w.copy())

    return w, history


# Run both optimizers
np.random.seed(42)
_, batch_hist = batch_gd_train(X, y, learning_rate=0.1, n_epochs=20)
np.random.seed(42)
_, sgd_path_hist = sgd_train_with_path(X, y, learning_rate=0.01, n_epochs=3)

batch_weights = np.array(batch_hist["weights"])
sgd_weights = np.array(sgd_path_hist["weights"])
Out[9]:
Visualization
Contour plot of loss surface with two optimization trajectories: a smooth blue path for batch GD and a jagged red path for SGD, both converging toward the minimum.
Optimization paths through weight space. Batch gradient descent (blue) takes smooth, direct steps toward the optimum. SGD (red) follows a noisy, wandering path but makes many more updates. Despite the zigzagging, SGD reaches the vicinity of the optimum quickly.

The visualization reveals the fundamental difference in how these methods explore the loss landscape. Batch gradient descent (blue) takes measured, deliberate steps directly toward the minimum, with each step using perfect information from the entire dataset. SGD (red) wanders drunkenly, zigzagging across the landscape as different training examples pull it in different directions.

Yet both reach the same destination. And despite its erratic path, SGD made many more updates in the same computational budget. This is the core tradeoff: precision vs. speed.

Minibatch Gradient Descent: The Best of Both Worlds

We've now seen two extremes:

  1. Batch gradient descent: Use all NN examples. Precise gradients, but one update per full dataset pass.
  2. Pure SGD: Use 1 example. Fast updates, but extremely noisy gradients.

In practice, neither extreme is ideal. Batch is too slow; pure SGD is too noisy and can't exploit modern hardware. The solution? Meet in the middle.

The Minibatch Compromise

Instead of all examples or just one, we use a small random subset, a minibatch, of BB examples:

L1BiBLi\nabla L \approx \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla L_i

where:

  • B\mathcal{B}: a minibatch, a random subset of BB training examples
  • BB: the batch size (typically 32, 64, 128, or 256)
  • Li\nabla L_i: the gradient from the ii-th example in the minibatch
  • 1B\frac{1}{B}: averaging factor that normalizes the sum

This is the same averaging formula as batch gradient descent, just applied to a smaller sample. If B=NB = N, we get batch gradient descent. If B=1B = 1, we get pure SGD. For BB somewhere in between, we get the advantages of both:

  • Reduced variance: Averaging over BB examples smooths out the noise. The variance of the gradient estimate decreases as 1/B1/B.
  • Efficient computation: GPUs are designed for parallel matrix operations. A minibatch of 64 examples can be processed almost as fast as a single example because the matrix multiplications happen in parallel.
  • Practical balance: We still make N/BN/B updates per epoch (for 1000 examples and B=32B=32, that's about 31 updates), so we iterate quickly without extreme noise.
Minibatch Gradient Descent

Minibatch gradient descent computes gradients using a small random subset of training examples (typically 32-256). This provides more stable gradients than pure SGD while maintaining computational efficiency.

Implementing Minibatch SGD

The implementation is nearly identical to pure SGD, but we process examples in groups rather than one at a time:

In[10]:
Code
def minibatch_sgd_train(X, y, batch_size=32, learning_rate=0.01, n_epochs=10):
    """Train linear regression using minibatch SGD."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [w.copy()], "loss": [mse_loss(w, X, y)]}

    for epoch in range(n_epochs):
        # Shuffle data
        indices = np.random.permutation(n_samples)

        # Process in batches
        for start in range(0, n_samples, batch_size):
            batch_idx = indices[start : start + batch_size]
            X_batch = X[batch_idx]
            y_batch = y[batch_idx]

            # Compute gradient for batch
            gradient = mse_gradient(w, X_batch, y_batch)

            # Update weights
            w = w - learning_rate * gradient

        history["weights"].append(w.copy())
        history["loss"].append(mse_loss(w, X, y))

    return w, history

The key difference from pure SGD is in the loop structure: instead of iterating over individual indices, we iterate in steps of batch_size and slice out groups of examples. The gradient computation itself is unchanged; we just pass a matrix X_batch instead of a single row.

Let's compare how different batch sizes affect convergence:

In[11]:
Code
# Compare different batch sizes
batch_sizes = [1, 16, 64, 256]
histories = {}

for bs in batch_sizes:
    _, hist = minibatch_sgd_train(
        X, y, batch_size=bs, learning_rate=0.01, n_epochs=20
    )
    histories[bs] = hist
Out[12]:
Visualization
Line plot showing loss decreasing over epochs for four different batch sizes, with smaller batches showing more oscillation.
Training loss curves for different batch sizes. Smaller batches (B=1, B=16) show noisier convergence but can escape shallow local minima. Larger batches (B=64, B=256) provide smoother convergence. All eventually reach similar final losses.

The curves reveal an illuminating pattern:

  • Small batches (B=1, B=16) converge faster initially because they take more update steps per epoch, covering more ground early. But the path is jagged, reflecting high gradient variance.
  • Large batches (B=64, B=256) are smoother, with less oscillation. But they make fewer updates per epoch, so early progress is slower.
  • All reach similar final losses, given enough epochs. The journey differs, but the destination is the same.

This is the batch size tradeoff in action: more noise vs. more updates. Neither extreme is optimal, which is why batch sizes in the 32-256 range are the practical sweet spot.

Why These Specific Batch Sizes?

You'll notice that batch sizes are almost always powers of 2: 32, 64, 128, 256. This isn't superstition; it's hardware optimization. GPUs process data in parallel using memory layouts that align to powers of 2. A batch of 64 examples may process in the same time as a batch of 50, simply because 64 fits the hardware's natural granularity.

Beyond hardware, batch size affects learning dynamics:

  • Too small (B < 16): Gradient variance is high, often requiring smaller learning rates to compensate. Training becomes erratic.
  • Sweet spot (B = 32-256): Variance is reduced enough for stable training, while still making many updates per epoch. Memory usage fits comfortably on most GPUs.
  • Too large (B > 1024): Diminishing returns on gradient quality (variance doesn't decrease much beyond a point), and research suggests very large batches may find sharper minima that generalize worse.

Updates per Epoch: The Hidden Tradeoff

Beyond noise, batch size affects how many times we update weights per epoch. Smaller batches mean more updates per pass through the data:

Out[13]:
Visualization
Bar chart showing updates per epoch decreasing as batch size increases, from 1000 updates at B=1 to about 4 updates at B=256.
Number of weight updates per epoch as a function of batch size. Smaller batches provide more frequent updates: pure SGD (B=1) makes N updates per epoch, while large batches make few. This tradeoff interacts with gradient noise, since more updates means faster exploration but noisier individual steps.

With batch size 1 (pure SGD), we make 1000 updates per epoch, one per example. With batch size 256, we make only about 4 updates. This is why small batches often converge faster in terms of epochs: each epoch does more optimization work. But each individual update is noisier, so the tradeoff isn't straightforward.

Quantifying the Variance Reduction

The central limit theorem predicts that gradient variance should decrease as 1/B1/B: double the batch size, halve the variance. Let's verify this empirically:

In[14]:
Code
def compute_gradient_variance(X, y, w, batch_size, n_samples=100):
    """Estimate variance of minibatch gradients.

    Computes 100 different minibatch gradients and measures their variance.
    """
    gradients = []
    n = len(y)

    for _ in range(n_samples):
        idx = np.random.choice(n, size=batch_size, replace=False)
        g = mse_gradient(w, X[idx], y[idx])
        gradients.append(g)

    gradients = np.array(gradients)
    return np.mean(np.var(gradients, axis=0))


# Measure gradient variance for different batch sizes
test_batch_sizes = [1, 4, 16, 64, 256, 512]
variances = []

w_test = np.array([0.5, 0.5])
for bs in test_batch_sizes:
    var = compute_gradient_variance(X, y, w_test, bs)
    variances.append(var)
Out[15]:
Visualization
Log-log plot showing gradient variance decreasing linearly with batch size.
Gradient variance decreases as batch size increases, following an approximately 1/B relationship. This explains why larger batches allow larger learning rates. The dashed line shows the theoretical 1/B decay.

The log-log plot confirms the 1/B1/B relationship: a straight line with slope -1. Doubling the batch size halves the variance, exactly as the central limit theorem predicts.

Let's visualize this variance reduction more intuitively by plotting the distribution of gradient estimates for different batch sizes:

Out[16]:
Visualization
Violin plot showing gradient distributions narrowing dramatically as batch size increases from 1 to 256.
Distribution of minibatch gradient estimates for different batch sizes. With B=1 (pure SGD), gradients scatter widely around the true value. As batch size increases, the distribution tightens, concentrating near the true gradient. This is variance reduction in action.

The violin plots make the variance reduction visceral. With B=1, gradient estimates scatter wildly, some far from the true value (green dashed line). As batch size increases, the distribution narrows dramatically. By B=256, estimates cluster tightly around the truth. This is why larger batches allow larger learning rates: the gradient you compute is much closer to the true direction.

This relationship has a profound practical consequence: if you double the batch size, you can often double the learning rate while maintaining training stability. The variance reduction from larger batches compensates for the larger steps. This "linear scaling rule" is widely used when training on multiple GPUs, where larger effective batch sizes are natural.

Learning Rate: The Critical Hyperparameter

We've discussed batch size as a dial we can tune. But the learning rate η\eta is the critical hyperparameter, the one that makes or breaks training. Get it wrong, and your model either learns nothing or explodes.

The Goldilocks Problem

Unlike batch size, where 32-256 usually works fine, learning rate is problem-specific. A value that works perfectly for one model might cause another to diverge. Here's what happens at different settings:

  • Too small (η=0.0001\eta = 0.0001): Each step is a timid shuffle toward the minimum. Training is stable and won't diverge, but it's glacially slow. You might need thousands of epochs to converge, and you may get stuck in shallow local minima along the way.

  • Just right (η=0.01\eta = 0.01): Each step makes meaningful progress without overshooting. The loss decreases steadily, and you reach a good solution in a reasonable number of epochs.

  • Too large (η=0.5\eta = 0.5): Each step overshoots the minimum, landing on the other side of the valley. The next step overshoots again. The loss oscillates wildly, or worse, increases exponentially until numerical overflow kills training.

In[17]:
Code
# Compare different learning rates
learning_rates = [0.0001, 0.001, 0.01, 0.1, 0.5]
lr_histories = {}

for lr in learning_rates:
    _, hist = minibatch_sgd_train(
        X, y, batch_size=32, learning_rate=lr, n_epochs=30
    )
    lr_histories[lr] = hist
Out[18]:
Visualization
Line plot showing loss curves diverging, oscillating, or converging smoothly depending on learning rate.
Training loss curves for different learning rates. Too small (0.0001) barely moves. Moderate values (0.001-0.01) converge well. Too large (0.1-0.5) either oscillates or diverges. Finding the right learning rate is crucial for efficient training.

Finding the right learning rate manually is tedious. The learning rate finder technique automates this by sweeping across learning rates and tracking when the loss starts increasing:

In[19]:
Code
def learning_rate_finder(
    X, y, batch_size=32, lr_min=1e-6, lr_max=1, n_steps=100
):
    """Find optimal learning rate range using exponential increase.

    Start with a tiny learning rate and exponentially increase it.
    Track how the loss changes. The optimal range is where loss
    decreases fastest, just before it starts increasing.
    """
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    # Exponentially spaced learning rates
    lrs = np.exp(np.linspace(np.log(lr_min), np.log(lr_max), n_steps))
    losses = []

    idx = 0
    for lr in lrs:
        # Get a batch
        batch_start = (idx * batch_size) % n_samples
        batch_end = batch_start + batch_size
        if batch_end > n_samples:
            idx = 0
            batch_start = 0
            batch_end = batch_size

        X_batch = X[batch_start:batch_end]
        y_batch = y[batch_start:batch_end]

        # Compute loss
        loss = mse_loss(w, X_batch, y_batch)
        losses.append(loss)

        # Update with current learning rate
        gradient = mse_gradient(w, X_batch, y_batch)
        w = w - lr * gradient

        idx += 1

        # Stop if loss explodes
        if loss > 4 * losses[0]:
            break

    return lrs[: len(losses)], losses


lrs_test, losses_test = learning_rate_finder(X, y)
Out[20]:
Visualization
Line plot showing loss decreasing then sharply increasing as learning rate grows, with optimal range highlighted.
Learning rate finder showing loss vs. learning rate on a log scale. The optimal learning rate is typically 1-10× smaller than where the loss starts increasing (around 0.01 here). This technique helps quickly identify a good starting point.

The plot shows the classic learning rate finder shape: loss decreases as we increase the learning rate (we're making meaningful progress), then suddenly shoots up when the rate gets too large (we're overshooting). The optimal learning rate is typically 1-10× smaller than where the loss starts increasing, around 0.01-0.05 in this case. This technique, popularized by Leslie Smith, can save hours of manual tuning.

SGD Convergence Properties

We've seen that SGD trades exact gradients for speed. But does this tradeoff come at a cost to where we end up? Understanding SGD's convergence behavior reveals both its power and its subtleties.

The Noisy Path to Convergence

Batch gradient descent, with exact gradients, descends smoothly to the minimum. Each step moves directly downhill, and if the learning rate is small enough, the algorithm converges to a fixed point.

SGD is different. With a constant learning rate, SGD never converges to a point. It oscillates around the minimum, bouncing in a region whose diameter is proportional to η\eta. The gradient noise prevents it from settling.

This might seem like a fundamental flaw, but it's actually a consequence of a tradeoff. To guarantee exact convergence, we need a decaying learning rate that satisfies two mathematical conditions:

t=1ηt=andt=1ηt2<\sum_{t=1}^{\infty} \eta_t = \infty \quad \text{and} \quad \sum_{t=1}^{\infty} \eta_t^2 < \infty

where:

  • ηt\eta_t: the learning rate at iteration tt
  • t=1ηt\sum_{t=1}^{\infty} \eta_t: the cumulative sum of all learning rates

These conditions may seem abstract, but they encode a precise balance:

  1. First condition (ηt=\sum \eta_t = \infty): We must be able to travel unbounded total distance. If learning rates decay too fast, like ηt=1/t2\eta_t = 1/t^2, the sum converges to a finite value, and we might stop moving before reaching the minimum. The algorithm gets "stuck" because it runs out of step budget.

  2. Second condition (ηt2<\sum \eta_t^2 < \infty): The accumulated variance must be bounded. Each SGD step has variance proportional to ηt2\eta_t^2 (from the noisy gradient). If this sum diverges, we bounce around forever, never settling down.

A schedule like ηt=η01+t\eta_t = \frac{\eta_0}{1 + t} threads this needle: it decays slowly enough that the sum diverges (we can always reach the goal), but fast enough that the sum of squares converges (we eventually stop bouncing).

From Theory to Practice: Learning Rate Schedules

In practice, few practitioners use theoretically-motivated schedules like 1/t1/t. Instead, empirically-tuned schedules dominate. The key insight is that we want large steps early (when we're far from the optimum and any progress is good) and small steps late (when we're near the optimum and need precision).

Let's implement and compare several common schedules:

In[21]:
Code
def sgd_with_schedule(X, y, lr_schedule, batch_size=32, n_epochs=50):
    """SGD with custom learning rate schedule."""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)

    history = {"weights": [], "loss": [], "lr": []}
    step = 0

    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)

        for start in range(0, n_samples, batch_size):
            lr = lr_schedule(step)
            batch_idx = indices[start : start + batch_size]

            gradient = mse_gradient(w, X[batch_idx], y[batch_idx])
            w = w - lr * gradient

            if step % 100 == 0:
                history["weights"].append(w.copy())
                history["loss"].append(mse_loss(w, X, y))
                history["lr"].append(lr)

            step += 1

    return w, history

We define four schedules representing different decay philosophies:

In[22]:
Code
# Constant: no decay at all
def constant_lr(step, lr0=0.01):
    return lr0


# Step decay: drop by factor of 10 every 500 steps
def step_decay(step, lr0=0.1, decay_rate=0.1, decay_steps=500):
    return lr0 * (decay_rate ** (step // decay_steps))


# Exponential: continuous smooth decay
def exponential_decay(step, lr0=0.1, decay_rate=0.995):
    return lr0 * (decay_rate**step)


# Inverse time: theoretically-motivated 1/t style
def inverse_time_decay(step, lr0=0.1, decay_rate=0.01):
    return lr0 / (1 + decay_rate * step)


# Train with different schedules
schedules = {
    "Constant": lambda s: constant_lr(s, 0.01),
    "Step decay": lambda s: step_decay(s),
    "Exponential decay": lambda s: exponential_decay(s),
    "Inverse time": lambda s: inverse_time_decay(s),
}

schedule_histories = {}
for name, schedule in schedules.items():
    _, hist = sgd_with_schedule(X, y, schedule, batch_size=32, n_epochs=30)
    schedule_histories[name] = hist
Out[23]:
Visualization
Line plot showing four learning rate schedules decreasing at different rates over training steps.
Learning rate over time for different scheduling strategies. Constant stays flat while decay strategies reduce the rate progressively.
Line plot showing loss curves converging to different final values depending on schedule.
Training loss with different schedules. Decaying schedules converge to lower final losses by taking smaller steps near the optimum.

The left plot shows how each schedule reduces the learning rate over training steps. The constant schedule stays flat; step decay has sharp drops; exponential and inverse-time decay smoothly.

The right plot reveals the consequence: decaying schedules achieve lower final losses. The constant schedule converges quickly but then oscillates because it can't settle precisely into the minimum with such large steps. The decaying schedules take big steps early (making fast initial progress) then small steps late (settling precisely into the minimum).

Common Learning Rate Schedules in Practice

Several schedules have become standard, each with particular strengths:

Common learning rate schedules and their applications.
ScheduleFormulaUse Case
Step decayηt=η0γt/s\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}Most common; drop by factor γ\gamma every ss steps
Exponentialηt=η0γt\eta_t = \eta_0 \cdot \gamma^tSmooth decay; can be too aggressive
Cosine annealingηt=ηmin+12(η0ηmin)(1+cos(tπT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\frac{t\pi}{T}))Popular for vision models; smooth with warm restarts
Linear warmupIncrease linearly for first WW steps, then decayEssential for transformers; prevents early instability

In these formulas:

  • ηt\eta_t: learning rate at step tt
  • η0\eta_0: initial learning rate
  • γ\gamma: decay factor (typically 0.1 for step decay, 0.99-0.999 for exponential)
  • ss: step interval for step decay
  • TT: total number of training steps for cosine annealing
  • ηmin\eta_{\min}: minimum learning rate (floor for cosine schedule)
  • WW: number of warmup steps

Let's visualize all these schedules together, including cosine annealing and warmup:

In[24]:
Code
def cosine_annealing(step, lr0=0.1, T=1000, eta_min=0.001):
    """Cosine annealing schedule."""
    return eta_min + 0.5 * (lr0 - eta_min) * (1 + np.cos(np.pi * step / T))


def warmup_then_decay(step, lr0=0.1, warmup_steps=200, decay_rate=0.995):
    """Linear warmup followed by exponential decay."""
    if step < warmup_steps:
        return lr0 * (step / warmup_steps)
    else:
        return lr0 * (decay_rate ** (step - warmup_steps))


# Generate schedule curves for visualization
steps = np.arange(1000)

schedule_curves = {
    "Constant": [0.01] * len(steps),
    "Step decay": [step_decay(s, lr0=0.1) for s in steps],
    "Exponential": [
        exponential_decay(s, lr0=0.1, decay_rate=0.997) for s in steps
    ],
    "Cosine annealing": [cosine_annealing(s, lr0=0.1, T=1000) for s in steps],
    "Warmup + decay": [warmup_then_decay(s, lr0=0.1) for s in steps],
}
Out[25]:
Visualization
Line plot showing five different learning rate schedules over 1000 training steps, each with distinct decay patterns.
Comparison of common learning rate schedules. Step decay has sharp drops; exponential decays smoothly; cosine annealing follows a smooth curve that decreases slowly at first and end, faster in the middle; warmup starts low and increases before decaying. Each has strengths for different training scenarios.

The choice of schedule often matters less than having a schedule. The key is reducing the learning rate as training progresses; the specific shape is secondary. That said, each schedule has its niche:

  • Cosine annealing is particularly popular for training vision models, as the smooth curve avoids sudden drops that can destabilize training
  • Warmup is essential for transformers and models with layer normalization, preventing early training instability
  • Step decay remains the workhorse for many applications, offering predictable drops at known epochs

SGD Noise as Implicit Regularization

Everything we've discussed so far treats SGD's noise as a necessary evil, the cost of speed. But here's a surprising fact: the noise actually helps generalization. What looks like a bug is actually a feature.

The Generalization Puzzle

Consider two training runs that achieve the same final loss on the training set. Both fit the data equally well. Yet one might generalize better to new data than the other. Why?

The answer lies in where they ended up in the loss landscape, not just how low they got.

Sharp vs. Flat Minima

Neural network loss landscapes have many local minima, perhaps exponentially many. Some are sharp: the loss increases steeply as you move away from the minimum in any direction. Others are flat: the loss changes gradually, creating a broad basin.

Research suggests that flat minima generalize better. The intuition is straightforward: test data comes from a slightly different distribution than training data. This shifts the loss landscape slightly. If you're in a sharp minimum, even a small shift might push you up a steep wall, dramatically increasing loss. If you're in a flat minimum, the same shift barely matters because you're still near the bottom of a broad basin.

In[26]:
Code
# Illustrate sharp vs flat minima with a 1D loss function
x = np.linspace(-2, 2, 1000)

# Sharp minimum at x=0
sharp_loss = 10 * x**2

# Flat minimum at x=0
flat_loss = x**4

# Second sharp minimum for contrast
sharp_loss_2 = 2 * (x - 0.5) ** 2 + 0.5
Out[27]:
Visualization
Line plot showing a narrow deep minimum (sharp) and a wide shallow minimum (flat) on a loss landscape.
Comparison of sharp and flat minima. The sharp minimum (red) has low loss but high curvature; small perturbations cause large loss increases. The flat minimum (green) is more robust to perturbations. SGD's noise naturally avoids sharp minima because escaping them is easy.

How SGD Noise Finds Flat Minima

Here's where SGD's noise becomes a feature rather than a bug. The gradient noise has a specific structure: it depends on the local curvature of the loss.

Near sharp minima with high curvature, gradient variance is high. The loss changes rapidly in all directions, so different training examples disagree sharply about which way to go. This disagreement translates to noisy, high-variance gradients.

Near flat minima with low curvature, gradient variance is low. The loss is nearly constant in the neighborhood, so all examples agree: "we're in a good spot." Gradients are small and consistent.

The consequence? SGD naturally bounces out of sharp minima while settling into flat ones. The noise is proportional to curvature, so sharp minima are inherently unstable under SGD dynamics.

This relationship can be quantified. The effective noise scale in SGD, measuring how much the optimization path fluctuates, is approximately:

Noise scaleηBσg2\text{Noise scale} \propto \frac{\eta}{B} \cdot \sigma_g^2

where:

  • η\eta: the learning rate (step size)
  • BB: the batch size (number of examples per gradient computation)
  • σg2\sigma_g^2: the variance of per-example gradients
  • ηB\frac{\eta}{B}: the noise-to-signal ratio of each update

Larger steps (η\eta) amplify the noise, while larger batches (BB) reduce it through averaging. The ratio η/B\eta/B controls the net noise level. This explains several well-known empirical observations:

  • Smaller batches → better generalization: More noise helps bounce out of sharp minima into flatter ones
  • Larger learning rates → better generalization (up to a point): Same mechanism, more exploration
  • The ratio η/B\eta/B matters more than either alone: Doubling batch size while doubling learning rate maintains similar dynamics
In[28]:
Code
# Demonstrate how SGD escapes sharp minima
def synthetic_loss_1d(w, sharp=True):
    """1D loss with sharp or flat minimum."""
    if sharp:
        return 10 * w**2 + np.random.randn() * 0.5  # Sharp with noise
    else:
        return w**4 + np.random.randn() * 0.1  # Flat with less noise


def sgd_1d(sharp=True, n_steps=500, lr=0.1):
    """Run SGD on 1D problem."""
    w = 0.0
    history = [w]

    for _ in range(n_steps):
        # Gradient with noise
        if sharp:
            grad = 20 * w + np.random.randn() * 2
        else:
            grad = 4 * w**3 + np.random.randn() * 0.5

        w = w - lr * grad
        history.append(w)

    return np.array(history)


# Run multiple trials
n_trials = 20
sharp_trajectories = [sgd_1d(sharp=True, n_steps=200) for _ in range(n_trials)]
flat_trajectories = [sgd_1d(sharp=False, n_steps=200) for _ in range(n_trials)]
Out[29]:
Visualization
Multiple trajectory lines oscillating widely around zero in a sharp minimum scenario.
SGD trajectories in sharp minimum. High gradient variance causes large oscillations; the optimizer struggles to settle.
Multiple trajectory lines converging smoothly toward zero in a flat minimum scenario.
SGD trajectories in flat minimum. Low gradient variance allows stable convergence to the minimum.

The contrast is striking. In the sharp minimum (left), SGD oscillates wildly because the high curvature creates high gradient variance, and the optimizer bounces around forever. In the flat minimum (right), SGD converges stably because the low curvature means low variance, and the optimizer settles down.

This isn't a flaw to be fixed; it's a feature to be exploited. The noise in SGD naturally selects for solutions that generalize well.

A Worked Example: Training a Classifier

We've explored SGD's mechanics through linear regression, a convex problem where the theory is clean. Now let's bring everything together with a more realistic example: training a neural network classifier from scratch using SGD. This will show how batch size, learning rate, and schedules interact when the loss landscape is non-convex.

The Dataset and Model

We'll use the "moons" dataset: two interleaved crescent shapes that require a nonlinear decision boundary. A simple 2-layer neural network with ReLU activations will learn to separate them.

In[30]:
Code
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split

# Generate dataset
X_clf, y_clf = make_moons(n_samples=2000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42
)


# Simple 2-layer neural network
class SimpleNN:
    def __init__(self, input_dim, hidden_dim, output_dim):
        # Xavier initialization
        self.W1 = np.random.randn(input_dim, hidden_dim) * np.sqrt(
            2.0 / input_dim
        )
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, output_dim) * np.sqrt(
            2.0 / hidden_dim
        )
        self.b2 = np.zeros(output_dim)

    def relu(self, x):
        return np.maximum(0, x)

    def relu_derivative(self, x):
        return (x > 0).astype(float)

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.relu(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2.flatten()

    def backward(self, X, y):
        m = len(y)

        # Output layer gradient
        dz2 = (self.a2.flatten() - y).reshape(-1, 1)
        dW2 = self.a1.T @ dz2 / m
        db2 = np.mean(dz2, axis=0)

        # Hidden layer gradient
        da1 = dz2 @ self.W2.T
        dz1 = da1 * self.relu_derivative(self.z1)
        dW1 = X.T @ dz1 / m
        db1 = np.mean(dz1, axis=0)

        return {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}

    def update(self, gradients, lr):
        self.W1 -= lr * gradients["W1"]
        self.b1 -= lr * gradients["b1"]
        self.W2 -= lr * gradients["W2"]
        self.b2 -= lr * gradients["b2"]

    def loss(self, X, y):
        pred = self.forward(X)
        # Binary cross-entropy
        eps = 1e-15
        pred = np.clip(pred, eps, 1 - eps)
        return -np.mean(y * np.log(pred) + (1 - y) * np.log(1 - pred))

    def accuracy(self, X, y):
        pred = self.forward(X) > 0.5
        return np.mean(pred == y)

The neural network implements the forward pass (input → hidden → output), the backward pass (computing gradients for each weight), and weight updates. The loss method computes binary cross-entropy, and accuracy measures classification performance.

Training with SGD

Now let's train this network with SGD, comparing constant vs. decaying learning rates:

In[31]:
Code
def train_nn_sgd(
    X_train,
    y_train,
    X_test,
    y_test,
    hidden_dim=32,
    batch_size=32,
    lr=0.1,
    n_epochs=50,
    lr_decay=None,
):
    """Train neural network with SGD."""
    np.random.seed(42)
    model = SimpleNN(2, hidden_dim, 1)

    n_samples = len(y_train)
    history = {
        "train_loss": [],
        "test_loss": [],
        "train_acc": [],
        "test_acc": [],
        "lr": [],
    }

    step = 0
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)

        for start in range(0, n_samples, batch_size):
            batch_idx = indices[start : start + batch_size]
            X_batch = X_train[batch_idx]
            y_batch = y_train[batch_idx]

            # Forward pass
            model.forward(X_batch)

            # Backward pass
            gradients = model.backward(X_batch, y_batch)

            # Learning rate schedule
            current_lr = lr if lr_decay is None else lr_decay(step, lr)

            # Update
            model.update(gradients, current_lr)
            step += 1

        # Record metrics
        history["train_loss"].append(model.loss(X_train, y_train))
        history["test_loss"].append(model.loss(X_test, y_test))
        history["train_acc"].append(model.accuracy(X_train, y_train))
        history["test_acc"].append(model.accuracy(X_test, y_test))
        history["lr"].append(current_lr)

    return model, history


# Train with step decay schedule
def step_decay_schedule(step, lr0, decay_rate=0.5, decay_every=400):
    return lr0 * (decay_rate ** (step // decay_every))


model_default, hist_default = train_nn_sgd(
    X_train, y_train, X_test, y_test, batch_size=32, lr=0.5, n_epochs=100
)

model_decay, hist_decay = train_nn_sgd(
    X_train,
    y_train,
    X_test,
    y_test,
    batch_size=32,
    lr=0.5,
    n_epochs=100,
    lr_decay=step_decay_schedule,
)
Out[32]:
Console
Training Results:
=======================================================
Configuration             Train Acc    Test Acc    
-------------------------------------------------------
Constant LR (η=0.5)       0.9725       0.9675
Step Decay LR             0.9700       0.9700

Generalization gap (train - test accuracy):
  Constant LR: 0.0050
  Step Decay:  0.0000

Both configurations achieve strong test accuracy, and our simple 2-layer network successfully separates the crescents. The generalization gap (train minus test accuracy) is small, indicating the models are not overfitting. Learning rate decay achieves slightly better final performance by taking smaller steps as training progresses, allowing finer adjustments near the optimum.

Visualizing the Learning Process

Let's examine the training dynamics more closely:

Out[33]:
Visualization
Line plot showing training and test loss decreasing over epochs for two learning rate strategies.
Training and test loss over epochs. Both methods converge, but learning rate decay achieves lower final loss by taking smaller steps near the optimum.
Line plot showing training and test accuracy increasing over epochs for two learning rate strategies.
Accuracy on training and test sets. The gap between train and test accuracy indicates model generalization. Both achieve similar test accuracy.

The loss curves show a familiar pattern: rapid initial decrease followed by slower refinement. The learning rate decay schedule (red) achieves lower final loss because it can take more precise steps late in training. The accuracy curves show corresponding behavior, with both reaching high accuracy quickly, though the decaying schedule achieves slightly better final values.

The Learned Decision Boundary

Finally, let's visualize what the network actually learned: the decision boundary separating the two classes.

Out[34]:
Visualization
Scatter plot showing two crescent moon shapes separated by a curved decision boundary.
Decision boundary learned by the neural network on the moons dataset. The nonlinear boundary correctly separates the two crescent-shaped classes. Training points are shown as circles (correct) or X markers (misclassified).

The network learned a smooth, curved boundary that follows the natural shape of the moons. Correctly classified points appear as circles; the rare misclassified points (near the boundary where the classes overlap) appear as X markers. SGD with learning rate decay found weights that generalize well. The boundary isn't overfit to individual training points but captures the underlying pattern.

Complete SGD Implementation

Having explored SGD's theory and applications, let's consolidate everything into a reusable implementation. This SGDOptimizer class encapsulates the key components: learning rate scheduling and minibatch generation.

In[35]:
Code
class SGDOptimizer:
    """
    Stochastic Gradient Descent optimizer with configurable options.

    Parameters
    ----------
    learning_rate : float
        Initial learning rate
    batch_size : int
        Number of samples per minibatch
    lr_schedule : callable, optional
        Function(step, lr0) -> learning rate at step
    """

    def __init__(self, learning_rate=0.01, batch_size=32, lr_schedule=None):
        self.lr0 = learning_rate
        self.batch_size = batch_size
        self.lr_schedule = lr_schedule
        self.step = 0

    def get_lr(self):
        """Get current learning rate."""
        if self.lr_schedule is None:
            return self.lr0
        return self.lr_schedule(self.step, self.lr0)

    def step_update(self, params, gradients):
        """
        Update parameters using SGD.

        Parameters
        ----------
        params : dict
            Dictionary of parameter arrays
        gradients : dict
            Dictionary of gradient arrays (same keys as params)

        Returns
        -------
        params : dict
            Updated parameters
        """
        lr = self.get_lr()

        for key in params:
            params[key] = params[key] - lr * gradients[key]

        self.step += 1
        return params

    def create_batches(self, X, y, shuffle=True):
        """
        Generate minibatches from data.

        Yields
        ------
        X_batch, y_batch : tuple
            Minibatch of features and labels
        """
        n_samples = len(y)
        indices = np.arange(n_samples)

        if shuffle:
            np.random.shuffle(indices)

        for start in range(0, n_samples, self.batch_size):
            batch_idx = indices[start : start + self.batch_size]
            yield X[batch_idx], y[batch_idx]

The class has three main responsibilities:

  1. get_lr(): Returns the current learning rate, applying the schedule if one is provided
  2. step_update(params, gradients): Applies the SGD update rule: wwηLw \leftarrow w - \eta \nabla L
  3. create_batches(X, y): Generates shuffled minibatches for one epoch

Here's how you would use the optimizer in a training loop:

In[58]:
Code
optimizer = SGDOptimizer(
    learning_rate=0.01,
    batch_size=32,
    lr_schedule=lambda step, lr0: lr0 * 0.99**step,
)

for epoch in range(n_epochs):
    for X_batch, y_batch in optimizer.create_batches(X, y):
        # Forward pass
        predictions = model(X_batch)

        # Compute gradients (via backpropagation)
        gradients = compute_gradients(predictions, y_batch)

        # Update parameters: w = w - lr * grad
        model.params = optimizer.step_update(model.params, gradients)

The optimizer handles learning rate scheduling internally, so you don't need to manually track the step count. The create_batches method yields shuffled minibatches, ensuring each epoch sees the data in a different order (important for avoiding cyclic patterns that can hurt convergence).

Limitations and Challenges

SGD is remarkably effective. It's trained virtually every neural network you've ever used. But it has well-known limitations that motivate the more sophisticated optimizers we'll study next.

One Learning Rate for All Dimensions

The "optimal" learning rate varies across different dimensions of the parameter space. Dimensions with large gradients (steep loss surface) want small learning rates to avoid overshooting. Dimensions with small gradients (gentle slope) want large learning rates to make meaningful progress.

Vanilla SGD uses the same η\eta for all dimensions. This is a fundamental mismatch: the learning rate that works for steep dimensions is too small for gentle ones, and vice versa.

Difficulty with Saddle Points

In high-dimensional spaces, saddle points (where gradients are zero but it's not a minimum) are far more common than local minima. Near a saddle point, gradients become tiny, and SGD slows to a crawl. The optimizer might spend thousands of iterations in the "saddle" region before random noise eventually pushes it over the edge.

Oscillation in Ill-Conditioned Problems

When the loss surface is elongated, with high curvature in some directions and low curvature in others, SGD exhibits a characteristic pathology: it oscillates rapidly across the narrow valley while making slow progress along the long axis.

Out[36]:
Visualization
Contour plot showing an elongated elliptical loss surface with an oscillating optimization path.
SGD struggling with an ill-conditioned loss surface. The optimization path (red) oscillates across the narrow valley while slowly progressing toward the minimum. This inefficiency motivates momentum-based methods covered in the next chapter.

The oscillation pattern is unmistakable. In the steep direction (w1w_1, horizontal axis), SGD overshoots repeatedly, bouncing back and forth across the valley. In the shallow direction (w2w_2, vertical axis), it makes painfully slow progress because the gradients are small.

This is an ill-conditioned problem, where the ratio of largest to smallest curvature is high. For vanilla SGD, ill-conditioning is poison. The learning rate that's appropriate for the steep direction is far too small for the shallow one.

These limitations motivate momentum-based methods, which we'll explore in the next chapter. By accumulating a "velocity" that builds up along consistent gradient directions, momentum methods dampen the oscillations while accelerating progress along the valley floor.

Summary

Stochastic Gradient Descent trades exact gradients for speed, making neural network training practical at scale. The key ideas:

Core mechanics:

  • Batch gradient descent uses all data for each update; computationally expensive but precise
  • Pure SGD uses single examples; fast but noisy
  • Minibatch SGD (the practical choice) uses small batches (32-256), balancing speed and stability

Learning rate:

  • Too small: training is slow but stable
  • Too large: training is fast but may diverge
  • Learning rate finder helps identify a good starting point
  • Decaying schedules (step, exponential, cosine) improve final convergence

Noise as regularization:

  • SGD noise helps escape sharp minima that generalize poorly
  • Smaller batches and larger learning rates increase effective noise
  • The ratio η/B\eta/B controls the noise scale

Practical considerations:

  • Shuffle data each epoch to avoid cycles
  • Monitor both training and validation loss
  • Use learning rate schedules for best convergence
  • Gradient variance decreases as 1/B1/B with batch size BB

SGD's limitations, sensitivity to learning rate, slow convergence on ill-conditioned problems, and difficulty with saddle points, motivate the momentum-based optimizers we'll explore in the next chapter.

Key Parameters

Key SGD hyperparameters and their effects on training.
ParameterTypical ValuesEffect
Learning rate (η\eta)0.001-0.1Higher values mean faster but less stable training
Batch size (BB)32-256Larger batches reduce gradient variance but may hurt generalization
Epochs50-500More epochs allow more updates; use early stopping to prevent overfitting
Learning rate decay0.1-0.5 every N stepsEnables convergence to precise minima

When starting with a new problem, a reasonable baseline is:

  • Batch size: 32 or 64
  • Learning rate: Use a learning rate finder, or start with 0.01
  • Schedule: Step decay by 0.1 every 30% of total training
  • Shuffle: Always shuffle data each epoch

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about stochastic gradient descent and optimization.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{stochasticgradientdescentfrombatchtominibatchoptimization, author = {Michael Brenndoerfer}, title = {Stochastic Gradient Descent: From Batch to Minibatch Optimization}, year = {2025}, url = {https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). Stochastic Gradient Descent: From Batch to Minibatch Optimization. Retrieved from https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization
MLAAcademic
Michael Brenndoerfer. "Stochastic Gradient Descent: From Batch to Minibatch Optimization." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization>.
CHICAGOAcademic
Michael Brenndoerfer. "Stochastic Gradient Descent: From Batch to Minibatch Optimization." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Stochastic Gradient Descent: From Batch to Minibatch Optimization'. Available at: https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). Stochastic Gradient Descent: From Batch to Minibatch Optimization. https://mbrenndoerfer.com/writing/stochastic-gradient-descent-neural-network-optimization
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free