Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations

Michael Brenndoerfer

Master neural network loss functions from MSE to cross-entropy, including numerical stability, label smoothing, and focal loss for imbalanced data.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Loss FunctionsLink Copied

How does a neural network know it's wrong? When a model predicts that a movie review is positive but it's actually negative, something needs to measure that mistake. Loss functions quantify the difference between predictions and ground truth, providing the signal that guides learning. Without a loss function, a neural network has no compass, no way to improve.

In the previous chapters, we built the forward pass: data flows through layers, activations introduce non-linearity, and the network produces an output. But that output needs to be evaluated. Loss functions sit at the end of the forward pass, taking predictions and true labels, and producing a single number that summarizes how wrong the model is. This number then flows backward through backpropagation, telling each parameter how to adjust.

This chapter explores the mathematical foundations of loss functions, starting with mean squared error for regression and cross-entropy for classification. You'll learn why cross-entropy works better than squared error for classification, how to handle numerical stability issues that plague naive implementations, and advanced techniques like label smoothing and focal loss that address real-world challenges. By the end, you'll understand how to choose, implement, and even design custom loss functions for your specific needs.

The Role of Loss FunctionsLink Copied

Before diving into specific formulas, let's establish what loss functions do and why they matter so much. A loss function serves three interconnected purposes.

First, it provides an optimization objective. Neural networks learn by minimizing a loss function through gradient descent. The loss surface defines the landscape the optimizer navigates, and the choice of loss function shapes that landscape. Smooth loss surfaces with clear gradients lead to stable training, while poorly chosen losses can create plateaus, sharp cliffs, or misleading local minima.

Second, it encodes the task. Different problems require different notions of "correctness." Predicting house prices within $10,000 is very different from predicting whether an email is spam. The loss function translates task requirements into mathematical objectives. Regression losses measure distance from target values, while classification losses measure how confidently the model assigns probability to the correct class.

Third, it shapes the gradients. During backpropagation, the gradient of the loss with respect to predictions determines how strongly each parameter gets updated. Some loss functions produce gradients that are proportional to the error magnitude, while others produce gradients that depend on confidence levels. This distinction profoundly affects training dynamics.

Loss Function

A loss function (also called cost function or objective function) maps model predictions and ground truth labels to a non-negative scalar that quantifies prediction error:

\mathcal{L}: \mathbb{R}^n \times \mathbb{R}^n \rightarrow \mathbb{R}^+

where:

$\hat{y} \in \mathbb{R}^n$ : the model's predictions (a vector of $n$ values)
$y \in \mathbb{R}^n$ : the ground truth labels
$\mathcal{L}(\hat{y}, y)$ : the loss value, typically non-negative, where 0 indicates perfect prediction

The goal of training is to find parameters $\theta$ that minimize the expected loss over the training data.

Mean Squared Error for RegressionLink Copied

We start with the most intuitive loss function: mean squared error (MSE). When predicting continuous values, like house prices, stock returns, or temperature, we want predictions close to targets. MSE measures the average squared distance between predictions and targets.

Mathematical FormulationLink Copied

For a single prediction, the squared error is simply the squared difference between prediction and target:

\text{SE} = (\hat{y} - y)^2

where:

$\hat{y}$ : the model's prediction for this sample
$y$ : the true target value
$\text{SE}$ : the squared error, always non-negative

Squaring serves two purposes. It makes all errors positive (a prediction 5 units too high is as bad as 5 units too low), and it penalizes large errors more heavily than small ones. An error of 10 contributes 100 to the loss, while an error of 1 contributes only 1.

Mean Squared Error (MSE)

MSE averages the squared errors across all $n$ samples in a batch:

\mathcal{L}_{\text{MSE}} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)^2

where:

$n$ : the number of samples in the batch
$\hat{y}_i$ : the model's prediction for sample $i$
$y_i$ : the true target value for sample $i$
The $\frac{1}{n}$ factor normalizes the loss so it doesn't grow with batch size

Why Squared Error?Link Copied

Why square the error rather than take its absolute value, cube it, or use some other function? The answer lies in a beautiful connection between MSE and probability theory that reveals the hidden assumptions behind this seemingly arbitrary choice.

Imagine you're predicting house prices. Even with a perfect model, your predictions won't be exactly right because of factors you can't observe: the seller's mood, minor undisclosed repairs, or market fluctuations on the day of sale. These unobserved factors create random noise around the true relationship. The central limit theorem tells us that when many small, independent factors combine, their sum tends toward a Gaussian (normal) distribution. This is why assuming Gaussian noise is often reasonable.

Under this assumption, the observed target $y$ equals the model's prediction $\hat{y}$ plus some random noise $\epsilon$ :

y = \hat{y} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2)

where:

$y$ : the observed target value
$\hat{y}$ : the model's prediction
$\epsilon$ : the random noise term, drawn from a Gaussian distribution
$\sigma^2$ : the variance of the noise (how spread out the errors are)
$\mathcal{N}(0, \sigma^2)$ : the Gaussian (normal) distribution with mean 0 and variance $\sigma^2$

Now we can ask: given our prediction $\hat{y}$ , what's the probability of observing the actual target $y$ ? The Gaussian distribution gives us this probability, and taking its logarithm (for computational convenience and to turn products into sums) yields:

\log p(y | \hat{y}) = -\frac{1}{2\sigma^2}(y - \hat{y})^2 + \text{const}

where:

$\log p(y | \hat{y})$ : the log-probability of observing $y$ given that the model predicted $\hat{y}$
$-\frac{1}{2\sigma^2}$ : a negative scaling factor (since $\sigma^2 > 0$ , this term is always negative)
$(y - \hat{y})^2$ : the squared error between target and prediction
$\text{const}$ : terms that don't depend on $\hat{y}$ and thus don't affect optimization

The key insight is that maximizing this log-likelihood, which means finding predictions that make the observed data most probable, is mathematically equivalent to minimizing the squared error $(y - \hat{y})^2$ . The negative sign flips maximization to minimization, and the constant scaling factor $\frac{1}{2\sigma^2}$ doesn't change which prediction is optimal.

This probabilistic interpretation has profound implications. When you minimize MSE, you're implicitly assuming that prediction errors follow a Gaussian distribution. If errors actually follow a different distribution, like one with heavy tails where outliers are common, MSE may not be the best choice. This is why robust alternatives like Mean Absolute Error (MAE) exist for situations where outliers are prevalent.

Out[3]:

Visualization

Line plot showing MSE curve rising steeply (quadratic) compared to MAE curve rising linearly as prediction error increases. — MSE vs MAE loss penalty by error size. MSE's quadratic penalty makes an error of 5 cost 25 times more than an error of 1, while MAE treats them proportionally.

Scatter plot with fitted lines showing MSE fit pulled toward outlier while MAE fit remains robust. — Effect of a single outlier on line fitting. MSE's minimum shifts toward the outlier while a robust MAE fit stays closer to the majority of data points.

ImplementationLink Copied

Let's implement MSE from scratch and verify against PyTorch's built-in implementation:

In[4]:

Code

# Generate sample regression data
np.random.seed(42)
n_samples = 100

# True relationship: y = 2x + 1 + noise
x = np.random.randn(n_samples)
y_true = 2 * x + 1 + 0.3 * np.random.randn(n_samples)

# Simulated predictions (imperfect model)
y_pred = 1.8 * x + 1.2


def mse_loss_numpy(y_pred, y_true):
    """Compute MSE loss from scratch."""
    squared_errors = (y_pred - y_true) ** 2
    return np.mean(squared_errors)


# Our implementation
our_mse = mse_loss_numpy(y_pred, y_true)

# Generate sample regression data
np.random.seed(42)
n_samples = 100

# True relationship: y = 2x + 1 + noise
x = np.random.randn(n_samples)
y_true = 2 * x + 1 + 0.3 * np.random.randn(n_samples)

# Simulated predictions (imperfect model)
y_pred = 1.8 * x + 1.2


def mse_loss_numpy(y_pred, y_true):
    """Compute MSE loss from scratch."""
    squared_errors = (y_pred - y_true) ** 2
    return np.mean(squared_errors)


# Our implementation
our_mse = mse_loss_numpy(y_pred, y_true)

Out[5]:

Console

MSE Loss Comparison:
----------------------------------------
Our implementation:  0.145489
PyTorch MSELoss:     0.145489
Difference:          7.55e-09

The values match, confirming our implementation is correct. The tiny difference (if any) comes from floating-point precision.

The MSE GradientLink Copied

Understanding the gradient of MSE reveals why it works well for regression. Taking the derivative with respect to a single prediction $\hat{y}_i$ :

\frac{\partial \mathcal{L}_{\text{MSE}}}{\partial \hat{y}_i} = \frac{2}{n}(\hat{y}_i - y_i)

where:

$\frac{\partial \mathcal{L}_{\text{MSE}}}{\partial \hat{y}_i}$ : the partial derivative of the loss with respect to prediction $i$ , telling us how to adjust $\hat{y}_i$ to reduce loss
$\frac{2}{n}$ : a scaling factor arising from the derivative of $x^2$ (which gives $2x$ ) and the $\frac{1}{n}$ averaging
$(\hat{y}_i - y_i)$ : the prediction error, positive if prediction is too high, negative if too low

This gradient has an elegant property: it's proportional to the error itself. If a prediction is 10 units too high, the gradient pushes it down with magnitude proportional to 10. If a prediction is 0.1 units too high, the gradient is proportionally smaller. This linear relationship between error and gradient leads to stable, predictable learning dynamics.

Out[6]:

Visualization

Parabolic curve showing MSE loss increasing quadratically as prediction error moves away from zero in either direction. — MSE loss as a function of prediction error. The parabolic shape means errors further from zero incur quadratically increasing penalties.

Straight line through origin showing MSE gradient increasing linearly with prediction error, negative for negative errors and positive for positive errors. — MSE gradient (derivative) as a function of prediction error. The linear relationship means gradient magnitude scales proportionally with error.

The parabolic loss curve and linear gradient are hallmarks of MSE. The minimum occurs exactly at zero error, and the gradient always points toward that minimum. However, the quadratic penalty means MSE is sensitive to outliers: a single prediction with error 10 contributes as much loss as 100 predictions with error 1.

Binary Cross-Entropy LossLink Copied

Regression predicts continuous values, but classification predicts discrete categories. For binary classification (two classes), we need a different approach. The model outputs a probability $p \in [0, 1]$ that the input belongs to class 1, and we need to measure how well that probability matches the true label $y \in \{0, 1\}$ .

Why Not Use MSE for Classification?Link Copied

It's tempting to apply MSE to classification: just measure the squared difference between predicted probability and target label. Let's see why this fails.

Consider a sample with true label $y = 1$ (positive class). The model predicts $p = 0.9$ (90% confident it's positive). The MSE is $(0.9 - 1)^2 = 0.01$ , a small loss. Now suppose the model predicts $p = 0.1$ (90% confident it's negative, wrong!). The MSE is $(0.1 - 1)^2 = 0.81$ , still less than 1.

The problem is the gradient. For MSE, $\frac{\partial \mathcal{L}}{\partial p} = 2(p - y)$ . When $p = 0.1$ and $y = 1$ , the gradient is $2(0.1 - 1) = -1.8$ . But sigmoid outputs saturate at extreme values, producing near-zero gradients. The loss wants to push the probability higher, but the sigmoid's vanishing gradient prevents effective learning. This is the infamous vanishing gradient problem for classification.

Cross-entropy solves this by producing gradients that don't depend on the sigmoid's derivative.

Out[7]:

Visualization

Line plot showing MSE loss bounded near 1 while BCE loss increases sharply as predicted probability approaches 0. — Loss curves for a positive sample (y=1). MSE saturates at 1 for completely wrong predictions, while BCE approaches infinity, providing much stronger corrective signal.

Line plot showing BCE gradient magnitude increasing sharply for low probability predictions while MSE gradient remains bounded. — Gradient magnitude comparison. BCE produces large gradients for confident wrong predictions, while MSE gradients are bounded and small near the extremes.

The visualization reveals the critical difference. When the model confidently predicts $p = 0.1$ for a true positive ( $y = 1$ ), MSE gives a loss of only 0.81 with a gradient magnitude of about 1.8. BCE gives a loss of 2.3 with a gradient magnitude of 10, over 5 times stronger. This stronger signal helps overcome the vanishing gradients from sigmoid saturation, making BCE far more effective for classification.

Mathematical DerivationLink Copied

To derive binary cross-entropy, we need to think about classification from a probabilistic perspective. The model outputs a probability $p$ that a sample belongs to class 1. What we want is to find the $p$ that makes the observed label $y$ most likely. This is the principle of maximum likelihood estimation.

Consider what "likelihood" means here. If the true label is $y = 1$ (positive class), then the model should output a high probability $p$ . The likelihood of observing $y = 1$ given prediction $p$ is simply $p$ itself: a model predicting $p = 0.9$ makes observing a positive label 90% likely. Conversely, if $y = 0$ (negative class), the likelihood of observing this given prediction $p$ is $1 - p$ : a model predicting $p = 0.1$ makes observing a negative label 90% likely.

We can express both cases in a single elegant formula using exponents:

P(y | p) = p^y (1 - p)^{1-y}

where:

$P(y | p)$ : the probability of observing label $y$ given the model's predicted probability $p$
$p$ : the model's predicted probability for class 1 (must be between 0 and 1)
$y \in \{0, 1\}$ : the true label
$p^y$ : equals $p$ when $y=1$ , equals $1$ when $y=0$ (since any number to the power of 0 is 1)
$(1-p)^{1-y}$ : equals $1-p$ when $y=0$ , equals $1$ when $y=1$

The beauty of this formula is how the exponents act as switches. When $y = 1$ , the formula simplifies to $p^1 \cdot (1-p)^0 = p \cdot 1 = p$ . When $y = 0$ , it becomes $p^0 \cdot (1-p)^1 = 1 \cdot (1-p) = 1 - p$ . One formula handles both cases.

Now, we want to maximize this likelihood, but maximization is less convenient than minimization for gradient descent. Also, products of probabilities (which arise when we have multiple samples) become sums when we take logarithms, which is numerically more stable. So we take the negative log:

-\log P(y | p) = -y \log(p) - (1 - y) \log(1 - p)

where:

$-\log P(y | p)$ : the negative log-likelihood, our loss function
$-y \log(p)$ : the loss contribution when the true class is 1 (penalizes low $p$ )
$-(1-y) \log(1-p)$ : the loss contribution when the true class is 0 (penalizes high $p$ )

This is the binary cross-entropy for a single sample. The logarithm has a crucial property: $-\log(x)$ approaches infinity as $x$ approaches 0. This means confident wrong predictions (like $p = 0.01$ when $y = 1$ ) incur enormous penalties, while confident correct predictions (like $p = 0.99$ when $y = 1$ ) incur almost no penalty. The loss function naturally focuses learning on the mistakes that matter most.

Binary Cross-Entropy (BCE) Loss

Binary cross-entropy measures the negative log-likelihood for binary classification:

\mathcal{L}_{\text{BCE}} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]

where:

$n$ : number of samples
$y_i \in \{0, 1\}$ : true label for sample $i$
$\hat{p}_i \in (0, 1)$ : predicted probability of class 1 for sample $i$
$\log$ : natural logarithm

The loss approaches 0 when predictions are confident and correct, and approaches infinity when predictions are confident and wrong.

Understanding the BCE TermsLink Copied

The formula has two terms that activate depending on the true label:

When $y = 1$ : Only $-\log(p)$ contributes. If $p \approx 1$ , loss is near 0. If $p \approx 0$ , loss explodes toward infinity.
When $y = 0$ : Only $-\log(1 - p)$ contributes. If $p \approx 0$ , loss is near 0. If $p \approx 1$ , loss explodes.

This asymmetry is exactly what we want. A confident wrong prediction (predicting $p = 0.99$ when $y = 0$ ) incurs a huge penalty, forcing the model to avoid overconfident mistakes.

In[8]:

Code

def binary_cross_entropy_numpy(p_pred, y_true, epsilon=1e-15):
    """Compute binary cross-entropy from scratch.

    Args:
        p_pred: Predicted probabilities (0 to 1)
        y_true: True binary labels (0 or 1)
        epsilon: Small constant for numerical stability
    """
    # Clip predictions to avoid log(0)
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)

    # Compute BCE
    bce = -y_true * np.log(p_pred) - (1 - y_true) * np.log(1 - p_pred)
    return np.mean(bce)


# Generate sample binary classification data
np.random.seed(42)
n_samples = 100

# True labels (binary)
y_true_binary = np.random.randint(0, 2, n_samples).astype(float)

# Simulated predicted probabilities (imperfect model)
# Add noise to true labels and clip to valid probability range
noise = 0.3 * np.random.randn(n_samples)
p_pred = np.clip(y_true_binary + noise, 0.01, 0.99)

# Our implementation
our_bce = binary_cross_entropy_numpy(p_pred, y_true_binary)

def binary_cross_entropy_numpy(p_pred, y_true, epsilon=1e-15):
    """Compute binary cross-entropy from scratch.

    Args:
        p_pred: Predicted probabilities (0 to 1)
        y_true: True binary labels (0 or 1)
        epsilon: Small constant for numerical stability
    """
    # Clip predictions to avoid log(0)
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)

    # Compute BCE
    bce = -y_true * np.log(p_pred) - (1 - y_true) * np.log(1 - p_pred)
    return np.mean(bce)


# Generate sample binary classification data
np.random.seed(42)
n_samples = 100

# True labels (binary)
y_true_binary = np.random.randint(0, 2, n_samples).astype(float)

# Simulated predicted probabilities (imperfect model)
# Add noise to true labels and clip to valid probability range
noise = 0.3 * np.random.randn(n_samples)
p_pred = np.clip(y_true_binary + noise, 0.01, 0.99)

# Our implementation
our_bce = binary_cross_entropy_numpy(p_pred, y_true_binary)

Out[9]:

Console

Binary Cross-Entropy Comparison:
---------------------------------------------
Our implementation:  0.161924
PyTorch BCELoss:     0.161924
Difference:          9.14e-09

Our implementation matches PyTorch's built-in BCE loss, confirming the correctness of our formula. The loss value of approximately 0.35 reflects a reasonably good model since predictions correlate with true labels, but not perfectly, as we intentionally added noise to simulate real-world imperfection.

The BCE GradientLink Copied

The gradient of BCE with respect to predicted probability $p$ reveals why it works so well:

\frac{\partial \mathcal{L}_{\text{BCE}}}{\partial p} = -\frac{y}{p} + \frac{1 - y}{1 - p}

where:

$\frac{\partial \mathcal{L}_{\text{BCE}}}{\partial p}$ : the derivative of the loss with respect to the predicted probability
$-\frac{y}{p}$ : the gradient contribution when $y=1$ , which is large (negative) when $p$ is small
$\frac{1-y}{1-p}$ : the gradient contribution when $y=0$ , which is large (positive) when $p$ is close to 1

For a single sample with $y = 1$ , this simplifies to:

\frac{\partial \mathcal{L}}{\partial p} = -\frac{1}{p}

When $p$ is small (wrong prediction), the gradient magnitude $\frac{1}{p}$ is large, producing strong corrective signal. When $p$ is close to 1 (correct prediction), the gradient is small, leaving good predictions relatively undisturbed.

Critically, when combined with a sigmoid activation $p = \sigma(z)$ , the chain rule gives:

\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial p} \cdot \frac{\partial p}{\partial z} = \left(-\frac{y}{p} + \frac{1-y}{1-p}\right) \cdot p(1-p) = p - y

where:

$z$ : the pre-activation input to the sigmoid (the logit)
$\frac{\partial \mathcal{L}}{\partial z}$ : the gradient we need for backpropagation
$\frac{\partial p}{\partial z} = p(1-p)$ : the derivative of the sigmoid function
$p - y$ : the remarkably simple final result

The sigmoid's $p(1-p)$ derivative combines with the BCE gradient terms to produce a beautifully simple result: just the difference between prediction and target. For $y=1$ , we get $-\frac{1}{p} \cdot p(1-p) = -(1-p) = p - 1$ . For $y=0$ , we get $\frac{1}{1-p} \cdot p(1-p) = p$ . Both cases simplify to $p - y$ . This elegant cancellation is not coincidental; it's a fundamental property of the cross-entropy / sigmoid pairing that makes training stable.

Out[10]:

Visualization

Curve showing BCE loss decreasing from high values near p=0 to near zero as p approaches 1 for positive samples. — BCE loss for positive samples (y=1). The loss increases sharply as predicted probability approaches 0, heavily penalizing confident wrong predictions.

Curve showing BCE loss near zero for small p values increasing sharply as p approaches 1 for negative samples. — BCE loss for negative samples (y=0). The loss increases sharply as predicted probability approaches 1, penalizing false positive predictions.

The asymmetric penalty structure is clear: for positive samples, low predictions are heavily penalized; for negative samples, high predictions are heavily penalized. The vertical asymptotes at the extremes reflect the infinite loss of predicting 0 probability for a true positive or 1 probability for a true negative.

Multiclass Cross-Entropy LossLink Copied

Binary classification handles two classes. But what about classifying text into multiple categories: sentiment (positive, negative, neutral), intent (question, command, statement), or language (English, Spanish, French)? Multiclass cross-entropy extends BCE to handle any number of classes.

From Binary to MulticlassLink Copied

Binary classification has a convenient property: with only two classes, knowing the probability of class 1 automatically tells you the probability of class 0 (it's just $1 - p$ ). But what happens when we have three, ten, or even thousands of classes? We need a way to convert the network's raw outputs into a valid probability distribution where all probabilities are positive and sum to 1.

The network produces raw scores called logits, one for each class. These logits can be any real number: positive, negative, or zero. They represent the network's "confidence" in each class, but they're not probabilities. A logit of 5 for class A and 3 for class B tells us the network prefers A, but by how much? And what about the other classes?

The softmax function solves this problem elegantly:

p_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

where:

$p_k$ : the output probability for class $k$ , guaranteed to be between 0 and 1
$z_k$ : the raw score (logit) for class $k$ , the unnormalized output from the network
$e^{z_k}$ : the exponential function applied to the logit, ensuring positivity
$\sum_{j=1}^{K} e^{z_j}$ : the sum of exponentials over all $K$ classes, the normalization constant
$K$ : the total number of classes

Why use the exponential function? It has three essential properties. First, $e^x$ is always positive, so every class gets a positive "vote." Second, the exponential preserves ordering: if $z_1 > z_2$ , then $e^{z_1} > e^{z_2}$ , so the class with the highest logit gets the highest probability. Third, and most importantly, the exponential amplifies differences. If $z_1 = 2$ and $z_2 = 1$ , the difference is just 1, but $e^2 / e^1 \approx 2.7$ , so class 1 gets nearly three times the probability of class 2. This amplification means the model can express strong preferences when confident and more uniform distributions when uncertain.

The denominator $\sum_{j=1}^{K} e^{z_j}$ is the normalization constant that ensures all probabilities sum to 1. Think of it as the total "votes" across all classes, and each class's probability is its share of the total.

Out[11]:

Visualization

Bar chart showing raw logits with mixed positive and negative values for 5 classes. — Raw logits from a neural network can be any real numbers, positive or negative. These represent unnormalized confidence scores for each class.

Bar chart showing softmax probabilities with all positive values summing to 1. — After softmax, all values are positive and sum to 1. The relative ordering is preserved but differences are amplified. The highest logit (class 2) gets the majority of probability mass.

Mathematical FormulationLink Copied

With softmax converting logits to probabilities, we need a loss function that measures how well the predicted distribution matches reality. The key insight is that we don't need to penalize every class's probability. We only care about one thing: how much probability did the model assign to the correct class?

This leads to a remarkably simple loss function. Given a true label $y$ (an integer from 0 to $K-1$ ) and predicted probabilities $\mathbf{p} = [p_0, p_1, \ldots, p_{K-1}]$ , we simply take the negative log of the probability assigned to the true class:

Categorical Cross-Entropy Loss

Categorical cross-entropy (also called softmax loss or negative log-likelihood) for multiclass classification:

\mathcal{L}_{\text{CE}} = -\log(p_y) = -\log\left(\frac{e^{z_y}}{\sum_{j=1}^{K} e^{z_j}}\right)

where:

$y$ : the true class label (integer from 0 to $K-1$ )
$p_y$ : the predicted probability assigned to the true class
$z_y$ : the logit (pre-softmax score) for the true class
$K$ : the total number of classes

For a batch of $n$ samples:

\mathcal{L}_{\text{CE}} = -\frac{1}{n} \sum_{i=1}^{n} \log(p_{y_i}^{(i)})

The loss only considers the probability assigned to the correct class, but this single number captures everything we need. If the model assigns 90% probability to the correct class, loss is $-\log(0.9) \approx 0.105$ , a small penalty. If it assigns only 10%, loss is $-\log(0.1) \approx 2.303$ , a much larger penalty. The logarithm's shape, steep near zero and flat near one, means the model is strongly pushed to avoid low probabilities on correct classes but receives diminishing rewards for pushing already-high probabilities even higher.

Notice how this connects back to binary cross-entropy. In the binary case, we had two terms: $-y\log(p) - (1-y)\log(1-p)$ . For multiclass, we're essentially doing the same thing, but with $K$ classes instead of 2. The one-hot encoding of the label (all zeros except a 1 for the true class) acts as the selector, zeroing out all terms except the one for the correct class.

One-Hot Encoding PerspectiveLink Copied

An equivalent formulation uses one-hot encoded labels. If the true class is $y = 2$ among $K = 4$ classes, the one-hot vector is $\mathbf{t} = [0, 0, 1, 0]$ . The cross-entropy becomes:

\mathcal{L}_{\text{CE}} = -\sum_{k=0}^{K-1} t_k \log(p_k)

where:

$\mathbf{t} = [t_0, t_1, \ldots, t_{K-1}]$ : the one-hot target vector, with $t_y = 1$ for the true class and $t_k = 0$ for all other classes
$p_k$ : the predicted probability for class $k$
$t_k \log(p_k)$ : the weighted log-probability, which equals $\log(p_k)$ when $t_k = 1$ and 0 otherwise

Since $\mathbf{t}$ is one-hot, only the term where $t_k = 1$ survives, giving us $-\log(p_y)$ . The one-hot formulation is conceptually useful and generalizes to soft labels (as we'll see with label smoothing).

In[12]:

Code

def softmax(logits):
    """Compute softmax probabilities from logits."""
    # Subtract max for numerical stability
    exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)


def cross_entropy_loss_numpy(logits, labels):
    """Compute cross-entropy loss from scratch.

    Args:
        logits: Raw scores of shape (n_samples, n_classes)
        labels: Integer class labels of shape (n_samples,)
    """
    n_samples = logits.shape[0]

    # Compute softmax probabilities
    probs = softmax(logits)

    # Get probability of true class for each sample
    true_class_probs = probs[np.arange(n_samples), labels]

    # Compute negative log likelihood
    return -np.mean(np.log(true_class_probs + 1e-15))


# Generate sample multiclass data
np.random.seed(42)
n_samples = 100
n_classes = 5

# True labels (integers 0 to n_classes-1)
y_true_multi = np.random.randint(0, n_classes, n_samples)

# Simulated logits (model outputs before softmax)
# Make correct class have higher logits on average
logits = np.random.randn(n_samples, n_classes)
for i in range(n_samples):
    logits[i, y_true_multi[i]] += 2.0  # Boost correct class

# Our implementation
our_ce = cross_entropy_loss_numpy(logits, y_true_multi)

def softmax(logits):
    """Compute softmax probabilities from logits."""
    # Subtract max for numerical stability
    exp_logits = np.exp(logits - np.max(logits, axis=-1, keepdims=True))
    return exp_logits / np.sum(exp_logits, axis=-1, keepdims=True)


def cross_entropy_loss_numpy(logits, labels):
    """Compute cross-entropy loss from scratch.

    Args:
        logits: Raw scores of shape (n_samples, n_classes)
        labels: Integer class labels of shape (n_samples,)
    """
    n_samples = logits.shape[0]

    # Compute softmax probabilities
    probs = softmax(logits)

    # Get probability of true class for each sample
    true_class_probs = probs[np.arange(n_samples), labels]

    # Compute negative log likelihood
    return -np.mean(np.log(true_class_probs + 1e-15))


# Generate sample multiclass data
np.random.seed(42)
n_samples = 100
n_classes = 5

# True labels (integers 0 to n_classes-1)
y_true_multi = np.random.randint(0, n_classes, n_samples)

# Simulated logits (model outputs before softmax)
# Make correct class have higher logits on average
logits = np.random.randn(n_samples, n_classes)
for i in range(n_samples):
    logits[i, y_true_multi[i]] += 2.0  # Boost correct class

# Our implementation
our_ce = cross_entropy_loss_numpy(logits, y_true_multi)

Out[13]:

Console

Multiclass Cross-Entropy Comparison:
---------------------------------------------
Our implementation:  0.811311
PyTorch CE Loss:     0.811311
Difference:          3.47e-08

The loss value of approximately 0.94 indicates a model that performs reasonably well but not perfectly. Recall that we boosted correct class logits by 2.0, which gives the model an advantage but doesn't guarantee perfect predictions due to the random logit initialization. A loss near zero would indicate nearly perfect predictions, while a loss above 2.0 would suggest performance barely better than random guessing.

The Cross-Entropy GradientLink Copied

For multiclass cross-entropy with softmax, the gradient with respect to logit $z_k$ has an elegant form:

\frac{\partial \mathcal{L}}{\partial z_k} = p_k - \mathbf{1}[k = y]

where:

$\frac{\partial \mathcal{L}}{\partial z_k}$ : the gradient of the loss with respect to the logit for class $k$
$p_k$ : the predicted probability for class $k$ (from softmax)
$\mathbf{1}[k = y]$ : the indicator function, which equals 1 if $k$ is the true class $y$ , and 0 otherwise

In vector form, if $\mathbf{t}$ is the one-hot target:

\frac{\partial \mathcal{L}}{\partial \mathbf{z}} = \mathbf{p} - \mathbf{t}

where:

$\mathbf{z} = [z_0, z_1, \ldots, z_{K-1}]$ : the vector of all logits
$\mathbf{p} = [p_0, p_1, \ldots, p_{K-1}]$ : the vector of predicted probabilities
$\mathbf{t}$ : the one-hot target vector

This is remarkably simple: the gradient is just the difference between predicted probabilities and the target distribution. For the correct class, the gradient pushes to increase probability. For incorrect classes, it pushes to decrease probability. The magnitude is proportional to how wrong the prediction is.

Out[14]:

Visualization

Bar chart showing gradients for 5 classes, with class 2 having a negative value and all other classes having positive values proportional to their predicted probabilities. — Cross-entropy gradients for a 5-class example. The true class (class 2) receives a negative gradient to increase its probability, while all other classes receive positive gradients to decrease their probabilities.

Class 2 (the true class) has probability 0.62 but the target is 1.0, so its gradient is negative ( $0.62 - 1.0 = -0.38$ ), pushing the logit higher. All other classes have gradients equal to their probabilities, pushing their logits lower. This balanced competition drives the model toward confident, correct predictions.

Numerical StabilityLink Copied

The mathematical formulas for cross-entropy involve logarithms of probabilities and exponentials of logits. These operations are prone to numerical issues: $\log(0) = -\infty$ , and $e^{1000}$ overflows to infinity. Practical implementations must handle these edge cases carefully.

The Log-Sum-Exp TrickLink Copied

The formulas we've derived are mathematically correct, but computers have finite precision. When we try to compute $e^{1000}$ , the result exceeds the largest number a 64-bit float can represent, returning infinity. Similarly, $e^{-1000}$ underflows to zero. Since neural networks can produce arbitrarily large or small logits, especially early in training, we need implementations that handle these edge cases gracefully.

The solution is elegant: instead of computing exponentials of the original logits, we shift all logits by subtracting the maximum value first:

p_k = \frac{e^{z_k - \max(\mathbf{z})}}{\sum_{j} e^{z_j - \max(\mathbf{z})}}

where:

$\max(\mathbf{z})$ : the maximum value among all logits
$z_k - \max(\mathbf{z})$ : the shifted logit, now guaranteed to be $\leq 0$
$e^{z_k - \max(\mathbf{z})}$ : the exponentiated shifted logit, guaranteed to be $\leq 1$

Why does this work? The shift is mathematically equivalent to multiplying both numerator and denominator by $e^{-\max(\mathbf{z})}$ , which cancels out. But numerically, it transforms all exponents to be at most 0, so the largest exponential is $e^0 = 1$ , which never overflows. The smallest exponentials might still underflow to zero, but that's fine: they contribute negligibly to the sum anyway.

For cross-entropy specifically, we can be even cleverer. Instead of computing softmax probabilities and then taking their log (which risks taking $\log(0)$ for underflowed probabilities), we can compute the log-softmax directly using the log-sum-exp trick:

\log\left(\sum_j e^{z_j}\right) = \max(\mathbf{z}) + \log\left(\sum_j e^{z_j - \max(\mathbf{z})}\right)

The left side is the naive computation that would overflow. The right side splits it into two stable parts: the maximum (which is just a number) plus the log of a sum of small exponentials. Since cross-entropy is $-z_y + \log(\sum_j e^{z_j})$ , we can compute the entire loss without ever materializing potentially problematic intermediate values.

In[15]:

Code

def stable_cross_entropy(logits, labels):
    """Numerically stable cross-entropy using log-sum-exp trick.

    Args:
        logits: Shape (n_samples, n_classes)
        labels: Shape (n_samples,), integer class labels
    """
    n_samples = logits.shape[0]

    # Log-sum-exp trick for numerical stability
    max_logits = np.max(logits, axis=1, keepdims=True)
    log_sum_exp = max_logits.squeeze() + np.log(
        np.sum(np.exp(logits - max_logits), axis=1)
    )

    # Get logit of true class
    true_logits = logits[np.arange(n_samples), labels]

    # Cross-entropy: -log(softmax) = -true_logit + log_sum_exp
    return np.mean(-true_logits + log_sum_exp)


# Test with extreme logits that would overflow naive implementation
extreme_logits = np.array([[1000.0, 999.0, 998.0]])
extreme_labels = np.array([0])

stable_loss = stable_cross_entropy(extreme_logits, extreme_labels)

def stable_cross_entropy(logits, labels):
    """Numerically stable cross-entropy using log-sum-exp trick.

    Args:
        logits: Shape (n_samples, n_classes)
        labels: Shape (n_samples,), integer class labels
    """
    n_samples = logits.shape[0]

    # Log-sum-exp trick for numerical stability
    max_logits = np.max(logits, axis=1, keepdims=True)
    log_sum_exp = max_logits.squeeze() + np.log(
        np.sum(np.exp(logits - max_logits), axis=1)
    )

    # Get logit of true class
    true_logits = logits[np.arange(n_samples), labels]

    # Cross-entropy: -log(softmax) = -true_logit + log_sum_exp
    return np.mean(-true_logits + log_sum_exp)


# Test with extreme logits that would overflow naive implementation
extreme_logits = np.array([[1000.0, 999.0, 998.0]])
extreme_labels = np.array([0])

stable_loss = stable_cross_entropy(extreme_logits, extreme_labels)

Out[16]:

Console

Numerical Stability Test:
--------------------------------------------------
Extreme logits: [1000.  999.  998.]
Stable CE loss: 0.407606
PyTorch CE loss: 0.407606

Note: Without the log-sum-exp trick, exp(1000) would overflow!

The stable implementation correctly computes a loss of approximately 0.41 even with extreme logit values of 1000, 999, and 998. These values would cause exp(1000) to overflow to infinity in a naive implementation. The log-sum-exp trick makes this computation tractable by working with the differences between logits rather than their absolute values.

Clipping for BCELink Copied

For binary cross-entropy, the danger is $\log(0) = -\infty$ . When predicted probability is exactly 0 or 1, the loss becomes infinite. The standard solution is to clip probabilities to a small epsilon range:

p_{\text{clipped}} = \text{clip}(p, \epsilon, 1 - \epsilon)

where:

$p$ : the original predicted probability
$\epsilon$ : a small constant, typically $10^{-7}$ or smaller
$\text{clip}(p, \epsilon, 1 - \epsilon)$ : constrains $p$ to the range $[\epsilon, 1 - \epsilon]$
$p_{\text{clipped}}$ : the clipped probability, now safe to use with $\log$

This introduces a tiny bias but prevents numerical explosions. PyTorch's BCEWithLogitsLoss combines sigmoid and BCE in a numerically stable way, avoiding the need for explicit clipping.

In[17]:

Code

def stable_bce(p_pred, y_true, epsilon=1e-7):
    """Stable BCE with probability clipping."""
    p_clipped = np.clip(p_pred, epsilon, 1 - epsilon)
    return -np.mean(
        y_true * np.log(p_clipped) + (1 - y_true) * np.log(1 - p_clipped)
    )


# Test with edge case probabilities
edge_probs = np.array([0.0, 1.0, 0.5])
edge_labels = np.array([1.0, 0.0, 1.0])

# Unstable version would give -inf and inf
unstable_terms = -edge_labels * np.log(edge_probs + 1e-15) - (
    1 - edge_labels
) * np.log(1 - edge_probs + 1e-15)
stable_loss = stable_bce(edge_probs, edge_labels)

def stable_bce(p_pred, y_true, epsilon=1e-7):
    """Stable BCE with probability clipping."""
    p_clipped = np.clip(p_pred, epsilon, 1 - epsilon)
    return -np.mean(
        y_true * np.log(p_clipped) + (1 - y_true) * np.log(1 - p_clipped)
    )


# Test with edge case probabilities
edge_probs = np.array([0.0, 1.0, 0.5])
edge_labels = np.array([1.0, 0.0, 1.0])

# Unstable version would give -inf and inf
unstable_terms = -edge_labels * np.log(edge_probs + 1e-15) - (
    1 - edge_labels
) * np.log(1 - edge_probs + 1e-15)
stable_loss = stable_bce(edge_probs, edge_labels)

Out[18]:

Console

BCE Edge Case Handling:
--------------------------------------------------
p=0.0, y=1: raw_loss=34.54, stable_loss=16.12
p=1.0, y=0: raw_loss=34.54, stable_loss=16.12
p=0.5, y=1: raw_loss=0.69, stable_loss=0.69

The first two rows show the worst-case scenarios: predicting exactly 0.0 for a positive sample ( $y=1$ ) and exactly 1.0 for a negative sample ( $y=0$ ). The raw loss shows approximately 34.5, which represents the capped value from adding $10^{-15}$ . The stable version clips to $10^{-7}$ , giving a loss of about 16. While still large, it's bounded and won't cause numerical issues during backpropagation. The third sample with $p=0.5$ shows normal behavior where raw and stable losses are nearly identical.

Label SmoothingLink Copied

Hard labels (one-hot vectors) assume perfect ground truth: the true class has probability 1, all others have probability 0. But in practice, labels can be noisy or ambiguous. A "positive" sentiment review might contain some negative elements. Label smoothing softens this assumption by distributing a small amount of probability to non-target classes.

MotivationLink Copied

Hard labels encourage the model to become overconfident. To minimize cross-entropy toward zero, the model must push $p_y \to 1$ , which requires $z_y \to \infty$ relative to other logits. This leads to large weight magnitudes and poor generalization.

Label smoothing regularizes by making the target distribution slightly uncertain:

Label Smoothing

Label smoothing replaces hard targets with soft targets by redistributing probability mass:

y_k^{\text{smooth}} = \begin{cases} 1 - \alpha + \frac{\alpha}{K} & \text{if } k = y \\ \frac{\alpha}{K} & \text{otherwise} \end{cases}

where:

$\alpha \in [0, 1]$ : smoothing parameter (typically 0.1)
$K$ : number of classes
$y$ : true class label

For $\alpha = 0.1$ and $K = 10$ : the true class gets probability $0.9 + 0.01 = 0.91$ , and each wrong class gets $0.01$ .

Implementation and EffectLink Copied

In[19]:

Code

def label_smoothing_cross_entropy(logits, labels, alpha=0.1):
    """Cross-entropy with label smoothing.

    Args:
        logits: Shape (n_samples, n_classes)
        labels: Shape (n_samples,), integer class labels
        alpha: Smoothing parameter
    """
    n_samples, n_classes = logits.shape

    # Create smoothed targets
    smooth_targets = np.full((n_samples, n_classes), alpha / n_classes)
    smooth_targets[np.arange(n_samples), labels] += 1 - alpha

    # Compute log-softmax (stable)
    max_logits = np.max(logits, axis=1, keepdims=True)
    log_softmax = (
        logits
        - max_logits
        - np.log(np.sum(np.exp(logits - max_logits), axis=1, keepdims=True))
    )

    # Cross-entropy with soft targets: sum over classes
    return -np.mean(np.sum(smooth_targets * log_softmax, axis=1))


# Compare hard vs smoothed CE
np.random.seed(42)
sample_logits = np.random.randn(1000, 10)
sample_labels = np.random.randint(0, 10, 1000)

hard_ce = cross_entropy_loss_numpy(sample_logits, sample_labels)
smooth_ce = label_smoothing_cross_entropy(
    sample_logits, sample_labels, alpha=0.1
)

def label_smoothing_cross_entropy(logits, labels, alpha=0.1):
    """Cross-entropy with label smoothing.

    Args:
        logits: Shape (n_samples, n_classes)
        labels: Shape (n_samples,), integer class labels
        alpha: Smoothing parameter
    """
    n_samples, n_classes = logits.shape

    # Create smoothed targets
    smooth_targets = np.full((n_samples, n_classes), alpha / n_classes)
    smooth_targets[np.arange(n_samples), labels] += 1 - alpha

    # Compute log-softmax (stable)
    max_logits = np.max(logits, axis=1, keepdims=True)
    log_softmax = (
        logits
        - max_logits
        - np.log(np.sum(np.exp(logits - max_logits), axis=1, keepdims=True))
    )

    # Cross-entropy with soft targets: sum over classes
    return -np.mean(np.sum(smooth_targets * log_softmax, axis=1))


# Compare hard vs smoothed CE
np.random.seed(42)
sample_logits = np.random.randn(1000, 10)
sample_labels = np.random.randint(0, 10, 1000)

hard_ce = cross_entropy_loss_numpy(sample_logits, sample_labels)
smooth_ce = label_smoothing_cross_entropy(
    sample_logits, sample_labels, alpha=0.1
)

Out[20]:

Console

Label Smoothing Effect:
--------------------------------------------------
Hard labels CE:     2.755984
Smoothed CE (α=0.1): 2.753744

Visualization of target distributions:

Hard target (class 3 of 5):
  [0, 0, 0, 1, 0]

Smooth target (α=0.1, class 3 of 5):
  [0.02, 0.02, 0.02, 0.92, 0.02]

The smoothed loss is slightly higher because the model can never perfectly match a distribution that assigns positive probability to wrong classes. This is the regularization effect: instead of driving toward infinite confidence, the model settles at a more moderate confidence level.

Out[21]:

Visualization

Bar chart showing one-hot encoding with single bar at 1.0 for the true class and 0 for others. — Hard labels (one-hot encoding) place all probability mass on the true class. This encourages the model to become overconfident.

Bar chart showing smoothed labels with main bar at 0.92 and small bars at 0.02 for other classes. — Soft labels with α=0.1 redistribute probability mass. The true class retains most probability (0.92) while incorrect classes receive a small uniform share (0.02 each).

When to Use Label SmoothingLink Copied

Label smoothing is particularly effective for:

Large-scale classification with many classes, where overconfidence is common
Knowledge distillation, where soft targets from a teacher model naturally provide smoothing
Noisy labels, where hard targets may be incorrect

It's less useful when you need well-calibrated probabilities (label smoothing can hurt calibration) or when the number of classes is very small.

Focal Loss for Class ImbalanceLink Copied

Real-world classification problems often have imbalanced classes. In spam detection, perhaps 1% of emails are spam. In medical diagnosis, rare diseases appear in less than 0.1% of cases. Standard cross-entropy treats all samples equally, but when one class dominates, the model learns to predict the majority class and ignores the minority.

The Problem with Standard Cross-EntropyLink Copied

Consider a dataset with 99% negative samples and 1% positive samples. A model that always predicts "negative" achieves 99% accuracy and relatively low cross-entropy. The gradients from the abundant negative samples overwhelm the gradients from rare positive samples.

Even when the model does learn to recognize positive samples, easy negatives (samples the model is already confident about) contribute the same gradient magnitude as hard positives. The model spends most of its capacity on samples it already handles well.

Focal Loss FormulationLink Copied

Focal loss, introduced by Lin et al. (2017) for object detection, down-weights easy examples and focuses training on hard examples:

Focal Loss

Focal loss adds a modulating factor to cross-entropy that reduces loss for well-classified examples:

\mathcal{L}_{\text{focal}} = -\alpha_t (1 - p_t)^\gamma \log(p_t)

where:

$p_t$ : probability of the true class (i.e., $p$ if $y=1$ , else $1-p$ for binary)
$\gamma \geq 0$ : focusing parameter (typically 2)
$\alpha_t$ : class weight (optional, addresses class imbalance directly)
$(1 - p_t)^\gamma$ : modulating factor that down-weights easy examples

When $\gamma = 0$ , focal loss reduces to standard cross-entropy. As $\gamma$ increases, the effect of easy examples diminishes.

Understanding the Modulating FactorLink Copied

The key insight is the $(1 - p_t)^\gamma$ term. When the model is confident and correct ( $p_t \approx 1$ ), this factor is tiny: $(1 - 0.99)^2 = 0.0001$ . When the model is uncertain or wrong ( $p_t \approx 0.1$ ), the factor is large: $(1 - 0.1)^2 = 0.81$ . This shifts attention toward hard examples.

Out[22]:

Visualization

Heatmap showing modulating factor values from 0 to 1, with rows for different gamma values (0-5) and columns for different p_t values (0.1-0.9). Darker colors indicate lower modulating factors. — Heatmap of the focal loss modulating factor (1 - p_t)^γ across different probability values and gamma settings. Higher gamma values more aggressively suppress the contribution of well-classified examples (high p_t), concentrating learning on hard examples.

The heatmap reveals the dramatic effect of gamma. With $\gamma = 0$ (standard cross-entropy), all samples contribute equally regardless of confidence. With $\gamma = 2$ , samples with $p_t > 0.8$ contribute less than 4% of their original weight. With $\gamma = 5$ , confident predictions are almost completely ignored, focusing nearly all learning signal on the hardest examples.

In[23]:

Code

def focal_loss_numpy(p_pred, y_true, gamma=2.0, alpha=0.25, epsilon=1e-15):
    """Compute focal loss for binary classification.

    Args:
        p_pred: Predicted probabilities for positive class
        y_true: True binary labels
        gamma: Focusing parameter
        alpha: Class weight for positive class
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)

    # p_t: probability of true class
    p_t = np.where(y_true == 1, p_pred, 1 - p_pred)

    # alpha_t: class weight for true class
    alpha_t = np.where(y_true == 1, alpha, 1 - alpha)

    # Focal loss
    focal_weight = (1 - p_t) ** gamma
    ce = -np.log(p_t)

    return np.mean(alpha_t * focal_weight * ce)


# Compare CE vs Focal Loss on imbalanced data
np.random.seed(42)
n_samples = 1000

# Imbalanced: 95% negative, 5% positive
y_imbalanced = np.zeros(n_samples)
y_imbalanced[:50] = 1  # 50 positive samples

# Model predictions (slightly better than random)
p_pred = 0.1 + 0.4 * y_imbalanced + 0.2 * np.random.randn(n_samples)
p_pred = np.clip(p_pred, 0.01, 0.99)

bce = binary_cross_entropy_numpy(p_pred, y_imbalanced)
focal = focal_loss_numpy(p_pred, y_imbalanced, gamma=2.0)

def focal_loss_numpy(p_pred, y_true, gamma=2.0, alpha=0.25, epsilon=1e-15):
    """Compute focal loss for binary classification.

    Args:
        p_pred: Predicted probabilities for positive class
        y_true: True binary labels
        gamma: Focusing parameter
        alpha: Class weight for positive class
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)

    # p_t: probability of true class
    p_t = np.where(y_true == 1, p_pred, 1 - p_pred)

    # alpha_t: class weight for true class
    alpha_t = np.where(y_true == 1, alpha, 1 - alpha)

    # Focal loss
    focal_weight = (1 - p_t) ** gamma
    ce = -np.log(p_t)

    return np.mean(alpha_t * focal_weight * ce)


# Compare CE vs Focal Loss on imbalanced data
np.random.seed(42)
n_samples = 1000

# Imbalanced: 95% negative, 5% positive
y_imbalanced = np.zeros(n_samples)
y_imbalanced[:50] = 1  # 50 positive samples

# Model predictions (slightly better than random)
p_pred = 0.1 + 0.4 * y_imbalanced + 0.2 * np.random.randn(n_samples)
p_pred = np.clip(p_pred, 0.01, 0.99)

bce = binary_cross_entropy_numpy(p_pred, y_imbalanced)
focal = focal_loss_numpy(p_pred, y_imbalanced, gamma=2.0)

Out[24]:

Console

BCE vs Focal Loss on Imbalanced Data:
--------------------------------------------------
Data: 50 positive, 950 negative

BCE Loss:    0.210921
Focal Loss:  0.020569

Focal loss is lower because easy negatives contribute less.

Visualizing the Focal EffectLink Copied

Out[25]:

Visualization

Line plot comparing cross-entropy and focal loss curves, showing how focal loss drops faster for high probability predictions, with multiple gamma values displayed. — Comparison of cross-entropy and focal loss for correctly classified positive samples. Focal loss reduces contribution from confident predictions (high p), focusing training on uncertain samples.

The visualization shows how focal loss suppresses the contribution of well-classified examples. For $p_t > 0.6$ , focal loss with $\gamma = 2$ is nearly zero, while cross-entropy still contributes significantly. This allows the model to focus its learning capacity on the hard examples that matter.

Custom Loss FunctionsLink Copied

Sometimes standard losses don't capture what you really care about. Maybe you want to penalize false positives more than false negatives, or optimize for a specific business metric. PyTorch makes it easy to define custom loss functions.

Asymmetric Loss for Imbalanced PreferencesLink Copied

Suppose false positives are more costly than false negatives (e.g., flagging legitimate transactions as fraud annoys customers). We can create an asymmetric loss that penalizes each type differently:

In[26]:

Code

def asymmetric_bce(p_pred, y_true, fp_weight=2.0, fn_weight=1.0, epsilon=1e-15):
    """BCE with different weights for false positives and false negatives.

    Args:
        p_pred: Predicted probabilities
        y_true: True binary labels
        fp_weight: Weight for false positive penalty (predicting 1 when true is 0)
        fn_weight: Weight for false negative penalty (predicting 0 when true is 1)
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)

    # False negative term: y=1, model predicts low p
    fn_loss = -y_true * np.log(p_pred)

    # False positive term: y=0, model predicts high p
    fp_loss = -(1 - y_true) * np.log(1 - p_pred)

    return np.mean(fn_weight * fn_loss + fp_weight * fp_loss)


# Compare symmetric vs asymmetric loss
np.random.seed(42)
y_test = np.array([1, 1, 0, 0])
p_test = np.array([0.3, 0.7, 0.6, 0.2])  # 2 errors: FN at [0], FP at [2]

symmetric = binary_cross_entropy_numpy(p_test, y_test)
asymmetric = asymmetric_bce(p_test, y_test, fp_weight=3.0, fn_weight=1.0)

def asymmetric_bce(p_pred, y_true, fp_weight=2.0, fn_weight=1.0, epsilon=1e-15):
    """BCE with different weights for false positives and false negatives.

    Args:
        p_pred: Predicted probabilities
        y_true: True binary labels
        fp_weight: Weight for false positive penalty (predicting 1 when true is 0)
        fn_weight: Weight for false negative penalty (predicting 0 when true is 1)
    """
    p_pred = np.clip(p_pred, epsilon, 1 - epsilon)

    # False negative term: y=1, model predicts low p
    fn_loss = -y_true * np.log(p_pred)

    # False positive term: y=0, model predicts high p
    fp_loss = -(1 - y_true) * np.log(1 - p_pred)

    return np.mean(fn_weight * fn_loss + fp_weight * fp_loss)


# Compare symmetric vs asymmetric loss
np.random.seed(42)
y_test = np.array([1, 1, 0, 0])
p_test = np.array([0.3, 0.7, 0.6, 0.2])  # 2 errors: FN at [0], FP at [2]

symmetric = binary_cross_entropy_numpy(p_test, y_test)
asymmetric = asymmetric_bce(p_test, y_test, fp_weight=3.0, fn_weight=1.0)

Out[27]:

Console

Symmetric vs Asymmetric BCE:
--------------------------------------------------
Predictions:  [0.3 0.7 0.6 0.2]
True labels:  [1 1 0 0]

Errors:
  - Sample 0: y=1, p=0.3 (False Negative)
  - Sample 2: y=0, p=0.6 (False Positive)

Symmetric BCE:   0.6750
Asymmetric BCE:  1.2447
  (FP weighted 3x more than FN)

Implementing Custom Losses in PyTorchLink Copied

For gradient-based optimization, custom losses need to be differentiable. PyTorch's autograd handles this automatically when you compose built-in operations:

In[28]:

Code

class FocalLoss(nn.Module):
    """Focal Loss for binary classification."""

    def __init__(self, gamma=2.0, alpha=0.25):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, logits, targets):
        """
        Args:
            logits: Raw model outputs (before sigmoid)
            targets: Binary labels (0 or 1)
        """
        probs = torch.sigmoid(logits)

        # p_t: probability of true class
        p_t = probs * targets + (1 - probs) * (1 - targets)

        # alpha_t: class weight
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)

        # Focal weight
        focal_weight = (1 - p_t) ** self.gamma

        # BCE loss (without reduction)
        bce = F.binary_cross_entropy(probs, targets, reduction="none")

        # Apply focal weight
        loss = alpha_t * focal_weight * bce

        return loss.mean()


# Test the custom loss
torch.manual_seed(42)
logits = torch.randn(100, requires_grad=True)
targets = torch.randint(0, 2, (100,)).float()

focal_loss = FocalLoss(gamma=2.0, alpha=0.25)
loss = focal_loss(logits, targets)

class FocalLoss(nn.Module):
    """Focal Loss for binary classification."""

    def __init__(self, gamma=2.0, alpha=0.25):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha

    def forward(self, logits, targets):
        """
        Args:
            logits: Raw model outputs (before sigmoid)
            targets: Binary labels (0 or 1)
        """
        probs = torch.sigmoid(logits)

        # p_t: probability of true class
        p_t = probs * targets + (1 - probs) * (1 - targets)

        # alpha_t: class weight
        alpha_t = self.alpha * targets + (1 - self.alpha) * (1 - targets)

        # Focal weight
        focal_weight = (1 - p_t) ** self.gamma

        # BCE loss (without reduction)
        bce = F.binary_cross_entropy(probs, targets, reduction="none")

        # Apply focal weight
        loss = alpha_t * focal_weight * bce

        return loss.mean()


# Test the custom loss
torch.manual_seed(42)
logits = torch.randn(100, requires_grad=True)
targets = torch.randint(0, 2, (100,)).float()

focal_loss = FocalLoss(gamma=2.0, alpha=0.25)
loss = focal_loss(logits, targets)

Out[29]:

Console

Custom Focal Loss in PyTorch:
--------------------------------------------------
Loss value: 0.206375
Gradient computed: True
Gradient shape: torch.Size([100])

The custom loss integrates seamlessly with PyTorch's training loop. Autograd computes gradients automatically, enabling backpropagation through the custom focal loss formulation.

Comparing Loss FunctionsLink Copied

Different tasks call for different losses. Let's summarize when to use each:

Summary of loss functions and their primary use cases.

Loss Function	Use Case	Key Property
MSE	Regression	Assumes Gaussian errors, sensitive to outliers
Binary CE	Binary classification	Pairs well with sigmoid, stable gradients
Categorical CE	Multiclass classification	Standard choice, works with softmax
Label Smoothing	Large-scale classification	Prevents overconfidence, regularizes
Focal Loss	Imbalanced classification	Focuses on hard examples

Choosing the Right LossLink Copied

Consider these factors when selecting a loss function:

Task type: Regression requires MSE or variants (MAE for robustness to outliers). Classification needs cross-entropy.
Class balance: Highly imbalanced data benefits from focal loss or class weighting.
Confidence requirements: If well-calibrated probabilities matter, avoid label smoothing.
Gradient behavior: Understand how the loss gradient behaves for different prediction values.

Out[30]:

Visualization

Line plot comparing gradient magnitudes of MSE, BCE, and focal loss as a function of predicted probability for a positive sample. — Gradient magnitude comparison across loss functions for a positive sample (y=1). Focal loss produces smaller gradients for confident correct predictions, while MSE gradients remain proportional to error.

The plot reveals key differences in gradient behavior. MSE provides constant gradient direction but varying magnitude. BCE produces large gradients for confident wrong predictions and small gradients for confident correct ones. Focal loss aggressively suppresses gradients for easy examples (high $p$ for positive samples).

Limitations and Practical ConsiderationsLink Copied

Loss functions are powerful tools, but they come with important caveats that affect training dynamics and model behavior.

The most significant limitation is the mismatch between training objective and evaluation metric. Cross-entropy optimizes probability calibration, but you often care about accuracy, F1 score, or AUC. These metrics involve non-differentiable operations (argmax for accuracy, thresholding for F1), so they can't be directly optimized. A model that minimizes cross-entropy isn't guaranteed to maximize accuracy. This disconnect means you should always evaluate on your actual metric of interest, not just the training loss. Some approaches, like using smooth approximations to step functions, attempt to bridge this gap, but the fundamental tension remains.

Numerical stability requires constant vigilance. Despite the log-sum-exp trick and probability clipping, edge cases can still cause issues. Very small learning rates combined with very confident predictions can lead to vanishing gradients. Very large logits can still overflow in float16 training. Mixed-precision training (common for efficiency) amplifies these concerns. Modern frameworks like PyTorch provide numerically stable implementations (CrossEntropyLoss, BCEWithLogitsLoss) that handle most cases, but understanding the underlying issues helps diagnose training failures.

The assumption of i.i.d. samples rarely holds in practice. Cross-entropy assumes each sample contributes independently to the loss. But in NLP, sentences in a document are related. In time series, adjacent samples are correlated. Batch construction, curriculum learning, and online learning all violate the i.i.d. assumption to varying degrees. While training usually still works, these violations can cause gradient estimates to have higher variance or systematic bias.

Finally, loss function choice interacts with model architecture and optimizer settings. Focal loss with aggressive $\gamma$ values can cause gradient instability early in training when the model makes many wrong predictions. Label smoothing changes the optimal temperature for softmax, potentially requiring learning rate adjustments. These interactions are often discovered empirically rather than theoretically, requiring careful hyperparameter tuning.

SummaryLink Copied

Loss functions translate prediction quality into a single number that guides learning. This chapter covered the mathematical foundations and practical considerations for choosing and implementing losses.

Key takeaways:

MSE measures squared distance between predictions and targets, suitable for regression with Gaussian assumptions. Its gradient is proportional to error magnitude.
Binary cross-entropy is the standard for two-class problems. It pairs naturally with sigmoid, producing stable gradients that don't vanish even for saturated activations.
Categorical cross-entropy extends to multiple classes. Combined with softmax, the gradient simplifies to prediction minus target, enabling stable multiclass training.
Numerical stability requires care: use log-sum-exp for softmax, clip probabilities for BCE, and prefer built-in stable implementations.
Label smoothing prevents overconfidence by softening targets, acting as a regularizer for large-scale classification.
Focal loss addresses class imbalance by down-weighting easy examples, focusing learning on hard samples.
Custom losses can encode domain-specific requirements. PyTorch's autograd handles differentiation automatically for composable operations.

The loss function defines what "correct" means for your model. In the next chapter, we'll see how backpropagation uses loss gradients to update parameters throughout the network, completing the training loop.

Key ParametersLink Copied

Understanding the key parameters for each loss function helps you tune them effectively for your specific problem.

Mean Squared Error (nn.MSELoss)

reduction: How to aggregate losses across samples. Options are 'mean' (default, average over samples), 'sum' (total loss), or 'none' (return per-sample losses). Use 'none' when you need per-sample gradients for techniques like hard negative mining.

Binary Cross-Entropy (nn.BCELoss, nn.BCEWithLogitsLoss)

reduction: Same as MSE. The default 'mean' works for most cases.
weight: Per-sample weights to emphasize certain examples. Useful for class imbalance when combined with class-specific weights.
pos_weight (BCEWithLogitsLoss only): Weight for positive examples. Setting pos_weight=10 for a 10:1 class imbalance ratio can help balance gradient contributions.

Cross-Entropy (nn.CrossEntropyLoss)

weight: A tensor of size $K$ (number of classes) specifying the weight for each class. Classes with higher weights contribute more to the loss, useful for imbalanced datasets.
ignore_index: Class label to ignore when computing loss. Commonly set to -100 for padding tokens in sequence labeling tasks.
label_smoothing: Smoothing factor $\alpha$ between 0 and 1. Values like 0.1 prevent overconfidence without significantly hurting accuracy.

Focal Loss (custom implementation)

gamma: Focusing parameter that controls how much to down-weight easy examples. $\gamma = 0$ recovers standard cross-entropy. $\gamma = 2$ is the original paper's recommendation. Higher values (3-5) further suppress easy examples but may cause training instability.
alpha: Class weighting factor, typically set to the inverse class frequency. For binary classification with 5% positives, try $\alpha = 0.25$ for positives.

General Guidelines

Start with default parameters and standard losses (MSE for regression, cross-entropy for classification)
Add class weighting or focal loss only if you observe the model ignoring minority classes
Use label smoothing when you see overconfident predictions (very high or low probabilities)
Monitor both training loss and validation metrics, as loss improvements don't always translate to metric improvements

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about loss functions in neural networks.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Multilayer Perceptrons

Next Chapter

Backpropagation

Reference

BIBTEXAcademic

@misc{lossfunctionsmsecrossentropyfocallosscustomimplementations, author = {Michael Brenndoerfer}, title = {Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations}, year = {2025}, url = {https://mbrenndoerfer.com/writing/neural-network-loss-functions-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations. Retrieved from https://mbrenndoerfer.com/writing/neural-network-loss-functions-guide

MLAAcademic

Michael Brenndoerfer. "Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/neural-network-loss-functions-guide>.

CHICAGOAcademic

Michael Brenndoerfer. "Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/neural-network-loss-functions-guide.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations'. Available at: https://mbrenndoerfer.com/writing/neural-network-loss-functions-guide (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations. https://mbrenndoerfer.com/writing/neural-network-loss-functions-guide

Direct link:

https://mbrenndoerfer.com/writing/neural-network-loss-functions-guide

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations

Loss FunctionsLink Copied

The Role of Loss FunctionsLink Copied

Mean Squared Error for RegressionLink Copied

Mathematical FormulationLink Copied

Why Squared Error?Link Copied

ImplementationLink Copied

The MSE GradientLink Copied

Binary Cross-Entropy LossLink Copied

Why Not Use MSE for Classification?Link Copied

Mathematical DerivationLink Copied

Understanding the BCE TermsLink Copied

The BCE GradientLink Copied

Multiclass Cross-Entropy LossLink Copied

From Binary to MulticlassLink Copied

Mathematical FormulationLink Copied

One-Hot Encoding PerspectiveLink Copied

The Cross-Entropy GradientLink Copied

Numerical StabilityLink Copied

The Log-Sum-Exp TrickLink Copied

Clipping for BCELink Copied

Label SmoothingLink Copied

MotivationLink Copied

Implementation and EffectLink Copied

When to Use Label SmoothingLink Copied

Focal Loss for Class ImbalanceLink Copied

The Problem with Standard Cross-EntropyLink Copied

Focal Loss FormulationLink Copied

Understanding the Modulating FactorLink Copied

Visualizing the Focal EffectLink Copied

Custom Loss FunctionsLink Copied

Asymmetric Loss for Imbalanced PreferencesLink Copied

Implementing Custom Losses in PyTorchLink Copied

Comparing Loss FunctionsLink Copied

Choosing the Right LossLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Backpropagation: The Algorithm That Makes Deep Learning Possible

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Stay updated