Router Z-Loss: Numerical Stability for MoE Training

Michael Brenndoerfer

Learn how z-loss stabilizes Mixture of Experts training by penalizing large router logits. Covers formulation, coefficient tuning, and implementation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Router Z-LossLink Copied

The auxiliary balancing loss we covered in the previous chapter addresses one critical challenge in Mixture of Experts training: ensuring tokens distribute evenly across experts. But MoE models face another subtle yet dangerous problem: router instability during training. As router logits grow unboundedly large, softmax computations become numerically unstable, gradients explode or vanish, and training can collapse entirely. Router z-loss provides an elegant solution by penalizing large router logits, keeping the gating network well-behaved throughout training. Z-loss is critical for stable MoE training. The success of a training run often depends on this single auxiliary loss term.

The Router Instability ProblemLink Copied

To understand why router instability occurs, first revisit how routing decisions are made. Recall from the discussion of gating networks that the router produces logits $z_i = W_r h_i$ for each input token $h_i$ . These logits represent the raw, unnormalized scores that indicate how strongly the router prefers each expert for a given token. These logits pass through a softmax to produce routing probabilities:

p_j = \frac{\exp(z_j)}{\sum_{k=1}^{N} \exp(z_k)}

where:

$p_j$ : the probability assigned to expert $j$
$z_j$ : the router logit for expert $j$
$N$ : the total number of experts
$\exp(z_j)$ : the exponential of the logit, ensuring a positive value
$\sum_{k=1}^{N} \exp(z_k)$ : the sum over all $N$ experts, serving as the normalization constant

The softmax function transforms the arbitrary real-valued logits into a proper probability distribution where all values are positive and sum to one. This transformation is elegant and widely used, but it introduces subtle numerical challenges when the input logits span extreme ranges.

During training, nothing explicitly constrains the magnitude of these logits. The neural network is free to produce logits of any size, and the gradient descent process will adjust parameters in whatever direction minimizes the loss function. The softmax function is scale-invariant in its output probabilities: multiplying all logits by a constant doesn't change the relative probabilities. If you double all the logits, you still get the same ranking and the same probability assignments. However, scale dramatically affects the sharpness of the distribution and the numerical stability of the computation. This simple property creates a hidden danger that only appears as training progresses.

Logit DriftLink Copied

As training progresses, the router learns to route tokens to appropriate experts by increasing the logits for preferred experts relative to others. This is exactly what we want: the router should learn to distinguish which expert is best suited for each type of input. Without constraints, however, this creates a natural logit drift toward larger logit magnitudes. If expert 3 is consistently preferred for certain tokens, the router learns to make $z_3$ large and positive while other logits become large and negative. The optimization process finds that making these differences more extreme leads to more confident routing decisions.

This drift accelerates over time through a self-reinforcing feedback loop. Larger logits create sharper probability distributions, which produce stronger gradients for the "winning" experts. When the probability assigned to the chosen expert approaches 1.0, the gradient signal becomes very clear: this expert should handle this token. These stronger gradients further increase the logit gap, which creates even sharper distributions, which produce even stronger gradients. This positive feedback loop pushes logits toward extreme values without any natural stopping point.

To illustrate this phenomenon concretely, consider a router that starts with logits around $[1.0, 0.8, 0.5, 0.2]$ . The corresponding probabilities might be roughly $[0.35, 0.29, 0.21, 0.15]$ , which is a reasonably soft distribution. After many training steps, if expert 1 is consistently correct, the logits might evolve to $[10.0, 3.0, 2.0, 1.0]$ , giving probabilities closer to $[0.99, 0.009, 0.0003, 0.0001]$ . The training process has learned to be more confident, but the logit magnitudes have increased dramatically. Left unchecked, this drift continues until logits reach values that cause computational problems.

Out[2]:

Visualization

Four bar charts arranged horizontally showing probability distributions across four experts at increasing logit scales (1x, 2x, 5x, 10x). As scale increases, probability mass concentrates on Expert 0 (blue bar), with probabilities rising from 0.35 to 0.99, while other experts receive diminishing probability, demonstrating how logit drift leads to overconfident routing decisions. — Probability distributions across four experts at increasing logit scales. As the scale factor increases from 1x to 10x, the probability mass concentrates on the highest-scoring expert (Expert 0), demonstrating how logit drift leads to overconfident routing decisions.

Numerical ConsequencesLink Copied

Large logits cause several numerical problems that can derail training entirely. These issues show why z-loss targets logit magnitudes:

Softmax overflow: Computing $\exp(z_j)$ for $z_j > 709$ (float64) or $z_j > 88$ (float32) produces infinity. When any exponential term overflows, the entire softmax computation becomes undefined. Even a single overflowed value corrupts the probability calculation for all experts.
Softmax underflow: Computing $\exp(z_j)$ for very negative $z_j$ produces zero due to floating-point underflow. This might seem harmless since small probabilities should be near zero anyway. However, when combined with division operations, these zeros can cause division-by-zero errors or produce undefined gradients.
Gradient instability: The softmax gradient involves terms like $p_j(1 - p_j)$ , which vanishes as probabilities approach 0 or 1. When the router becomes extremely confident, assigning probability 0.9999 to one expert and 0.0001 to others, the gradients become tiny. This gradient saturation prevents the router from learning to change its routing decisions even when it should.
Loss of precision: When logits span a huge range, floating-point arithmetic loses significant digits. A computation involving $\exp(50)$ and $\exp(-50)$ cannot represent both values accurately in the same calculation, even with double precision. This precision loss introduces numerical noise that accumulates over training steps.

While the softmax-with-max-subtraction trick (subtracting $\max_j z_j$ before exponentiation) prevents overflow in the forward pass, it doesn't address the fundamental instability that large logit magnitudes create for gradient flow. The backward pass still suffers from saturated gradients and precision issues. Moreover, the max-subtraction trick only helps with the exponential computation; it doesn't prevent the underlying logit drift that causes all these problems.

Manifestation in TrainingLink Copied

Router instability typically manifests in several recognizable ways that you should monitor:

Training loss spikes: Sudden, unexpected increases in loss occur as numerical issues cascade through the network. The loss might be decreasing smoothly for thousands of steps, then suddenly jump to a much higher value or even become NaN. These spikes often indicate that router logits have grown large enough to cause computational problems.
Gradient explosions: NaN or inf values appearing in gradients signal that the numerical instability has propagated to the backward pass. Once gradients contain invalid values, parameter updates become meaningless, and the model parameters can become corrupted. Recovering from gradient explosions often requires rolling back to an earlier checkpoint.
Expert collapse: The router becoming stuck on a single expert due to saturated gradients represents a subtle but serious failure mode. When logits become extreme, the router cannot learn to route tokens to different experts even when the current routing is suboptimal. The gradients for non-selected experts become vanishingly small, freezing the routing pattern in place.
Poor generalization: Overconfident routing that doesn't transfer to new inputs indicates that the model has memorized routing patterns rather than learning generalizable features. When the router makes extremely sharp decisions based on subtle input features, small changes in input distribution can completely change routing behavior in unpredictable ways.

These issues become more severe as model scale increases. Larger models have more parameters in the router network, more capacity to push logits to extreme values, and longer training runs that provide more opportunity for drift to accumulate. The ST-MoE paper from Google observed that without z-loss, larger MoE models frequently experienced training instabilities that smaller models avoided. This scale-dependent behavior makes z-loss increasingly important as the field moves toward ever-larger MoE architectures.

Z-Loss FormulationLink Copied

Router z-loss directly penalizes large logit magnitudes by adding a regularization term to the training objective. Rather than placing hard constraints on logit values, which would require constrained optimization techniques, z-loss uses soft regularization that integrates smoothly with standard gradient descent. The formulation targets the log-sum-exp of router logits, a smooth, differentiable proxy for the maximum logit that has particularly nice mathematical properties.

Mathematical DefinitionLink Copied

For a batch of $B$ tokens, where token $i$ has router logits $z_{i,j}$ for expert $j$ , the z-loss is defined as:

\mathcal{L}_z = \frac{1}{B} \sum_{i=1}^{B} \left( \log \left( \sum_{j=1}^{N} \exp(z_{i,j}) \right) \right)^2

where:

$\mathcal{L}_z$ : the router z-loss
$B$ : the batch size (number of tokens)
$N$ : the number of experts
$z_{i,j}$ : the logit for token $i$ and expert $j$
$\log \sum \ldots$ : the log-sum-exp term approximating the maximum logit

Let's unpack this formula to understand how it works. The innermost part, $\sum_{j=1}^{N} \exp(z_{i,j})$ , computes the sum of exponentials of all logits for a single token. This sum is exactly the normalization constant that appears in the denominator of the softmax function. Taking the logarithm of this sum gives us the log-sum-exp (LSE), which has a natural interpretation as a "soft maximum" of the logits.

The inner term $\log \sum_j \exp(z_{i,j})$ is the log-sum-exp (LSE) of the logits for token $i$ . This function has a useful property that makes it ideal for our purposes: it approximates the maximum logit while remaining smooth and differentiable everywhere:

\max_j z_{i,j} \leq \log \sum_{j=1}^{N} \exp(z_{i,j}) \leq \max_j z_{i,j} + \log N

where:

$\max_j z_{i,j}$ : the largest logit value for token $i$
$N$ : the total number of experts
$\log N$ : the maximum bound on the approximation error

This inequality tells us that the LSE always lies between the maximum logit and the maximum logit plus $\log N$ . For typical expert counts (8, 16, or 64 experts), $\log N$ is a small constant (approximately 2, 2.8, or 4.2 respectively), so the LSE closely tracks the maximum logit. However, unlike the hard maximum function, which has discontinuous gradients at points where multiple logits are tied, the LSE is smooth everywhere. This smoothness is crucial for gradient-based optimization.

Squaring this quantity and averaging across tokens creates a penalty that grows quadratically as logits increase. The squaring operation has two important effects: it ensures the loss is always non-negative, and it provides increasingly strong regularization as logits grow. A token with LSE value of 10 contributes 100 to the loss, while a token with LSE value of 20 contributes 400. This quadratic growth creates strong pressure to keep logits bounded without overly penalizing moderate values.

Out[3]:

Visualization

Two side-by-side line plots comparing log-sum-exp (LSE) and maximum functions. Left panel shows LSE (red dashed line) smoothly tracking the maximum logit (blue solid line) as the dominant logit varies from -5 to 15, with an orange shaded region showing the bounded gap between them. Right panel shows the gap magnitude (orange line) remaining below the theoretical bound log(N) = 1.39 (red dashed line), confirming LSE closely approximates the maximum while remaining differentiable. — Comparison of log-sum-exp (LSE) and maximum function behavior. The left panel shows LSE smoothly tracking the maximum logit $z_1$, while the right panel illustrates the approximation gap, which remains bounded by $\log N$ (red dashed line).

Why Log-Sum-Exp?Link Copied

The log-sum-exp formulation has several advantages over alternative approaches to constraining logit magnitudes. These advantages show why this design is standard:

Smooth gradients: Unlike penalties on $\max_j z_j$ (which has discontinuous gradients at points where the maximum switches from one logit to another) or $\sum_j z_j^2$ (which penalizes all logits equally regardless of their relative magnitudes), LSE smoothly emphasizes the largest logits while providing gradients to all experts. When one logit is much larger than the others, the LSE gradient is concentrated on that logit. When multiple logits are similar, the gradient is distributed among them. This smooth interpolation between these cases makes optimization well-behaved.
Scale awareness: The LSE naturally captures the "effective scale" of the logit distribution in a way that considers both the maximum value and the spread. For example, a distribution with logits $[10, 10, 10]$ has an LSE of approximately $11.1$ , while a distribution with logits $[30, 30, 30]$ has an LSE of approximately $31.1$ . Although both produce identical uniform probability distributions (and thus have the same entropy), their LSE values differ significantly, correctly reflecting that the larger logits pose greater stability risks.
Numerical stability: The LSE is itself computed stably using the max-subtraction trick, avoiding the very overflow issues we're trying to prevent. Modern deep learning frameworks implement numerically stable versions of log-sum-exp that never overflow, making z-loss robust to compute even when logits have already drifted to large values. This means z-loss can be applied as a corrective measure even when training has already started experiencing instability.

Gradient AnalysisLink Copied

To understand the effect of z-loss on training dynamics, we derive its gradient with respect to the router logit $z_{i,j}$ . This derivation reveals how the loss creates corrective pressure on large logits. Applying the chain rule to the squared term:

\begin{aligned} \frac{\partial \mathcal{L}_z}{\partial z_{i,j}} &= \frac{\partial}{\partial z_{i,j}} \left[ \frac{1}{B} \sum_{k=1}^{B} \text{LSE}(z_k)^2 \right] \\ &= \frac{1}{B} \cdot \frac{\partial}{\partial z_{i,j}} (\text{LSE}(z_i)^2) && \text{(only term $i$ depends on $z_{i,j}$)} \\ &= \frac{1}{B} \cdot 2 \cdot \text{LSE}(z_i) \cdot \frac{\partial}{\partial z_{i,j}} \text{LSE}(z_i) && \text{(power rule)} \\ &= \frac{2}{B} \cdot \text{LSE}(z_i) \cdot \text{softmax}(z_i)_j && \text{(derivative of LSE is softmax)} \end{aligned}

where:

$\frac{\partial \mathcal{L}_z}{\partial z_{i,j}}$ : the gradient of the loss with respect to a specific logit
$B$ : the batch size
$\text{LSE}(z_i)$ : the log-sum-exp value for token $i$
$\text{softmax}(z_i)_j$ : the probability assigned to expert $j$

The key insight in this derivation is that the derivative of the log-sum-exp with respect to one of its inputs is exactly the corresponding softmax probability. This elegant relationship emerges from the structure of the exponential function and is a standard result in deep learning.

This gradient formula reveals important behavior that explains why z-loss is effective:

The gradient is proportional to the current softmax probability: experts receiving more routing get larger gradients pushing their logits down. This is exactly what we want, as the experts with the largest logits (and therefore highest probabilities) receive the strongest corrective signal. Experts that are already receiving little routing probability receive only small gradients from z-loss, allowing them to remain available for future routing without being artificially suppressed.
The gradient scales with the current LSE value: larger logits create stronger corrective pressure. When the LSE is small (indicating well-behaved logits), the z-loss gradient is correspondingly small and doesn't interfere much with task learning. When the LSE grows large (indicating potential instability), the gradient grows proportionally, creating a stronger restoring force.
The quadratic nature (from squaring) provides increasingly strong regularization as logits grow. The factor of 2 from the power rule, combined with the LSE scaling, means the gradient grows faster than linearly with logit magnitude. This provides gentle regularization for moderate logits but strong correction for extreme values, striking a balance between allowing useful routing patterns and preventing numerical instability.

Out[4]:

Visualization

Heatmap with color intensity representing z-loss gradient magnitudes across four experts (vertical axis) and logit scale factors from 0.5 to 10 (horizontal axis). Gradient values increase from yellow (low) to dark red (high) as scale increases. Expert 0 (top row) shows the highest gradients at large scales, indicating z-loss provides strongest corrective pressure on the dominant expert's logits where instability risk is greatest. — Z-loss gradient magnitude across expert indices and logit scales. The heatmap shows gradients increasing with logit scale and concentrating on experts with high routing probabilities (lower indices), providing targeted correction where logits are largest.

Out[5]:

Visualization

Line plot showing z-loss gradient magnitudes for four experts (Expert 0 in red, Experts 1-3 in blue, green, purple) as logit scale factor increases from 0.5 to 10. Expert 0's gradient (red line) grows most rapidly, reaching highest values, while other experts show slower growth, demonstrating that z-loss concentrates corrective pressure on experts with highest routing probabilities and largest logits. — Gradient magnitude profiles for individual experts as logit scale increases. The preferred expert (Expert 0, red) receives significantly stronger gradients compared to other experts, ensuring the router focuses corrective pressure on the dominant logits causing instability.

Why Z-Loss WorksLink Copied

Z-loss addresses router instability through several complementary mechanisms that work together to maintain training stability. These mechanisms explain why this regularization works.

Bounded Logit GrowthLink Copied

By continuously penalizing large LSE values, z-loss creates a restoring force that counteracts the natural drift toward larger logits. Whenever logits grow too large, the z-loss gradient pushes them back down. This establishes an equilibrium where the router learns discriminative logit patterns without unbounded growth. The router can still learn to prefer certain experts for certain inputs, but the absolute magnitude of the preferences remains bounded.

The key insight is that useful routing doesn't require extreme logits. A router with logits $[2.0, 0.5, 0.3, 0.1]$ can effectively route to expert 1 with probability approximately 0.54, which is sufficient to ensure expert 1 is selected in top-2 routing. Increasing these to $[20.0, 5.0, 3.0, 1.0]$ makes the probability approximately 0.999999, but this extreme confidence adds no practical benefit while creating numerical hazards. The routing decision is the same in both cases, but the numerical properties are vastly different.

This equilibrium behavior is similar to other regularization techniques like weight decay, which creates restoring forces that prevent unbounded parameter growth. The difference is that z-loss specifically targets the router logits, which are the quantities most directly responsible for numerical stability in MoE training.

Softmax Temperature EffectLink Copied

Z-loss implicitly encourages a "softer" probability distribution over experts, as if the softmax were computed at a higher temperature. By penalizing the logit scale, it prevents the extremely peaked distributions that arise from large logit differences. When logit differences are bounded, the resulting probability distributions maintain non-trivial probabilities for multiple experts rather than concentrating all probability mass on a single expert.

This softer routing has several benefits that extend beyond mere numerical stability:

Better gradient flow: Non-zero probabilities for multiple experts mean gradients flow to more expert parameters during each training step. Even experts that aren't selected still receive gradient signals through the softmax probabilities, allowing them to continue learning and remain competitive. This prevents the "rich get richer" dynamic where selected experts improve while others stagnate.
Improved exploration: The router remains willing to try alternative experts rather than fixating on past choices. Soft probabilities mean that stochastic routing (if used) can occasionally select non-preferred experts, allowing the model to discover better routing patterns. Even with deterministic top-k routing, softer probabilities mean the router can more easily shift its preferences as the experts evolve during training.
Smoother learning: Gradual probability shifts rather than sudden expert switches create more stable training dynamics. When routing probabilities can change continuously rather than jumping between extremes, the experts experience more consistent training signals. This consistency helps experts develop coherent specializations rather than being whipsawed by sudden changes in their token assignments.

Out[6]:

Visualization

Two side-by-side line plots showing z-loss effects on routing behavior. Left panel shows routing entropy (blue line) decreasing from maximum entropy (green dashed line at 2.08) toward zero as logit scale increases from 0.5 to 15, with green shaded region (0.5-3) labeled stable and red shaded region (8-15) labeled unstable. Right panel shows maximum routing probability (red line) increasing from uniform probability (green dashed line at 0.125) toward near-deterministic routing (red dotted line at 0.99), confirming that large logits create overconfident routing. — Routing entropy and maximum probability as a function of logit scale. The left panel shows entropy decreasing into the unstable region (red) as logits grow, while the right panel shows maximum probability approaching 1.0. Z-loss keeps logits in the stable region (green), preserving gradient flow.

Regularization PropertiesLink Copied

Beyond stability, z-loss provides genuine regularization that improves generalization to new data. By preventing overconfident routing, it encourages the model to develop routing patterns that depend on genuine, robust input features rather than spurious correlations or memorized associations. The regularization effect is similar to adding noise or using dropout, both of which prevent overconfident predictions.

Models trained with z-loss often show better performance on held-out data, particularly for inputs that differ from training distributions. When the router is forced to maintain some uncertainty in its routing decisions, it learns to base those decisions on stable, generalizable features rather than memorizing specific training examples. This improved generalization is especially valuable in language models, which must handle diverse inputs that may differ substantially from training data.

The regularization benefit of z-loss complements the stability benefit: both arise from the same mechanism of preventing extreme logit values, but they manifest in different ways. Stability is visible during training as the absence of loss spikes and gradient explosions. Regularization is visible during evaluation as improved performance on out-of-distribution inputs.

Z-Loss CoefficientLink Copied

The z-loss coefficient $\alpha_z$ controls the strength of logit regularization and represents the primary hyperparameter that you must tune when using this technique:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \alpha_z \mathcal{L}_z

where:

$\mathcal{L}_{\text{total}}$ : the combined training objective
$\mathcal{L}_{\text{task}}$ : the primary task loss
$\alpha_z$ : the z-loss weighting coefficient
$\mathcal{L}_z$ : the auxiliary z-loss term

Choosing an appropriate coefficient requires balancing stability against routing effectiveness. Too small a coefficient provides insufficient regularization and allows instability. Too large a coefficient over-regularizes the router, preventing it from learning sharp routing patterns that enable expert specialization. The optimal coefficient lies in a middle ground that must be found empirically for each setting.

Coefficient GuidelinesLink Copied

Typical z-loss coefficients fall in the range $10^{-4}$ to $10^{-2}$ , with several factors influencing the optimal choice:

Model scale: Larger models often need stronger z-loss (higher $\alpha_z$ ) because they have more capacity for logit growth. The router network in a large model has more parameters and processes higher-dimensional representations, giving it more "room" to develop extreme logit patterns. The longer training runs typical of larger models also provide more time for drift to accumulate.
Number of experts: More experts slightly increase the LSE for a given max logit because of the $\log N$ term in the LSE bounds, so the coefficient may need adjustment. However, this effect is typically small compared to other factors, since $\log N$ grows slowly with the number of experts.
Training stability requirements: If training instabilities occur despite using z-loss, increasing $\alpha_z$ by 2-5x often helps stabilize training. This is a common first response to observed instability. However, if instabilities persist even with high z-loss coefficients, other factors like learning rate or batch size may need adjustment.
Routing sharpness: Very high $\alpha_z$ can make routing too soft, reducing the benefit of expert specialization. When routing becomes nearly uniform, different experts receive similar token mixtures, preventing them from developing distinct specializations. Monitoring expert utilization and specialization can reveal if the z-loss coefficient is too high.

The ST-MoE paper used $\alpha_z = 10^{-4}$ as a starting point, finding it sufficient to stabilize training while preserving routing quality. This value has become a common default in subsequent MoE implementations, though you should treat it as a starting point rather than a universal optimum.

Tuning StrategyLink Copied

A practical approach to setting the z-loss coefficient involves the following steps, which balance systematic exploration with responsive adjustment:

Start conservative: Begin with $\alpha_z = 10^{-4}$ and monitor training stability. This conservative starting point provides some regularization without being so strong that it prevents useful routing patterns from developing. Most training runs will be stable at this coefficient.
Track LSE statistics: Log the mean and max LSE values across training batches. Healthy values typically stay below 10, though this depends on the number of experts. If LSE values grow steadily throughout training, the z-loss coefficient may need to be increased. If LSE values drop quickly to very small values, the coefficient may be too high.
Watch for instability: If loss spikes or NaN gradients occur, increase $\alpha_z$ by 2-5x. These events indicate that the current regularization strength is insufficient to prevent numerical problems. After increasing the coefficient, training may need to restart from an earlier checkpoint before the instability occurred.
Check routing quality: If routing becomes too uniform (all experts receiving similar load), reduce $\alpha_z$ . Uniform routing defeats the purpose of having multiple experts, since all experts would receive the same training signal and develop similar capabilities. Some variation in expert load is healthy and indicates that the router is learning useful distinctions.
Validate on held-out data: Ensure z-loss isn't hurting generalization through over-regularization. While z-loss typically improves generalization by preventing overconfident routing, excessively strong regularization could prevent the model from learning useful routing patterns altogether.

Dynamic Coefficient SchedulesLink Copied

Some implementations use dynamic z-loss coefficients that change during training, adapting the regularization strength to the needs of different training phases:

Warmup: Start with higher $\alpha_z$ during early training when instabilities are most common, then reduce as training stabilizes. Early training often sees rapid parameter changes that can push logits to extreme values, making strong regularization especially valuable. As training converges and parameters change more slowly, weaker regularization allows sharper routing.
Curriculum: Increase $\alpha_z$ as model scale grows during progressive training. Some training approaches start with a small model and gradually increase its size. As the model grows, the capacity for logit drift increases, motivating stronger regularization.
Adaptive: Adjust $\alpha_z$ based on observed LSE statistics, increasing when values exceed a threshold and decreasing when values are safely bounded. This approach automatically responds to the actual training dynamics rather than following a predetermined schedule.

Combined Auxiliary LossesLink Copied

Real MoE systems use both load balancing loss (from the previous chapter) and z-loss together. These losses address complementary problems and work synergistically to create stable, effective MoE training. Neither loss alone is sufficient: load balancing without z-loss can still experience numerical instability, while z-loss without load balancing allows expert collapse and utilization imbalance.

Joint ObjectiveLink Copied

The complete MoE training objective combines the task loss with both auxiliary losses:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \alpha_{\text{aux}} \mathcal{L}_{\text{aux}} + \alpha_z \mathcal{L}_z

where:

$\mathcal{L}_{\text{total}}$ : the complete joint objective
$\mathcal{L}_{\text{task}}$ : the primary task loss
$\mathcal{L}_{\text{aux}}$ : the auxiliary balancing loss (encourages uniform expert utilization)
$\mathcal{L}_z$ : the z-loss (prevents logit instability)
$\alpha_{\text{aux}}, \alpha_z$ : weighting coefficients for the auxiliary terms

This combined objective guides the model toward three goals simultaneously: achieving good task performance, maintaining balanced expert utilization, and preserving numerical stability. The relative importance of these goals is controlled by the two auxiliary coefficients.

Complementary RolesLink Copied

The two auxiliary losses serve distinct purposes that together create a robust training framework:

Comparison of the distinct roles of load balancing loss and z-loss. Load balancing ensures tokens are distributed evenly across experts to prevent collapse, while z-loss penalizes large logit magnitudes to maintain numerical stability.

Aspect	Load Balancing Loss	Z-Loss
Target	Expert load distribution	Router logit magnitudes
Problem addressed	Expert collapse, imbalanced utilization	Numerical instability, gradient issues
Mechanism	Penalizes deviation from uniform load	Penalizes large log-sum-exp values
Effect on routing	Encourages diversity	Encourages softness

Load balancing loss ensures that all experts are used and that no expert becomes overloaded or underutilized. Z-loss ensures that the routing computation itself remains numerically stable. These concerns are orthogonal: a model could have perfectly balanced load with numerically unstable logits, or stable logits with highly imbalanced load. Both auxiliary losses are needed to achieve both goals.

Interaction EffectsLink Copied

The two losses can interact in subtle ways that you should understand:

Synergy: Z-loss keeps routing soft, making it easier for load balancing to spread tokens across experts. Hard routing with very peaked probabilities resists load balancing pressure because the gradient from load balancing loss is proportional to softmax probabilities. When probabilities are nearly binary (0.99 for one expert, 0.01 for others), the load balancing gradient can only adjust the dominant expert's logit. Softer probabilities distribute the gradient across all experts, making load balancing more effective.
Tension: Very strong z-loss ( $\alpha_z$ too high) can make routing nearly uniform, reducing the importance of load balancing. When z-loss forces all logits to be similar, the resulting routing probabilities are already close to uniform regardless of load balancing. In this regime, further increasing load balancing coefficient has little effect. Conversely, aggressive load balancing might push some expert logits up to attract more tokens, which increases z-loss.
Balance: Finding the right coefficient combination requires considering both effects. A common starting point is $\alpha_{\text{aux}} = 10^{-2}$ and $\alpha_z = 10^{-4}$ , making load balancing the stronger constraint with z-loss providing background stabilization. This ratio reflects the fact that expert imbalance is typically a more pressing problem than numerical instability in early training, though both must be controlled.

ImplementationLink Copied

Let's implement z-loss and observe its effect on router behavior. We'll build on the MoE infrastructure from previous chapters.

In[7]:

Code

import torch
import numpy as np

## Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

import torch
import numpy as np

## Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

Z-Loss ComputationLink Copied

The core z-loss computation follows directly from the mathematical definition.

In[8]:

Code

def compute_z_loss(router_logits: torch.Tensor) -> torch.Tensor:
    """
    Compute router z-loss to encourage small logit magnitudes.

    Args:
        router_logits: Tensor of shape (batch_size, num_experts)
                      containing raw router logits before softmax

    Returns:
        Scalar z-loss value
    """
    # Compute log-sum-exp for each token (numerically stable)
    # logsumexp automatically handles the max-subtraction trick
    log_sum_exp = torch.logsumexp(router_logits, dim=-1)  # (batch_size,)

    # Square and average across batch
    z_loss = torch.mean(log_sum_exp**2)

    return z_loss

def compute_z_loss(router_logits: torch.Tensor) -> torch.Tensor:
    """
    Compute router z-loss to encourage small logit magnitudes.

    Args:
        router_logits: Tensor of shape (batch_size, num_experts)
                      containing raw router logits before softmax

    Returns:
        Scalar z-loss value
    """
    # Compute log-sum-exp for each token (numerically stable)
    # logsumexp automatically handles the max-subtraction trick
    log_sum_exp = torch.logsumexp(router_logits, dim=-1)  # (batch_size,)

    # Square and average across batch
    z_loss = torch.mean(log_sum_exp**2)

    return z_loss

In[9]:

Code

## Test z-loss on different logit scales
small_logits = torch.randn(32, 8) * 1.0  # Standard normal scale
medium_logits = torch.randn(32, 8) * 5.0  # 5x scale
large_logits = torch.randn(32, 8) * 20.0  # 20x scale

## Compute losses for display
loss_small = compute_z_loss(small_logits)
loss_medium = compute_z_loss(medium_logits)
loss_large = compute_z_loss(large_logits)

## Test z-loss on different logit scales
small_logits = torch.randn(32, 8) * 1.0  # Standard normal scale
medium_logits = torch.randn(32, 8) * 5.0  # 5x scale
large_logits = torch.randn(32, 8) * 20.0  # 20x scale

## Compute losses for display
loss_small = compute_z_loss(small_logits)
loss_medium = compute_z_loss(medium_logits)
loss_large = compute_z_loss(large_logits)

Out[10]:

Console

Z-loss values for different logit scales:
  Small logits (std=1.0):  6.4938
  Medium logits (std=5.0): 53.8352
  Large logits (std=20.0): 1251.4348

The z-loss increases dramatically with logit scale, providing strong incentive to keep logits bounded.

Out[11]:

Visualization

Line plot showing z-loss value (blue line) increasing quadratically as logit standard deviation increases from 0.1 to 25. Three red scatter points mark test values at standard deviations 1.0, 5.0, and 20.0 with annotated coordinates, demonstrating the rapid growth: small logits (std=1.0) produce low z-loss, while large logits (std=20.0) produce dramatically higher z-loss, providing strong incentive to keep logits bounded. — Z-loss magnitude vs. logit standard deviation. The curve demonstrates quadratic growth, providing gentle regularization for small logits (stable region) while imposing increasingly strong penalties on large logits (unstable region) to prevent explosion.

Router with Z-LossLink Copied

Now let's create a router class that computes both routing probabilities and the z-loss.

In[12]:

Code

import torch
import torch.nn as nn
import torch.nn.functional as F


def compute_z_loss(router_logits: torch.Tensor) -> torch.Tensor:
    """
    Compute router z-loss to encourage small logit magnitudes.

    Args:
        router_logits: Tensor of shape (batch_size, num_experts)
                      containing raw router logits before softmax

    Returns:
        Scalar z-loss value
    """
    # Compute log-sum-exp for each token (numerically stable)
    # logsumexp automatically handles the max-subtraction trick
    log_sum_exp = torch.logsumexp(router_logits, dim=-1)  # (batch_size,)

    # Square and average across batch
    z_loss = torch.mean(log_sum_exp**2)

    return z_loss


class RouterWithZLoss(nn.Module):
    """Router network that tracks z-loss for stability."""

    def __init__(self, hidden_dim: int, num_experts: int):
        super().__init__()
        self.num_experts = num_experts
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)

        # Initialize with small weights to start with small logits
        nn.init.normal_(self.gate.weight, std=0.01)

    def forward(self, x: torch.Tensor, top_k: int = 2):
        """
        Compute routing probabilities and z-loss.

        Args:
            x: Input tensor of shape (batch_size, hidden_dim)
            top_k: Number of experts to select

        Returns:
            routing_weights: Top-k routing probabilities (batch_size, top_k)
            expert_indices: Selected expert indices (batch_size, top_k)
            z_loss: Scalar z-loss for this batch
        """
        # Compute router logits
        logits = self.gate(x)  # (batch_size, num_experts)

        # Compute z-loss before any masking
        z_loss = compute_z_loss(logits)

        # Get routing probabilities
        routing_probs = F.softmax(logits, dim=-1)

        # Select top-k experts
        top_k_probs, top_k_indices = torch.topk(routing_probs, top_k, dim=-1)

        # Renormalize top-k probabilities
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        return top_k_weights, top_k_indices, z_loss

import torch
import torch.nn as nn
import torch.nn.functional as F


def compute_z_loss(router_logits: torch.Tensor) -> torch.Tensor:
    """
    Compute router z-loss to encourage small logit magnitudes.

    Args:
        router_logits: Tensor of shape (batch_size, num_experts)
                      containing raw router logits before softmax

    Returns:
        Scalar z-loss value
    """
    # Compute log-sum-exp for each token (numerically stable)
    # logsumexp automatically handles the max-subtraction trick
    log_sum_exp = torch.logsumexp(router_logits, dim=-1)  # (batch_size,)

    # Square and average across batch
    z_loss = torch.mean(log_sum_exp**2)

    return z_loss


class RouterWithZLoss(nn.Module):
    """Router network that tracks z-loss for stability."""

    def __init__(self, hidden_dim: int, num_experts: int):
        super().__init__()
        self.num_experts = num_experts
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)

        # Initialize with small weights to start with small logits
        nn.init.normal_(self.gate.weight, std=0.01)

    def forward(self, x: torch.Tensor, top_k: int = 2):
        """
        Compute routing probabilities and z-loss.

        Args:
            x: Input tensor of shape (batch_size, hidden_dim)
            top_k: Number of experts to select

        Returns:
            routing_weights: Top-k routing probabilities (batch_size, top_k)
            expert_indices: Selected expert indices (batch_size, top_k)
            z_loss: Scalar z-loss for this batch
        """
        # Compute router logits
        logits = self.gate(x)  # (batch_size, num_experts)

        # Compute z-loss before any masking
        z_loss = compute_z_loss(logits)

        # Get routing probabilities
        routing_probs = F.softmax(logits, dim=-1)

        # Select top-k experts
        top_k_probs, top_k_indices = torch.topk(routing_probs, top_k, dim=-1)

        # Renormalize top-k probabilities
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        return top_k_weights, top_k_indices, z_loss

Visualizing Z-Loss EffectsLink Copied

Let's visualize how z-loss affects logit distributions during training.

In[13]:

Code

import torch


def simulate_training_step(router, optimizer, inputs, z_loss_coeff):
    """Simulate one training step with z-loss."""
    optimizer.zero_grad()

    # Forward pass
    weights, indices, z_loss = router(inputs)

    # Simulated task loss (we'll use dummy loss for illustration)
    # In practice, this would be the actual language modeling loss
    task_loss = torch.tensor(0.0)  # Placeholder

    # Combined loss
    total_loss = task_loss + z_loss_coeff * z_loss

    # Backward pass
    total_loss.backward()
    optimizer.step()

    return z_loss.item()


def run_training_simulation(z_loss_coeff: float, steps: int = 100):
    """Run training simulation tracking z-loss and logit statistics."""
    hidden_dim = 64
    num_experts = 8
    batch_size = 32

    router = RouterWithZLoss(hidden_dim, num_experts)
    optimizer = torch.optim.Adam(router.parameters(), lr=0.01)

    z_losses = []
    max_logits = []
    mean_logits = []

    for step in range(steps):
        # Generate random inputs
        inputs = torch.randn(batch_size, hidden_dim)

        # Get current logit statistics before training step
        with torch.no_grad():
            logits = router.gate(inputs)
            max_logits.append(logits.abs().max().item())
            mean_logits.append(logits.abs().mean().item())

        # Training step
        z_loss = simulate_training_step(router, optimizer, inputs, z_loss_coeff)
        z_losses.append(z_loss)

    return z_losses, max_logits, mean_logits

import torch


def simulate_training_step(router, optimizer, inputs, z_loss_coeff):
    """Simulate one training step with z-loss."""
    optimizer.zero_grad()

    # Forward pass
    weights, indices, z_loss = router(inputs)

    # Simulated task loss (we'll use dummy loss for illustration)
    # In practice, this would be the actual language modeling loss
    task_loss = torch.tensor(0.0)  # Placeholder

    # Combined loss
    total_loss = task_loss + z_loss_coeff * z_loss

    # Backward pass
    total_loss.backward()
    optimizer.step()

    return z_loss.item()


def run_training_simulation(z_loss_coeff: float, steps: int = 100):
    """Run training simulation tracking z-loss and logit statistics."""
    hidden_dim = 64
    num_experts = 8
    batch_size = 32

    router = RouterWithZLoss(hidden_dim, num_experts)
    optimizer = torch.optim.Adam(router.parameters(), lr=0.01)

    z_losses = []
    max_logits = []
    mean_logits = []

    for step in range(steps):
        # Generate random inputs
        inputs = torch.randn(batch_size, hidden_dim)

        # Get current logit statistics before training step
        with torch.no_grad():
            logits = router.gate(inputs)
            max_logits.append(logits.abs().max().item())
            mean_logits.append(logits.abs().mean().item())

        # Training step
        z_loss = simulate_training_step(router, optimizer, inputs, z_loss_coeff)
        z_losses.append(z_loss)

    return z_losses, max_logits, mean_logits

In[14]:

Code

## Run simulations to compare different z-loss coefficients
coefficients = [0.0, 0.001, 0.01]
results = {}

for coeff in coefficients:
    z_losses, max_logits, _ = run_training_simulation(coeff, steps=100)
    results[coeff] = {"z_losses": z_losses, "max_logits": max_logits}

## Run simulations to compare different z-loss coefficients
coefficients = [0.0, 0.001, 0.01]
results = {}

for coeff in coefficients:
    z_losses, max_logits, _ = run_training_simulation(coeff, steps=100)
    results[coeff] = {"z_losses": z_losses, "max_logits": max_logits}

Out[15]:

Visualization

Two side-by-side line plots comparing training dynamics with and without z-loss. Left panel shows maximum absolute logit magnitude over 100 training steps: red line (no z-loss) grows unboundedly, while blue and green lines (with z-loss coefficients 0.001 and 0.01) remain bounded, demonstrating z-loss prevents logit drift. Right panel shows z-loss values stabilizing over training steps for the two regularized cases, confirming effective constraint satisfaction. — Impact of z-loss coefficients on training dynamics. The left panel shows that non-zero coefficients (blue, green) effectively bound maximum logit magnitudes compared to the unbounded growth of the unregularized case (red). The right panel shows the corresponding z-loss values stabilizing.

Without z-loss, the logit magnitudes grow freely. With z-loss applied, the optimizer continuously pushes logits back toward smaller values, establishing a stable equilibrium.

Complete MoE with Combined LossesLink Copied

Let's implement a complete MoE layer that computes both load balancing and z-loss.

In[16]:

Code

import torch
import torch.nn as nn
import torch.nn.functional as F


class MoELayerWithAuxLosses(nn.Module):
    """
    Mixture of Experts layer with both load balancing and z-loss.
    """

    def __init__(
        self,
        hidden_dim: int,
        num_experts: int = 8,
        expert_dim: int = 256,
        top_k: int = 2,
        aux_loss_coeff: float = 0.01,
        z_loss_coeff: float = 0.001,
    ):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.aux_loss_coeff = aux_loss_coeff
        self.z_loss_coeff = z_loss_coeff

        # Router
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        nn.init.normal_(self.router.weight, std=0.01)

        # Expert networks (simple FFN for illustration)
        self.experts = nn.ModuleList(
            [
                nn.Sequential(
                    nn.Linear(hidden_dim, expert_dim),
                    nn.GELU(),
                    nn.Linear(expert_dim, hidden_dim),
                )
                for _ in range(num_experts)
            ]
        )

        # Track auxiliary losses
        self.aux_loss = 0.0
        self.z_loss = 0.0

    def compute_load_balancing_loss(
        self, router_probs: torch.Tensor, expert_mask: torch.Tensor
    ) -> torch.Tensor:
        """Compute auxiliary load balancing loss."""
        # Fraction of tokens routed to each expert
        tokens_per_expert = expert_mask.float().mean(dim=0)

        # Average routing probability per expert
        router_prob_per_expert = router_probs.mean(dim=0)

        # Load balancing loss (encourage uniform distribution)
        aux_loss = self.num_experts * torch.sum(
            tokens_per_expert * router_prob_per_expert
        )

        return aux_loss

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass with auxiliary loss computation.

        Args:
            x: Input tensor of shape (batch_size, seq_len, hidden_dim)

        Returns:
            Output tensor of same shape as input
        """
        batch_size, seq_len, hidden_dim = x.shape
        x_flat = x.view(-1, hidden_dim)  # (batch_size * seq_len, hidden_dim)

        # Router forward pass
        router_logits = self.router(x_flat)  # (B*S, num_experts)
        router_probs = F.softmax(router_logits, dim=-1)

        # Compute z-loss
        self.z_loss = compute_z_loss(router_logits)

        # Select top-k experts
        top_k_probs, top_k_indices = torch.topk(
            router_probs, self.top_k, dim=-1
        )

        # Create expert mask for load balancing loss
        expert_mask = torch.zeros_like(router_probs)
        expert_mask.scatter_(1, top_k_indices, 1.0)

        # Compute load balancing loss
        self.aux_loss = self.compute_load_balancing_loss(
            router_probs, expert_mask
        )

        # Renormalize top-k weights
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Compute expert outputs (simplified loop implementation)
        output = torch.zeros_like(x_flat)
        for i in range(self.top_k):
            expert_idx = top_k_indices[:, i]  # (B*S,)
            weight = top_k_weights[:, i : i + 1]  # (B*S, 1)

            for e in range(self.num_experts):
                mask = expert_idx == e
                if mask.any():
                    expert_input = x_flat[mask]
                    expert_output = self.experts[e](expert_input)
                    output[mask] += weight[mask] * expert_output

        return output.view(batch_size, seq_len, hidden_dim)

    def get_total_aux_loss(self) -> torch.Tensor:
        """Get combined auxiliary loss for training."""
        return (
            self.aux_loss_coeff * self.aux_loss
            + self.z_loss_coeff * self.z_loss
        )

import torch
import torch.nn as nn
import torch.nn.functional as F


class MoELayerWithAuxLosses(nn.Module):
    """
    Mixture of Experts layer with both load balancing and z-loss.
    """

    def __init__(
        self,
        hidden_dim: int,
        num_experts: int = 8,
        expert_dim: int = 256,
        top_k: int = 2,
        aux_loss_coeff: float = 0.01,
        z_loss_coeff: float = 0.001,
    ):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.aux_loss_coeff = aux_loss_coeff
        self.z_loss_coeff = z_loss_coeff

        # Router
        self.router = nn.Linear(hidden_dim, num_experts, bias=False)
        nn.init.normal_(self.router.weight, std=0.01)

        # Expert networks (simple FFN for illustration)
        self.experts = nn.ModuleList(
            [
                nn.Sequential(
                    nn.Linear(hidden_dim, expert_dim),
                    nn.GELU(),
                    nn.Linear(expert_dim, hidden_dim),
                )
                for _ in range(num_experts)
            ]
        )

        # Track auxiliary losses
        self.aux_loss = 0.0
        self.z_loss = 0.0

    def compute_load_balancing_loss(
        self, router_probs: torch.Tensor, expert_mask: torch.Tensor
    ) -> torch.Tensor:
        """Compute auxiliary load balancing loss."""
        # Fraction of tokens routed to each expert
        tokens_per_expert = expert_mask.float().mean(dim=0)

        # Average routing probability per expert
        router_prob_per_expert = router_probs.mean(dim=0)

        # Load balancing loss (encourage uniform distribution)
        aux_loss = self.num_experts * torch.sum(
            tokens_per_expert * router_prob_per_expert
        )

        return aux_loss

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass with auxiliary loss computation.

        Args:
            x: Input tensor of shape (batch_size, seq_len, hidden_dim)

        Returns:
            Output tensor of same shape as input
        """
        batch_size, seq_len, hidden_dim = x.shape
        x_flat = x.view(-1, hidden_dim)  # (batch_size * seq_len, hidden_dim)

        # Router forward pass
        router_logits = self.router(x_flat)  # (B*S, num_experts)
        router_probs = F.softmax(router_logits, dim=-1)

        # Compute z-loss
        self.z_loss = compute_z_loss(router_logits)

        # Select top-k experts
        top_k_probs, top_k_indices = torch.topk(
            router_probs, self.top_k, dim=-1
        )

        # Create expert mask for load balancing loss
        expert_mask = torch.zeros_like(router_probs)
        expert_mask.scatter_(1, top_k_indices, 1.0)

        # Compute load balancing loss
        self.aux_loss = self.compute_load_balancing_loss(
            router_probs, expert_mask
        )

        # Renormalize top-k weights
        top_k_weights = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)

        # Compute expert outputs (simplified loop implementation)
        output = torch.zeros_like(x_flat)
        for i in range(self.top_k):
            expert_idx = top_k_indices[:, i]  # (B*S,)
            weight = top_k_weights[:, i : i + 1]  # (B*S, 1)

            for e in range(self.num_experts):
                mask = expert_idx == e
                if mask.any():
                    expert_input = x_flat[mask]
                    expert_output = self.experts[e](expert_input)
                    output[mask] += weight[mask] * expert_output

        return output.view(batch_size, seq_len, hidden_dim)

    def get_total_aux_loss(self) -> torch.Tensor:
        """Get combined auxiliary loss for training."""
        return (
            self.aux_loss_coeff * self.aux_loss
            + self.z_loss_coeff * self.z_loss
        )

In[17]:

Code

## Test the complete MoE layer
moe = MoELayerWithAuxLosses(
    hidden_dim=128,
    num_experts=8,
    expert_dim=256,
    top_k=2,
    aux_loss_coeff=0.01,
    z_loss_coeff=0.001,
)

## Sample input
x = torch.randn(4, 16, 128)  # (batch, seq_len, hidden)

## Forward pass
output = moe(x)

## Extract values for display
input_shape = x.shape
output_shape = output.shape
load_balancing_loss = moe.aux_loss
z_loss_val = moe.z_loss
total_aux_loss = moe.get_total_aux_loss()

## Test the complete MoE layer
moe = MoELayerWithAuxLosses(
    hidden_dim=128,
    num_experts=8,
    expert_dim=256,
    top_k=2,
    aux_loss_coeff=0.01,
    z_loss_coeff=0.001,
)

## Sample input
x = torch.randn(4, 16, 128)  # (batch, seq_len, hidden)

## Forward pass
output = moe(x)

## Extract values for display
input_shape = x.shape
output_shape = output.shape
load_balancing_loss = moe.aux_loss
z_loss_val = moe.z_loss
total_aux_loss = moe.get_total_aux_loss()

Out[18]:

Console

Input shape: torch.Size([4, 16, 128])
Output shape: torch.Size([4, 16, 128])
Load balancing loss: 2.0026
Z-loss: 4.3485
Total auxiliary loss: 0.0244

The model processes the input shape [4, 16, 128] and produces a matching output shape. The auxiliary losses are non-zero, indicating that both balancing and z-loss constraints are active and contributing to the training objective.

Monitoring Auxiliary LossesLink Copied

During training, tracking both auxiliary losses helps diagnose issues.

In[19]:

Code

import torch
import torch.nn.functional as F


def training_loop_with_monitoring(
    model: MoELayerWithAuxLosses,
    num_steps: int = 200,
    batch_size: int = 8,
    seq_len: int = 32,
    hidden_dim: int = 128,
):
    """Simulated training loop with loss monitoring."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

    history = {"task_loss": [], "aux_loss": [], "z_loss": [], "total_loss": []}

    for step in range(num_steps):
        optimizer.zero_grad()

        # Generate random batch
        x = torch.randn(batch_size, seq_len, hidden_dim)
        target = torch.randn(batch_size, seq_len, hidden_dim)

        # Forward pass
        output = model(x)

        # Task loss (MSE for illustration)
        task_loss = F.mse_loss(output, target)

        # Combined loss
        aux_loss = model.get_total_aux_loss()
        total_loss = task_loss + aux_loss

        # Backward pass
        total_loss.backward()
        optimizer.step()

        # Record history
        history["task_loss"].append(task_loss.item())
        history["aux_loss"].append(model.aux_loss.item())
        history["z_loss"].append(model.z_loss.item())
        history["total_loss"].append(total_loss.item())

    return history

import torch
import torch.nn.functional as F


def training_loop_with_monitoring(
    model: MoELayerWithAuxLosses,
    num_steps: int = 200,
    batch_size: int = 8,
    seq_len: int = 32,
    hidden_dim: int = 128,
):
    """Simulated training loop with loss monitoring."""
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

    history = {"task_loss": [], "aux_loss": [], "z_loss": [], "total_loss": []}

    for step in range(num_steps):
        optimizer.zero_grad()

        # Generate random batch
        x = torch.randn(batch_size, seq_len, hidden_dim)
        target = torch.randn(batch_size, seq_len, hidden_dim)

        # Forward pass
        output = model(x)

        # Task loss (MSE for illustration)
        task_loss = F.mse_loss(output, target)

        # Combined loss
        aux_loss = model.get_total_aux_loss()
        total_loss = task_loss + aux_loss

        # Backward pass
        total_loss.backward()
        optimizer.step()

        # Record history
        history["task_loss"].append(task_loss.item())
        history["aux_loss"].append(model.aux_loss.item())
        history["z_loss"].append(model.z_loss.item())
        history["total_loss"].append(total_loss.item())

    return history

In[20]:

Code

## Train the model
moe = MoELayerWithAuxLosses(
    hidden_dim=128,
    num_experts=8,
    expert_dim=256,
    top_k=2,
    aux_loss_coeff=0.01,
    z_loss_coeff=0.001,
)
history = training_loop_with_monitoring(moe, num_steps=200)

## Train the model
moe = MoELayerWithAuxLosses(
    hidden_dim=128,
    num_experts=8,
    expert_dim=256,
    top_k=2,
    aux_loss_coeff=0.01,
    z_loss_coeff=0.001,
)
history = training_loop_with_monitoring(moe, num_steps=200)

Out[21]:

Visualization

Three side-by-side line plots showing training progression over 200 steps. Left panel shows task loss (blue line) decreasing steadily from initial value. Center panel shows load balancing loss (red line) quickly reaching stable equilibrium value. Right panel shows z-loss (green line) stabilizing at low value, indicating both auxiliary losses maintain routing stability without interfering with model learning. — Training progression of task and auxiliary losses over 200 steps. Task loss (left) decreases steadily, while load balancing loss (center) and z-loss (right) quickly reach stable equilibrium values, indicating successful constraint satisfaction.

The auxiliary losses quickly reach stable values while the task loss continues to improve. This indicates healthy MoE training where routing stability is maintained without interfering with model learning.

Expert Load DistributionLink Copied

We can also track how the auxiliary losses affect expert utilization.

In[22]:

Code

import torch
import torch.nn.functional as F


def analyze_expert_distribution(
    model: MoELayerWithAuxLosses, num_samples: int = 100
):
    """Analyze expert load distribution after training."""
    model.eval()

    expert_counts = torch.zeros(model.num_experts)
    total_tokens = 0

    with torch.no_grad():
        for _ in range(num_samples):
            x = torch.randn(8, 32, 128)
            x_flat = x.view(-1, 128)

            router_logits = model.router(x_flat)
            router_probs = F.softmax(router_logits, dim=-1)
            _, top_k_indices = torch.topk(router_probs, model.top_k, dim=-1)

            for k in range(model.top_k):
                for e in range(model.num_experts):
                    expert_counts[e] += (top_k_indices[:, k] == e).sum()

            total_tokens += x_flat.shape[0] * model.top_k

    expert_fractions = expert_counts / total_tokens
    return expert_fractions.numpy()

import torch
import torch.nn.functional as F


def analyze_expert_distribution(
    model: MoELayerWithAuxLosses, num_samples: int = 100
):
    """Analyze expert load distribution after training."""
    model.eval()

    expert_counts = torch.zeros(model.num_experts)
    total_tokens = 0

    with torch.no_grad():
        for _ in range(num_samples):
            x = torch.randn(8, 32, 128)
            x_flat = x.view(-1, 128)

            router_logits = model.router(x_flat)
            router_probs = F.softmax(router_logits, dim=-1)
            _, top_k_indices = torch.topk(router_probs, model.top_k, dim=-1)

            for k in range(model.top_k):
                for e in range(model.num_experts):
                    expert_counts[e] += (top_k_indices[:, k] == e).sum()

            total_tokens += x_flat.shape[0] * model.top_k

    expert_fractions = expert_counts / total_tokens
    return expert_fractions.numpy()

In[23]:

Code

expert_fractions = analyze_expert_distribution(moe)

expert_fractions = analyze_expert_distribution(moe)

Out[24]:

Visualization

Bar chart showing token fraction percentage for each of eight experts (Expert 0-7) after training with combined auxiliary losses. All bars are close to 12.5% (ideal uniform distribution marked by red dashed line), with individual percentages annotated on each bar. The near-uniform distribution confirms that load balancing loss and z-loss together effectively prevent expert collapse and maintain balanced utilization. — Expert load distribution after training with combined auxiliary losses. The bar chart shows token fractions for each expert remaining close to the ideal uniform distribution (12.5%, red dashed line), confirming effective load balancing.

The combined effect of load balancing loss and z-loss produces well-distributed expert utilization, close to the ideal uniform distribution.

Key ParametersLink Copied

The key parameters for the MoE layer with z-loss are:

aux_loss_coeff: Weighting coefficient for the load balancing loss (typically ~0.01).
z_loss_coeff: Weighting coefficient for the router z-loss (typically 1e-4 to 1e-2).
num_experts: Total number of experts in the layer.
top_k: Number of experts active for each token (usually 1 or 2).

Limitations and ImpactLink Copied

Router z-loss has proven essential for stable MoE training at scale, but it comes with trade-offs worth understanding.

The primary limitation is that z-loss introduces an additional hyperparameter ( $\alpha_z$ ) that must be tuned. While default values work reasonably well across many settings, the optimal coefficient depends on model architecture, training dynamics, and the specific task. Setting $\alpha_z$ too high can over-regularize the router, making it unable to develop sharp routing patterns that let experts specialize effectively. Too low, and the stability benefits disappear. This adds complexity to the already challenging process of training large MoE models.

Z-loss also has subtle interactions with other training choices. Batch size affects the LSE statistics, meaning coefficients tuned for one batch size may not transfer directly to another. Learning rate schedules that cause rapid parameter changes can temporarily spike router logits, triggering large z-loss gradients that may disrupt training. You might find that z-loss requires more careful warmup schedules than load balancing loss alone.

Despite these considerations, z-loss has become a standard component of MoE training. The ST-MoE paper demonstrated that without z-loss, larger MoE models frequently experienced training instabilities: loss spikes, gradient explosions, and in some cases complete training collapse. With z-loss, these same models trained smoothly to completion. This stability benefit scales with model size, becoming more important as MoE architectures grow larger.

The technique also provides a useful diagnostic. Monitoring z-loss values during training reveals router health: steadily increasing z-loss suggests the router is drifting toward instability, even before overt failures occur. This early warning allows you to intervene by adjusting coefficients or learning rates before training derails.

Looking ahead to the next chapter on expert parallelism, z-loss becomes even more important. Distributed training across multiple devices introduces additional sources of instability from gradient synchronization and communication delays. The regularization provided by z-loss helps maintain consistent router behavior across the distributed system.

SummaryLink Copied

Router z-loss addresses a critical but often overlooked problem in MoE training: the tendency for router logits to grow unboundedly, causing numerical instabilities that can derail training. The key concepts are:

The instability problem: Router logits naturally drift toward larger values during training because nothing explicitly constrains them. Large logits cause softmax overflow, gradient instability, and poor generalization.
Z-loss formulation: The z-loss penalizes the squared log-sum-exp of router logits, $\mathcal{L}_z = \frac{1}{B} \sum_i (\log \sum_j \exp(z_{i,j}))^2$ . This provides smooth gradients that push large logits back toward reasonable values.
Stabilization mechanism: Z-loss creates a restoring force that counteracts logit drift, maintaining bounded logits throughout training without preventing the router from learning discriminative patterns.
Coefficient selection: Typical values range from $10^{-4}$ to $10^{-2}$ , with larger models often needing higher coefficients. Start conservative and adjust based on observed LSE statistics and training stability.
Combined auxiliary losses: Z-loss works alongside load balancing loss to address complementary problems. Load balancing encourages uniform expert utilization while z-loss ensures numerical stability. The combined objective is $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \alpha_{\text{aux}} \mathcal{L}_{\text{aux}} + \alpha_z \mathcal{L}_z$ .

With both auxiliary losses in place, MoE models train stably while maintaining balanced expert utilization. The next chapter explores expert parallelism: how to distribute MoE computation across multiple devices efficiently.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about router z-loss and its role in stabilizing MoE training.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Auxiliary Balancing Loss

Next Chapter

Expert Parallelism

Coming Soon

Reference

BIBTEXAcademic

@misc{routerzlossnumericalstabilityformoetraining, author = {Michael Brenndoerfer}, title = {Router Z-Loss: Numerical Stability for MoE Training}, year = {2025}, url = {https://mbrenndoerfer.com/writing/router-z-loss-moe-training-stability}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Router Z-Loss: Numerical Stability for MoE Training. Retrieved from https://mbrenndoerfer.com/writing/router-z-loss-moe-training-stability

MLAAcademic

Michael Brenndoerfer. "Router Z-Loss: Numerical Stability for MoE Training." 2026. Web. today. <https://mbrenndoerfer.com/writing/router-z-loss-moe-training-stability>.

CHICAGOAcademic

Michael Brenndoerfer. "Router Z-Loss: Numerical Stability for MoE Training." Accessed today. https://mbrenndoerfer.com/writing/router-z-loss-moe-training-stability.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Router Z-Loss: Numerical Stability for MoE Training'. Available at: https://mbrenndoerfer.com/writing/router-z-loss-moe-training-stability (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Router Z-Loss: Numerical Stability for MoE Training. https://mbrenndoerfer.com/writing/router-z-loss-moe-training-stability

Direct link:

https://mbrenndoerfer.com/writing/router-z-loss-moe-training-stability

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Router Z-Loss: Numerical Stability for MoE Training

Router Z-LossLink Copied

The Router Instability ProblemLink Copied

Logit DriftLink Copied

Numerical ConsequencesLink Copied

Manifestation in TrainingLink Copied

Z-Loss FormulationLink Copied

Mathematical DefinitionLink Copied

Why Log-Sum-Exp?Link Copied

Gradient AnalysisLink Copied

Why Z-Loss WorksLink Copied

Bounded Logit GrowthLink Copied

Softmax Temperature EffectLink Copied

Regularization PropertiesLink Copied

Z-Loss CoefficientLink Copied

Coefficient GuidelinesLink Copied

Tuning StrategyLink Copied

Dynamic Coefficient SchedulesLink Copied

Combined Auxiliary LossesLink Copied

Joint ObjectiveLink Copied

Complementary RolesLink Copied

Interaction EffectsLink Copied

ImplementationLink Copied

Z-Loss ComputationLink Copied

Router with Z-LossLink Copied

Visualizing Z-Loss EffectsLink Copied

Complete MoE with Combined LossesLink Copied

Monitoring Auxiliary LossesLink Copied

Expert Load DistributionLink Copied

Key ParametersLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Auxiliary Balancing Loss: Preventing Expert Collapse in MoE

MoE Load Balancing: Token Distribution & Expert Collapse

Top-K Routing: Expert Selection in Mixture of Experts Models

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Auxiliary Balancing Loss: Preventing Expert Collapse in MoE

MoE Load Balancing: Token Distribution & Expert Collapse

Top-K Routing: Expert Selection in Mixture of Experts Models

Stay updated