Learn how auxiliary balancing loss prevents expert collapse in MoE models. Covers loss formulations, coefficient tuning, and PyTorch implementation.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Auxiliary Balancing Loss
In the previous chapter, we explored why load balancing matters for Mixture of Experts models: without it, a few experts monopolize computation while others sit idle. But understanding the problem doesn't solve it. The gating network, trained purely to minimize task loss, has no incentive to spread tokens evenly. We need to explicitly tell the model that balance matters.
The solution is an auxiliary loss: a secondary objective added to the main task loss that penalizes imbalanced expert usage. This chapter covers the mathematical formulation of balancing losses, how to tune the coefficient that controls their strength, the fundamental tension between balancing and task performance and how to implement these losses in practice.
Why Gating Networks Cause Imbalance
Before diving into loss formulations, let's understand why the problem exists in the first place. The imbalance we observe in MoE models isn't a bug in the gating mechanism; rather, it's an emergent property of how neural networks learn through gradient descent. To see this clearly, consider what happens during early training when all experts and the router begin with random weights.
By chance, some experts receive slightly more tokens than others. This initial asymmetry might seem insignificant, perhaps one expert receives 27% of tokens while another receives 23%. However, this small difference sets in motion a cascade of effects. The experts that receive more tokens also receive more gradient updates, since each processed token generates gradients that flow back through the expert network. With more updates, these experts improve faster. Their weights adjust more quickly to the training distribution, and they become genuinely better at processing the types of tokens they see.
The gating network, meanwhile, is doing its job diligently. It observes the reconstruction quality or task performance when tokens are routed to different experts, and it adjusts its weights to favor better-performing routes. When it notices that certain experts produce better outputs, it increases the routing probabilities toward those experts. This is exactly the behavior we want from a gating network when experts have different competencies, but during early training, this feedback creates a self-reinforcing cycle.
This creates a feedback loop: success breeds more success, failure breeds abandonment. The experts that received slightly more tokens initially become genuinely better, which makes them receive even more tokens, which makes them improve even further. Meanwhile, the experts that received fewer tokens fall behind. They update less frequently, improve more slowly, and become comparatively worse choices. The gating network, observing this growing gap, routes even fewer tokens to the struggling experts.
The gating network is doing exactly what we trained it to do: minimize task loss by routing tokens to the best experts. The problem is that "best" becomes self-fulfilling. Experts that receive no tokens never improve, so they remain poor choices, so they continue receiving no tokens. Left unchecked, this process converges to a degenerate state where one or two experts handle nearly all computation while the remaining experts contribute nothing. This represents a catastrophic waste of model capacity, as we've paid the memory cost of multiple expert networks but receive the computational benefit of only a few.
An auxiliary loss breaks this cycle by adding a cost for imbalance. The core insight is that we can modify what "success" means during training. Instead of optimizing purely for task performance, we add a secondary objective that penalizes concentration of expert usage. The total training objective becomes:
where:
- : the combined objective function used for training
- : the primary objective (e.g., cross-entropy loss)
- : a scalar coefficient controlling the weight of the balancing penalty
- : the auxiliary loss term that penalizes imbalanced expert usage
The balancing loss increases when expert usage becomes skewed, forcing the optimizer to consider both task performance and expert utilization. This formulation allows us to tune how much we care about balance relative to task performance through the coefficient . A small gently encourages balance while prioritizing task quality; a larger enforces stricter balance at the potential cost of task performance.
Importance-Based Auxiliary Loss
The original MoE paper by Shazeer et al. (2017) introduced an importance-based auxiliary loss. The intuition behind this approach is straightforward: if we can measure how much attention the router pays to each expert, we can penalize situations where some experts receive vastly more attention than others. The key insight is that router probabilities, before the hard selection of which expert to use, provide a soft measure of expert "importance" that we can aggregate and analyze.
For each token in a batch, the gating network produces probabilities for routing to expert . These probabilities sum to one across experts and represent the router's assessment of how suitable each expert is for processing that particular token. A high probability indicates the router considers that expert a good match; a low probability indicates a poor match. The importance of expert across the batch is the sum of these probabilities:
where:
- : the total importance score for expert across the batch
- : a token in the current batch
- : the probability assigned to expert by the gating network for token
- : the summation over all tokens in the batch
To understand what this measures, consider a concrete example. If our batch contains 100 tokens and the router assigns an average probability of 0.4 to expert 3 across all tokens, then the importance of expert 3 would be 40. If all tokens strongly prefer expert 3, then will be large while other experts have small importance values. In a perfectly balanced scenario with 4 experts, each expert would have an importance of 25 (since probabilities must sum to 1 for each token, the total importance across all experts equals the batch size).
The importance measure captures something subtle but important: it reflects the router's preferences before any hard decisions are made. Even if a token ultimately gets routed to expert 1 because it has probability 0.35 compared to expert 2's 0.30, both experts contributed to the importance calculation. This provides a smoother signal than counting actual routing decisions.
To quantify the imbalance in importance scores, we need a metric that captures how spread out or concentrated the distribution is. We use the squared coefficient of variation (CV):
where:
- : the importance-based auxiliary loss
- : the standard deviation of importance scores across all experts
- : the mean importance score across all experts
- : the coefficient of variation, measuring relative dispersion
The coefficient of variation measures spread relative to the mean. This relative measure is important because we want the loss to be meaningful regardless of batch size. A standard deviation of 10 means something very different when the mean is 25 versus when the mean is 1000. By dividing by the mean, we get a dimensionless ratio that measures proportional variation.
Squaring the coefficient of variation serves two purposes. First, it ensures the loss is differentiable at zero, which matters for gradient-based optimization. Second, it penalizes large deviations more strongly than small ones, creating a steeper gradient when imbalance is severe and a gentler gradient when balance is nearly achieved. When all experts have equal importance, and the loss is zero. When importance is concentrated on one expert, the loss becomes large.
This formulation has an elegant property: it's batch-level, not token-level. The model can route individual tokens wherever it wants, as long as the aggregate distribution remains balanced. This flexibility is crucial because different tokens genuinely benefit from different experts. The loss doesn't micromanage individual routing decisions; instead, it sets a constraint on the overall outcome.
Load Balancing Loss Formulation
The Switch Transformer paper introduced a refined formulation that became the standard in modern MoE architectures. This newer approach addresses some limitations of the importance-based loss by more directly measuring actual routing behavior rather than just router preferences. Instead of using coefficient of variation, it directly measures two quantities and penalizes their correlation.
For a batch of tokens and experts, we define two complementary measures of expert usage. The first captures what actually happens during routing, while the second captures what the router intended to happen.
Fraction of tokens routed to expert :
where:
- : the fraction of tokens routed to expert
- : the total number of tokens in the batch
- : the indicator function, which is 1 if the condition is true and 0 otherwise
- : the index of the expert with the highest probability for token
- : the total count of tokens in the batch assigned to expert
This quantity measures the actual routing outcome. When the router makes its final decision for each token, selecting the expert with the highest probability, some experts will be chosen more frequently than others. The fraction counts how many tokens were ultimately sent to expert and normalizes by the total token count. This is the fraction of tokens where expert was selected (had the highest gate probability). It represents the actual load on each expert.
The crucial aspect of is that it reflects discrete decisions. A token is either routed to an expert or it isn't, and this binary nature means captures the ground truth of computational load. If expert 3 processes 40% of tokens, then , regardless of whether those routing decisions were made with high confidence (probability 0.9) or narrow margins (probability 0.26 versus 0.25 for the runner-up).
Fraction of router probability assigned to expert :
where:
- : the fraction of router probability assigned to expert
- : the routing probability assigned to expert for token
- : the total number of tokens in the batch
- : the sum of routing probabilities for expert across all tokens
This quantity measures the router's overall preference for each expert. Unlike , which counts discrete decisions, aggregates the continuous probability values. This is the average probability the router assigns to expert across all tokens. It represents the router's "intention" to use each expert, capturing not just which expert wins the argmax competition, but also how confident the router is in its choices.
The distinction between and is subtle but important. Consider a batch where half the tokens give expert 1 probability 0.51 and expert 2 probability 0.49, while the other half gives expert 2 probability 0.51 and expert 1 probability 0.49. The fractions and would both be approximately 0.5, reflecting balanced intentions. But if by chance the argmax consistently favors expert 1, then would be much larger than , revealing the imbalance in actual routing.
The load balancing loss is the scaled dot product of these vectors:
where:
- : the auxiliary balancing loss
- : the total number of experts
- : the fraction of tokens routed to expert (actual load)
- : the average probability assigned to expert (intended load)
- : the summation over all experts
The factor normalizes the loss so that perfect balance gives . This normalization is a thoughtful design choice that makes the loss interpretable. A value of 1.0 means perfect balance, values above 1.0 indicate imbalance, and the magnitude tells us how severe the imbalance is. Under perfect balance, each expert receives fraction of both tokens and probability, so for all . We can verify the loss value:
When load is imbalanced, both and grow for popular experts and shrink for unpopular ones. The correlation between these quantities drives the loss increase. Since both terms are larger for popular experts, their product is even larger, and the sum exceeds 1.
Why This Formulation Works
This loss formulation works because of what it penalizes. The product structure creates a natural feedback mechanism that targets exactly the behavior we want to discourage. Consider what happens when expert is overused:
- More tokens are routed to it, so increases
- The router assigns higher probabilities to it, so increases
- The product grows quadratically
This quadratic growth is the key insight. When an expert is slightly overused, both and are slightly elevated, and their product reflects this doubly. When an expert is heavily overused, both quantities are substantially elevated, and their product amplifies the signal even further. The loss creates a gradient that pushes the router to reduce probabilities for overused experts.
Critically, only contributes gradients (since involves a discrete argmax, which is non-differentiable). But amplifies the gradient signal: popular experts get stronger pushback. This asymmetry is actually beneficial. The non-differentiable acts as a weighting factor, telling the optimization process where to focus its attention. Experts with high actual load receive stronger gradient signals on their probability terms, while experts with low actual load receive weaker signals. This adaptive weighting helps the optimizer prioritize fixing the most severe imbalances.
The formulation also avoids a subtle failure mode. If we only penalized variance, the router could game the system by assigning nearly-uniform probabilities while still routing most tokens to one expert (through tiny differences that all favor the same expert). Imagine a scenario where the router learns to give expert 1 probability 0.251 and all other experts probability 0.2497. The probability distribution looks nearly uniform, so a variance penalty would be small. Yet the argmax would consistently select expert 1, creating severe imbalance. By including , we measure actual routing decisions, not just intentions.
Loss Coefficient Tuning
The coefficient in controls the strength of the balancing incentive. This single hyperparameter has outsized importance in MoE training, as it determines how the model allocates its optimization effort between learning useful representations and maintaining computational balance. Setting correctly is crucial: too small and experts collapse; too large and the model sacrifices task performance to achieve perfect balance.
The key considerations for tuning are:
- Typical values: Most implementations use between 0.01 and 0.1. Switch Transformer used 0.01; Mixtral uses 0.01 by default
- Scale dependence: The appropriate depends on how large typically is. If task loss is around 2.0 and balancing loss is around 1.0, means balancing contributes about 0.5% of the total gradient signal
- Number of experts: More experts require careful tuning, as the minimum achievable imbalance increases with expert count
- Top-k value: With top-2 routing, each token contributes to two experts' load, which naturally improves balance compared to top-1
Effects of Extreme Coefficients
When is too small, the model effectively ignores the balancing signal. The gradient contribution from the auxiliary loss becomes negligible compared to the task loss gradients. Expert collapse proceeds as if no auxiliary loss existed. You'll observe a few experts receiving most tokens while others remain unused throughout training. The training curves may look healthy in terms of task loss, but inspection of expert utilization reveals the underlying pathology.
When is too large, the model prioritizes balance over task performance. The optimizer, faced with a strong gradient signal to equalize expert usage, adjusts the router weights primarily to reduce the balancing loss. In the extreme, the router learns to assign exactly uniform probabilities to all experts, achieving perfect balance but terrible task performance. The model loses its ability to specialize experts for different types of inputs. Every expert becomes equally mediocre at handling all tokens, squandering the capacity benefits that MoE architectures are designed to provide.
The sweet spot achieves "soft" balance: experts receive roughly similar token counts, but the router retains freedom to assign non-uniform probabilities based on input characteristics. In this regime, the auxiliary loss prevents pathological collapse without overly constraining the router's flexibility.
Practical Tuning Strategy
A reliable approach is to start with and monitor both metrics during training:
- Track expert utilization (fraction of tokens per expert) across training
- Track the ratio
- If experts collapse (one expert gets >50% of tokens), increase by 2-3x
- If task loss plateaus while balance loss keeps dropping, decrease
We'll explore a related technique called Router Z-Loss in the next chapter, which provides additional stabilization with less impact on task performance.
Balancing vs Task Loss Tradeoffs
There's a fundamental tension between balancing and task performance that no formulation can eliminate. This tension isn't a flaw in the auxiliary loss approach; it reflects a genuine tradeoff inherent to MoE architectures. Understanding this tradeoff helps set realistic expectations and informs practical decisions about coefficient values.
Why perfect balance hurts performance: Different types of inputs genuinely benefit from different experts. A language model might naturally develop experts for code, dialogue, technical writing, and creative text. This specialization emerges because experts can become particularly good at certain patterns when they see them repeatedly. If code represents only 10% of training data, the code expert should receive only 10% of tokens. Forcing equal distribution means either routing non-code tokens to the code expert (hurting their performance) or routing code tokens away from the code expert (hurting code performance). In both cases, we sacrifice task quality for an arbitrary notion of fairness.
Why some imbalance is tolerable: In practice, moderate imbalance (say, 2:1 ratio between most and least popular experts) has minimal impact on inference efficiency. Modern accelerators handle this gracefully through techniques like dynamic batching and load balancing at the infrastructure level. The goal is preventing pathological collapse, not achieving perfect uniformity. A model where all eight experts receive between 10% and 15% of tokens is functioning well, even though the distribution isn't perfectly uniform.
Empirical findings: The Switch Transformer paper found that achieved a good balance between utilization and task performance. At this setting, expert utilization variance was significantly reduced compared to no auxiliary loss, while perplexity degradation was minimal (less than 0.5%). Higher values of continued improving balance but with diminishing returns and increasing task performance cost.
The Capacity Factor Interaction
As we discussed in the Load Balancing chapter, the capacity factor limits how many tokens each expert can process. This creates an important interaction with the auxiliary loss, as both mechanisms influence expert utilization but through different means. The auxiliary loss and capacity factor work together:
- The auxiliary loss encourages the router to spread tokens evenly
- The capacity factor enforces a hard cap, dropping tokens when an expert is overloaded
These mechanisms are complementary. The auxiliary loss provides a soft incentive through gradients, nudging the router toward balance without forcing specific outcomes. The capacity factor provides a hard constraint that takes effect when soft incentives are insufficient. With a well-tuned auxiliary loss, fewer tokens hit the capacity cap, reducing wasted computation. But if the auxiliary loss is too weak, the capacity factor does most of the work, causing token dropping and information loss.
The interplay between these mechanisms suggests a practical principle: the auxiliary loss should be strong enough that the capacity factor rarely needs to drop tokens. Token dropping represents a failure mode where computation is wasted and information is lost. By tuning to achieve reasonable balance, we can keep most tokens below the capacity threshold while still allowing meaningful expert specialization.
Implementation
Let's implement the auxiliary balancing loss step by step. We'll create a module that can be integrated into any MoE training loop.
This function computes the exact formulation from the Switch Transformer paper. Let's verify it works correctly with a simple example.
The loss is above 1.0 because expert 2 receives most tokens (high ) and also has high average probability (high ). With perfect balance, we'd expect and loss .
Handling Top-K Routing
When using top-2 or top-k routing, each token contributes to multiple experts' load. We need to adjust the computation:
The top-k loss is typically lower than top-1 because selecting two experts per token naturally spreads load more evenly.
Integrating with Training
Here's how to incorporate the auxiliary loss into a training loop:
The total loss is the sum of task loss and weighted auxiliary loss. The auxiliary loss is initially high (>1.0), reflecting the random initialization of the router.
Monitoring Expert Utilization
Tracking expert utilization during training helps diagnose load balancing issues:
With random initialization, some imbalance naturally occurs. The auxiliary loss should push this ratio closer to 1.0 during training.
Visualizing the Balancing Effect
Let's visualize how the auxiliary loss affects routing over training iterations:
The auxiliary loss pushes the model toward more balanced routing. As training progresses, the auxiliary loss approaches 1.0 (perfect balance), and the imbalance ratio decreases.
Key Parameters
The key parameters for the auxiliary balancing loss implementation are:
- num_experts: Number of experts in the MoE layer.
- aux_loss_coef: Coefficient scaling the auxiliary loss (typically 0.01-0.1).
- k: Number of experts selected per token (top-k routing).
- expert_size: Hidden dimension size of each expert network.
Limitations and Practical Considerations
While the auxiliary balancing loss is essential for stable MoE training, it has important limitations.
Batch-level balancing only: The loss encourages balance within each batch but doesn't guarantee global balance across the entire dataset. If certain input types cluster in specific batches, expert specialization patterns may still emerge unevenly. Larger batch sizes help mitigate this issue by providing more representative samples.
Gradient signal limitations: The fraction of tokens routed () doesn't contribute gradients because it involves a non-differentiable argmax operation. Only the router probabilities () are differentiable. This means the loss influences the router through probability adjustments, not through direct feedback about routing decisions. Some tokens may still be misrouted if their probability distribution is nearly uniform.
Tension with expert specialization: An overly strong auxiliary loss can prevent meaningful expert specialization. If the model is forced to route all input types equally across experts, each expert becomes a generalist rather than developing specific competencies. The optimal balance point depends on the data distribution and task requirements.
Sensitivity to expert count: The loss formulation scales with the number of experts, but optimal values may still need adjustment as expert count changes. With 8 experts versus 64 experts, the same may have different practical effects on routing behavior.
Interaction with capacity factor: When combined with capacity-based token dropping, the auxiliary loss and capacity factor can work at cross purposes. The loss pushes for balance while the capacity factor enforces hard limits. If is too weak, the capacity factor does most of the balancing work by dropping tokens, which wastes computation and loses information.
These limitations motivate additional techniques like Router Z-Loss (covered in the next chapter), which provides complementary stabilization by penalizing extreme router logits.
Summary
The auxiliary balancing loss is a critical component of MoE training that prevents expert collapse and ensures efficient computation. The key concepts are:
- Problem: Gating networks naturally create feedback loops where successful experts receive more tokens and improve further, while unused experts stagnate
- Solution: Add an auxiliary loss term that penalizes imbalanced routing
- Components: The loss combines token fraction (, actual load) with probability fraction (, intended load), achieving perfect balance loss of 1.0
- Coefficient tuning: Typical values of to balance load distribution against task performance; too low causes collapse, too high forces unproductive uniformity
- Tradeoff: Some imbalance is natural and desirable when data distributions are non-uniform; the goal is preventing pathological collapse, not perfect uniformity
- Implementation: Track both auxiliary loss and expert utilization during training to diagnose load balancing issues
With this foundation, the next chapter explores Router Z-Loss, which addresses router training stability by penalizing extreme logit values, complementing the auxiliary balancing loss in modern MoE architectures.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about auxiliary balancing loss in Mixture of Experts models.










Comments