Learn how load balancing prevents expert collapse in Mixture of Experts models. Explore token fractions, load metrics, and capacity constraints for stable training.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Load Balancing
In the previous chapters, we explored how Mixture of Experts architectures use gating networks to route tokens to specialized expert networks. Top-k routing selects the most relevant experts for each token, enabling sparse computation while maintaining model capacity. However, this routing mechanism introduces a critical challenge: the gating network can develop strong preferences for certain experts while ignoring others. This creates an imbalance that wastes model capacity, causes computational inefficiencies during distributed training, and in extreme cases leads to complete model collapse.
Load balancing addresses these problems by ensuring all experts receive a roughly equal share of training tokens. Without explicit balancing mechanisms, MoE models consistently converge to pathological states where only a handful of experts process the vast majority of tokens. Understanding why this happens and how to measure it is essential before we explore solutions like auxiliary losses in subsequent chapters.
The Expert Utilization Problem
When a gating network routes tokens to experts, nothing inherently prevents it from developing a preference for certain experts over others. In fact, the opposite is true: the training dynamics of MoE systems naturally push toward imbalanced routing. This tendency emerges not from any flaw in the architecture design but from the fundamental way that learning systems respond to feedback signals.
To understand why imbalance occurs, consider what happens during the early stages of training. If expert happens to perform slightly better than its peers on a few early batches, perhaps due to nothing more than favorable random initialization, the gating network learns to send more tokens to . This makes sense from an optimization perspective: the gating network is simply learning to route tokens to whichever expert currently produces the best outputs. However, this reasonable local decision creates a problematic global pattern.
With more tokens flowing to , this expert receives more gradient updates and accumulates more training signal. These additional updates allow to improve further, which in turn reinforces the gating network's preference for routing to this expert. Meanwhile, experts that receive fewer tokens get less training signal and fall further behind. They have fewer opportunities to learn from data, fewer gradient updates to refine their weights, and consequently less capability to offer when tokens are occasionally routed their way. This disparity creates a positive feedback loop that amplifies initial imbalances over time.
The tendency for initially successful experts to attract more tokens, leading to further improvement and even more routing preference. This positive feedback loop is a form of preferential attachment that emerges from the interaction between the gating network and expert training.
The problem manifests at two distinct but interconnected levels. At the token level, some experts process disproportionately many tokens within each batch, meaning that the computational load is not evenly distributed across the available expert networks. At the training level, some experts receive far more gradient updates over the course of training, meaning that certain experts learn extensively while others remain undertrained. Both types of imbalance waste model capacity, and together they create a situation where the theoretical advantages of MoE architectures cannot be realized in practice.
Capacity Waste and Computational Inefficiency
An 8-expert MoE layer with severe imbalance might route 60% of tokens to two experts while the remaining six experts share only 40% of tokens. In the extreme, several experts might receive essentially no tokens at all, sitting idle while a small subset of their peers handle all the processing workload.
This imbalanced state creates two distinct but equally serious problems that undermine the value of the MoE architecture:
-
Wasted parameters: Underutilized experts contribute little to model predictions despite consuming memory. A model with 8 experts where only 3 are active effectively has the capacity of a 3-expert model with extra overhead. The memory budget allocated to store the weights of unused experts could have been spent on additional layers, larger hidden dimensions, or other architectural improvements that would actually benefit the model's performance.
-
Training bottlenecks: In distributed settings where each expert resides on a different accelerator, load imbalance creates stragglers. The accelerators hosting popular experts finish processing their tokens while others sit idle, then must wait for synchronization. This waiting time represents pure waste: expensive hardware consuming power without producing useful computation.
The computational inefficiency is particularly acute in expert parallelism, where each expert is placed on a separate device. Consider a scenario where expert receives 10 times more tokens than expert . In this case, the device hosting performs 10 times more computation while the device hosting mostly waits. Since training requires synchronization across all devices at regular intervals, the entire system is bottlenecked by the slowest device. Paradoxically, the slowest device in this context is not the one with the least capable hardware but rather the one hosting the most overloaded expert. We'll explore expert parallelism in detail later in this part.
Expert Collapse
The most severe manifestation of load imbalance is expert collapse, a failure mode where the model converges to using only one or two experts for nearly all tokens. This isn't just an efficiency problem; it fundamentally breaks the sparse computation promise of MoE. When collapse occurs, the model has effectively reverted to a dense architecture, paying the memory costs of multiple experts while receiving the computational benefits of only one or two. The entire motivation for using MoE, the ability to scale model capacity without proportionally scaling computation, is lost.
How Collapse Happens
Expert collapse occurs when the rich-get-richer dynamics run to their logical extreme. The process typically unfolds through a predictable sequence of stages, each building on the previous one:
- Early training: The gating network's random initialization creates slight expert preferences. These initial biases are typically small and might seem insignificant, but they provide the seed from which imbalance can grow.
- Divergence: Preferred experts receive more tokens and more gradients, improving faster. The gap between experts begins to widen, though it may still be subtle enough to escape notice in standard training metrics.
- Reinforcement: The gating network learns to rely more heavily on better experts. At this stage, the feedback loop is firmly established and accelerating. The router's preferences become increasingly pronounced.
- Saturation: A small subset of experts handles nearly all tokens. The model has effectively reorganized itself around a handful of dominant experts, with others receiving only occasional, often random, token assignments.
- Collapse: Non-preferred experts stop receiving training signal entirely. The process reaches its endpoint, where certain experts are completely excluded from the model's functioning.
Once collapsed, recovery is nearly impossible through normal training. The ignored experts become stale: their weights remain at initialization values or drift randomly without meaningful gradients. The gating network has learned to avoid them, so they never get the chance to improve. This creates a stable but pathological equilibrium where the model is stuck with reduced capacity and no path back to balanced utilization.
Detecting Collapse
Collapse can be detected by monitoring expert utilization during training. A healthy MoE layer shows all experts receiving tokens in roughly equal proportions, with natural fluctuations based on the content of each batch. A collapsing layer shows increasing concentration in a small number of experts, with the trend persisting across batches rather than averaging out over time.
The warning signs include:
- Monotonically increasing utilization for a subset of experts
- Near-zero routing probabilities for some experts
- Decreasing entropy in the router's softmax outputs
- Sharp drops in validation performance as diversity decreases
Why Collapse Is Self-Reinforcing
The softmax operation in the gating network contributes to collapse in a fundamental way that makes this failure mode particularly difficult to avoid. Recall from the gating networks chapter that router scores pass through a softmax to produce routing probabilities:
where:
- : the routing probability assigned to expert
- : the raw score (logit) for expert
- : the total number of experts
- : the exponential function applied to the score, ensuring a positive value
- : the normalizing sum over all experts
This formula reveals why collapse is self-reinforcing. The exponential function at the heart of softmax has a powerful sharpening effect on probability distributions. Small differences in raw scores get amplified by the exponential, and larger differences get amplified even more. To understand this intuitively, consider that if one score increases by just 1 unit, its contribution to the numerator increases by a factor of approximately 2.7 (since ). This multiplicative effect means that even modest advantages in raw scores translate into substantial advantages in routing probability.
If expert consistently has slightly higher scores than others, the softmax concentrates probability mass on , reducing training signal to other experts. As the gap grows over successive training steps, the softmax's sharpening effect accelerates collapse. What starts as a small preference can rapidly snowball into complete dominance. This is why monitoring metrics like router entropy, which we will discuss shortly, provides valuable early warning signals before collapse becomes irreversible.
Load Metrics
To address load imbalance, we first need to quantify it. Measuring imbalance precisely allows us to detect problems early, compare different balancing strategies, and tune hyperparameters effectively. Several metrics capture different aspects of expert utilization, each offering a distinct perspective on the health of the routing system.
Token Fraction
The most direct and intuitive measure is the fraction of tokens routed to each expert within a batch. This metric simply counts how many tokens each expert receives and divides by the total, giving us a clear picture of the current workload distribution. For a batch of tokens with top-1 routing, the token fraction for expert is:
where:
- : the fraction of the batch assigned to expert
- : the total number of tokens in the batch
- : the index variable iterating over all tokens
- : the indicator function, which is 1 if the condition inside is true and 0 otherwise
- : the routing probability assigned to expert for token
- : the index of the expert with the highest probability for token
- : the specific expert we are measuring
The formula works by iterating through every token in the batch and checking whether that token was routed to expert . The indicator function returns 1 when token 's highest-probability expert matches , and 0 otherwise. Summing these indicators and dividing by gives us the fraction of all tokens that expert processed.
Perfect balance means for all experts. In practice, some deviation is expected and even desirable, since not all tokens should require the same expert. Natural language contains diverse phenomena, and different token types may genuinely need different processing. However, severe skew indicates problems that will compound over training.
Load Imbalance Factor
While token fractions tell us how many tokens each expert received, interpreting these raw numbers requires context. Is a token fraction of 0.15 for one expert concerning in an 8-expert system? What about in a 16-expert system? The load imbalance factor provides a normalized measure that answers these questions by measuring how far from uniform the distribution is:
where:
- : the Load Imbalance Factor
- : the total number of experts
- : the highest token fraction observed among all experts
- : the fraction of tokens assigned to expert
This metric compares the maximum utilization to the ideal uniform utilization (). The intuition behind this formula is straightforward: if all experts received exactly their fair share, the maximum fraction would be , and multiplying by would give exactly 1. Any deviation from uniformity pushes this value higher.
With perfect balance ( for all ), the load imbalance factor equals 1. If all tokens go to a single expert ( for one expert), the factor equals . Values above 1 indicate imbalance, with higher values being worse. This scaling makes the metric interpretable across different numbers of experts: a load imbalance factor of 2 always means the most-used expert receives twice its fair share, regardless of whether the system has 4 experts or 64.
Coefficient of Variation
Another useful metric is the coefficient of variation of expert loads, which captures how spread out the utilization values are relative to their expected value:
where:
- : the Coefficient of Variation
- : the standard deviation of the token fractions across all experts
- : the mean token fraction (equal to since fractions sum to 1)
- : the number of experts
The coefficient of variation differs from the load imbalance factor in what it measures. While the load imbalance factor focuses on the single most overloaded expert, the coefficient of variation considers the entire distribution. A coefficient of variation of 0 indicates perfect balance, where all experts receive identical token counts. Higher values indicate greater dispersion in utilization. This metric is particularly useful for detecting cases where multiple experts are both over- and under-utilized, even if no single expert is dramatically overloaded.
Router Probability Metrics
Rather than measuring hard assignments, which can mask the router's underlying preferences, we can also examine the soft probabilities from the router. The mean routing probability for expert across a batch is:
where:
- : the average probability assigned to expert across the batch
- : the number of tokens in the batch
- : the probability assigned to expert for token
This metric reveals information that token fractions alone cannot. Even if tokens are routed via top-k selection, which makes discrete choices, the underlying probabilities reveal the router's preferences before discretization. An expert might receive the same number of tokens as its peers in terms of final assignments but might consistently be the second choice rather than the first, indicating that it is at risk of falling behind.
We can quantify the dispersion of these soft probabilities using the entropy of the mean routing probabilities:
where:
- : the entropy of the average routing distribution
- : the total number of experts
- : the average probability assigned to expert
- : the natural logarithm
Entropy measures the uncertainty or uniformity of a probability distribution. High entropy in suggests the router spreads probability across experts relatively evenly; low entropy indicates concentration on a small number of preferred experts. Maximum entropy occurs when all experts have equal average probability ( for all ), corresponding to a uniform distribution. As the router's preferences sharpen toward collapse, entropy decreases toward zero.
Why Balanced Routing Matters
Beyond computational efficiency, balanced routing directly affects model quality in ways that may not be immediately obvious. The relationship between load balance and model performance runs deeper than simply keeping hardware busy.
Capacity Utilization
Each expert in an MoE layer contains independent parameters that can specialize in different aspects of the input. The power of the MoE architecture comes from this distributed specialization: rather than a single network trying to handle all patterns, multiple networks can each become expert in their own domain. An 8-expert layer with 64M parameters per expert has 512M total parameters. If only 3 experts are used, the effective capacity drops to 192M parameters while memory consumption remains at 512M. Balanced routing ensures the model leverages its full parameter budget, achieving the capacity efficiency that motivates MoE designs in the first place.
Gradient Quality
Experts that receive few tokens get noisy gradient estimates. The mathematics of stochastic gradient descent relies on averaging over many samples to get accurate estimates of the true gradient direction. In the extreme, an expert that sees only 10 tokens per batch has high-variance gradients compared to one that sees 1,000 tokens. This variance makes optimization unstable for underutilized experts, preventing them from learning effectively even when they do receive tokens. The resulting expert weights may oscillate rather than converge, never settling into useful representations.
Specialization Diversity
With balanced routing, experts are forced to handle diverse token types. This encouragement of diversity leads to different experts specializing in different linguistic phenomena: one might become skilled at handling numerical expressions, another at processing named entities, and still another at understanding syntactic structures. If routing is imbalanced, overutilized experts become generalists (handling everything), while underutilized experts never develop meaningful specializations. The model loses the specialization benefits that make MoE architectures valuable.
Training Stability
Severe imbalance creates training instabilities that manifest in multiple ways. As expert utilization shifts, the effective model capacity changes mid-training. The model might behave differently from one epoch to the next not because it is learning but because different experts are being used. Sharp routing transitions, where the gating network suddenly starts preferring a different expert, cause loss spikes and can disrupt convergence. These instabilities make training progress unpredictable and can require extensive hyperparameter tuning to mitigate.
Measuring Load Balance in Practice
Let's implement functions to compute load balance metrics and visualize expert utilization across training.
We'll start by creating a simple gating network and generating routing decisions for a batch of tokens.
Now let's compute routing probabilities and expert assignments.
The output confirms we have 2,048 tokens in total (16 sequences 128 length). The routing probability tensor shape matches our expectation, containing a probability distribution over the 8 experts for every token in the batch.
Computing Token Fractions
Let's compute the fraction of tokens routed to each expert.
Even with random initialization, we can see some imbalance emerging. Let's compute our load balance metrics.
The load imbalance factor above 1.0 indicates that the most popular expert receives more than its fair share of tokens. The normalized entropy below 1.0 shows that the distribution is not uniform.
Simulating Training Dynamics
To see how imbalance develops during training, let's simulate the rich-get-richer dynamics by iteratively biasing the gating network toward popular experts.
The simulation demonstrates how quickly expert utilization diverges from uniform. After only 100 steps, the load imbalance factor has risen substantially, with some experts receiving several times their fair share of tokens while others receive almost none.
Visualizing Expert Collapse
Let's run a longer simulation with stronger reinforcement to observe expert collapse.
The stacked area chart shows expert collapse in action. By the end of training, nearly all tokens are routed to just one or two experts. The other experts, despite consuming memory and having trainable parameters, contribute nothing to model outputs.
Analyzing Router Entropy
Router entropy provides another view into load balance. Let's examine how the softmax output distribution changes as collapse progresses.
The entropy plot clearly shows the progression toward collapse. Starting at maximum entropy (uniform routing), the entropy monotonically decreases as the router concentrates probability mass on fewer experts. This metric provides an early warning signal: entropy begins dropping before token fractions show obvious skew.
The Cost of Imbalance in Distributed Training
Load imbalance has particularly severe consequences when experts are distributed across devices. In expert parallelism, each expert resides on a separate accelerator, and the system's throughput is constrained by the slowest device. Let's quantify the efficiency loss that results from uneven workload distribution.
The relationship is stark: efficiency is inversely proportional to the load imbalance factor. When the most-loaded expert receives twice its fair share of tokens (LIF = 2), parallel efficiency drops to 50%. When one expert receives all tokens (LIF = ), efficiency approaches since only one device does useful work while the others wait.
Capacity Factor Constraints
One approach to limiting imbalance is to impose a hard capacity constraint on each expert. Rather than allowing the router to send arbitrarily many tokens to a popular expert, we define a maximum limit that caps how many tokens any single expert can process. This constraint prevents the most extreme forms of imbalance by rejecting tokens that would overflow an expert's capacity.
We define a maximum capacity limit for each expert using the following formula:
where:
- : the maximum number of tokens an expert is allowed to process
- : the capacity factor (typically ), which determines the buffer size
- : the total number of tokens in the batch
- : the number of experts
- : the floor function, ensuring an integer result
This formula sets the limit to a multiple of the ideal uniform load (), allowing for natural variance while capping extreme imbalance. The capacity factor provides flexibility: a value of 1.0 would enforce strict uniformity, while higher values like 1.25 or 1.5 allow experts to handle somewhat more than their fair share before hitting the limit. The floor function ensures we get a whole number of tokens, since partial token assignments are not meaningful.
Capacity constraints prevent catastrophic imbalance by rejecting tokens that would overflow an expert's capacity. However, this approach has a significant downside: dropped tokens receive no expert processing and must either be handled by a shared residual pathway or simply contribute less to the model's output. The next chapter explores auxiliary losses that encourage balance during training rather than enforcing it through hard constraints.
Key Parameters
The key parameters for managing load balance are:
- num_experts: The total number of experts in the MoE layer. More experts increase capacity but make balancing harder.
- capacity_factor: A scalar multiplier (typically 1.0 to 1.25) determining the maximum tokens an expert can process relative to uniform load.
- reinforcement_strength: A simulation parameter modeling the magnitude of the "rich-get-richer" feedback loop.
Limitations and Impact
Load balancing in MoE models involves navigating fundamental tensions. Perfect balance, where every expert receives exactly the same number of tokens, may not be optimal for model quality. Different token types genuinely require different processing, and some semantic categories may be more prevalent in the data than others. Forcing exact uniformity could degrade performance by preventing natural specialization patterns.
The metrics we've explored also have limitations. Token fractions measured per-batch can fluctuate substantially; an expert might appear underutilized in one batch but be essential for a specific topic that appears in the next. Longer-term averaging provides more stable estimates but responds slowly to genuine shifts in expert utility. Entropy-based metrics capture the router's confidence distribution but don't distinguish between an expert that's appropriately confident on relevant tokens versus one that's inappropriately dominant.
Despite these limitations, load balancing is critical for practical MoE deployment. Without it, the promised efficiency gains of sparse computation evaporate as models collapse to effectively dense computation through a small subset of experts. The training instabilities from severe imbalance, such as gradient variance, capacity oscillation, and convergence failures, make unbalanced MoE models difficult to train at scale.
The next two chapters address load balancing through loss function design. Auxiliary balancing losses add a penalty term that encourages the router to spread tokens across experts. Router z-loss specifically targets the softmax's tendency to sharpen into collapse-prone distributions. Together with capacity constraints, these mechanisms enable stable training of MoE models with balanced expert utilization.
Summary
Load balancing ensures that all experts in a Mixture of Experts model receive meaningful training signal and contribute to model predictions. Without explicit balancing mechanisms, MoE training exhibits rich-get-richer dynamics where initially preferred experts attract more tokens, receive more gradients, and become even more preferred; this positive feedback loop leads to expert collapse.
Key load balance metrics include the load imbalance factor (measuring deviation from uniform allocation), coefficient of variation (quantifying spread in token fractions), and router entropy (capturing the concentration of routing probability mass). These metrics enable early detection of imbalance before it degrades model quality or training stability.
The consequences of imbalance extend beyond wasted model capacity. In distributed training with expert parallelism, load imbalance creates stragglers that destroy parallel efficiency. A load imbalance factor of 2 halves throughput, and collapsed models may run at only efficiency despite having experts.
Capacity constraints provide one mechanism for limiting imbalance by rejecting tokens that would overflow an expert's quota. However, this approach drops tokens rather than preventing the router from developing imbalanced preferences in the first place. The auxiliary balancing loss and router z-loss covered in the next chapters address this by shaping the training objective itself.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about load balancing in Mixture of Experts models.

















Comments