MoE Load Balancing: Token Distribution & Expert Collapse

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

Learn how load balancing prevents expert collapse in Mixture of Experts models. Explore token fractions, load metrics, and capacity constraints for stable training.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Load BalancingLink Copied

In the previous chapters, we explored how Mixture of Experts architectures use gating networks to route tokens to specialized expert networks. Top-k routing selects the most relevant experts for each token, enabling sparse computation while maintaining model capacity. However, this routing mechanism introduces a critical challenge: the gating network can develop strong preferences for certain experts while ignoring others. This creates an imbalance that wastes model capacity, causes computational inefficiencies during distributed training, and in extreme cases leads to complete model collapse.

Load balancing addresses these problems by ensuring all experts receive a roughly equal share of training tokens. Without explicit balancing mechanisms, MoE models consistently converge to pathological states where only a handful of experts process the vast majority of tokens. Understanding why this happens and how to measure it is essential before we explore solutions like auxiliary losses in subsequent chapters.

The Expert Utilization ProblemLink Copied

When a gating network routes tokens to experts, nothing inherently prevents it from developing a preference for certain experts over others. In fact, the opposite is true: the training dynamics of MoE systems naturally push toward imbalanced routing. This tendency emerges not from any flaw in the architecture design but from the fundamental way that learning systems respond to feedback signals.

To understand why imbalance occurs, consider what happens during the early stages of training. If expert $E_3$ happens to perform slightly better than its peers on a few early batches, perhaps due to nothing more than favorable random initialization, the gating network learns to send more tokens to $E_3$ . This makes sense from an optimization perspective: the gating network is simply learning to route tokens to whichever expert currently produces the best outputs. However, this reasonable local decision creates a problematic global pattern.

With more tokens flowing to $E_3$ , this expert receives more gradient updates and accumulates more training signal. These additional updates allow $E_3$ to improve further, which in turn reinforces the gating network's preference for routing to this expert. Meanwhile, experts that receive fewer tokens get less training signal and fall further behind. They have fewer opportunities to learn from data, fewer gradient updates to refine their weights, and consequently less capability to offer when tokens are occasionally routed their way. This disparity creates a positive feedback loop that amplifies initial imbalances over time.

Rich-Get-Richer Dynamics

The tendency for initially successful experts to attract more tokens, leading to further improvement and even more routing preference. This positive feedback loop is a form of preferential attachment that emerges from the interaction between the gating network and expert training.

The problem manifests at two distinct but interconnected levels. At the token level, some experts process disproportionately many tokens within each batch, meaning that the computational load is not evenly distributed across the available expert networks. At the training level, some experts receive far more gradient updates over the course of training, meaning that certain experts learn extensively while others remain undertrained. Both types of imbalance waste model capacity, and together they create a situation where the theoretical advantages of MoE architectures cannot be realized in practice.

Capacity Waste and Computational InefficiencyLink Copied

An 8-expert MoE layer with severe imbalance might route 60% of tokens to two experts while the remaining six experts share only 40% of tokens. In the extreme, several experts might receive essentially no tokens at all, sitting idle while a small subset of their peers handle all the processing workload.

This imbalanced state creates two distinct but equally serious problems that undermine the value of the MoE architecture:

Wasted parameters: Underutilized experts contribute little to model predictions despite consuming memory. A model with 8 experts where only 3 are active effectively has the capacity of a 3-expert model with extra overhead. The memory budget allocated to store the weights of unused experts could have been spent on additional layers, larger hidden dimensions, or other architectural improvements that would actually benefit the model's performance.
Training bottlenecks: In distributed settings where each expert resides on a different accelerator, load imbalance creates stragglers. The accelerators hosting popular experts finish processing their tokens while others sit idle, then must wait for synchronization. This waiting time represents pure waste: expensive hardware consuming power without producing useful computation.

The computational inefficiency is particularly acute in expert parallelism, where each expert is placed on a separate device. Consider a scenario where expert $E_1$ receives 10 times more tokens than expert $E_7$ . In this case, the device hosting $E_1$ performs 10 times more computation while the device hosting $E_7$ mostly waits. Since training requires synchronization across all devices at regular intervals, the entire system is bottlenecked by the slowest device. Paradoxically, the slowest device in this context is not the one with the least capable hardware but rather the one hosting the most overloaded expert. We'll explore expert parallelism in detail later in this part.

Expert CollapseLink Copied

The most severe manifestation of load imbalance is expert collapse, a failure mode where the model converges to using only one or two experts for nearly all tokens. This isn't just an efficiency problem; it fundamentally breaks the sparse computation promise of MoE. When collapse occurs, the model has effectively reverted to a dense architecture, paying the memory costs of multiple experts while receiving the computational benefits of only one or two. The entire motivation for using MoE, the ability to scale model capacity without proportionally scaling computation, is lost.

How Collapse HappensLink Copied

Expert collapse occurs when the rich-get-richer dynamics run to their logical extreme. The process typically unfolds through a predictable sequence of stages, each building on the previous one:

Early training: The gating network's random initialization creates slight expert preferences. These initial biases are typically small and might seem insignificant, but they provide the seed from which imbalance can grow.
Divergence: Preferred experts receive more tokens and more gradients, improving faster. The gap between experts begins to widen, though it may still be subtle enough to escape notice in standard training metrics.
Reinforcement: The gating network learns to rely more heavily on better experts. At this stage, the feedback loop is firmly established and accelerating. The router's preferences become increasingly pronounced.
Saturation: A small subset of experts handles nearly all tokens. The model has effectively reorganized itself around a handful of dominant experts, with others receiving only occasional, often random, token assignments.
Collapse: Non-preferred experts stop receiving training signal entirely. The process reaches its endpoint, where certain experts are completely excluded from the model's functioning.

Once collapsed, recovery is nearly impossible through normal training. The ignored experts become stale: their weights remain at initialization values or drift randomly without meaningful gradients. The gating network has learned to avoid them, so they never get the chance to improve. This creates a stable but pathological equilibrium where the model is stuck with reduced capacity and no path back to balanced utilization.

Detecting CollapseLink Copied

Collapse can be detected by monitoring expert utilization during training. A healthy MoE layer shows all experts receiving tokens in roughly equal proportions, with natural fluctuations based on the content of each batch. A collapsing layer shows increasing concentration in a small number of experts, with the trend persisting across batches rather than averaging out over time.

The warning signs include:

Monotonically increasing utilization for a subset of experts
Near-zero routing probabilities for some experts
Decreasing entropy in the router's softmax outputs
Sharp drops in validation performance as diversity decreases

Why Collapse Is Self-ReinforcingLink Copied

The softmax operation in the gating network contributes to collapse in a fundamental way that makes this failure mode particularly difficult to avoid. Recall from the gating networks chapter that router scores pass through a softmax to produce routing probabilities:

g_i = \frac{e^{s_i}}{\sum_{j=1}^{E} e^{s_j}}

where:

$g_i$ : the routing probability assigned to expert $i$
$s_i$ : the raw score (logit) for expert $i$
$E$ : the total number of experts
$e^{s_i}$ : the exponential function applied to the score, ensuring a positive value
$\sum_{j=1}^{E} e^{s_j}$ : the normalizing sum over all $E$ experts

This formula reveals why collapse is self-reinforcing. The exponential function at the heart of softmax has a powerful sharpening effect on probability distributions. Small differences in raw scores $s_i$ get amplified by the exponential, and larger differences get amplified even more. To understand this intuitively, consider that if one score increases by just 1 unit, its contribution to the numerator increases by a factor of approximately 2.7 (since $e^1 \approx 2.718$ ). This multiplicative effect means that even modest advantages in raw scores translate into substantial advantages in routing probability.

If expert $E_1$ consistently has slightly higher scores than others, the softmax concentrates probability mass on $E_1$ , reducing training signal to other experts. As the gap grows over successive training steps, the softmax's sharpening effect accelerates collapse. What starts as a small preference can rapidly snowball into complete dominance. This is why monitoring metrics like router entropy, which we will discuss shortly, provides valuable early warning signals before collapse becomes irreversible.

Out[2]:

Visualization

Impact of softmax sharpening on routing probabilities. Small differences in raw scores (blue) are amplified into large differences in probabilities (orange), demonstrating how the exponential function accelerates expert collapse as score gaps widen.

Load MetricsLink Copied

To address load imbalance, we first need to quantify it. Measuring imbalance precisely allows us to detect problems early, compare different balancing strategies, and tune hyperparameters effectively. Several metrics capture different aspects of expert utilization, each offering a distinct perspective on the health of the routing system.

Token FractionLink Copied

The most direct and intuitive measure is the fraction of tokens routed to each expert within a batch. This metric simply counts how many tokens each expert receives and divides by the total, giving us a clear picture of the current workload distribution. For a batch of $N$ tokens with top-1 routing, the token fraction for expert $i$ is:

f_i = \frac{1}{N} \sum_{t=1}^{N} \mathbb{1}[\text{argmax}_j\, g_j^{(t)} = i]

where:

$f_i$ : the fraction of the batch assigned to expert $i$
$N$ : the total number of tokens in the batch
$t$ : the index variable iterating over all tokens
$\mathbb{1}[\cdot]$ : the indicator function, which is 1 if the condition inside is true and 0 otherwise
$g_j^{(t)}$ : the routing probability assigned to expert $j$ for token $t$
$\text{argmax}_j\, g_j^{(t)}$ : the index of the expert with the highest probability for token $t$
$i$ : the specific expert we are measuring

The formula works by iterating through every token in the batch and checking whether that token was routed to expert $i$ . The indicator function returns 1 when token $t$ 's highest-probability expert matches $i$ , and 0 otherwise. Summing these indicators and dividing by $N$ gives us the fraction of all tokens that expert $i$ processed.

Perfect balance means $f_i = \frac{1}{E}$ for all $E$ experts. In practice, some deviation is expected and even desirable, since not all tokens should require the same expert. Natural language contains diverse phenomena, and different token types may genuinely need different processing. However, severe skew indicates problems that will compound over training.

Load Imbalance FactorLink Copied

While token fractions tell us how many tokens each expert received, interpreting these raw numbers requires context. Is a token fraction of 0.15 for one expert concerning in an 8-expert system? What about in a 16-expert system? The load imbalance factor provides a normalized measure that answers these questions by measuring how far from uniform the distribution is:

\text{LIF} = E \cdot \max_i f_i

where:

$\text{LIF}$ : the Load Imbalance Factor
$E$ : the total number of experts
$\max_i f_i$ : the highest token fraction observed among all experts
$f_i$ : the fraction of tokens assigned to expert $i$

This metric compares the maximum utilization to the ideal uniform utilization ( $1/E$ ). The intuition behind this formula is straightforward: if all experts received exactly their fair share, the maximum fraction would be $1/E$ , and multiplying by $E$ would give exactly 1. Any deviation from uniformity pushes this value higher.

With perfect balance ( $f_i = \frac{1}{E}$ for all $i$ ), the load imbalance factor equals 1. If all tokens go to a single expert ( $f_i = 1$ for one expert), the factor equals $E$ . Values above 1 indicate imbalance, with higher values being worse. This scaling makes the metric interpretable across different numbers of experts: a load imbalance factor of 2 always means the most-used expert receives twice its fair share, regardless of whether the system has 4 experts or 64.

Coefficient of VariationLink Copied

Another useful metric is the coefficient of variation of expert loads, which captures how spread out the utilization values are relative to their expected value:

\text{CV} = \frac{\sigma(f)}{\mu(f)} = \frac{\sigma(f)}{1/E}

where:

$\text{CV}$ : the Coefficient of Variation
$\sigma(f)$ : the standard deviation of the token fractions across all experts
$\mu(f)$ : the mean token fraction (equal to $1/E$ since fractions sum to 1)
$E$ : the number of experts

The coefficient of variation differs from the load imbalance factor in what it measures. While the load imbalance factor focuses on the single most overloaded expert, the coefficient of variation considers the entire distribution. A coefficient of variation of 0 indicates perfect balance, where all experts receive identical token counts. Higher values indicate greater dispersion in utilization. This metric is particularly useful for detecting cases where multiple experts are both over- and under-utilized, even if no single expert is dramatically overloaded.

Router Probability MetricsLink Copied

Rather than measuring hard assignments, which can mask the router's underlying preferences, we can also examine the soft probabilities from the router. The mean routing probability for expert $i$ across a batch is:

\bar{g}_i = \frac{1}{N} \sum_{t=1}^{N} g_i^{(t)}

where:

$\bar{g}_i$ : the average probability assigned to expert $i$ across the batch
$N$ : the number of tokens in the batch
$g_i^{(t)}$ : the probability assigned to expert $i$ for token $t$

This metric reveals information that token fractions alone cannot. Even if tokens are routed via top-k selection, which makes discrete choices, the underlying probabilities reveal the router's preferences before discretization. An expert might receive the same number of tokens as its peers in terms of final assignments but might consistently be the second choice rather than the first, indicating that it is at risk of falling behind.

We can quantify the dispersion of these soft probabilities using the entropy of the mean routing probabilities:

H(\bar{g}) = -\sum_{i=1}^{E} \bar{g}_i \log \bar{g}_i

where:

$H(\bar{g})$ : the entropy of the average routing distribution
$E$ : the total number of experts
$\bar{g}_i$ : the average probability assigned to expert $i$
$\log$ : the natural logarithm

Entropy measures the uncertainty or uniformity of a probability distribution. High entropy in $\bar{g}$ suggests the router spreads probability across experts relatively evenly; low entropy indicates concentration on a small number of preferred experts. Maximum entropy occurs when all experts have equal average probability ( $\bar{g}_i = 1/E$ for all $i$ ), corresponding to a uniform distribution. As the router's preferences sharpen toward collapse, entropy decreases toward zero.

Why Balanced Routing MattersLink Copied

Beyond computational efficiency, balanced routing directly affects model quality in ways that may not be immediately obvious. The relationship between load balance and model performance runs deeper than simply keeping hardware busy.

Capacity UtilizationLink Copied

Each expert in an MoE layer contains independent parameters that can specialize in different aspects of the input. The power of the MoE architecture comes from this distributed specialization: rather than a single network trying to handle all patterns, multiple networks can each become expert in their own domain. An 8-expert layer with 64M parameters per expert has 512M total parameters. If only 3 experts are used, the effective capacity drops to 192M parameters while memory consumption remains at 512M. Balanced routing ensures the model leverages its full parameter budget, achieving the capacity efficiency that motivates MoE designs in the first place.

Gradient QualityLink Copied

Experts that receive few tokens get noisy gradient estimates. The mathematics of stochastic gradient descent relies on averaging over many samples to get accurate estimates of the true gradient direction. In the extreme, an expert that sees only 10 tokens per batch has high-variance gradients compared to one that sees 1,000 tokens. This variance makes optimization unstable for underutilized experts, preventing them from learning effectively even when they do receive tokens. The resulting expert weights may oscillate rather than converge, never settling into useful representations.

Specialization DiversityLink Copied

With balanced routing, experts are forced to handle diverse token types. This encouragement of diversity leads to different experts specializing in different linguistic phenomena: one might become skilled at handling numerical expressions, another at processing named entities, and still another at understanding syntactic structures. If routing is imbalanced, overutilized experts become generalists (handling everything), while underutilized experts never develop meaningful specializations. The model loses the specialization benefits that make MoE architectures valuable.

Training StabilityLink Copied

Severe imbalance creates training instabilities that manifest in multiple ways. As expert utilization shifts, the effective model capacity changes mid-training. The model might behave differently from one epoch to the next not because it is learning but because different experts are being used. Sharp routing transitions, where the gating network suddenly starts preferring a different expert, cause loss spikes and can disrupt convergence. These instabilities make training progress unpredictable and can require extensive hyperparameter tuning to mitigate.

Measuring Load Balance in PracticeLink Copied

Let's implement functions to compute load balance metrics and visualize expert utilization across training.

In[3]:

Code

import torch
import numpy as np

## Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

import torch
import numpy as np

## Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

We'll start by creating a simple gating network and generating routing decisions for a batch of tokens.

In[4]:

Code

import torch
import torch.nn as nn
import torch.nn.functional as F


class SimpleGatingNetwork(nn.Module):
    """A basic gating network for routing tokens to experts."""

    def __init__(self, input_dim, num_experts):
        super().__init__()
        self.gate = nn.Linear(input_dim, num_experts)
        self.num_experts = num_experts

    def forward(self, x):
        # x shape: (batch_size, seq_len, input_dim)
        logits = self.gate(x)  # (batch_size, seq_len, num_experts)
        probs = F.softmax(logits, dim=-1)
        return probs


## Create gating network
input_dim = 256
num_experts = 8
gating_net = SimpleGatingNetwork(input_dim, num_experts)

## Generate sample input
batch_size = 16
seq_len = 128
hidden_states = torch.randn(batch_size, seq_len, input_dim)

import torch
import torch.nn as nn
import torch.nn.functional as F


class SimpleGatingNetwork(nn.Module):
    """A basic gating network for routing tokens to experts."""

    def __init__(self, input_dim, num_experts):
        super().__init__()
        self.gate = nn.Linear(input_dim, num_experts)
        self.num_experts = num_experts

    def forward(self, x):
        # x shape: (batch_size, seq_len, input_dim)
        logits = self.gate(x)  # (batch_size, seq_len, num_experts)
        probs = F.softmax(logits, dim=-1)
        return probs


## Create gating network
input_dim = 256
num_experts = 8
gating_net = SimpleGatingNetwork(input_dim, num_experts)

## Generate sample input
batch_size = 16
seq_len = 128
hidden_states = torch.randn(batch_size, seq_len, input_dim)

Now let's compute routing probabilities and expert assignments.

In[5]:

Code

## Get routing probabilities
with torch.no_grad():
    routing_probs = gating_net(hidden_states)  # (batch, seq, experts)

## Flatten to token-level view
flat_probs = routing_probs.view(-1, num_experts)  # (batch*seq, experts)

## Get top-1 expert assignments
expert_assignments = torch.argmax(flat_probs, dim=-1)  # (batch*seq,)
num_tokens = expert_assignments.shape[0]

## Get routing probabilities
with torch.no_grad():
    routing_probs = gating_net(hidden_states)  # (batch, seq, experts)

## Flatten to token-level view
flat_probs = routing_probs.view(-1, num_experts)  # (batch*seq, experts)

## Get top-1 expert assignments
expert_assignments = torch.argmax(flat_probs, dim=-1)  # (batch*seq,)
num_tokens = expert_assignments.shape[0]

Out[6]:

Console

Total tokens: 2048
Number of experts: 8
Routing probabilities shape: torch.Size([2048, 8])

The output confirms we have 2,048 tokens in total (16 sequences $\times$ 128 length). The routing probability tensor shape matches our expectation, containing a probability distribution over the 8 experts for every token in the batch.

Computing Token FractionsLink Copied

Let's compute the fraction of tokens routed to each expert.

In[7]:

Code

def compute_token_fractions(expert_assignments, num_experts):
    """Compute the fraction of tokens assigned to each expert."""
    counts = torch.zeros(num_experts)
    for i in range(num_experts):
        counts[i] = (expert_assignments == i).sum()
    fractions = counts / len(expert_assignments)
    return fractions


token_fractions = compute_token_fractions(expert_assignments, num_experts)

def compute_token_fractions(expert_assignments, num_experts):
    """Compute the fraction of tokens assigned to each expert."""
    counts = torch.zeros(num_experts)
    for i in range(num_experts):
        counts[i] = (expert_assignments == i).sum()
    fractions = counts / len(expert_assignments)
    return fractions


token_fractions = compute_token_fractions(expert_assignments, num_experts)

Out[8]:

Console

Token fractions per expert:
  Expert 0: 0.146 █████
  Expert 1: 0.121 ████
  Expert 2: 0.140 █████
  Expert 3: 0.128 █████
  Expert 4: 0.129 █████
  Expert 5: 0.093 ███
  Expert 6: 0.108 ████
  Expert 7: 0.135 █████

Out[9]:

Visualization

Token distribution across 8 experts with randomly initialized gating weights. Even before training, random initialization creates uneven token assignment, establishing initial biases that can be amplified by feedback loops.

Even with random initialization, we can see some imbalance emerging. Let's compute our load balance metrics.

In[10]:

Code

def compute_load_metrics(token_fractions):
    """Compute load balance metrics."""
    num_experts = len(token_fractions)

    # Load imbalance factor
    lif = num_experts * token_fractions.max().item()

    # Coefficient of variation
    mean_frac = 1.0 / num_experts
    std_frac = token_fractions.std().item()
    cv = std_frac / mean_frac

    # Entropy (normalized)
    probs = token_fractions + 1e-10  # Avoid log(0)
    entropy = -(probs * torch.log(probs)).sum().item()
    max_entropy = np.log(num_experts)
    normalized_entropy = entropy / max_entropy

    return {
        "load_imbalance_factor": lif,
        "coefficient_of_variation": cv,
        "normalized_entropy": normalized_entropy,
        "max_token_fraction": token_fractions.max().item(),
        "min_token_fraction": token_fractions.min().item(),
    }


metrics = compute_load_metrics(token_fractions)

def compute_load_metrics(token_fractions):
    """Compute load balance metrics."""
    num_experts = len(token_fractions)

    # Load imbalance factor
    lif = num_experts * token_fractions.max().item()

    # Coefficient of variation
    mean_frac = 1.0 / num_experts
    std_frac = token_fractions.std().item()
    cv = std_frac / mean_frac

    # Entropy (normalized)
    probs = token_fractions + 1e-10  # Avoid log(0)
    entropy = -(probs * torch.log(probs)).sum().item()
    max_entropy = np.log(num_experts)
    normalized_entropy = entropy / max_entropy

    return {
        "load_imbalance_factor": lif,
        "coefficient_of_variation": cv,
        "normalized_entropy": normalized_entropy,
        "max_token_fraction": token_fractions.max().item(),
        "min_token_fraction": token_fractions.min().item(),
    }


metrics = compute_load_metrics(token_fractions)

Out[11]:

Console

Load Balance Metrics:
  Load Imbalance Factor: 1.172 (ideal: 1.0)
  Coefficient of Variation: 0.140 (ideal: 0.0)
  Normalized Entropy: 0.996 (ideal: 1.0)
  Max/Min Token Fraction: 0.146 / 0.093

The load imbalance factor above 1.0 indicates that the most popular expert receives more than its fair share of tokens. The normalized entropy below 1.0 shows that the distribution is not uniform.

Simulating Training DynamicsLink Copied

To see how imbalance develops during training, let's simulate the rich-get-richer dynamics by iteratively biasing the gating network toward popular experts.

In[12]:

Code

def simulate_training_imbalance(
    num_experts, num_steps, reinforcement_strength=0.1
):
    """Simulate how expert preferences evolve during training without balancing."""
    # Start with uniform expert scores
    expert_scores = torch.zeros(num_experts)

    history = {"step": [], "fractions": [], "metrics": []}

    for step in range(num_steps):
        # Convert scores to probabilities
        probs = F.softmax(expert_scores, dim=0)

        # Sample token assignments based on probabilities
        num_tokens = 1000
        assignments = torch.multinomial(probs, num_tokens, replacement=True)

        # Count tokens per expert
        fractions = torch.zeros(num_experts)
        for i in range(num_experts):
            fractions[i] = (assignments == i).sum().float() / num_tokens

        # Record history
        history["step"].append(step)
        history["fractions"].append(fractions.clone())
        history["metrics"].append(compute_load_metrics(fractions))

        # Simulate rich-get-richer: boost scores of popular experts
        expert_scores += reinforcement_strength * (fractions - 1 / num_experts)

    return history


## Run simulation
history = simulate_training_imbalance(num_experts=8, num_steps=100)

def simulate_training_imbalance(
    num_experts, num_steps, reinforcement_strength=0.1
):
    """Simulate how expert preferences evolve during training without balancing."""
    # Start with uniform expert scores
    expert_scores = torch.zeros(num_experts)

    history = {"step": [], "fractions": [], "metrics": []}

    for step in range(num_steps):
        # Convert scores to probabilities
        probs = F.softmax(expert_scores, dim=0)

        # Sample token assignments based on probabilities
        num_tokens = 1000
        assignments = torch.multinomial(probs, num_tokens, replacement=True)

        # Count tokens per expert
        fractions = torch.zeros(num_experts)
        for i in range(num_experts):
            fractions[i] = (assignments == i).sum().float() / num_tokens

        # Record history
        history["step"].append(step)
        history["fractions"].append(fractions.clone())
        history["metrics"].append(compute_load_metrics(fractions))

        # Simulate rich-get-richer: boost scores of popular experts
        expert_scores += reinforcement_strength * (fractions - 1 / num_experts)

    return history


## Run simulation
history = simulate_training_imbalance(num_experts=8, num_steps=100)

Out[13]:

Visualization

Line plot showing 8 expert token fractions diverging from uniform to heavily imbalanced over 100 steps. — Evolution of expert utilization and load imbalance over 100 training steps. Left: Token fractions for 8 experts diverge from the uniform baseline (dashed line) as rich-get-richer dynamics take hold. Right: The Load Imbalance Factor rises steadily, quantifying the growing disparity in expert workload.

The simulation demonstrates how quickly expert utilization diverges from uniform. After only 100 steps, the load imbalance factor has risen substantially, with some experts receiving several times their fair share of tokens while others receive almost none.

Visualizing Expert CollapseLink Copied

Let's run a longer simulation with stronger reinforcement to observe expert collapse.

In[14]:

Code

## Simulate collapse with stronger dynamics
collapse_history = simulate_training_imbalance(
    num_experts=8, num_steps=200, reinforcement_strength=0.3
)

## Simulate collapse with stronger dynamics
collapse_history = simulate_training_imbalance(
    num_experts=8, num_steps=200, reinforcement_strength=0.3
)

Out[15]:

Visualization

Stacked area chart showing token distribution collapsing to 1-2 dominant experts over 200 steps. — Expert collapse dynamics visualized via stacked area chart. Over 200 training steps, the token distribution (colored areas) concentrates entirely onto two dominant experts, while the remaining six experts (compressed at the top) receive zero tokens.

The stacked area chart shows expert collapse in action. By the end of training, nearly all tokens are routed to just one or two experts. The other experts, despite consuming memory and having trainable parameters, contribute nothing to model outputs.

Analyzing Router EntropyLink Copied

Router entropy provides another view into load balance. Let's examine how the softmax output distribution changes as collapse progresses.

In[16]:

Code

def compute_router_entropy(expert_scores):
    """Compute entropy of the router's softmax output."""
    probs = F.softmax(expert_scores, dim=0)
    entropy = -(probs * torch.log(probs + 1e-10)).sum()
    return entropy.item()


def simulate_with_entropy(num_experts, num_steps, reinforcement_strength=0.2):
    """Track router entropy during simulated training."""
    expert_scores = torch.zeros(num_experts)
    entropies = []

    max_entropy = np.log(num_experts)

    for step in range(num_steps):
        # Record entropy
        entropy = compute_router_entropy(expert_scores)
        entropies.append(entropy / max_entropy)  # Normalized

        # Sample and reinforce
        probs = F.softmax(expert_scores, dim=0)
        assignments = torch.multinomial(probs, 1000, replacement=True)
        fractions = torch.zeros(num_experts)
        for i in range(num_experts):
            fractions[i] = (assignments == i).sum().float() / 1000
        expert_scores += reinforcement_strength * (fractions - 1 / num_experts)

    return entropies


entropy_history = simulate_with_entropy(num_experts=8, num_steps=150)

def compute_router_entropy(expert_scores):
    """Compute entropy of the router's softmax output."""
    probs = F.softmax(expert_scores, dim=0)
    entropy = -(probs * torch.log(probs + 1e-10)).sum()
    return entropy.item()


def simulate_with_entropy(num_experts, num_steps, reinforcement_strength=0.2):
    """Track router entropy during simulated training."""
    expert_scores = torch.zeros(num_experts)
    entropies = []

    max_entropy = np.log(num_experts)

    for step in range(num_steps):
        # Record entropy
        entropy = compute_router_entropy(expert_scores)
        entropies.append(entropy / max_entropy)  # Normalized

        # Sample and reinforce
        probs = F.softmax(expert_scores, dim=0)
        assignments = torch.multinomial(probs, 1000, replacement=True)
        fractions = torch.zeros(num_experts)
        for i in range(num_experts):
            fractions[i] = (assignments == i).sum().float() / 1000
        expert_scores += reinforcement_strength * (fractions - 1 / num_experts)

    return entropies


entropy_history = simulate_with_entropy(num_experts=8, num_steps=150)

Out[17]:

Visualization

Line plot showing normalized entropy declining from 1.0 toward 0 over 150 training steps. — Normalized router entropy over 150 training steps. The metric declines monotonically from maximum entropy (uniform distribution, green dashed line) toward zero, providing an early warning signal of collapse before it becomes irreversible.

The entropy plot clearly shows the progression toward collapse. Starting at maximum entropy (uniform routing), the entropy monotonically decreases as the router concentrates probability mass on fewer experts. This metric provides an early warning signal: entropy begins dropping before token fractions show obvious skew.

Out[18]:

Visualization

Comparison of three load balance metrics during expert collapse. The Load Imbalance Factor (left) and Coefficient of Variation (center) increase as imbalance worsens, while Normalized Entropy (right) decreases, illustrating how different metrics capture the same degradation in routing quality.

The Cost of Imbalance in Distributed TrainingLink Copied

Load imbalance has particularly severe consequences when experts are distributed across devices. In expert parallelism, each expert resides on a separate accelerator, and the system's throughput is constrained by the slowest device. Let's quantify the efficiency loss that results from uneven workload distribution.

In[19]:

Code

def compute_parallel_efficiency(token_fractions):
    """
    Compute parallel efficiency when experts are on separate devices.

    In expert parallelism, all devices must wait for the slowest one.
    Efficiency is the ratio of useful compute to total compute time.
    """
    num_experts = len(token_fractions)

    # Each device's workload is proportional to its token fraction
    # Time is determined by the most loaded device
    max_load = token_fractions.max().item()

    # Average useful work per device
    avg_load = 1.0 / num_experts

    # Efficiency: useful work / time spent
    # Time = max_load, total useful work = avg_load * num_experts = 1.0
    efficiency = avg_load / max_load if max_load > 0 else 0

    return efficiency


## Compute efficiency for different imbalance levels
imbalance_levels = []
efficiencies = []

for step in range(0, len(collapse_history["fractions"]), 10):
    fractions = collapse_history["fractions"][step]
    lif = compute_load_metrics(fractions)["load_imbalance_factor"]
    eff = compute_parallel_efficiency(fractions)
    imbalance_levels.append(lif)
    efficiencies.append(eff)

def compute_parallel_efficiency(token_fractions):
    """
    Compute parallel efficiency when experts are on separate devices.

    In expert parallelism, all devices must wait for the slowest one.
    Efficiency is the ratio of useful compute to total compute time.
    """
    num_experts = len(token_fractions)

    # Each device's workload is proportional to its token fraction
    # Time is determined by the most loaded device
    max_load = token_fractions.max().item()

    # Average useful work per device
    avg_load = 1.0 / num_experts

    # Efficiency: useful work / time spent
    # Time = max_load, total useful work = avg_load * num_experts = 1.0
    efficiency = avg_load / max_load if max_load > 0 else 0

    return efficiency


## Compute efficiency for different imbalance levels
imbalance_levels = []
efficiencies = []

for step in range(0, len(collapse_history["fractions"]), 10):
    fractions = collapse_history["fractions"][step]
    lif = compute_load_metrics(fractions)["load_imbalance_factor"]
    eff = compute_parallel_efficiency(fractions)
    imbalance_levels.append(lif)
    efficiencies.append(eff)

Out[20]:

Visualization

Scatter plot showing parallel efficiency declining from 100% to below 20% as load imbalance factor increases. — Impact of load imbalance on distributed training efficiency. Parallel efficiency drops inversely with the Load Imbalance Factor; a factor of 2 (one expert doing double work) reduces system throughput to 50%, highlighting the severe computational cost of unbalanced routing.

The relationship is stark: efficiency is inversely proportional to the load imbalance factor. When the most-loaded expert receives twice its fair share of tokens (LIF = 2), parallel efficiency drops to 50%. When one expert receives all tokens (LIF = $E$ ), efficiency approaches $\frac{1}{E}$ since only one device does useful work while the others wait.

Capacity Factor ConstraintsLink Copied

One approach to limiting imbalance is to impose a hard capacity constraint on each expert. Rather than allowing the router to send arbitrarily many tokens to a popular expert, we define a maximum limit that caps how many tokens any single expert can process. This constraint prevents the most extreme forms of imbalance by rejecting tokens that would overflow an expert's capacity.

We define a maximum capacity limit for each expert using the following formula:

K_{\text{max}} = \left\lfloor C \cdot \frac{N}{E} \right\rfloor

where:

$K_{\text{max}}$ : the maximum number of tokens an expert is allowed to process
$C$ : the capacity factor (typically $>1.0$ ), which determines the buffer size
$N$ : the total number of tokens in the batch
$E$ : the number of experts
$\lfloor \cdot \rfloor$ : the floor function, ensuring an integer result

This formula sets the limit to a multiple $C$ of the ideal uniform load ( $N/E$ ), allowing for natural variance while capping extreme imbalance. The capacity factor $C$ provides flexibility: a value of 1.0 would enforce strict uniformity, while higher values like 1.25 or 1.5 allow experts to handle somewhat more than their fair share before hitting the limit. The floor function ensures we get a whole number of tokens, since partial token assignments are not meaningful.

In[21]:

Code

def apply_capacity_constraint(routing_probs, capacity_factor=1.25):
    """
    Apply capacity constraints by dropping excess tokens.

    Args:
        routing_probs: (num_tokens, num_experts) tensor of routing probabilities
        capacity_factor: How much over-subscription to allow

    Returns:
        assignment_mask: (num_tokens, num_experts) binary mask
        dropped_count: Number of tokens dropped due to capacity
        expert_counts: Tokens assigned per expert
        capacity_limit: The maximum tokens allowed per expert
    """
    num_tokens, num_experts = routing_probs.shape
    capacity_limit = int(capacity_factor * num_tokens / num_experts)

    # Get top-1 assignments
    top_expert = routing_probs.argmax(dim=-1)

    # Track assignments per expert
    expert_counts = torch.zeros(num_experts, dtype=torch.long)
    assignment_mask = torch.zeros(num_tokens, num_experts)
    dropped = 0

    for token_idx in range(num_tokens):
        expert_idx = top_expert[token_idx].item()
        if expert_counts[expert_idx] < capacity_limit:
            assignment_mask[token_idx, expert_idx] = 1
            expert_counts[expert_idx] += 1
        else:
            dropped += 1

    return assignment_mask, dropped, expert_counts, capacity_limit


## Create imbalanced probabilities to demonstrate capacity dropping
## (Simulating a state where Expert 0 is highly preferred)
logits = torch.randn(num_tokens, num_experts)
logits[:, 0] += 2.0  # Bias toward Expert 0
probs = F.softmax(logits, dim=-1)

mask, dropped, counts, capacity_limit = apply_capacity_constraint(
    probs, capacity_factor=1.25
)

def apply_capacity_constraint(routing_probs, capacity_factor=1.25):
    """
    Apply capacity constraints by dropping excess tokens.

    Args:
        routing_probs: (num_tokens, num_experts) tensor of routing probabilities
        capacity_factor: How much over-subscription to allow

    Returns:
        assignment_mask: (num_tokens, num_experts) binary mask
        dropped_count: Number of tokens dropped due to capacity
        expert_counts: Tokens assigned per expert
        capacity_limit: The maximum tokens allowed per expert
    """
    num_tokens, num_experts = routing_probs.shape
    capacity_limit = int(capacity_factor * num_tokens / num_experts)

    # Get top-1 assignments
    top_expert = routing_probs.argmax(dim=-1)

    # Track assignments per expert
    expert_counts = torch.zeros(num_experts, dtype=torch.long)
    assignment_mask = torch.zeros(num_tokens, num_experts)
    dropped = 0

    for token_idx in range(num_tokens):
        expert_idx = top_expert[token_idx].item()
        if expert_counts[expert_idx] < capacity_limit:
            assignment_mask[token_idx, expert_idx] = 1
            expert_counts[expert_idx] += 1
        else:
            dropped += 1

    return assignment_mask, dropped, expert_counts, capacity_limit


## Create imbalanced probabilities to demonstrate capacity dropping
## (Simulating a state where Expert 0 is highly preferred)
logits = torch.randn(num_tokens, num_experts)
logits[:, 0] += 2.0  # Bias toward Expert 0
probs = F.softmax(logits, dim=-1)

mask, dropped, counts, capacity_limit = apply_capacity_constraint(
    probs, capacity_factor=1.25
)

Out[22]:

Console

Capacity per expert: 320
Tokens dropped due to capacity: 1169 (57.1%)

Tokens assigned per expert (after capacity constraint):
  Expert 0: 320
  Expert 1: 69
  Expert 2: 86
  Expert 3: 65
  Expert 4: 99
  Expert 5: 75
  Expert 6: 83
  Expert 7: 82

Out[23]:

Visualization

Comparison of token distributions with and without capacity constraints. Left: Unconstrained routing allows Expert 0 to exceed fair share (green line). Right: Enforcing a capacity limit (red dotted line) redistributes or drops excess tokens, truncating the peak load to maintain computational balance.

Capacity constraints prevent catastrophic imbalance by rejecting tokens that would overflow an expert's capacity. However, this approach has a significant downside: dropped tokens receive no expert processing and must either be handled by a shared residual pathway or simply contribute less to the model's output. The next chapter explores auxiliary losses that encourage balance during training rather than enforcing it through hard constraints.

Key ParametersLink Copied

The key parameters for managing load balance are:

num_experts: The total number of experts in the MoE layer. More experts increase capacity but make balancing harder.
capacity_factor: A scalar multiplier (typically 1.0 to 1.25) determining the maximum tokens an expert can process relative to uniform load.
reinforcement_strength: A simulation parameter modeling the magnitude of the "rich-get-richer" feedback loop.

Limitations and ImpactLink Copied

Load balancing in MoE models involves navigating fundamental tensions. Perfect balance, where every expert receives exactly the same number of tokens, may not be optimal for model quality. Different token types genuinely require different processing, and some semantic categories may be more prevalent in the data than others. Forcing exact uniformity could degrade performance by preventing natural specialization patterns.

The metrics we've explored also have limitations. Token fractions measured per-batch can fluctuate substantially; an expert might appear underutilized in one batch but be essential for a specific topic that appears in the next. Longer-term averaging provides more stable estimates but responds slowly to genuine shifts in expert utility. Entropy-based metrics capture the router's confidence distribution but don't distinguish between an expert that's appropriately confident on relevant tokens versus one that's inappropriately dominant.

Despite these limitations, load balancing is critical for practical MoE deployment. Without it, the promised efficiency gains of sparse computation evaporate as models collapse to effectively dense computation through a small subset of experts. The training instabilities from severe imbalance, such as gradient variance, capacity oscillation, and convergence failures, make unbalanced MoE models difficult to train at scale.

The next two chapters address load balancing through loss function design. Auxiliary balancing losses add a penalty term that encourages the router to spread tokens across experts. Router z-loss specifically targets the softmax's tendency to sharpen into collapse-prone distributions. Together with capacity constraints, these mechanisms enable stable training of MoE models with balanced expert utilization.

SummaryLink Copied

Load balancing ensures that all experts in a Mixture of Experts model receive meaningful training signal and contribute to model predictions. Without explicit balancing mechanisms, MoE training exhibits rich-get-richer dynamics where initially preferred experts attract more tokens, receive more gradients, and become even more preferred; this positive feedback loop leads to expert collapse.

Key load balance metrics include the load imbalance factor (measuring deviation from uniform allocation), coefficient of variation (quantifying spread in token fractions), and router entropy (capturing the concentration of routing probability mass). These metrics enable early detection of imbalance before it degrades model quality or training stability.

The consequences of imbalance extend beyond wasted model capacity. In distributed training with expert parallelism, load imbalance creates stragglers that destroy parallel efficiency. A load imbalance factor of 2 halves throughput, and collapsed models may run at only $\frac{1}{E}$ efficiency despite having $E$ experts.

Capacity constraints provide one mechanism for limiting imbalance by rejecting tokens that would overflow an expert's quota. However, this approach drops tokens rather than preventing the router from developing imbalanced preferences in the first place. The auxiliary balancing loss and router z-loss covered in the next chapters address this by shaping the training objective itself.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about load balancing in Mixture of Experts models.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Top-K Routing

Next Chapter

Auxiliary Balancing Loss

Reference

BIBTEXAcademic

@misc{moeloadbalancingtokendistributionexpertcollapse, author = {Michael Brenndoerfer}, title = {MoE Load Balancing: Token Distribution & Expert Collapse}, year = {2025}, url = {https://mbrenndoerfer.com/writing/moe-load-balancing-expert-collapse-token-distribution}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). MoE Load Balancing: Token Distribution & Expert Collapse. Retrieved from https://mbrenndoerfer.com/writing/moe-load-balancing-expert-collapse-token-distribution

MLAAcademic

Michael Brenndoerfer. "MoE Load Balancing: Token Distribution & Expert Collapse." 2026. Web. today. <https://mbrenndoerfer.com/writing/moe-load-balancing-expert-collapse-token-distribution>.

CHICAGOAcademic

Michael Brenndoerfer. "MoE Load Balancing: Token Distribution & Expert Collapse." Accessed today. https://mbrenndoerfer.com/writing/moe-load-balancing-expert-collapse-token-distribution.

HARVARDAcademic

Michael Brenndoerfer (2025) 'MoE Load Balancing: Token Distribution & Expert Collapse'. Available at: https://mbrenndoerfer.com/writing/moe-load-balancing-expert-collapse-token-distribution (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). MoE Load Balancing: Token Distribution & Expert Collapse. https://mbrenndoerfer.com/writing/moe-load-balancing-expert-collapse-token-distribution

Direct link:

https://mbrenndoerfer.com/writing/moe-load-balancing-expert-collapse-token-distribution

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

MoE Load Balancing: Token Distribution & Expert Collapse

Load BalancingLink Copied

The Expert Utilization ProblemLink Copied

Capacity Waste and Computational InefficiencyLink Copied

Expert CollapseLink Copied

How Collapse HappensLink Copied

Detecting CollapseLink Copied

Why Collapse Is Self-ReinforcingLink Copied

Load MetricsLink Copied

Token FractionLink Copied

Load Imbalance FactorLink Copied

Coefficient of VariationLink Copied

Router Probability MetricsLink Copied

Why Balanced Routing MattersLink Copied

Capacity UtilizationLink Copied

Gradient QualityLink Copied

Specialization DiversityLink Copied

Training StabilityLink Copied

Measuring Load Balance in PracticeLink Copied

Computing Token FractionsLink Copied

Simulating Training DynamicsLink Copied

Visualizing Expert CollapseLink Copied

Analyzing Router EntropyLink Copied

The Cost of Imbalance in Distributed TrainingLink Copied

Capacity Factor ConstraintsLink Copied

Key ParametersLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Router Z-Loss: Numerical Stability for MoE Training

Auxiliary Balancing Loss: Preventing Expert Collapse in MoE

Top-K Routing: Expert Selection in Mixture of Experts Models

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Router Z-Loss: Numerical Stability for MoE Training

Auxiliary Balancing Loss: Preventing Expert Collapse in MoE

Top-K Routing: Expert Selection in Mixture of Experts Models

Stay updated