Learn how top-K routing selects experts in MoE architectures. Understand top-1 vs top-2 trade-offs, implementation details, and weighted output combination.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Top-K Routing
In the previous chapter, we explored gating networks that compute a probability distribution over all experts for each input token. Mixture of Experts is computationally tractable because we only need the top K experts with the highest routing scores. This selective activation allows MoE to scale to trillions of parameters while maintaining low per-token computational costs.
Top-K routing answers a fundamental question: given a token and its routing scores across all experts, which experts should actually process that token, and how should we combine their outputs? The choice of K affects model capacity, stability, and inference efficiency. This chapter explains how top-K routing works, why different values of K lead to different trade-offs, and how to implement the routing and combination process correctly.
Top-1 Routing
The simplest form of expert selection routes each token to exactly one expert, the one with the highest gating score. This approach, called top-1 routing or hard routing, maximizes computational efficiency by ensuring each token passes through only a single expert network. This approach is simple and increases model capacity without raising per-token computation costs. When each token visits only one expert, we can scale the number of experts arbitrarily without changing the computational cost for processing any individual token.
To understand how top-1 routing works, we will walk through the process step by step. First, we compute the routing scores (logits) for all experts by projecting the token representation through the router's learnable weights :
where:
- : the vector of routing logits (scores) for all experts
- : the router weight matrix
- : the input token representation vector
This linear transformation serves as a compatibility function. Each row of the weight matrix can be thought of as a "query vector" for one expert, and the dot product between this query and the token representation measures how well-suited that expert is for processing this particular token. Higher scores indicate stronger affinity between the token and the expert.
Then, top-1 routing selects the single expert with the highest score:
where:
- : the index of the selected expert
- : the operation that finds the index maximizing the value
- : the routing logit (score) for the -th expert
The argmax operation identifies the expert with the highest score. This expert then processes the token.
The output for that token becomes simply the output of the selected expert:
where:
- : the output vector for the token
- : the feed-forward network function of the selected expert
- : the input token representation vector
This is efficient. If we have 64 experts and use top-1 routing, each token uses exactly 1/64th of the expert capacity. A model with 64 experts, each containing 1 billion parameters in their feed-forward networks, would have 64 billion expert parameters total but only activate 1 billion of them for any given token. Separating total capacity from per-token costs allows MoE architectures to scale efficiently.
However, top-1 routing introduces a significant challenge: the operation is not differentiable. How do gradients flow back through a hard selection? The standard solution uses the straight-through estimator. During the forward pass, we use the hard selection, meaning the token genuinely goes through only one expert. During the backward pass, we pretend the operation was actually the softmax probabilities allowing gradients to flow to all experts proportionally to their routing probabilities. This creates a mismatch between forward and backward computations but works surprisingly well in practice. This approach uses gradients from soft probabilities to update experts based on their selection likelihood, even though the forward pass uses a hard decision.
A technique for backpropagating through discrete operations by using the discrete value in the forward pass but treating the operation as continuous typically using softmax probabilities in the backward pass.
Top-1 routing has another characteristic that can be either a feature or a bug depending on your perspective: it forces complete specialization. Each token must commit fully to one expert's computation, with no blending of perspectives. This can lead to sharper expert specialization but also means the model cannot hedge its bets when a token might benefit from multiple experts' knowledge. Consider a token that sits at the boundary between two semantic categories. With top-1 routing, the model must make a definitive choice, potentially losing information that a second expert could have contributed.
Top-2 Routing
The most common choice in modern MoE architectures is top-2 routing, which selects the two experts with the highest routing scores and combines their outputs. This change from top-1 to top-2 significantly affects model behavior and training. By allowing each token to benefit from two different experts, we introduce a form of ensemble learning at the token level, where multiple specialized perspectives contribute to the final representation.
With top-2 routing, given routing logits , we select indices and corresponding to the two highest values. The process begins identically to top-1 routing: we compute the same routing logits using the same linear transformation. The difference lies in what we do with these logits. Instead of selecting just the maximum, we identify both the highest and second-highest scoring experts.
The gating weights for these experts are computed by applying softmax only over the selected experts:
where:
- : the normalized gating weights for the first and second selected experts
- : the raw routing logits for the selected experts
- : the exponential function, which ensures all weights are positive and amplifies differences between values
- : the sum of exponentials, serving as a normalizing constant to ensure weights sum to 1
Notice that we apply softmax only over the two selected logits, not over all expert logits. This is a crucial design choice. If we had applied softmax over all experts first and then taken the top-2 probabilities, those probabilities might not sum to 1, because the mass would be distributed across all experts. By computing softmax over just the selected pair, we ensure that the two weights always sum exactly to 1, providing a proper convex combination of the two expert outputs.
Note that . The final output combines both expert outputs weighted by these normalized scores:
where:
- : the combined output vector for the token
- : the normalized scalar weights for the two experts
- : the functions computed by the selected experts
- : the input token representation
This weighted combination means that the final representation is a blend of two expert perspectives. If expert had a much higher routing score than expert , then will be close to 1 and will be close to 0, making the output dominated by the first expert. Conversely, if both experts had similar scores, the weights will be closer to 0.5 each, giving both experts roughly equal influence. This adaptive weighting allows the model to smoothly interpolate between relying on a single dominant expert and equally blending two perspectives.
Why does top-2 work better than top-1 in many settings? Several factors contribute:
The first is gradient flow. With two active experts per token, gradients reach twice as many expert parameters during each training step. This improves training efficiency and helps experts learn faster, particularly in the early stages of training when routing decisions are noisy. During early training, the router has not yet learned which experts are best for which tokens, so routing decisions are essentially random. With top-2 routing, more experts receive gradient signal during this critical period, helping the entire expert ensemble learn meaningful specializations more quickly.
The second is representation flexibility. Tokens often don't fit neatly into single categories. A token representing a technical term in a legal document might benefit from both a "technical/scientific" expert and a "legal language" expert. Top-2 routing allows this blending. Language is ambiguous, so forcing tokens into one category can be restrictive. The ability to combine two expert perspectives provides a richer representational palette.
The third is training stability. When only one expert processes each token, small changes in routing can cause dramatic shifts in which expert learns from which data. With two experts active, there's inherent smoothing that stabilizes training, as we'll see when we discuss load balancing in the next chapter. If the router makes a slightly different decision, the token might still visit at least one of the same experts, providing continuity in which experts receive which training signal.
The computational cost doubles compared to top-1, since each token now passes through two expert networks instead of one. For a model with 64 experts where each expert's feed-forward network has 1 billion parameters, top-2 routing activates 2 billion parameters per token instead of 1 billion. This is still a dramatic reduction from the 64 billion total expert parameters. Doubling the computation is usually worth the gains in stability and model quality.
Selecting K: Trade-offs
Choosing involves several trade-offs. Let's examine what happens as increases from toward the total number of experts . Understanding these trade-offs helps you choose an architecture based on performance and capacity needs.
Computational cost scales linearly with . If each expert's feed-forward network requires FLOPs per token, and you select experts, the expert computation costs FLOPs per token. This is the most direct trade-off: higher K means more computation. For a fixed computational budget, choosing a larger K means you can afford fewer total experts, which reduces the model's total capacity advantage over dense architectures.
Model capacity utilization increases with . With , each token sees of your expert capacity. With , it sees . At (all experts), you've effectively built a very expensive dense model with times the feed-forward computation. The power of MoE comes from keeping . The ratio represents what fraction of the expert capacity any given token can access. Keeping this ratio small is what enables the favorable scaling properties of MoE architectures.
Training stability generally improves with moderate . The original Shazeer et al. MoE work used . The GShard paper found worked well. Switch Transformer pushed to but required careful auxiliary losses to maintain stability. Lower values are more prone to training instability because routing decisions have larger effects. When K is small, the decision of which expert processes a token becomes more consequential. A mistake in routing has larger downstream effects, and the router receives noisier gradient signals because fewer experts are active for each token.
Specialization sharpness decreases with higher . When is small, experts must specialize to win the routing competition. When is large, experts can be more generalist since tokens see many of them anyway. This affects what kind of knowledge structure emerges in the expert networks. With aggressive specialization (low K), experts may develop distinct "personalities" focused on narrow domains. With milder specialization (higher K), experts may develop more overlapping capabilities.
Here's a summary of common choices:
| K Value | Use Case | Trade-off |
|---|---|---|
| K = 1 | Maximum efficiency (Switch Transformer) | Requires careful balancing; training can be unstable |
| K = 2 | Standard choice (GShard, Mixtral) | Good balance of efficiency and stability |
| K = 4 | Early MoE work | More computation but smoother training |
| K ≥ 8 | Rare in practice | Diminishing returns; approaches dense computation |
The Mixtral model uses with experts, meaning each token activates 2 out of 8 experts, or 25% of expert capacity. This keeps inference costs comparable to a model with roughly the feed-forward parameters of a dense model, while having the total expert parameters available. This configuration works well, increasing capacity while keeping computation manageable.
Routing Implementation
Implementing top-K routing correctly requires handling several details: selecting the top K indices, computing normalized weights, dealing with numerical stability, and enabling gradient flow. Let's build this step by step. A robust implementation must be efficient and handle training edge cases.
The core routing operation takes router logits and produces both the selected expert indices and their corresponding weights:
The torch.topk function efficiently finds the K largest values along the expert dimension. We then apply softmax only to these selected logits, ensuring the weights sum to 1 across the K active experts. This two-step process, first selecting then normalizing, is essential for obtaining proper convex combination weights.
Let's verify this works as expected:
The output shows that for each token, we get exactly K = 2 expert indices and their corresponding normalized weights that sum to 1.
For top-1 routing specifically, we often want to preserve the routing probability for use in auxiliary losses while still making a hard selection. The challenge is that we need the discrete selection for the forward pass but the full probability distribution to compute load balancing penalties:
We keep the full routing probability distribution because we'll need it for load balancing losses, which we cover in the next chapter. This design pattern, where the routing function returns auxiliary information beyond just the selected experts, is common in practical MoE implementations.
Combining Expert Outputs
Once we've selected experts and computed their weights, we need to actually route tokens through the selected experts and combine their outputs. This is where implementation complexity increases, because we're dealing with different experts for different tokens. Unlike a standard feed-forward layer where all tokens pass through the same parameters, MoE requires dynamically dispatching different tokens to different experts and then gathering and combining the results.
The simplest approach processes tokens individually:
This explicit loop makes the logic clear: for each token, iterate through its K selected experts, compute each expert's output, and add them together weighted by the routing weights. However, this approach is extremely slow because it processes tokens sequentially and can't leverage GPU parallelism. Modern GPUs are designed to process thousands of operations in parallel, and this nested loop structure forces sequential execution that drastically underutilizes the hardware.
The efficient approach groups tokens by their assigned experts:
This version groups all tokens assigned to each expert and processes them in a single batched forward pass through that expert. The key insight is that even though different tokens go to different experts, all tokens assigned to the same expert can be processed together. The outer loop over expert positions and the inner loop over experts means we make exactly forward passes through expert networks, but each pass processes all relevant tokens in parallel. This batching strategy transforms the problem from per-token sequential processing to per-expert parallel processing.
Let's create a complete working example:
Let's test the complete layer:
The output confirms that our MoE layer maintains the expected dimensions while activating only K experts per token. The total expert parameters scale with the number of experts, but each token only uses K experts worth of computation. Separating total parameters from per-token computation is the main advantage of MoE architectures.
Worked Example
Let's trace through a concrete example of top-2 routing to solidify understanding. Consider a small MoE layer with 4 experts processing a single token. This example shows how routing scores translate into expert selection and weight assignment.
The token has hidden representation with dimension 4. The router weight matrix is :
Now we select the top-2 experts:
We see that experts 3 () and 0 () have the highest logits. These two experts "won" the routing competition for this token. Notice that the margin between the winners and the losers (experts 1 and 2) is relatively small in this case, which means slightly different input representations could have resulted in different routing decisions. Now we compute normalized weights using softmax over only these two selected logits:
The routing has decided that expert 3 should contribute roughly 50.2% and expert 0 should contribute roughly 49.8% to this token's output. Experts 1 and 2 are completely bypassed; they don't compute anything for this token. Because the two selected logits were very close in value, the resulting weights are also close to equal. If one expert had dominated with a much higher logit, its weight would be closer to 1.0.
If expert 3 produces output and expert 0 produces output , the final output would be:
where:
- : the final combined output vector
- : the output vectors computed by expert 3 and expert 0
This nearly equal weighting means both experts contribute almost equally to the final representation. In contrast, if the logits had been more differentiated (say, 0.9 for expert 3 and 0.2 for expert 0), the softmax would produce weights closer to 0.67 and 0.33, giving expert 3 twice as much influence.
Key Parameters
The key parameters for the Top-K Routing implementation are:
- num_experts: Total number of experts in the MoE layer. This parameter determines the potential capacity of the MoE layer. More experts means more total parameters and potentially more specialized knowledge, but also more memory requirements and complexity in managing load balance
- k: Number of experts to select for each token (typically 1 or 2). This controls the trade-off between computational cost and model expressiveness. Lower values of K yield better efficiency, while higher values provide smoother training and richer token representations.
- hidden_dim: Dimensionality of the input and output representations. This must match the hidden dimension of the surrounding transformer architecture. The router uses this dimension to compute compatibility scores between tokens and experts.
- intermediate_dim: Dimensionality of the expert feed-forward networks. Larger intermediate dimensions increase the capacity of each individual expert but also increase the per-expert computational cost.
Visualizing Routing Patterns
To understand how tokens get distributed across experts, let's visualize the routing decisions for a sequence:
The heatmap reveals the sparsity pattern of top-K routing. Each token (row) activates exactly 2 experts (colored cells), with the color intensity indicating the routing weight. White cells represent experts that receive zero weight for that token. This sparse activation pattern is what makes MoE computationally efficient: instead of all 8 experts processing every token, only 2 do. Looking at the pattern across tokens, we can also observe how different tokens prefer different expert combinations, reflecting the router's learned notion of which experts are appropriate for which input representations.
Let's also examine how tokens are distributed across experts:
This distribution shows a common challenge with top-K routing: without explicit encouragement, tokens may not distribute evenly across experts. Some experts receive many tokens while others receive few. This load imbalance is a critical issue that we'll address in the next chapter on load balancing. The red dashed line shows what perfectly uniform distribution would look like; deviations from this line represent inefficiency in how we're using our expert capacity.
Limitations and Impact
Top-K routing enables the core promise of MoE architectures: scaling model capacity without proportionally scaling computation. However, this selective activation introduces several challenges that you must address.
The load imbalance problem. As our visualization showed, naive top-K routing can lead to severe load imbalance where some experts are overwhelmed with tokens while others sit idle. In the extreme case, a phenomenon called "expert collapse" can occur where the router learns to send all tokens to just one or two experts, effectively wasting the capacity of the other experts. This is particularly problematic during distributed training, where each expert typically resides on a different accelerator. If expert 0 receives 50% of tokens while expert 7 receives 5%, you've created a massive bottleneck. The next chapter introduces auxiliary losses that encourage balanced routing.
Discrete selection and gradient approximation. The top-K selection is fundamentally discrete, meaning we use approximations like the straight-through estimator to enable gradient flow. While these approximations work well in practice, they create a mismatch between the forward pass (hard selection) and backward pass (soft gradient flow). This can occasionally cause training instabilities, particularly with routing where the approximation is most severe.
Token dropping under capacity constraints. In distributed settings, each expert can only process a limited number of tokens per batch due to memory constraints. If the router sends more tokens to an expert than it can handle, some tokens must be dropped. Their representations pass through unchanged or via a simple residual connection. Dropped tokens miss the expert processing entirely, which can degrade model quality. This creates tension between model quality (wanting to route tokens optimally) and computational constraints (needing balanced, bounded expert loads).
Inference complexity. During inference, top-K routing requires dynamic batching where tokens are grouped by their selected experts. This is more complex than standard dense model inference and can be harder to optimize. When , you also need to gather and combine outputs from multiple experts per token, adding overhead.
Despite these challenges, top-K routing has proven remarkably effective. The Switch Transformer demonstrated that can work with proper auxiliary losses. Mixtral showed that with only 8 experts achieves excellent performance. The key insight is that the routing mechanism itself is not the whole story. It must be combined with load balancing techniques to realize the full potential of sparse expert architectures. We'll explore these techniques in detail in the upcoming chapters on load balancing and auxiliary losses.
Summary
Top-K routing makes Mixture of Experts computationally practical. Rather than using all experts for every token, we select only the K experts with the highest routing scores and ignore the rest.
The choice of involves fundamental trade-offs. Top-1 routing maximizes efficiency (only one expert per token) but can be unstable during training and requires careful load balancing. Top-2 routing, used by models like Mixtral, doubles the computation but provides smoother training dynamics and allows tokens to benefit from multiple experts' perspectives. Values of beyond 2 show diminishing returns and approach dense model computation.
The implementation involves two key steps: selecting the top-K experts using argmax operations on routing logits, and combining expert outputs using normalized weights computed by applying softmax only over the selected logits. Efficient implementations group tokens by their assigned experts to enable batched processing through each expert network.
The main limitation of top-K routing is load imbalance. Without additional mechanisms, some experts may receive far more tokens than others, wasting capacity and creating computational bottlenecks. The next chapter addresses this directly with load balancing techniques and auxiliary losses that encourage more uniform token distribution across experts.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about top-K routing in Mixture of Experts architectures.
















Comments