Gating Networks: Router Architecture in Mixture of Experts

Michael BrenndoerferUpdated January 3, 202641 min read

Explore gating networks in MoE architectures. Learn router design, softmax gating, Top-K selection, training dynamics, and emergent specialization patterns.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Gating Networks

In the previous chapter, we explored how expert networks provide specialized computational pathways within a Mixture of Experts (MoE) architecture. But having multiple experts raises an immediate question: how does the model decide which expert to use for each input? This decision falls to the gating network, also called the router, which serves as the traffic controller directing tokens to appropriate experts.

The router might seem like a minor component (often just a single linear layer), but its design profoundly impacts everything from model quality to training stability. A poorly designed router can cause expert collapse (where only one expert gets used), severe load imbalance, or failure to learn meaningful specialization. Getting the router right is one of the central challenges in MoE systems.

This chapter examines how routers work: their architecture, how they compute routing scores, how they learn through backpropagation, and what behavioral patterns emerge during training. Understanding the router provides the foundation for the routing strategies and balancing techniques we'll explore in upcoming chapters.

Router Architecture

The gating network in modern MoE transformers is remarkably simple: a linear projection from the token representation to a score for each expert. Despite this simplicity, the router must learn to make complex decisions about which experts are best suited for each input. To understand why this seemingly minimal architecture succeeds, we need to examine both its mathematical structure and the reasoning behind key design choices.

The Linear Router

At the heart of every MoE layer lies a deceptively simple question: given a token's representation, which expert should handle it? The standard router architecture answers this question with a single linear layer without bias. Before we write down the formula, consider what we need: a way to transform each token's high-dimensional representation into a set of scores, one for each expert. These scores should capture how well-suited each expert is for processing that particular token. A linear projection accomplishes exactly this:

G(x)=xWgG(x) = xW_g

where:

  • G(x)G(x): the vector of raw routing scores (logits) for the nn experts
  • xx: the input token representation vector of dimension dd
  • WgW_g: the learnable router weight matrix of dimension d×nd \times n
  • nn: the number of experts

This formula performs a matrix multiplication that projects the dd-dimensional token representation onto nn scores, one for each expert. The result is a vector of routing logits that will later be converted to probabilities.

This design choice, a simple linear projection, might seem limiting at first glance. After all, modern deep learning has shown the power of multi-layer networks for capturing complex patterns. Why not use a more expressive multi-layer network for routing? Several factors favor the linear design:

  • Computational efficiency: The router runs for every token, so it must be fast. A linear layer adds minimal overhead compared to the expert computations themselves.
  • Sufficient expressivity: The input representation xx already encodes rich semantic information from prior layers. By the time a token reaches the MoE layer, attention mechanisms have already captured contextual relationships. The router just needs to project this pre-processed information to expert preferences.
  • Training stability: Simpler architectures are easier to train and less prone to the instabilities that plague MoE systems. Adding nonlinearities to the router can create optimization challenges.
  • Interpretability: Linear routers allow clearer analysis of what input features influence routing decisions, since the mapping from input dimensions to expert scores is direct.

Input Representation

What exactly feeds into the router? Understanding the input is crucial because it determines what information the router can use to make its decisions. In a transformer's MoE layer, the router typically receives the same input that would go to a standard FFN block. This is the output of the attention sublayer (plus residual connection and normalization):

x=LayerNorm(Attention(h)+h)x = \text{LayerNorm}(\text{Attention}(h) + h)
  • xx: the input vector to the router
  • hh: the input state from the previous layer
  • Attention(h)\text{Attention}(h): the output of the self-attention mechanism
  • LayerNorm\text{LayerNorm}: the normalization operation that stabilizes the representation

This formula shows that the router's input incorporates both the attention output and the original hidden state through the residual connection, all normalized for stable computation. This representation already captures contextual information about the token's role in the sequence. The attention mechanism has allowed the token to gather relevant information from other positions, so the router is not making decisions based on the token in isolation. Instead, the router's job is to translate this contextual representation into expert preferences.

Some architectures experiment with alternative inputs. For instance, some routers use the pre-normalization representation or concatenate multiple representations to provide richer information. However, the standard approach (using the same input as the FFN) remains dominant due to its simplicity and effectiveness. Using the same input also ensures consistency: the expert receives the same representation that was used to decide to route to it.

Router Weight Matrix

Understanding the router weight matrix WgW_g, the only learnable component, provides intuition about how routing decisions are made. Each column wiw_i of the matrix represents a "prototype" for expert ii. You can think of these prototypes as characteristic patterns that expert ii is designed to handle. The routing score for expert ii is essentially the dot product between the input and that expert's prototype:

G(x)i=xwiG(x)_i = x \cdot w_i

where:

  • G(x)iG(x)_i: the routing score (logit) for expert ii
  • xx: the input token representation
  • wiw_i: the column vector from the weight matrix representing expert ii's prototype

This framing as a dot product helps build intuition about how routing works. Recall that the dot product measures similarity between vectors: it is large when vectors point in similar directions and small (or negative) when they point in different directions. Therefore, experts with prototypes that align well with the input representation receive higher scores. Conversely, experts whose prototypes point away from the input in representation space receive lower scores.

During training, the router learns to position these prototypes in representation space to capture meaningful distinctions between token types. For example, one prototype might align with representations of scientific terminology, while another aligns with conversational language. The learning process adjusts these prototypes so that different types of tokens are routed to experts that specialize in handling them.

The router parameters are typically initialized with standard techniques. As we covered in Part VII, proper initialization prevents the router from starting with extreme biases toward particular experts, ensuring that all experts have a chance to receive tokens during early training.

Routing Score Computation

Raw routing logits from the linear projection must be converted into routing weights that determine how much each expert contributes to the output. The logits can be any real number, positive or negative, and they don't have any inherent probabilistic meaning. This transformation from arbitrary scores to meaningful weights happens through a softmax function, producing a probability distribution over experts that we can interpret and use.

Softmax Gating

The softmax function is the standard tool for converting a vector of arbitrary real numbers into a probability distribution. In the context of routing, it transforms our expert logits into normalized routing weights:

p(eix)=exp(G(x)i)j=1nexp(G(x)j)p(e_i|x) = \frac{\exp(G(x)_i)}{\sum_{j=1}^{n} \exp(G(x)_j)}

where:

  • p(eix)p(e_i|x): the probability of selecting expert ii for input xx
  • G(x)iG(x)_i: the raw routing score (logit) for expert ii
  • exp(G(x)i)\exp(G(x)_i): the exponential function applied to the logit, ensuring the value is positive
  • j=1nexp(G(x)j)\sum_{j=1}^{n} \exp(G(x)_j): the sum of exponentials across all nn experts, serving as the normalizing constant

Let's see what this formula does. The exponential function serves two purposes: first, it ensures all values are positive (since probabilities cannot be negative); second, it amplifies differences between logits. If one expert has a logit of 3 and another has 1, the exponential makes the first about e27.4e^2 \approx 7.4 times larger, not just 2 points higher. This amplification makes the router's preferences more decisive.

The denominator serves as a normalizing constant, ensuring all probabilities sum to exactly 1. This normalization is essential because we want the routing weights to represent a proper probability distribution.

Out[2]:
Visualization
Bar chart showing raw logit values for 6 experts.
Raw routing logits for six experts before softmax transformation.
Bar chart showing softmax probabilities for 6 experts.
Softmax probabilities showing how exponential amplification makes Expert 2's slight advantage decisive.

These routing probabilities p(eix)p(e_i|x) represent the router's "confidence" that expert ii is appropriate for input xx. Higher probabilities indicate stronger preference, and we can compare probabilities across experts to understand the router's relative confidence in each option.

In a soft MoE (which we introduced conceptually in the sparse models chapter), the final output would be a weighted combination of all expert outputs:

y=i=1np(eix)Ei(x)y = \sum_{i=1}^{n} p(e_i|x) \cdot E_i(x)

where:

  • yy: the combined output of the MoE layer
  • p(eix)p(e_i|x): the routing probability for expert ii
  • Ei(x)E_i(x): the output computed by expert ii for input xx

However, this soft combination requires computing all expert outputs, which defeats the purpose of sparsity. If we must run every expert for every token, we gain no computational savings from having multiple specialized experts rather than one large one.

Hard Routing with Top-K Selection

Practical MoE systems use hard routing to achieve the computational benefits of sparsity: only the top-KK experts (by routing probability) are actually computed. The next chapter covers Top-K routing in detail, but understanding the key idea is essential for grasping how the router functions in practice.

The process works in four steps:

  1. Compute all routing probabilities via softmax
  2. Select the KK highest-scoring experts
  3. Renormalize selected probabilities to sum to 1
  4. Compute only those KK experts

For example, with 8 experts and K=2K=2, only 2 experts run per token, providing roughly 75% computational savings in the expert portion of the layer. The routing weights for selected experts become:

p~(eix)=p(eix)jTopKp(ejx)\tilde{p}(e_i|x) = \frac{p(e_i|x)}{\sum_{j \in \text{TopK}} p(e_j|x)}

where:

  • p~(eix)\tilde{p}(e_i|x): the renormalized weight for selected expert ii
  • p(eix)p(e_i|x): the original softmax probability for expert ii
  • TopK\text{TopK}: the set of indices corresponding to the KK experts with highest scores
  • jTopKp(ejx)\sum_{j \in \text{TopK}} p(e_j|x): the sum of probabilities for the selected experts, used as a normalizing constant

This renormalization ensures the weighted combination of selected experts produces properly scaled outputs. Without renormalization, if the two selected experts had probabilities 0.3 and 0.2, they would sum to only 0.5, effectively scaling down the output. Renormalization adjusts them to approximately 0.6 and 0.4, ensuring the output magnitude remains appropriate.

Out[3]:
Visualization
Bar chart showing softmax probabilities for 8 experts.
Full softmax probabilities across all 8 experts.
Bar chart showing only top-2 experts selected.
Top-2 selected experts with probabilities for non-selected experts set to zero.
Bar chart showing renormalized weights for selected experts.
Renormalized weights ensuring selected experts sum to 1.0.

Temperature Scaling

Some implementations add temperature scaling to the softmax to provide additional control over routing behavior:

p(eix)=exp(G(x)i/τ)j=1nexp(G(x)j/τ)p(e_i|x) = \frac{\exp(G(x)_i / \tau)}{\sum_{j=1}^{n} \exp(G(x)_j / \tau)}

where:

  • p(eix)p(e_i|x): the temperature-scaled probability for expert ii
  • G(x)iG(x)_i: the raw routing logit
  • τ\tau: the temperature parameter that controls the sharpness of the distribution
  • exp(G(x)i/τ)\exp(G(x)_i / \tau): the exponentiated scaled logit; using τ<1\tau < 1 amplifies differences, making the distribution sharper

This concept relates to the decoding temperature we covered in Part XVIII, where temperature controls the randomness of text generation. The principle is the same: temperature modulates how peaked or flat the probability distribution is.

Lower temperatures sharpen the distribution, pushing probability mass toward the highest-scoring expert and approaching hard assignment to a single expert. Higher temperatures soften the distribution, spreading probability more evenly and approaching uniform routing where every expert has roughly equal chance of selection.

Out[4]:
Visualization
Bar chart showing sharp routing distribution at low temperature.
Temperature T=0.5: Sharp distribution approaching hard assignment.
Bar chart showing standard routing distribution.
Temperature T=1.0: Standard softmax distribution.
Bar chart showing softer routing distribution.
Temperature T=2.0: Softer distribution with more even spread.
Bar chart showing very soft routing distribution.
Temperature T=5.0: Very soft distribution approaching uniform routing.

Temperature provides a way to control the "confidence" of routing decisions across different training phases. Early in training, higher temperatures encourage exploration across experts, preventing the model from committing too strongly to particular routing patterns before it has learned enough. Later, lower temperatures can encourage more decisive routing, allowing the model to leverage the specialization it has developed.

Router Training

Training the router presents a unique challenge that distinguishes it from standard neural network components: the router's output involves a discrete selection (which experts to use), but backpropagation requires continuous, differentiable operations. Discrete choices create discontinuities in the loss landscape, making gradients undefined at decision boundaries. How, then, do gradients flow through routing decisions?

End-to-End Gradients

The key insight that enables router training is that although expert selection is discrete, the routing weights are continuous. The model cannot backpropagate through the binary decision of "use expert 3 instead of expert 5," but it can backpropagate through the continuous weights applied to the selected experts. Gradients flow through the weights applied to selected experts:

y=iTopKp~(eix)Ei(x)y = \sum_{i \in \text{TopK}} \tilde{p}(e_i|x) \cdot E_i(x)

where:

  • yy: the output of the MoE layer
  • p~(eix)\tilde{p}(e_i|x): the renormalized routing weight for selected expert ii
  • Ei(x)E_i(x): the output computed by expert ii
  • TopK\text{TopK}: the set of indices for the selected experts

During backpropagation, the loss gradient with respect to router logits comes from two sources:

  1. Through the weights: How changing p~(eix)\tilde{p}(e_i|x) affects the weighted combination of expert outputs
  2. Through the softmax: How the raw logits affect the routing probabilities

For a selected expert ii, the gradient through routing weights follows the chain rule:

LG(x)i=Lyyp~(eix)p~(eix)G(x)i\frac{\partial \mathcal{L}}{\partial G(x)_i} = \frac{\partial \mathcal{L}}{\partial y} \cdot \frac{\partial y}{\partial \tilde{p}(e_i|x)} \cdot \frac{\partial \tilde{p}(e_i|x)}{\partial G(x)_i}

where:

  • LG(x)i\frac{\partial \mathcal{L}}{\partial G(x)_i}: the gradient of the loss with respect to the router logits
  • Ly\frac{\partial \mathcal{L}}{\partial y}: the gradient flowing back from the network output
  • yp~(eix)\frac{\partial y}{\partial \tilde{p}(e_i|x)}: the sensitivity of the output to the expert's weight
  • p~(eix)G(x)i\frac{\partial \tilde{p}(e_i|x)}{\partial G(x)_i}: the gradient of the softmax (or Top-K) function

This gradient chain has an intuitive interpretation. The first term captures how much the overall loss depends on this MoE layer's output. The second term captures how much changing the weight of a particular expert would change that output (essentially the expert's output vector). The third term captures how sensitive the routing weight is to changes in the raw logit.

This gradient updates the router to assign higher weights to experts that produce better outputs for each input. If an expert produces an output that reduces the loss, the gradient will adjust the router to give that expert a higher weight for similar inputs in the future.

The Discrete Selection Problem

One subtlety deserves careful attention: gradients only flow through selected experts. If expert jj wasn't in the Top-K, there's no direct gradient signal about whether selecting it would have been better. The model never computes Ej(x)E_j(x), so it cannot know how good that output would have been.

This creates a fundamental limitation in router learning. The router learns to improve its choices among selected experts based on their actual outputs, but has limited ability to discover that an unselected expert might be better. Imagine a scenario where expert 7 would be perfect for certain mathematical tokens, but the router never selects it for those tokens during early training. Without ever seeing expert 7's output on mathematical content, the gradient signal cannot tell the router to adjust toward selecting expert 7.

MoE systems can get stuck in suboptimal routing patterns. An expert might become permanently underutilized simply because random initialization didn't give it enough early opportunities.

Several techniques address this limitation:

  • Load balancing losses penalize uneven expert usage, encouraging exploration of all experts by adding a penalty when some experts receive many more tokens than others
  • Random routing occasionally selects experts not in the Top-K for gradient purposes, providing gradient signal about alternatives
  • Auxiliary losses provide additional training signals beyond the main task loss, guiding the router toward desirable behaviors

We'll explore these techniques in detail in the upcoming chapters on load balancing and auxiliary losses.

Gradient Scale Considerations

The routing weights are typically small values (especially when split across multiple experts), which can lead to gradient scaling issues. If two experts share a token with weights 0.5 and 0.5, the gradient flowing to each expert is scaled by 0.5. With more experts sharing, these weights become even smaller.

Beyond gradient scaling, another concern is numerical stability. If routing logits grow very large, the softmax can produce extreme probabilities (nearly 0 or 1), leading to vanishing gradients and unstable training. Some implementations use router z-loss to prevent routing logits from growing too large, maintaining stable training dynamics. The z-loss adds a penalty proportional to the log-sum-exp of routing logits:

Lz=1Bx(logi=1nexp(G(x)i))2\mathcal{L}_z = \frac{1}{B}\sum_{x} \left(\log \sum_{i=1}^{n} \exp(G(x)_i)\right)^2

where:

  • Lz\mathcal{L}_z: the router z-loss
  • BB: the batch size
  • xx: an input token representation in the batch
  • G(x)iG(x)_i: the raw routing score (logit) for expert ii
  • i=1nexp(G(x)i)\sum_{i=1}^{n} \exp(G(x)_i): the partition function (denominator of softmax) for input xx
  • logi=1nexp(G(x)i)\log \sum_{i=1}^{n} \exp(G(x)_i): the Log-Sum-Exp of the logits, serving as a smooth approximation of the maximum logit

The log-sum-exp quantity is closely related to the maximum logit: as one logit grows much larger than the others, the log-sum-exp approaches that maximum value. By penalizing the squared log-sum-exp, the z-loss discourages any single expert from receiving extremely high routing scores. This encourages the router to produce moderate logits rather than extreme values, improving numerical stability throughout training.

Router Learned Behavior

What patterns emerge in trained routers? Understanding these patterns provides insight into how MoE models organize their computation and what specializations develop naturally during training. MoE routers learn meaningful specialization, though patterns depend on the architecture, training data, and task.

Semantic Specialization

Studies of trained MoE language models show that experts often specialize by semantic domain. One expert might preferentially handle scientific text, another legal documents, another conversational content. This specialization emerges naturally from end-to-end training without any explicit labels about document types. The reason is that experts that specialize can model domain-specific patterns more precisely: an expert focused on legal language can learn the specific vocabulary, sentence structures, and logical patterns common in legal text.

To analyze this semantic specialization, we examine routing decisions on held-out data through a systematic process:

  1. Route many examples through the trained model
  2. Record which experts are selected for each token
  3. Aggregate by document type or topic
  4. Look for experts with concentrated domain preferences

These analyses reveal that domain specialization often develops in deeper layers, while earlier layers show more syntactic patterns. This makes sense from an information processing perspective: earlier layers establish basic linguistic structure, while later layers can afford to specialize on content-specific patterns.

Syntactic Patterns

Early MoE layers often route based on syntactic features rather than semantic content. Before the model can specialize on meaning, it must establish grammatical structure. Patterns observed in early layers include:

  • Part-of-speech clustering: Experts that preferentially handle verbs, nouns, or punctuation
  • Position effects: Some experts specialize in sentence-initial or sentence-final tokens
  • Subword patterns: Different experts for complete words versus subword fragments

These syntactic patterns make sense given the information processing hierarchy: early layers need to handle positional and structural processing that applies broadly across domains. A verb requires similar processing whether it appears in a legal document or a casual conversation.

Token Frequency Effects

High-frequency tokens (function words, common subwords) often route to a small subset of experts, while rare tokens distribute more broadly across the expert pool. This asymmetry likely reflects that common tokens have simpler, more predictable patterns that a few specialized experts can handle efficiently. Function words like "the," "is," and "of" behave consistently across contexts, so dedicating experts to them makes sense.

Rare tokens, on the other hand, appear in diverse contexts and may benefit from accessing different experts depending on their specific usage. A rare technical term might need different processing when appearing in a definition versus an application.

Layer-by-Layer Evolution

Routing patterns differ across layers in the model, showing a progression from general to specific processing. A common pattern observed in trained models:

  • Early layers: More uniform routing, syntactic specialization
  • Middle layers: Emerging domain specialization, more concentrated routing
  • Later layers: Strong specialization, decisive routing (lower entropy in routing distribution)

This progression mirrors the general finding that transformer representations become more abstract and task-specific in deeper layers. Early layers establish shared representations useful for all types of content, while later layers can afford to specialize because they build on this shared foundation.

Out[5]:
Visualization
Line plot showing routing entropy decreasing from early to late layers.
Simulated routing entropy across layers in an MoE model. Early layers show high entropy (uniform routing), while later layers exhibit lower entropy (more decisive routing). This pattern reflects increasing specialization as information flows through the network.

Worked Example

Let's trace through routing computation for a concrete example to solidify the concepts. Working through the math shows how formulas produce routing decisions. Suppose we have:

  • 4 experts
  • Input dimension d=4d = 4
  • An input token representation: x=[0.5,0.3,0.8,0.1]x = [0.5, -0.3, 0.8, 0.1]

And a router weight matrix:

Wg=[0.20.10.40.10.30.20.20.50.10.50.30.30.40.10.20.2]W_g = \begin{bmatrix} 0.2 & -0.1 & 0.4 & 0.1 \\ 0.3 & 0.2 & -0.2 & 0.5 \\ -0.1 & 0.5 & 0.3 & -0.3 \\ 0.4 & 0.1 & 0.2 & 0.2 \end{bmatrix}

Step 1: Compute routing logits

The first step applies the linear projection to compute raw scores for each expert. We multiply the input vector by the weight matrix:

G(x)=xWgG(x) = xW_g

For expert 1 (first column), we compute the dot product between the input and the first column of the weight matrix:

G(x)1=0.5(0.2)+(0.3)(0.3)+0.8(0.1)+0.1(0.4)=0.10.090.08+0.04=0.03\begin{aligned} G(x)_1 &= 0.5(0.2) + (-0.3)(0.3) + 0.8(-0.1) + 0.1(0.4) \\ &= 0.1 - 0.09 - 0.08 + 0.04 \\ &= -0.03 \end{aligned}

Computing all four expert scores using the same dot product approach:

G(x)=[0.03,0.30,0.52,0.32]G(x) = [-0.03, 0.30, 0.52, -0.32]

Expert 3 has the highest logit (0.52), followed by Expert 2 (0.30). Expert 1 and Expert 4 have negative scores, indicating the input doesn't align well with their prototypes.

Step 2: Apply softmax

Next, we convert these raw logits to probabilities. First, compute the exponentials to ensure positive values:

exp(G(x))=[0.97,1.35,1.68,0.73]\exp(G(x)) = [0.97, 1.35, 1.68, 0.73]

Sum: 0.97+1.35+1.68+0.73=4.730.97 + 1.35 + 1.68 + 0.73 = 4.73

Divide each exponential by the sum to get normalized probabilities:

p=[0.21,0.29,0.36,0.15]p = [0.21, 0.29, 0.36, 0.15]

Notice how the exponential amplified the differences: Expert 3's logit is only 0.22 higher than Expert 2's, but its probability is about 25% higher ((0.36) vs (0.29)).

Step 3: Top-2 selection

The two highest-scoring experts are Expert 3 (0.36) and Expert 2 (0.29). These will be the only experts that actually compute their outputs.

Renormalized weights ensure the selected experts' contributions sum to 1:

p~2=0.290.36+0.290.45\tilde{p}_2 = \frac{0.29}{0.36 + 0.29} \approx 0.45 p~3=0.360.36+0.290.55\tilde{p}_3 = \frac{0.36}{0.36 + 0.29} \approx 0.55

The final output is: y0.45E2(x)+0.55E3(x)y \approx 0.45 \cdot E_2(x) + 0.55 \cdot E_3(x)

Only experts 2 and 3 are computed, saving roughly 50% of expert computation compared to running all four experts. Expert 3 contributes slightly more to the output due to its higher routing score.

Out[6]:
Visualization
Bar chart showing raw logit values for 4 experts.
Step 1: Raw routing logits computed from input and weight matrix.
Bar chart showing softmax probabilities with selected experts highlighted.
Step 2: Softmax probabilities with top-2 experts selected (E2 and E3).
Bar chart showing renormalized weights for selected experts.
Step 3: Renormalized weights ensuring selected experts sum to 1.0.

Code Implementation

Let's implement a gating network from scratch to see these concepts in action.

In[7]:
Code
import torch
import numpy as np

## Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

Basic Router Module

We start with the core router implementation. The router is a linear layer that projects token representations to expert scores.

In[8]:
Code
import torch
import torch.nn as nn
import torch.nn.functional as F


class Router(nn.Module):
    """
    Gating network for Mixture of Experts.
    Routes tokens to top-k experts based on learned routing scores.
    """

    def __init__(self, d_model: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k

        # Single linear layer: projects from d_model to num_experts
        # No bias term, common in production MoE implementations
        self.gate = nn.Linear(d_model, num_experts, bias=False)

    def forward(self, x: torch.Tensor):
        """
        Compute routing weights for input tokens.

        Args:
            x: Input tensor of shape (batch_size, seq_len, d_model)

        Returns:
            routing_weights: Weights for selected experts (batch, seq, top_k)
            selected_experts: Indices of selected experts (batch, seq, top_k)
            router_probs: Full probability distribution (batch, seq, num_experts)
        """
        # Compute routing logits
        logits = self.gate(x)  # (batch, seq, num_experts)

        # Convert to probabilities via softmax
        router_probs = F.softmax(logits, dim=-1)

        # Select top-k experts
        routing_weights, selected_experts = torch.topk(
            router_probs, self.top_k, dim=-1
        )

        # Renormalize weights to sum to 1
        routing_weights = routing_weights / routing_weights.sum(
            dim=-1, keepdim=True
        )

        return routing_weights, selected_experts, router_probs

Let's verify the router produces expected outputs:

In[9]:
Code
## Create a router with 8 experts, selecting top-2
d_model = 256
num_experts = 8
router = Router(d_model, num_experts, top_k=2)

## Simulate a batch of token representations
batch_size = 2
seq_len = 4
x = torch.randn(batch_size, seq_len, d_model)

## Get routing decisions
routing_weights, selected_experts, router_probs = router(x)
Out[10]:
Console
Input shape: torch.Size([2, 4, 256])
Routing weights shape: torch.Size([2, 4, 2])
Selected experts shape: torch.Size([2, 4, 2])

First token routing probabilities (all experts):
[0.184 0.073 0.083 0.066 0.336 0.049 0.031 0.178]

First token selected experts: [4, 0]
First token weights for selected experts: [0.646 0.354]
Weights sum: 1.000

The router produces probability distributions over all experts, then selects the top-2 and renormalizes their weights.

Routing Score Analysis

To understand router behavior, let's examine how routing probabilities distribute across experts for different inputs.

In[11]:
Code
def analyze_routing_distribution(router, num_samples=1000, d_model=256):
    """
    Analyze how the router distributes tokens across experts.
    """
    # Generate random token representations
    x = torch.randn(num_samples, 1, d_model)

    with torch.no_grad():
        _, selected_experts, router_probs = router(x)

    # Count how often each expert is selected
    expert_counts = torch.zeros(router.num_experts)
    for i in range(router.num_experts):
        expert_counts[i] = (selected_experts == i).sum().item()

    # Compute mean probability assigned to each expert
    mean_probs = router_probs.squeeze(1).mean(dim=0)

    return expert_counts, mean_probs
In[12]:
Code
## Analyze an untrained router
counts, mean_probs = analyze_routing_distribution(router)
Out[13]:
Console
Expert selection counts (out of 2000 selections):
  Expert 0:  265 ██████████████████████████████████████
  Expert 1:  251 ████████████████████████████████████
  Expert 2:  261 █████████████████████████████████████
  Expert 3:  215 ██████████████████████████████
  Expert 4:  241 ██████████████████████████████████
  Expert 5:  227 ████████████████████████████████
  Expert 6:  278 ████████████████████████████████████████
  Expert 7:  262 █████████████████████████████████████

Mean routing probabilities:
  Expert 0: 0.126 ██████████
  Expert 1: 0.126 ██████████
  Expert 2: 0.128 ██████████
  Expert 3: 0.120 █████████
  Expert 4: 0.122 █████████
  Expert 5: 0.122 █████████
  Expert 6: 0.129 ██████████
  Expert 7: 0.126 ██████████

With random initialization, the router distributes fairly evenly across experts. This is expected before training shapes the routing behavior.

Out[14]:
Visualization
Bar chart showing selection counts for 8 experts with mean line.
Selection frequency: Count of tokens routed to each expert shows roughly uniform distribution.
Bar chart showing mean routing probabilities for 8 experts with uniform reference line.
Average probability: Mean routing probability per expert approaches uniform (1/8).

Visualizing Routing Entropy

Routing entropy measures how "decisive" the router is. Low entropy means the router strongly prefers specific experts; high entropy means uncertainty across many experts.

In[15]:
Code
def compute_routing_entropy(router_probs):
    """
    Compute entropy of routing distributions.
    Higher entropy = more uniform, lower = more concentrated.
    """
    # Avoid log(0)
    probs = router_probs.clamp(min=1e-10)
    entropy = -torch.sum(probs * torch.log(probs), dim=-1)
    return entropy


## Compute entropy for our sample
_, _, router_probs = router(x)
entropies = compute_routing_entropy(router_probs)
In[16]:
Code
## Maximum possible entropy (uniform distribution over 8 experts)
max_entropy = np.log(num_experts)
Out[17]:
Console
Maximum entropy (uniform over 8 experts): 2.079
Mean observed entropy: 1.904
Entropy ratio (observed/max): 91.56%

An untrained router produces near-maximum entropy, indicating uniform routing. Training should reduce this as the router learns to make more decisive choices.

Simulating Trained Router Behavior

Let's simulate what happens when a router learns domain-specific patterns. We'll create synthetic "domain" embeddings and train the router to recognize them.

Why can the loss hit ~0 almost immediately in this demo?

This is a supervised toy classification setup: we generate synthetic "domains" as Gaussian clusters and directly train the router to predict the domain label. If the clusters are well separated (large prototype_scale) and noise is small (noise_std), a linear gate can drive logits to large magnitudes, making softmax saturate and the loss appear to drop to ~0 in just a few optimization steps.

To keep the curve informative (and connect to the stability discussion above), the demo below uses overlapping domains and adds a small router z-loss term (z_loss_coef) to discourage runaway logits.

In[18]:
Code
## Create synthetic domain representations
## Imagine these are averaged representations from different document types
num_domains = 4
prototype_scale = 1.0
noise_std = 0.8
batch_size_per_domain = 64

## NOTE: If you crank up prototype_scale and reduce noise_std, this becomes
## an almost trivial supervised classification problem and the loss can drop
## to ~0 in just a few optimization steps (softmax saturation).
domain_prototypes = torch.randn(num_domains, d_model) * prototype_scale

## Create a router with one expert per domain
domain_router = Router(d_model, num_domains, top_k=1)

## Simulate "labeled" training: each sample comes from a known domain
## We'll train the router to route domain samples to corresponding experts

z_loss_coef = 1e-2
optimizer = torch.optim.Adam(
    domain_router.parameters(), lr=0.005, weight_decay=1e-4
)

losses = []
for epoch in range(200):
    epoch_loss = 0
    for domain_idx in range(num_domains):
        # Generate samples from this domain (prototype + noise)
        samples = (
            domain_prototypes[domain_idx]
            + torch.randn(batch_size_per_domain, d_model) * noise_std
        )
        samples = samples.unsqueeze(1)  # Add seq dimension

        # Train on logits directly for numerical stability
        logits = domain_router.gate(samples).squeeze(1)  # (batch, num_domains)
        labels = torch.full(
            (logits.shape[0],),
            domain_idx,
            dtype=torch.long,
            device=logits.device,
        )

        # Cross-entropy: encourage routing to the correct expert
        nll = F.cross_entropy(logits, labels)

        # Router z-loss: discourages runaway logits (keeps softmax from saturating)
        z = torch.logsumexp(logits, dim=-1)
        z_loss = (z**2).mean()

        loss = nll + z_loss_coef * z_loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    losses.append(epoch_loss / num_domains)
Out[19]:
Visualization
Line plot showing decreasing training loss over 200 epochs.
Router training loss over 200 epochs during domain-specific learning (with overlapping synthetic domains and a small router z-loss to prevent softmax saturation). The decreasing loss reflects the router's improved accuracy in mapping inputs to their corresponding experts.

The router quickly learns to distinguish domain-specific patterns. Let's verify it routes correctly:

In[20]:
Code
## Test routing accuracy on new samples
correct = 0
total = 0
domain_accuracies = []

with torch.no_grad():
    for domain_idx in range(num_domains):
        samples = (
            domain_prototypes[domain_idx] + torch.randn(100, d_model) * 0.3
        )
        samples = samples.unsqueeze(1)

        _, selected, _ = domain_router(samples)
        selected = selected.squeeze()

        # Calculate accuracy for this domain
        domain_acc = (selected == domain_idx).float().mean().item()
        domain_accuracies.append(domain_acc)

        correct += (selected == domain_idx).sum().item()
        total += 100
Out[21]:
Console
Routing accuracy: 100.0%

Per-domain accuracy:
  Domain 0: 100.0%
  Domain 1: 100.0%
  Domain 2: 100.0%
  Domain 3: 100.0%

The trained router successfully learns to route tokens from different domains to appropriate experts.

Out[22]:
Visualization
Heatmap showing routing accuracy for 4 domains, with high values on the diagonal.
Routing confusion matrix for the trained domain router. Strong diagonal values indicate that tokens from each domain are correctly routed to their corresponding expert. This demonstrates the router's ability to learn meaningful specialization patterns.

Router Weight Visualization

We can visualize the router's learned prototypes to understand what patterns it captures.

In[23]:
Code
## Extract router weights for visualization
weights = domain_router.gate.weight.detach().numpy()
Out[24]:
Visualization
Heatmap showing router weights for 4 experts across 50 dimensions with distinct patterns.
Heatmap of the learned router weight matrix, showing the first 50 dimensions of each expert's prototype. The distinct horizontal bands of weight intensity indicate that each expert has specialized its prototype to respond to a unique set of features in the input representation.

Each row shows a distinct pattern, representing the "prototype" that input must match to route to that expert. The router learns to position these prototypes to distinguish between domain-specific representations.

Key Parameters

The key parameters for the Gating Network are:

  • d_model: The dimension of input token representations.
  • num_experts: The total number of experts available in the layer.
  • top_k: The number of experts selected for each token. Common values are 1 or 2.
  • bias: Whether to include a bias term in the linear projection. Typically set to False.

Limitations and Impact

The gating network design we've covered represents the standard approach used in models like Switch Transformer and Mixtral. While effective, this design has important limitations that shape MoE research and practice.

Routers make discrete decisions while training requires continuous gradients. The standard approach handles this by flowing gradients through routing weights, not expert selection. This works but creates blind spots: the model never learns whether unselected experts might have been better choices. Several techniques address this limitation. Load balancing losses encourage the router to explore all experts by penalizing uneven usage. Random routing occasionally computes non-selected experts for gradient purposes. And capacity factors limit how many tokens any single expert can handle, forcing distribution across experts. We'll explore these techniques in the upcoming chapters on load balancing and auxiliary losses.

Another limitation is the simplicity of the routing decision. A single linear layer might not capture complex routing logic. Some research explores hierarchical routers (first choose an expert cluster, then an expert within it) or multi-layer routers. However, the simple linear approach remains dominant because more complex routers add overhead and don't consistently improve performance.

Despite these limitations, gating networks successfully enable sparse expert models that match or exceed dense model quality at lower computational cost. The router's ability to learn meaningful specialization, routing different types of content to appropriate experts, demonstrates that even simple gating mechanisms can capture important structure in language. This specialization emerges naturally from end-to-end training, without explicit supervision about which expert should handle which content. The router discovers useful divisions of labor by learning which expert produces better outputs for each input type. This emergent specialization is a compelling aspect of MoE architectures.

Summary

Gating networks serve as the decision-making component in Mixture of Experts architectures, determining which experts process each token. The key concepts covered in this chapter include:

  • Router architecture: Standard routers use a single linear layer projecting token representations to expert scores. This simple design balances computational efficiency with sufficient expressivity.

  • Routing score computation: Raw logits pass through softmax to produce a probability distribution over experts. Top-K selection then identifies which experts actually compute, with renormalized weights for combining their outputs.

  • Router training: Gradients flow through routing weights (which are continuous) even though expert selection is discrete. This creates blind spots for unselected experts, motivating auxiliary training techniques.

  • Learned behavior: Trained routers exhibit meaningful specialization, including semantic domain preferences, syntactic patterns, and layer-dependent routing strategies. This specialization emerges naturally from end-to-end training.

The next chapter covers Top-K routing in detail, examining how the number of selected experts affects model capacity, computation costs, and training dynamics.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about gating networks in Mixture of Experts architectures.

Loading component...

Reference

BIBTEXAcademic
@misc{gatingnetworksrouterarchitectureinmixtureofexperts, author = {Michael Brenndoerfer}, title = {Gating Networks: Router Architecture in Mixture of Experts}, year = {2025}, url = {https://mbrenndoerfer.com/writing/moe-gating-networks-router-architecture-design}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Gating Networks: Router Architecture in Mixture of Experts. Retrieved from https://mbrenndoerfer.com/writing/moe-gating-networks-router-architecture-design
MLAAcademic
Michael Brenndoerfer. "Gating Networks: Router Architecture in Mixture of Experts." 2026. Web. today. <https://mbrenndoerfer.com/writing/moe-gating-networks-router-architecture-design>.
CHICAGOAcademic
Michael Brenndoerfer. "Gating Networks: Router Architecture in Mixture of Experts." Accessed today. https://mbrenndoerfer.com/writing/moe-gating-networks-router-architecture-design.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Gating Networks: Router Architecture in Mixture of Experts'. Available at: https://mbrenndoerfer.com/writing/moe-gating-networks-router-architecture-design (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Gating Networks: Router Architecture in Mixture of Experts. https://mbrenndoerfer.com/writing/moe-gating-networks-router-architecture-design