Learn how expert networks power Mixture of Experts models. Explore FFN-based experts, capacity factors, expert counts, and transformer placement strategies.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Expert Networks
In the previous chapter, we explored how sparse models achieve computational efficiency by activating only a subset of parameters for each input. The Mixture of Experts (MoE) architecture realizes this principle through a collection of specialized subnetworks, each designed to handle different aspects of the input distribution. These subnetworks, called experts, form the computational backbone of MoE models.
This chapter examines what experts are, how they're structured, and the architectural decisions that determine their effectiveness. We'll see why the feed-forward network became the dominant choice for expert implementation, how to manage the flow of tokens through experts during batch processing, and where to place expert layers within the transformer architecture.
What Is an Expert?
An expert is an independent neural network component within an MoE layer. Each expert processes tokens routed to it, producing output representations that are combined based on routing weights.
The term "expert" suggests specialization, and MoE experts often develop distinct functional roles during training. Some experts might specialize in processing mathematical expressions, others in handling conversational text, and still others in domain-specific terminology. This specialization emerges naturally from the routing mechanism rather than being explicitly programmed. The model discovers, through gradient descent, that certain experts become more effective at processing certain types of inputs, and the routing mechanism learns to exploit this emergent structure.
To understand what an expert actually does, we need to think about it as a transformation function. Conceptually, an expert is a function that maps an input vector to an output vector within the same space. This means that when a token representation enters an expert, it emerges transformed but still compatible with the rest of the model's architecture:
where:
- : the function representing the -th expert
- : the dimension of the token representation (matching the model's hidden dimension)
This function transforms an input representation into an output representation of the same dimensionality. The notation indicates that both the input and output live in a -dimensional real vector space. This dimensionality constraint is essential because the expert's output must integrate seamlessly with residual connections and subsequent layers in the transformer.
For a set of experts , each processes its assigned tokens independently. The key insight is that different tokens can be routed to different experts, enabling the model to apply specialized processing based on input characteristics. Consider a sentence containing both technical terminology and conversational phrases: the technical terms might be routed to one expert that has learned relevant transformations for such vocabulary, while the conversational phrases flow to a different expert optimized for casual language patterns. This routing happens dynamically for each forward pass, adapting the computation to the specific characteristics of each input.
Experts as Feed-Forward Networks
The most common implementation uses standard feed-forward networks as experts. This choice isn't arbitrary; it stems from a fundamental property of transformer FFN layers that we discussed in Part XII. Understanding this property reveals why FFN layers are uniquely suited for expert-based sparsification.
The Position-Independence Property
Recall that the feed-forward network in a transformer block operates identically on each position in the sequence, processing each token in isolation. This property, sometimes called "position-wise" or "token-wise" processing, means that the FFN applies the exact same transformation to every token without any awareness of neighboring tokens. Mathematically, the FFN applies a point-wise transformation:
where:
- : the input vector corresponding to a specific token position
- : the weight matrix and bias vector for the first linear transformation
- : the projection into the intermediate expansion dimension
- : the non-linear activation function (typically ReLU or GELU)
- : the weight matrix and bias vector for the second linear transformation
Let's trace through this computation to build intuition. The input vector first undergoes a linear transformation through , which projects it from the model dimension into a higher-dimensional intermediate space (typically 4× larger). The bias term shifts this projection, and the activation function introduces non-linearity by selectively suppressing or passing through different components of the intermediate representation. Finally, projects the result back to the original dimensionality, with providing a final bias adjustment.
This position-independence is crucial for MoE. Because each token passes through the FFN independently, we can route different tokens to different FFN-style experts without disrupting the computational flow. If the FFN for token at position 3 has no interaction with the FFN for token at position 7, then there is no reason they must use the same FFN weights. Attention layers, by contrast, require all positions to interact simultaneously through the query-key-value mechanism, making them poor candidates for expert-based sparsification.
Expert FFN Architecture
Each expert in an MoE layer typically mirrors the standard transformer FFN structure, but with its own independent set of parameters. Building on the gated architectures covered in Part XII, Chapter 7, most modern expert implementations use the SwiGLU variant. This architecture splits the input into two parallel paths, one for gating and one for features, allowing the network to dynamically control information flow:
where:
- : the input token representation vector
- : the gating projection weights for expert
- : the feature up-projection weights for expert
- : the Sigmoid Linear Unit activation function
- : the gating mask values computed by the first path
- : the candidate features computed by the second path
- : the element-wise multiplication operator (Hadamard product), which filters the features using the gate
- : the down-projection weights transforming back to the model dimension
To understand why this gated structure is valuable, consider what each component contributes. The architecture processes the input through two parallel paths that serve complementary purposes. The up-projection generates candidate features by expanding the input into a higher-dimensional space where the model can represent more complex patterns. Meanwhile, the gating path computes a mask of scalar values between 0 and approximately 1 (due to the SiLU activation's range). Each element of this mask indicates how much the corresponding candidate feature should be preserved.
The element-wise product () then filters the candidates with this mask, allowing the network to suppress or preserve specific features before the final projection . This gating mechanism gives the expert fine-grained control over which information flows through. Some dimensions might be entirely suppressed (gate value near 0) while others pass through strongly (gate value near 1). This selective filtering is what allows experts to develop specialized behaviors for different input types.
The key parameters defining an expert's architecture include:
- Input/output dimension (): Matches the transformer's hidden dimension, typically 2048-8192 in large models
- Intermediate dimension (): The expansion dimension within the expert, often for standard FFNs or for SwiGLU to match parameter counts
- Activation function: Usually SiLU (Swish) for gated variants, or GELU/ReLU for standard two-layer FFNs
Each expert maintains its own weight matrices, completely independent of other experts. This independence is what enables specialization: expert can learn entirely different transformations than expert . During training, the gradient updates for each expert depend only on the tokens routed to that expert, allowing different experts to develop distinct computational roles.
Parameter Counting
To evaluate the memory footprint of an MoE layer, we sum the parameters across all experts. Understanding this calculation helps you anticipate memory requirements and make informed architectural decisions. Consider an MoE layer with experts, where each expert is a standard two-layer FFN with intermediate dimension :
where:
- : the total parameter count for a single expert
- : the parameter count for the weight matrices
- : the parameter count for the bias vectors
- : the hidden dimension size of the model
- : the intermediate dimension size of the feed-forward network
- : parameters for the two weight matrices ( and )
- : parameters for the bias vectors ( and )
Let's trace through why the weight matrix formula has the factor of 2. The first weight matrix has shape , mapping from the model dimension to the intermediate dimension, which contributes parameters. The second weight matrix has shape , mapping back from the intermediate dimension to the model dimension, contributing another parameters. Together, these two matrices account for the term. The bias vectors add a relatively small number of additional parameters: for and for .
For experts, the total MoE layer parameters are approximately . A model with 8 experts in each MoE layer has roughly 8× the FFN parameters of an equivalent dense model. However, if only 2 experts activate per token, the compute per forward pass remains comparable to the dense baseline. This is the fundamental trade-off of MoE: we store many more parameters than we use for any single input, gaining model capacity without proportional computational cost.
Expert Capacity
When processing batches of sequences, MoE models face a unique challenge: different tokens route to different experts, but hardware prefers uniform workloads. Expert capacity mechanisms manage this tension by imposing limits on how many tokens each expert can process.
The Batching Problem
Consider a batch of 1024 tokens being processed by an 8-expert MoE layer. If routing were perfectly balanced, each expert would receive exactly 128 tokens. In practice, some tokens cluster around certain experts while others receive few tokens. Without constraints, popular experts become computational bottlenecks while unpopular experts sit idle.
This load imbalance creates practical problems for efficient computation. Modern hardware, particularly GPUs, achieves peak throughput when performing regular, predictable operations on uniformly-sized data. When one expert must process 300 tokens while another processes only 50, we cannot fully utilize parallel processing capabilities. The expert with 300 tokens becomes a serialization point, and the expert with 50 tokens wastes potential compute cycles.
Capacity Factor
To prevent bottlenecks where popular experts become overloaded, we calculate the expert capacity (the maximum tokens per expert) based on the total batch size and a buffer multiplier. This approach provides a principled way to balance routing flexibility against computational efficiency:
where:
- : the expert capacity (maximum tokens per expert)
- : the capacity factor (typically ), acting as a buffer multiplier
- : the total number of tokens in the batch
- : the total count of available experts
- : the baseline load per expert under perfectly balanced routing
The intuition behind this formula is straightforward. The ratio represents the average number of tokens each expert would receive under perfectly balanced routing. In our example with 1024 tokens and 8 experts, this baseline is 128 tokens per expert. The capacity factor then scales this baseline to determine how much deviation from perfect balance we allow.
A capacity factor of 1.0 means each expert can handle exactly its "fair share" of tokens. Values greater than 1.0 provide headroom for imbalanced routing.
The capacity factor affects model behavior in several ways:
- : Minimal buffer, but tokens may overflow if routing is imbalanced
- : Common choice, providing 25% buffer for routing variance
- : Large buffer, rarely causes overflow but increases memory usage
Choosing the right capacity factor involves balancing memory efficiency against routing flexibility. A higher capacity factor allocates more buffer space for each expert's token queue, consuming more memory. However, it also reduces the probability that tokens are dropped due to capacity overflow, improving model quality.
Overflow and Dropped Tokens
When an expert reaches capacity, additional tokens routed to it are typically "dropped"; they bypass the MoE layer entirely and pass through via the residual connection. This ensures forward passes complete but means some tokens receive less processing than intended.
Dropped tokens receive no expert processing, relying solely on the residual connection. High drop rates indicate routing imbalance and can degrade model quality.
The drop rate provides a diagnostic signal. If 10% of tokens are dropped during training, the routing mechanism isn't distributing load effectively, motivating the auxiliary losses we'll explore in upcoming chapters on load balancing. You monitor drop rates as a health metric for your MoE training runs, as sustained high drop rates often correlate with degraded downstream performance.
Capacity During Training vs. Inference
Training typically uses fixed capacity factors to enable efficient batched operations. The predictable memory allocation allows for optimized kernel implementations and stable training dynamics. During inference, especially for single sequences, capacity constraints may be relaxed. Some implementations process all routed tokens regardless of capacity during inference, accepting variable compute time for better quality. This flexibility makes sense because inference often involves smaller batches where strict capacity constraints provide less benefit, and you typically prefer accuracy over strict latency guarantees.
Expert Count Selection
Choosing the number of experts involves balancing specialization, memory, and computational efficiency. Different models have explored configurations ranging from 8 to thousands of experts. The optimal choice depends on the training data distribution, available hardware, and target use case.
Trade-offs in Expert Count
Increasing the number of experts has several effects:
Benefits of more experts:
- Greater total parameter count without proportional compute increase
- Finer-grained specialization potential
- Better scaling on distributed systems with expert parallelism
Costs of more experts:
- More memory to store all expert weights
- Communication overhead when experts span devices
- Risk of some experts being underutilized
Empirical Findings
Major MoE models have converged on different expert counts based on their design goals:
| Model | Experts | Active Experts | Rationale |
|---|---|---|---|
| Switch Transformer | 128-2048 | 1 | Maximum sparsity, extreme scaling |
| Mixtral 8x7B | 8 | 2 | Balance simplicity with capacity |
| GShard | 2048 | 2 | Very large scale translation |
| DeepSeek-MoE | 64 | 6 | Finer-grained routing |
Switch Transformer demonstrated that even single-expert routing (one active expert per token) works well, enabling massive expert counts. Mixtral showed that modest expert counts (8) with simple routing achieve strong results while remaining practical to deploy.
The Capacity-Compute Relationship
More experts with fewer active per token increases the capacity-to-compute ratio. A model with 64 experts but only 2 active has 32× more parameters than its per-token compute would suggest. This ratio affects:
- Memory requirements (all expert weights must be stored)
- Potential for overfitting (more capacity than data may support)
- Inference efficiency (expert weights may need to be loaded from slower memory)
Shared vs. Independent Experts
Some architectures include "shared" experts that process every token alongside the sparse routing to specialized experts. DeepSeek-MoE uses this approach, with 2 shared experts plus routing among 64 specialized experts. Shared experts capture common transformations while specialized experts handle token-specific processing.
Expert Placement in Transformer
Not every layer in a transformer needs to be an MoE layer. The placement of expert layers significantly impacts model behavior and efficiency. This section examines the design patterns that have proven effective in practice.
FFN Replacement Pattern
The standard approach replaces some or all FFN layers with MoE layers while keeping attention layers unchanged. This follows naturally from the position-independence property discussed earlier.
In a transformer block, the computation flows through a sequence of operations. The input first passes through layer normalization, then through the attention mechanism where tokens exchange information. After a residual connection adds the input back to the attention output, another layer normalization prepares the representation for the FFN (or MoE) layer. Finally, another residual connection completes the block. The FFN position, coming after attention has already enabled cross-position communication, becomes the natural location for the MoE layer.
The FFN position becomes the MoE position. Attention layers remain dense, ensuring all positions can still interact before expert-specialized processing. This design principle reflects the distinct roles of attention and FFN layers in transformers: attention handles information routing between positions, while FFN layers transform the representation at each position independently.
Interleaving Strategies
Several patterns have been explored for alternating dense and MoE layers:
-
Every layer MoE: All FFN layers become MoE layers. Maximizes expert utilization but increases routing overhead and may not improve quality proportionally.
-
Alternating (every 2nd layer): Common in practice. Every other FFN layer is replaced with MoE. Mixtral uses this pattern. Provides regular dense layers for shared processing between expert layers.
-
Periodic (every nth layer): More aggressive interleaving, such as MoE at every 4th layer. Reduces router overhead but limits expert specialization opportunities.
Layer-Specific Expert Counts
Some architectures vary the number of experts by layer depth. Early layers might use fewer experts (patterns are more uniform), while later layers use more experts (task-specific specialization increases). This pattern appears in certain custom architectures though major models typically use uniform expert counts.
Cross-Layer Expert Sharing
An emerging technique shares expert weights across multiple layers while maintaining layer-specific routers. This reduces total parameters while preserving adaptive routing. Each layer can route differently even if the underlying expert computations are shared.
Code Implementation
Let's implement a simple expert network structure and explore capacity-constrained processing.
Each expert is an independent FFN. Let's examine the parameter counts:
The total parameter count (approx. 2.1M) reflects the sum of all experts, but the active parameters per token remain small (approx. 260k). This demonstrates how MoE increases model capacity without proportionally increasing inference compute.
Now let's implement capacity-constrained expert processing. This simulates how tokens are assigned to experts with overflow handling:
Let's simulate imbalanced routing to see capacity constraints in action:
The imbalanced routing causes popular experts to drop tokens. Let's visualize how different capacity factors affect drop rates:
The plot demonstrates that increasing the capacity factor significantly reduces the drop rate. In this simulation, increasing the buffer significantly reduces drops, though the heavy skew means some tokens are still dropped even at higher capacity factors.
Worked Example: Expert Specialization
Let's trace through how different tokens might be processed by different experts. Consider a sequence containing both code and natural language:
Each expert transforms the same input differently. After training, a well-functioning MoE learns to route tokens to experts whose transformations are most appropriate for that token type. The router, which we'll explore in the next chapter, learns to make these assignments based on the input representation.
Key Parameters
The key parameters for the Expert implementation are:
- d_model: The dimensionality of input and output tokens. Matches the model's hidden size.
- d_ff: The intermediate expansion dimension within each expert FFN.
- num_experts: The total number of experts available in the layer.
- capacity_factor: A multiplier determining the maximum number of tokens an expert can process relative to a perfectly balanced load.
Limitations and Practical Considerations
Expert networks introduce several challenges that you must navigate. The most immediate is memory overhead: storing experts requires the memory of a single FFN layer, even though most experts remain idle for any given token. A Mixtral 8x7B model stores 8 separate FFN copies per MoE layer, consuming significant GPU memory despite only activating 2 at inference time. This memory-compute asymmetry means MoE models often require more hardware to deploy than their per-token FLOP count would suggest.
Expert underutilization presents another concern. Without careful load balancing, some experts may receive far fewer tokens during training, leading to undertrained weights that hurt model quality when they are selected. This creates a feedback loop: underutilized experts produce worse outputs, causing the router to select them even less frequently. The auxiliary losses we'll discuss in upcoming chapters directly address this problem.
The capacity mechanism itself introduces discontinuities in the training signal. Tokens dropped due to capacity overflow still contribute to the loss but receive no expert processing, creating a mismatch between forward computation and gradient flow. Various techniques attempt to make this process more differentiable, but some information loss is inherent to hard capacity constraints.
Finally, expert placement decisions remain somewhat empirical. While alternating dense and MoE layers has become standard, the optimal interleaving pattern likely varies by task and scale. Some recent work suggests that certain layer positions benefit more from expert diversity than others, but clear design principles are still emerging.
Summary
Expert networks form the computational substrate of Mixture of Experts models. The key insights from this chapter include:
-
Experts are independent FFNs that mirror standard transformer feed-forward layers. The position-independence of FFN operations makes them natural candidates for sparse expert-based routing.
-
Expert capacity manages the tension between flexible routing and efficient batched computation. The capacity factor determines how much routing imbalance the system tolerates before dropping tokens.
-
Expert count involves trade-offs between specialization potential, memory requirements, and utilization efficiency. Successful models range from 8 experts (Mixtral) to thousands (Switch Transformer).
-
Expert placement typically replaces FFN layers while keeping attention layers dense. Alternating dense and MoE layers is a common pattern that balances routing overhead with expert specialization.
With the expert architecture established, the natural question becomes: how does the model decide which expert should process each token? The next chapter on gating networks addresses exactly this question, introducing the routing mechanisms that make sparse expert selection possible.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about expert networks in Mixture of Experts architectures.











Comments