Expert Networks: MoE Architecture & FFN Implementation

Michael BrenndoerferNovember 13, 202531 min read

Learn how expert networks power Mixture of Experts models. Explore FFN-based experts, capacity factors, expert counts, and transformer placement strategies.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Expert Networks

In the previous chapter, we explored how sparse models achieve computational efficiency by activating only a subset of parameters for each input. The Mixture of Experts (MoE) architecture realizes this principle through a collection of specialized subnetworks, each designed to handle different aspects of the input distribution. These subnetworks, called experts, form the computational backbone of MoE models.

This chapter examines what experts are, how they're structured, and the architectural decisions that determine their effectiveness. We'll see why the feed-forward network became the dominant choice for expert implementation, how to manage the flow of tokens through experts during batch processing, and where to place expert layers within the transformer architecture.

What Is an Expert?

Expert Network

An expert is an independent neural network component within an MoE layer. Each expert processes tokens routed to it, producing output representations that are combined based on routing weights.

The term "expert" suggests specialization, and MoE experts often develop distinct functional roles during training. Some experts might specialize in processing mathematical expressions, others in handling conversational text, and still others in domain-specific terminology. This specialization emerges naturally from the routing mechanism rather than being explicitly programmed. The model discovers, through gradient descent, that certain experts become more effective at processing certain types of inputs, and the routing mechanism learns to exploit this emergent structure.

To understand what an expert actually does, we need to think about it as a transformation function. Conceptually, an expert is a function that maps an input vector to an output vector within the same space. This means that when a token representation enters an expert, it emerges transformed but still compatible with the rest of the model's architecture:

Ei:RdRdE_i: \mathbb{R}^d \to \mathbb{R}^d

where:

  • EiE_i: the function representing the ii-th expert
  • dd: the dimension of the token representation (matching the model's hidden dimension)

This function transforms an input representation into an output representation of the same dimensionality. The notation Rd\mathbb{R}^d indicates that both the input and output live in a dd-dimensional real vector space. This dimensionality constraint is essential because the expert's output must integrate seamlessly with residual connections and subsequent layers in the transformer.

For a set of NN experts {E1,E2,,EN}\{E_1, E_2, \ldots, E_N\}, each processes its assigned tokens independently. The key insight is that different tokens can be routed to different experts, enabling the model to apply specialized processing based on input characteristics. Consider a sentence containing both technical terminology and conversational phrases: the technical terms might be routed to one expert that has learned relevant transformations for such vocabulary, while the conversational phrases flow to a different expert optimized for casual language patterns. This routing happens dynamically for each forward pass, adapting the computation to the specific characteristics of each input.

Experts as Feed-Forward Networks

The most common implementation uses standard feed-forward networks as experts. This choice isn't arbitrary; it stems from a fundamental property of transformer FFN layers that we discussed in Part XII. Understanding this property reveals why FFN layers are uniquely suited for expert-based sparsification.

The Position-Independence Property

Recall that the feed-forward network in a transformer block operates identically on each position in the sequence, processing each token in isolation. This property, sometimes called "position-wise" or "token-wise" processing, means that the FFN applies the exact same transformation to every token without any awareness of neighboring tokens. Mathematically, the FFN applies a point-wise transformation:

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x) = W_2 \cdot \sigma(W_1 \cdot x + b_1) + b_2

where:

  • xx: the input vector corresponding to a specific token position
  • W1,b1W_1, b_1: the weight matrix and bias vector for the first linear transformation
  • W1x+b1W_1 \cdot x + b_1: the projection into the intermediate expansion dimension
  • σ\sigma: the non-linear activation function (typically ReLU or GELU)
  • W2,b2W_2, b_2: the weight matrix and bias vector for the second linear transformation

Let's trace through this computation to build intuition. The input vector xx first undergoes a linear transformation through W1W_1, which projects it from the model dimension dd into a higher-dimensional intermediate space (typically 4× larger). The bias term b1b_1 shifts this projection, and the activation function σ\sigma introduces non-linearity by selectively suppressing or passing through different components of the intermediate representation. Finally, W2W_2 projects the result back to the original dimensionality, with b2b_2 providing a final bias adjustment.

This position-independence is crucial for MoE. Because each token passes through the FFN independently, we can route different tokens to different FFN-style experts without disrupting the computational flow. If the FFN for token at position 3 has no interaction with the FFN for token at position 7, then there is no reason they must use the same FFN weights. Attention layers, by contrast, require all positions to interact simultaneously through the query-key-value mechanism, making them poor candidates for expert-based sparsification.

Expert FFN Architecture

Each expert in an MoE layer typically mirrors the standard transformer FFN structure, but with its own independent set of parameters. Building on the gated architectures covered in Part XII, Chapter 7, most modern expert implementations use the SwiGLU variant. This architecture splits the input into two parallel paths, one for gating and one for features, allowing the network to dynamically control information flow:

Gate(x)=SiLU(Wgate(i)x)Features(x)=Wup(i)xEi(x)=Wdown(i)(Gate(x)Features(x))\begin{aligned} \text{Gate}(x) &= \text{SiLU}(W_{\text{gate}}^{(i)} \cdot x) \\ \text{Features}(x) &= W_{\text{up}}^{(i)} \cdot x \\ E_i(x) &= W_{\text{down}}^{(i)} \cdot (\text{Gate}(x) \odot \text{Features}(x)) \end{aligned}

where:

  • xx: the input token representation vector
  • Wgate(i)W_{\text{gate}}^{(i)}: the gating projection weights for expert ii
  • Wup(i)W_{\text{up}}^{(i)}: the feature up-projection weights for expert ii
  • SiLU\text{SiLU}: the Sigmoid Linear Unit activation function
  • Gate(x)\text{Gate}(x): the gating mask values computed by the first path
  • Features(x)\text{Features}(x): the candidate features computed by the second path
  • \odot: the element-wise multiplication operator (Hadamard product), which filters the features using the gate
  • Wdown(i)W_{\text{down}}^{(i)}: the down-projection weights transforming back to the model dimension

To understand why this gated structure is valuable, consider what each component contributes. The architecture processes the input through two parallel paths that serve complementary purposes. The up-projection Wup(i)xW_{\text{up}}^{(i)} x generates candidate features by expanding the input into a higher-dimensional space where the model can represent more complex patterns. Meanwhile, the gating path SiLU(Wgate(i)x)\text{SiLU}(W_{\text{gate}}^{(i)} x) computes a mask of scalar values between 0 and approximately 1 (due to the SiLU activation's range). Each element of this mask indicates how much the corresponding candidate feature should be preserved.

The element-wise product (\odot) then filters the candidates with this mask, allowing the network to suppress or preserve specific features before the final projection Wdown(i)W_{\text{down}}^{(i)}. This gating mechanism gives the expert fine-grained control over which information flows through. Some dimensions might be entirely suppressed (gate value near 0) while others pass through strongly (gate value near 1). This selective filtering is what allows experts to develop specialized behaviors for different input types.

The key parameters defining an expert's architecture include:

  • Input/output dimension (dmodeld_{\text{model}}): Matches the transformer's hidden dimension, typically 2048-8192 in large models
  • Intermediate dimension (dffd_{\text{ff}}): The expansion dimension within the expert, often 4dmodel4d_{\text{model}} for standard FFNs or 83dmodel\frac{8}{3}d_{\text{model}} for SwiGLU to match parameter counts
  • Activation function: Usually SiLU (Swish) for gated variants, or GELU/ReLU for standard two-layer FFNs

Each expert maintains its own weight matrices, completely independent of other experts. This independence is what enables specialization: expert E1E_1 can learn entirely different transformations than expert E2E_2. During training, the gradient updates for each expert depend only on the tokens routed to that expert, allowing different experts to develop distinct computational roles.

Parameter Counting

To evaluate the memory footprint of an MoE layer, we sum the parameters across all experts. Understanding this calculation helps you anticipate memory requirements and make informed architectural decisions. Consider an MoE layer with NN experts, where each expert is a standard two-layer FFN with intermediate dimension dffd_{\text{ff}}:

Nweights=2dmodeldffNbiases=dff+dmodelNexpert=Nweights+Nbiases=2dmodeldff+dff+dmodel\begin{aligned} N_{\text{weights}} &= 2 \cdot d_{\text{model}} \cdot d_{\text{ff}} \\ N_{\text{biases}} &= d_{\text{ff}} + d_{\text{model}} \\ N_{\text{expert}} &= N_{\text{weights}} + N_{\text{biases}} \\ &= 2 \cdot d_{\text{model}} \cdot d_{\text{ff}} + d_{\text{ff}} + d_{\text{model}} \end{aligned}

where:

  • NexpertN_{\text{expert}}: the total parameter count for a single expert
  • NweightsN_{\text{weights}}: the parameter count for the weight matrices
  • NbiasesN_{\text{biases}}: the parameter count for the bias vectors
  • dmodeld_{\text{model}}: the hidden dimension size of the model
  • dffd_{\text{ff}}: the intermediate dimension size of the feed-forward network
  • 2dmodeldff2 \cdot d_{\text{model}} \cdot d_{\text{ff}}: parameters for the two weight matrices (W1W_1 and W2W_2)
  • dff+dmodeld_{\text{ff}} + d_{\text{model}}: parameters for the bias vectors (b1b_1 and b2b_2)

Let's trace through why the weight matrix formula has the factor of 2. The first weight matrix W1W_1 has shape (dff,dmodel)(d_{\text{ff}}, d_{\text{model}}), mapping from the model dimension to the intermediate dimension, which contributes dmodel×dffd_{\text{model}} \times d_{\text{ff}} parameters. The second weight matrix W2W_2 has shape (dmodel,dff)(d_{\text{model}}, d_{\text{ff}}), mapping back from the intermediate dimension to the model dimension, contributing another dff×dmodeld_{\text{ff}} \times d_{\text{model}} parameters. Together, these two matrices account for the 2dmodeldff2 \cdot d_{\text{model}} \cdot d_{\text{ff}} term. The bias vectors add a relatively small number of additional parameters: dffd_{\text{ff}} for b1b_1 and dmodeld_{\text{model}} for b2b_2.

For NN experts, the total MoE layer parameters are approximately N×2×dmodel×dffN \times 2 \times d_{\text{model}} \times d_{\text{ff}}. A model with 8 experts in each MoE layer has roughly 8× the FFN parameters of an equivalent dense model. However, if only 2 experts activate per token, the compute per forward pass remains comparable to the dense baseline. This is the fundamental trade-off of MoE: we store many more parameters than we use for any single input, gaining model capacity without proportional computational cost.

Out[2]:
Visualization
Comparison of total versus active parameters as expert counts increase. While total parameters (red) scale linearly with the number of experts, active parameters per token (green) remain nearly constant, enabling significant capacity scaling without proportional increases in computational cost.
Comparison of total versus active parameters as expert counts increase. While total parameters (red) scale linearly with the number of experts, active parameters per token (green) remain nearly constant, enabling significant capacity scaling without proportional increases in computational cost.

Expert Capacity

When processing batches of sequences, MoE models face a unique challenge: different tokens route to different experts, but hardware prefers uniform workloads. Expert capacity mechanisms manage this tension by imposing limits on how many tokens each expert can process.

The Batching Problem

Consider a batch of 1024 tokens being processed by an 8-expert MoE layer. If routing were perfectly balanced, each expert would receive exactly 128 tokens. In practice, some tokens cluster around certain experts while others receive few tokens. Without constraints, popular experts become computational bottlenecks while unpopular experts sit idle.

This load imbalance creates practical problems for efficient computation. Modern hardware, particularly GPUs, achieves peak throughput when performing regular, predictable operations on uniformly-sized data. When one expert must process 300 tokens while another processes only 50, we cannot fully utilize parallel processing capabilities. The expert with 300 tokens becomes a serialization point, and the expert with 50 tokens wastes potential compute cycles.

Capacity Factor

To prevent bottlenecks where popular experts become overloaded, we calculate the expert capacity KK (the maximum tokens per expert) based on the total batch size and a buffer multiplier. This approach provides a principled way to balance routing flexibility against computational efficiency:

K=CTEK = C \cdot \frac{T}{E}

where:

  • KK: the expert capacity (maximum tokens per expert)
  • CC: the capacity factor (typically 1.0\ge 1.0), acting as a buffer multiplier
  • TT: the total number of tokens in the batch
  • EE: the total count of available experts
  • T/ET/E: the baseline load per expert under perfectly balanced routing

The intuition behind this formula is straightforward. The ratio T/ET/E represents the average number of tokens each expert would receive under perfectly balanced routing. In our example with 1024 tokens and 8 experts, this baseline is 128 tokens per expert. The capacity factor CC then scales this baseline to determine how much deviation from perfect balance we allow.

A capacity factor of 1.0 means each expert can handle exactly its "fair share" of tokens. Values greater than 1.0 provide headroom for imbalanced routing.

The capacity factor affects model behavior in several ways:

  • C=1.0C = 1.0: Minimal buffer, but tokens may overflow if routing is imbalanced
  • C=1.25C = 1.25: Common choice, providing 25% buffer for routing variance
  • C=2.0C = 2.0: Large buffer, rarely causes overflow but increases memory usage

Choosing the right capacity factor involves balancing memory efficiency against routing flexibility. A higher capacity factor allocates more buffer space for each expert's token queue, consuming more memory. However, it also reduces the probability that tokens are dropped due to capacity overflow, improving model quality.

Out[3]:
Visualization
Expert capacity allocation for different capacity factors (C) with 1024 tokens and 8 experts. The base allocation (blue) represents perfect load balancing, while the buffer headroom (orange) scales with C to accommodate routing variance and prevent token dropping.
Expert capacity allocation for different capacity factors (C) with 1024 tokens and 8 experts. The base allocation (blue) represents perfect load balancing, while the buffer headroom (orange) scales with C to accommodate routing variance and prevent token dropping.

Overflow and Dropped Tokens

When an expert reaches capacity, additional tokens routed to it are typically "dropped"; they bypass the MoE layer entirely and pass through via the residual connection. This ensures forward passes complete but means some tokens receive less processing than intended.

Token Dropping

Dropped tokens receive no expert processing, relying solely on the residual connection. High drop rates indicate routing imbalance and can degrade model quality.

The drop rate provides a diagnostic signal. If 10% of tokens are dropped during training, the routing mechanism isn't distributing load effectively, motivating the auxiliary losses we'll explore in upcoming chapters on load balancing. You monitor drop rates as a health metric for your MoE training runs, as sustained high drop rates often correlate with degraded downstream performance.

Capacity During Training vs. Inference

Training typically uses fixed capacity factors to enable efficient batched operations. The predictable memory allocation allows for optimized kernel implementations and stable training dynamics. During inference, especially for single sequences, capacity constraints may be relaxed. Some implementations process all routed tokens regardless of capacity during inference, accepting variable compute time for better quality. This flexibility makes sense because inference often involves smaller batches where strict capacity constraints provide less benefit, and you typically prefer accuracy over strict latency guarantees.

Expert Count Selection

Choosing the number of experts involves balancing specialization, memory, and computational efficiency. Different models have explored configurations ranging from 8 to thousands of experts. The optimal choice depends on the training data distribution, available hardware, and target use case.

Trade-offs in Expert Count

Increasing the number of experts has several effects:

Benefits of more experts:

  • Greater total parameter count without proportional compute increase
  • Finer-grained specialization potential
  • Better scaling on distributed systems with expert parallelism

Costs of more experts:

  • More memory to store all expert weights
  • Communication overhead when experts span devices
  • Risk of some experts being underutilized

Empirical Findings

Major MoE models have converged on different expert counts based on their design goals:

Expert count configurations across major MoE architectures.
ModelExpertsActive ExpertsRationale
Switch Transformer128-20481Maximum sparsity, extreme scaling
Mixtral 8x7B82Balance simplicity with capacity
GShard20482Very large scale translation
DeepSeek-MoE646Finer-grained routing

Switch Transformer demonstrated that even single-expert routing (one active expert per token) works well, enabling massive expert counts. Mixtral showed that modest expert counts (8) with simple routing achieve strong results while remaining practical to deploy.

Out[4]:
Visualization
Comparison of total and active expert counts across Mixtral, DeepSeek, Switch Transformer, and GShard. While total expert counts (gray) range from 8 to 2048, the number of active experts per token (green) remains consistently low, illustrating the fundamental sparsity of different MoE architectures.
Comparison of total and active expert counts across Mixtral, DeepSeek, Switch Transformer, and GShard. While total expert counts (gray) range from 8 to 2048, the number of active experts per token (green) remains consistently low, illustrating the fundamental sparsity of different MoE architectures.
Ratios of total capacity to active computation for different MoE architectures. The Switch Transformer achieves over 500x capacity leverage, demonstrating how sparse routing allows for massive parameter counts with minimal active compute per token.
Ratios of total capacity to active computation for different MoE architectures. The Switch Transformer achieves over 500x capacity leverage, demonstrating how sparse routing allows for massive parameter counts with minimal active compute per token.

The Capacity-Compute Relationship

More experts with fewer active per token increases the capacity-to-compute ratio. A model with 64 experts but only 2 active has 32× more parameters than its per-token compute would suggest. This ratio affects:

  • Memory requirements (all expert weights must be stored)
  • Potential for overfitting (more capacity than data may support)
  • Inference efficiency (expert weights may need to be loaded from slower memory)

Shared vs. Independent Experts

Some architectures include "shared" experts that process every token alongside the sparse routing to specialized experts. DeepSeek-MoE uses this approach, with 2 shared experts plus routing among 64 specialized experts. Shared experts capture common transformations while specialized experts handle token-specific processing.

Expert Placement in Transformer

Not every layer in a transformer needs to be an MoE layer. The placement of expert layers significantly impacts model behavior and efficiency. This section examines the design patterns that have proven effective in practice.

FFN Replacement Pattern

The standard approach replaces some or all FFN layers with MoE layers while keeping attention layers unchanged. This follows naturally from the position-independence property discussed earlier.

In a transformer block, the computation flows through a sequence of operations. The input first passes through layer normalization, then through the attention mechanism where tokens exchange information. After a residual connection adds the input back to the attention output, another layer normalization prepares the representation for the FFN (or MoE) layer. Finally, another residual connection completes the block. The FFN position, coming after attention has already enabled cross-position communication, becomes the natural location for the MoE layer.

The FFN position becomes the MoE position. Attention layers remain dense, ensuring all positions can still interact before expert-specialized processing. This design principle reflects the distinct roles of attention and FFN layers in transformers: attention handles information routing between positions, while FFN layers transform the representation at each position independently.

Interleaving Strategies

Several patterns have been explored for alternating dense and MoE layers:

  • Every layer MoE: All FFN layers become MoE layers. Maximizes expert utilization but increases routing overhead and may not improve quality proportionally.

  • Alternating (every 2nd layer): Common in practice. Every other FFN layer is replaced with MoE. Mixtral uses this pattern. Provides regular dense layers for shared processing between expert layers.

  • Periodic (every nth layer): More aggressive interleaving, such as MoE at every 4th layer. Reduces router overhead but limits expert specialization opportunities.

Out[5]:
Visualization
MoE layer placement strategies in a 12-layer transformer. The alternating pattern (every 2nd layer) is a common configuration that balances the routing overhead of experts (red) with the shared processing of dense FFN layers (gray).
MoE layer placement strategies in a 12-layer transformer. The alternating pattern (every 2nd layer) is a common configuration that balances the routing overhead of experts (red) with the shared processing of dense FFN layers (gray).

Layer-Specific Expert Counts

Some architectures vary the number of experts by layer depth. Early layers might use fewer experts (patterns are more uniform), while later layers use more experts (task-specific specialization increases). This pattern appears in certain custom architectures though major models typically use uniform expert counts.

Cross-Layer Expert Sharing

An emerging technique shares expert weights across multiple layers while maintaining layer-specific routers. This reduces total parameters while preserving adaptive routing. Each layer can route differently even if the underlying expert computations are shared.

Code Implementation

Let's implement a simple expert network structure and explore capacity-constrained processing.

In[6]:
Code
import torch.nn as nn
import torch.nn.functional as F


## Expert as a standard FFN with SiLU activation
class Expert(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=True)
        self.w2 = nn.Linear(d_ff, d_model, bias=True)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)))


## Create experts with shared configuration
d_model = 256
d_ff = 512
num_experts = 8

experts = nn.ModuleList([Expert(d_model, d_ff) for _ in range(num_experts)])

Each expert is an independent FFN. Let's examine the parameter counts:

In[7]:
Code
def count_parameters(module):
    return sum(p.numel() for p in module.parameters())


single_expert_params = count_parameters(experts[0])
total_expert_params = count_parameters(experts)
Out[8]:
Console
Parameters per expert: 262,912
Total expert parameters: 2,103,296
Expansion factor: 8x over single FFN

The total parameter count (approx. 2.1M) reflects the sum of all experts, but the active parameters per token remain small (approx. 260k). This demonstrates how MoE increases model capacity without proportionally increasing inference compute.

Now let's implement capacity-constrained expert processing. This simulates how tokens are assigned to experts with overflow handling:

In[9]:
Code
def process_with_capacity(
    tokens,  # (batch_size * seq_len, d_model)
    expert_assignments,  # (num_tokens,) - which expert each token goes to
    experts,  # list of expert modules
    capacity_factor=1.25,
):
    """
    Process tokens through experts with capacity constraints.
    Returns processed tokens and drop statistics.
    """
    num_tokens = tokens.shape[0]
    num_experts = len(experts)
    d_model = tokens.shape[1]

    # Calculate capacity per expert
    base_capacity = num_tokens // num_experts
    capacity = int(capacity_factor * base_capacity)

    # Track assignments and outputs
    output = torch.zeros_like(tokens)
    expert_counts = torch.zeros(num_experts, dtype=torch.long)
    dropped_counts = torch.zeros(num_experts, dtype=torch.long)

    # Process each token
    for i in range(num_tokens):
        expert_idx = expert_assignments[i].item()

        if expert_counts[expert_idx] < capacity:
            # Expert has capacity: process the token
            with torch.no_grad():  # For demonstration only
                output[i] = experts[expert_idx](tokens[i : i + 1]).squeeze(0)
            expert_counts[expert_idx] += 1
        else:
            # Expert at capacity: token is dropped (uses residual)
            output[i] = tokens[i]  # Pass through unchanged
            dropped_counts[expert_idx] += 1

    return output, expert_counts, dropped_counts

Let's simulate imbalanced routing to see capacity constraints in action:

In[10]:
Code
import torch
import torch.nn as nn
import torch.nn.functional as F

## Recreate experts for this cell (needed since earlier cell has output: false)
d_model = 256
d_ff = 512
num_experts = 8


class Expert(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.w1 = nn.Linear(d_model, d_ff, bias=True)
        self.w2 = nn.Linear(d_ff, d_model, bias=True)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)))


experts = nn.ModuleList([Expert(d_model, d_ff) for _ in range(num_experts)])

## Generate random tokens
torch.manual_seed(42)
batch_tokens = 256
tokens = torch.randn(batch_tokens, d_model)

## Create imbalanced routing: some experts more "popular"
## Expert 0 and 1 get most tokens
probs = torch.tensor([0.3, 0.25, 0.1, 0.1, 0.08, 0.07, 0.05, 0.05])
expert_assignments = torch.multinomial(probs, batch_tokens, replacement=True)

## Define capacity factor
capacity_factor = 1.25

## Process with capacity constraint
output, counts, dropped = process_with_capacity(
    tokens, expert_assignments, experts, capacity_factor=capacity_factor
)

total_dropped = dropped.sum().item()
drop_rate = 100 * total_dropped / batch_tokens

## Calculate per-expert statistics
expert_stats = []
for i in range(num_experts):
    assigned = (expert_assignments == i).sum().item()
    processed = counts[i].item()
    dropped_count = dropped[i].item()
    expert_stats.append((i, assigned, processed, dropped_count))
Out[11]:
Console
Expert utilization with capacity_factor=1.25:
Base capacity: 32 tokens per expert
Actual capacity: 40 tokens per expert

Expert 0: 82 assigned, 40 processed, 42 dropped
Expert 1: 65 assigned, 40 processed, 25 dropped
Expert 2: 28 assigned, 28 processed, 0 dropped
Expert 3: 22 assigned, 22 processed, 0 dropped
Expert 4: 22 assigned, 22 processed, 0 dropped
Expert 5: 15 assigned, 15 processed, 0 dropped
Expert 6: 10 assigned, 10 processed, 0 dropped
Expert 7: 12 assigned, 12 processed, 0 dropped

Total drop rate: 26.2%
Out[12]:
Visualization
Token load distribution across 8 experts under imbalanced routing conditions. Popular experts (e.g., E0, E1) exceed the capacity threshold (dashed line) and must drop excess tokens (red), while underutilized experts leave allocated capacity unused.
Token load distribution across 8 experts under imbalanced routing conditions. Popular experts (e.g., E0, E1) exceed the capacity threshold (dashed line) and must drop excess tokens (red), while underutilized experts leave allocated capacity unused.

The imbalanced routing causes popular experts to drop tokens. Let's visualize how different capacity factors affect drop rates:

In[13]:
Code
capacity_factors = [1.0, 1.1, 1.25, 1.5, 1.75, 2.0]
drop_rates = []

for cf in capacity_factors:
    _, _, dropped = process_with_capacity(
        tokens, expert_assignments, experts, cf
    )
    drop_rate = 100 * dropped.sum().item() / batch_tokens
    drop_rates.append(drop_rate)
Out[14]:
Visualization
Line chart showing drop rate decreasing from 15% at capacity 1.0 to near 0% at capacity 2.0.
Relationship between capacity factor and token drop rate. Increasing the capacity factor reduces the drop rate from ~15% at C=1.0 to near zero at C=2.0, illustrating the trade-off between memory usage and information preservation.

The plot demonstrates that increasing the capacity factor significantly reduces the drop rate. In this simulation, increasing the buffer significantly reduces drops, though the heavy skew means some tokens are still dropped even at higher capacity factors.

Worked Example: Expert Specialization

Let's trace through how different tokens might be processed by different experts. Consider a sequence containing both code and natural language:

In[15]:
Code
import torch

## Simulated hidden representations for different token types
## (In practice, these come from earlier transformer layers)
code_token = torch.randn(1, d_model)  # Represents "def"
english_token = torch.randn(1, d_model)  # Represents "the"
math_token = torch.randn(1, d_model)  # Represents "∫"

## Each expert produces different outputs
## due to their independent weights
with torch.no_grad():
    code_outputs = [experts[i](code_token) for i in range(num_experts)]
    english_outputs = [experts[i](english_token) for i in range(num_experts)]

## Measure output diversity across experts
code_var = torch.stack(code_outputs).var(dim=0).mean().item()
english_var = torch.stack(english_outputs).var(dim=0).mean().item()
Out[16]:
Console
Code token output variance across experts: 0.0356
English token output variance across experts: 0.0363

Each expert transforms the same input differently. After training, a well-functioning MoE learns to route tokens to experts whose transformations are most appropriate for that token type. The router, which we'll explore in the next chapter, learns to make these assignments based on the input representation.

Out[17]:
Visualization
L2 distance between individual expert outputs and the mean output for the same code token. The substantial deviation of each expert from the mean demonstrates that experts learn distinct transformations, providing specialized processing for the same input.
L2 distance between individual expert outputs and the mean output for the same code token. The substantial deviation of each expert from the mean demonstrates that experts learn distinct transformations, providing specialized processing for the same input.

Key Parameters

The key parameters for the Expert implementation are:

  • d_model: The dimensionality of input and output tokens. Matches the model's hidden size.
  • d_ff: The intermediate expansion dimension within each expert FFN.
  • num_experts: The total number of experts available in the layer.
  • capacity_factor: A multiplier determining the maximum number of tokens an expert can process relative to a perfectly balanced load.

Limitations and Practical Considerations

Expert networks introduce several challenges that you must navigate. The most immediate is memory overhead: storing NN experts requires N×N \times the memory of a single FFN layer, even though most experts remain idle for any given token. A Mixtral 8x7B model stores 8 separate FFN copies per MoE layer, consuming significant GPU memory despite only activating 2 at inference time. This memory-compute asymmetry means MoE models often require more hardware to deploy than their per-token FLOP count would suggest.

Expert underutilization presents another concern. Without careful load balancing, some experts may receive far fewer tokens during training, leading to undertrained weights that hurt model quality when they are selected. This creates a feedback loop: underutilized experts produce worse outputs, causing the router to select them even less frequently. The auxiliary losses we'll discuss in upcoming chapters directly address this problem.

The capacity mechanism itself introduces discontinuities in the training signal. Tokens dropped due to capacity overflow still contribute to the loss but receive no expert processing, creating a mismatch between forward computation and gradient flow. Various techniques attempt to make this process more differentiable, but some information loss is inherent to hard capacity constraints.

Finally, expert placement decisions remain somewhat empirical. While alternating dense and MoE layers has become standard, the optimal interleaving pattern likely varies by task and scale. Some recent work suggests that certain layer positions benefit more from expert diversity than others, but clear design principles are still emerging.

Summary

Expert networks form the computational substrate of Mixture of Experts models. The key insights from this chapter include:

  • Experts are independent FFNs that mirror standard transformer feed-forward layers. The position-independence of FFN operations makes them natural candidates for sparse expert-based routing.

  • Expert capacity manages the tension between flexible routing and efficient batched computation. The capacity factor determines how much routing imbalance the system tolerates before dropping tokens.

  • Expert count involves trade-offs between specialization potential, memory requirements, and utilization efficiency. Successful models range from 8 experts (Mixtral) to thousands (Switch Transformer).

  • Expert placement typically replaces FFN layers while keeping attention layers dense. Alternating dense and MoE layers is a common pattern that balances routing overhead with expert specialization.

With the expert architecture established, the natural question becomes: how does the model decide which expert should process each token? The next chapter on gating networks addresses exactly this question, introducing the routing mechanisms that make sparse expert selection possible.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about expert networks in Mixture of Experts architectures.

Loading component...

Reference

BIBTEXAcademic
@misc{expertnetworksmoearchitectureffnimplementation, author = {Michael Brenndoerfer}, title = {Expert Networks: MoE Architecture & FFN Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/expert-networks-moe-architecture-ffn-implementation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Expert Networks: MoE Architecture & FFN Implementation. Retrieved from https://mbrenndoerfer.com/writing/expert-networks-moe-architecture-ffn-implementation
MLAAcademic
Michael Brenndoerfer. "Expert Networks: MoE Architecture & FFN Implementation." 2026. Web. today. <https://mbrenndoerfer.com/writing/expert-networks-moe-architecture-ffn-implementation>.
CHICAGOAcademic
Michael Brenndoerfer. "Expert Networks: MoE Architecture & FFN Implementation." Accessed today. https://mbrenndoerfer.com/writing/expert-networks-moe-architecture-ffn-implementation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Expert Networks: MoE Architecture & FFN Implementation'. Available at: https://mbrenndoerfer.com/writing/expert-networks-moe-architecture-ffn-implementation (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Expert Networks: MoE Architecture & FFN Implementation. https://mbrenndoerfer.com/writing/expert-networks-moe-architecture-ffn-implementation