Learn how expert parallelism distributes MoE experts across devices using all-to-all communication, enabling efficient training of trillion-parameter models.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Expert Parallelism
Mixture of Experts models achieve their remarkable efficiency by activating only a subset of parameters for each token. As we explored in previous chapters, a model with 64 experts might only use 2 experts per token, giving you the capacity of a massive model with the compute cost of a much smaller one. But this creates a fundamental systems challenge: where do all those experts physically reside, and how do tokens get to the right experts when they're spread across multiple devices?
Expert parallelism is the distributed computing strategy designed specifically for MoE architectures. Unlike data parallelism (which replicates the entire model) or tensor parallelism (which shards individual layers), expert parallelism distributes the expert networks themselves across devices while keeping tokens flowing to whichever experts they need. This chapter examines how this works in practice, from expert placement decisions to the all-to-all communication patterns that make it possible.
The Distribution Challenge
To understand why expert parallelism is necessary, consider a concrete scenario. You're training an MoE model with 64 experts, where each expert is a two-layer feed-forward network with hidden dimension 4096. This is a common configuration where each expert acts as an independent specialist. Each expert contains roughly 134 million parameters (for a model dimension of 4096). With 64 experts, the expert layers alone require storing 8.6 billion parameters, not counting the attention layers, embeddings, and other components.
This parameter count comes from a simple calculation. Each expert consists of two linear transformations: one that projects from the model dimension to the feed-forward hidden dimension, and another that projects back. When you multiply the model dimension by the expanded hidden dimension for each of these two layers, and then multiply by the number of experts, the total parameter count grows rapidly. The mathematics is simple, but the implications for hardware are significant.
No single GPU can efficiently hold all of this. Even if memory weren't a constraint, we'd want to leverage multiple devices to parallelize computation. The question becomes: how do we distribute experts across devices while maintaining efficient training?
With 8 GPUs, expert parallelism places 8 experts on each device. This is the core principle: partition the expert pool across the available devices rather than replicating it. This approach is simple. Instead of asking each device to maintain a copy of every expert (which would be memory-prohibitive), we ask each device to host a specific subset of experts. Tokens that need those experts must travel to that device, but the memory burden is distributed evenly across the cluster.
Expert Placement Strategies
Once we commit to distributing experts across devices, we face a design decision: which experts should live on which devices? This seemingly simple question has meaningful implications for system performance, communication patterns, and load balance.
The simplest and most common strategy is uniform placement, where experts are evenly divided across devices. With experts and devices, each device holds experts. This approach requires no prior knowledge about how experts will be used and creates predictable, symmetric communication patterns.
The placement mapping shown above illustrates the contiguous assignment strategy: experts 0 through 7 reside on device 0, experts 8 through 15 on device 1, and so forth. This contiguous assignment simplifies the mental model and makes it easy to compute which device holds any given expert through simple integer division.
Uniform placement works well when load balancing is effective and all experts receive roughly equal traffic. However, as we discussed in the chapters on load balancing and auxiliary losses, achieving perfect balance is challenging. Some experts inevitably become more popular than others, leading to computational hotspots where certain devices must process more tokens than their peers.
Capacity-Aware Placement
An alternative strategy accounts for expected expert utilization. If certain experts consistently receive more tokens, placing them on separate devices can better balance computation. The intuition is simple: if experts 0 and 1 are both heavily used, placing them on the same device would overload it while leaving others underused. By spreading popular experts across different devices, we can achieve better computational balance.
The greedy algorithm presented here processes experts in order of their expected load, always assigning each expert to the device with the lowest current total load. This bin-packing approach, while not optimal in all cases, achieves a meaningful reduction in load variance compared to uniform placement. The variance reduction shown in the output demonstrates that even this simple heuristic can significantly improve balance when expert utilization patterns are non-uniform.
Capacity-aware placement reduces load imbalance but requires knowledge of expert utilization patterns, which may change during training. This creates a practical challenge: the routing patterns that determine expert popularity evolve as the model learns, meaning that a placement decision made at initialization may become suboptimal later in training. In practice, most implementations use uniform placement combined with strong load balancing losses, preferring to shape the routing behavior rather than adapt the physical placement.
All-to-All Communication
Expert parallelism relies on a communication primitive called all-to-all. This primitive is fundamentally different from the collective operations commonly used in dense model training, and understanding its behavior is essential for reasoning about MoE system performance.
Unlike all-reduce (which aggregates values across devices) or all-gather (which collects values from all devices), all-to-all performs a complete exchange where each device sends different data to each other device. Think of it as a postal system where every post office simultaneously sends letters to every other post office, and all the letters arrive at their destinations at roughly the same time. Each device has a different message for each recipient, and all messages are exchanged in a single coordinated operation.
An all-to-all operation takes input tensors partitioned across devices and redistributes them so that each device receives the portions destined for it from all other devices. If device has data for device , that data moves from to , and this happens simultaneously for all device pairs. The total volume of data in the system remains constant; only its distribution changes.
Here's how all-to-all fits into the MoE forward pass. The process unfolds in four distinct phases, each essential to the overall computation:
-
Routing decision: Each device computes which experts each of its tokens should visit. The gating network produces probabilities over all experts, and the top-k selection determines which experts will process each token.
-
First all-to-all (dispatch): Tokens are sent to the devices holding their target experts. A token that was generated on device 0 but needs expert 42 (located on device 5) must travel across the network to device 5.
-
Expert computation: Each device processes tokens using its local experts. At this point, each device has received all tokens that need its experts, regardless of where those tokens originated.
-
Second all-to-all (combine): Results are sent back to the original devices. The processed token representations must return to the devices where they originated so that the model can continue processing them through subsequent layers.
The dispatch matrix reveals the communication pattern in concrete terms. Each row represents a source device, and each column represents a destination device. The value at position (i, j) tells us how many token-expert pairs device i sends to device j during the dispatch phase. With uniform random routing and 8 devices, each device sends roughly equal traffic to all devices (including itself). The values along the diagonal are particularly interesting: these represent tokens that happen to route to experts on their original device, requiring no network communication. These "local" routings are essentially free from a communication perspective, though they still require computation.
The All-to-All Operation
PyTorch's distributed library provides all_to_all for this communication pattern. The operation coordinates all devices to simultaneously exchange data according to their individual send and receive specifications. This coordination is complex at scale, as it requires synchronization across all participating devices.
The dispatch counts demonstrate the core accounting that drives the all-to-all operation. In this small example with 4 tokens and 2 experts per token, we see that 4 token-expert pairs are destined for device 0 and 4 for device 1, reflecting a balanced routing pattern across the two devices.
The critical insight is that all-to-all enables dynamic routing: each token goes exactly where it needs to go based on the gating network's decision. This differs fundamentally from static parallelism strategies where data movement patterns are fixed. In tensor parallelism, for example, the communication pattern is determined entirely by the model architecture and is identical for every input. In expert parallelism, the communication pattern depends on the content of the input itself, as mediated by the learned gating network. This dynamic behavior is both the source of MoE's flexibility and the root of its systems complexity.
Communication Overhead Analysis
Expert parallelism introduces communication costs that don't exist in dense models. Understanding these costs is essential for system design and performance optimization. While dense model training also involves communication (for gradient synchronization, for example), the all-to-all operations in MoE occur in the forward pass itself, directly adding latency to every training step.
Communication Volume
For each MoE layer, we perform two all-to-all operations. The first operation dispatches tokens from their originating devices to the devices holding their target experts. The second operation returns the processed results back to the originating devices. Each operation moves the same amount of data, as every token that goes out must come back.
The total communication volume (in elements) per MoE layer is:
Let's examine each component of this formula to understand what drives communication cost:
- : the batch size (number of sequences). Larger batches increase throughput but also increase the total number of tokens that must be routed.
- : the sequence length (number of tokens per sequence). Longer sequences mean more tokens per batch, directly scaling communication.
- : the top- parameter (number of experts per token). When , each token creates two communication events rather than one.
- : the hidden dimension (size of each token vector). Larger models with wider hidden dimensions must send more data per token.
- : accounts for the two all-to-all operations (dispatch and combine). This factor is fixed by the algorithm structure.
The multiplicative relationship between these factors means that communication volume can grow quickly. Doubling any single factor doubles the communication cost. Doubling all of them together would increase communication by a factor of 16.
This substantial volume of data movement, roughly 32 GB per forward pass for a single batch, highlights why network bandwidth is often the bottleneck in MoE training. To put this in perspective, even with high-bandwidth interconnects capable of 200 GB/s, transferring 32 GB takes around 160 milliseconds if the transfers were purely sequential. The communication cost scales linearly with the number of MoE layers, making efficient routing essential. This is one reason why some architectures use MoE layers only in alternating positions rather than in every layer.
Communication vs Computation Trade-off
The key metric for understanding expert parallelism efficiency is the ratio of communication time to computation time. If communication dominates, adding more devices provides diminishing returns because each device spends more time waiting for data than actually computing. Conversely, if computation dominates, the system can efficiently utilize additional devices.
Several insights emerge from this analysis. First, communication overhead grows with the number of devices because more tokens must cross device boundaries. With 2 devices, half the tokens (on average) route to local experts and require no network transfer. With 64 devices, only 1/64 of tokens stay local, meaning the vast majority must travel across the network.
Second, efficiency remains high when computation dominates, but degrades as we scale to many devices. The efficiency percentages in the rightmost column show this degradation: at 2 devices we might maintain 95% efficiency, but at 64 devices this drops significantly. This suggests a practical limit to how far expert parallelism alone can scale.
Third, high-bandwidth interconnects like NVLink are essential for maintaining efficiency at scale. The analysis above assumes 200 GB/s bandwidth, which is achievable with modern GPU interconnects. Slower connections, such as those between nodes in a cluster connected by InfiniBand or Ethernet, would shift these efficiency curves dramatically downward.
Expert Parallelism Implementation
A complete expert parallelism implementation requires coordinating routing decisions, all-to-all communication, and expert computation. The implementation must handle several subtle challenges: variable-length messages between device pairs, preservation of token ordering for correct gradient computation, and efficient batching of expert computations. Here's a simplified but functional implementation that demonstrates the core concepts:
The output shapes confirm that the model processes the input sequence and produces an output of the same dimensionality. The router probabilities tensor has shape corresponding to the batch size, sequence length, and number of experts, providing a complete probability distribution over experts for each token. The selected indices tensor (top-k) has the same batch and sequence dimensions but with the last dimension equal to k, indicating which experts were chosen for each token. These auxiliary outputs provide visibility into the gating mechanism, allowing us to verify routing behavior, compute load balancing metrics, and debug potential issues.
Handling Variable-Length Communication
A subtlety of expert parallelism is that all-to-all messages have variable lengths. One device might send 1000 tokens to device 0 but only 500 to device 1, depending on routing decisions. This variability arises naturally from the learned gating function: different inputs activate different experts, and the distribution of selected experts changes from batch to batch. This requires careful buffer management to ensure that receiving devices allocate sufficient space without wasting memory:
The buffer preparation function performs several important transformations. First, it expands the hidden states to account for the top-k selection, creating a separate copy of each token for each of its selected experts. Then, it sorts these expanded tokens by their destination device, grouping all tokens bound for device 0 together, followed by all tokens bound for device 1, and so on. This sorting is essential because the all-to-all operation expects contiguous chunks of data for each destination.
The permutation indices are crucial: after receiving results back via the second all-to-all, we use them to restore the original token ordering. Without this restoration step, the processed representations would be scrambled, breaking the correspondence between tokens and their positions in the sequence. The inverse permutation ensures that after the round-trip through expert computation, each token's result ends up in the correct position for subsequent layers.
Key Parameters
The key parameters for the Expert Parallel MoE implementation are:
-
num_experts: The total number of experts in the model. This determines the potential for specialization and affects both model capacity and communication patterns. Common values range from 8 (as in Mixtral) to thousands (as in Switch Transformer).
-
num_local_experts: The number of experts stored on the current device (typically
num_experts / world_size). This value determines the memory footprint per device and must divide evenly into the total expert count for uniform placement. -
top_k: The number of experts selected for each token (usually 1 or 2). Higher values increase computational cost but may improve model quality by allowing more experts to contribute to each token's representation.
-
capacity_factor: A multiplier to reserve extra buffer space for load imbalances (e.g., 1.25x expected load). This parameter provides headroom for the natural variation in routing patterns, preventing token dropping when some experts receive more traffic than expected.
Combining Expert Parallelism with Other Strategies
In practice, expert parallelism is combined with other parallelism strategies to scale MoE models to hundreds of GPUs. No single parallelism strategy can address all the challenges of training massive models: expert parallelism handles the distribution of experts, but other strategies are needed to handle large batch sizes, oversized individual layers, and the sequential dependencies between layers.
-
Expert + Data Parallelism: Replicate the expert parallel setup across multiple groups of devices. Each group has a complete set of experts, and different groups process different batches. This combination multiplies throughput by the number of data parallel replicas while keeping the communication complexity of expert parallelism contained within each replica group.
-
Expert + Tensor Parallelism: Shard individual experts across devices in addition to distributing experts. This handles cases where individual experts are too large for a single device. For example, if each expert has 7 billion parameters (as in Mixtral-scale models), tensor parallelism can split each expert across 2 or more devices, with expert parallelism then distributing these sharded experts across additional device groups.
-
Expert + Pipeline Parallelism: Distribute different layers across different pipeline stages, with expert parallelism within each MoE layer. This approach addresses the sequential nature of transformer computation, where each layer depends on the output of the previous layer. Pipeline parallelism overlaps computation across stages, while expert parallelism handles the within-layer distribution of experts.
The analysis shows that for a model of this scale, the expert parameters alone (112 GB) exceed the memory of a single 80GB device. We require at least 2 devices just to store the experts, and significantly more when training due to optimizer states and activations. The 3x multiplier for training memory accounts for the Adam optimizer (which maintains two additional tensors per parameter) and the activation checkpointing overhead. In practice, teams training models at this scale often use 8, 16, or more devices, combining expert parallelism with data parallelism to achieve both memory capacity and training throughput.
Limitations and Impact
Expert parallelism enables scaling MoE models that would otherwise be impossible to train, but it comes with significant challenges that you must understand and address.
The communication overhead from all-to-all operations can bottleneck training, especially when scaling to many devices or using slower interconnects. Models like Switch Transformer (which we'll explore in the next chapter) sometimes use expert parallelism across hundreds of devices, where communication costs become substantial. Techniques like capacity factors and dropping tokens help manage this, but at the cost of some model quality. The dropped tokens represent information that the model never processes, creating a subtle but real degradation in the model's ability to learn from its training data.
Load imbalance creates another persistent challenge. Even with auxiliary losses encouraging balanced routing, some experts inevitably receive more traffic than others. This means some devices finish their computation earlier and must wait, reducing overall efficiency. The problem worsens with more devices, as the probability of imbalance across any device pair increases. A system with 64 devices has 64 opportunities for one device to become a bottleneck, compared to just 2 opportunities with 2 devices.
Despite these challenges, expert parallelism has been transformative for scaling language models. It allows models to have many more parameters than they could with dense architectures while maintaining reasonable compute costs. The Switch Transformer demonstrated training models with trillions of parameters using expert parallelism. More recent models like Mixtral use expert parallelism more conservatively, with 8 experts providing a balance between capacity and communication efficiency.
The impact extends beyond just scale. Expert parallelism enables a form of model specialization that dense models cannot achieve. Different experts can learn different capabilities, and the routing mechanism determines which capabilities to apply for each input. This architectural choice influences not just training efficiency but the kinds of representations these models learn. Experts tend to specialize in different ways: some focus on particular syntactic patterns, others on specific domains or languages. This emergent specialization is only possible because expert parallelism makes it practical to maintain many separate expert networks within a single model.
Summary
Expert parallelism is the distributed computing strategy that makes large-scale MoE models practical. Rather than replicating all experts on every device (expensive) or sharding experts arbitrarily (complex), expert parallelism assigns complete experts to devices and uses all-to-all communication to route tokens to wherever their selected experts reside.
The core concepts include:
- Expert placement assigns experts to devices, typically uniformly, though capacity-aware placement can improve load balance when expert utilization patterns are known or can be estimated
- All-to-all communication is the primitive that enables dynamic routing, where each device sends different data to each other device based on the gating network's decisions
- Communication overhead grows with device count and can become a bottleneck at scale, making high-bandwidth interconnects essential for maintaining training efficiency
- Implementation requires careful buffer management for variable-length messages and coordination with routing decisions to preserve token ordering
Expert parallelism combines naturally with other parallelism strategies. Production systems typically use expert parallelism for the MoE layers alongside data parallelism for throughput and potentially tensor parallelism for very large expert dimensions. The choice of which strategies to combine, and in what proportions, depends on the specific model architecture, available hardware, and performance requirements.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about expert parallelism in Mixture of Experts models.












Comments