A comprehensive guide to Mixture of Experts (MoE) architectures, including routing mechanisms, load balancing, emergent specialization, and how sparse activation enabled models to scale to trillions of parameters while maintaining practical computational costs.

This article is part of the free-to-read History of Language AI
Sign in to mark chapters as read and track your learning journey
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2021: Mixture of Experts
By 2021, the field of large language models had reached an inflection point. Models like GPT-3, with 175 billion parameters, had demonstrated unprecedented capabilities, but they also exposed fundamental limitations in how neural networks were being scaled. Every forward pass through these massive dense models required activating all 175 billion parameters, making both training and inference computationally expensive. Researchers at Google and other leading institutions recognized that this uniform activation pattern was inefficient. Not every input required the full model capacity, yet dense architectures had no way to selectively activate only relevant parameters. This insight would lead to the widespread adoption of Mixture of Experts (MoE) architectures, a paradigm shift that would enable models to scale to trillions of parameters while maintaining practical computational costs.
The concept of Mixture of Experts wasn't entirely new. Work in the 1990s had explored using multiple expert networks with gating mechanisms for routing inputs. However, these early approaches had struggled with practical challenges, particularly around training stability and load balancing. By 2021, several factors converged to make MoE architectures viable at scale. The transformer architecture provided a stable foundation for expert networks. Advances in distributed training made it possible to train models across many devices. Perhaps most importantly, researchers developed techniques for load balancing and routing that prevented the common failure mode where MoE models would collapse to using only a few experts.
The breakthrough came from researchers at Google Brain and Google Research, who introduced novel MoE architectures in 2021 with the GShard and Switch Transformer models. These models demonstrated that MoE architectures could achieve better performance than dense models of similar computational cost while scaling to sizes that would be impractical with dense architectures. The key innovation was designing expert networks as transformer layers and developing sophisticated routing mechanisms that learned to distribute work across experts during training. This work would establish MoE as a fundamental architectural pattern for scaling language models.
The significance of MoE architectures extended beyond computational efficiency. These models enabled a form of emergent specialization, where different experts naturally learned to handle different types of inputs or tasks without explicit supervision. Experts might specialize in different domains, languages, or reasoning patterns, creating a form of modular intelligence within a single model. This specialization capability would prove valuable for applications requiring both broad knowledge and specialized expertise, opening new possibilities for how language models could be deployed and used.
The Problem
As language models grew larger throughout 2020 and early 2021, researchers encountered a fundamental scaling problem. Dense neural network architectures, where every parameter is activated for every forward pass, became increasingly inefficient as model sizes approached hundreds of billions of parameters. GPT-3, released in 2020 with 175 billion parameters, required activating all 175 billion parameters even for simple tasks that might only need a fraction of the model's total capacity. This uniform activation pattern meant that computational cost scaled linearly with model size, making it economically difficult to scale models beyond a certain point.
The memory requirements for dense models were also becoming prohibitive. Every parameter needed to be stored in memory, and every forward pass required loading and computing with all parameters. For models with hundreds of billions of parameters, this meant that even storing the model weights required substantial hardware resources. Training these models required expensive GPU clusters with hundreds or thousands of devices, putting them out of reach for most research organizations. The computational and memory constraints were creating a practical ceiling on how large dense models could become.
Dense architectures also suffered from diminishing returns as models grew larger. Researchers found that adding more parameters to dense models provided less improvement per parameter than earlier scaling had demonstrated. Each additional parameter contributed less to model capability, meaning that the cost of scaling up was increasing faster than the benefits. Beyond a certain point, simply adding more parameters to dense models no longer provided meaningful improvements, suggesting that architectural changes might be necessary to continue improving model performance.
Another fundamental limitation was that dense models treated all inputs equally, using the full model capacity for every computation regardless of complexity. A simple question requiring basic factual recall would activate all parameters, just like a complex reasoning task requiring sophisticated analysis. This uniform activation was wasteful, consuming computational resources for tasks that didn't need them while simultaneously limiting the total model size that could be practically deployed. The inability to selectively activate only relevant parameters based on input characteristics represented a fundamental inefficiency in dense architectures.
The training dynamics of very large dense models also presented challenges. As models grew to hundreds of billions of parameters, training became increasingly unstable and expensive. The memory requirements for storing gradients and optimizer states during training scaled with model size, making it difficult to use effective batch sizes or learning rate schedules. These training challenges further limited the practical scalability of dense architectures, creating additional pressure to find alternative approaches that could achieve similar or better performance with more efficient resource usage.
The Solution
Mixture of Experts architectures addressed these limitations by introducing sparsity into neural network computation. Instead of using all model parameters for every input, MoE models divide the model into multiple expert networks, each a complete neural network capable of processing inputs independently. A learned gating or routing mechanism then selects which experts should handle each input, typically activating only 1-2 experts out of many. This sparse activation pattern means that while the total number of parameters in the model can be very large, the computational cost of each forward pass is much smaller, as only the active experts need to be computed.
The key innovation that made MoE architectures viable at scale in 2021 was the integration of expert networks into transformer layers. Researchers at Google Brain designed expert networks as feed-forward layers within the transformer architecture, replacing the standard dense feed-forward layer with multiple expert feed-forward networks. Each expert was a complete two-layer feed-forward network, and a routing mechanism would select which experts to use for each token or group of tokens. This design maintained the transformer architecture's proven structure while introducing the efficiency benefits of sparse activation.
Routing Mechanism
The routing mechanism is the core component that determines which experts process each input. In the MoE architectures introduced in 2021, routing typically worked at the token level, where each token position in the sequence could be routed to different experts. The routing network, also called the gating network, takes the token representation as input and produces a probability distribution over all available experts. The mechanism computes scores for each expert, applies a softmax to create probabilities, and then selects the top-k experts (typically k=1 or k=2) based on these probabilities.
The mathematical formulation of routing begins with computing expert scores. For a token representation , the gating network computes scores for each expert :
where and are learned parameters for expert . These scores are then normalized using a softmax function to create a probability distribution:
where is the total number of experts. The routing mechanism then selects the top-k experts with the highest probabilities. The final output combines the selected experts' outputs weighted by their routing probabilities or as a simple weighted sum, depending on the specific MoE variant.
Load Balancing
One of the critical challenges with MoE architectures is ensuring that work is distributed evenly across experts during training. If one expert receives most of the inputs while others remain idle, the model effectively reduces to a smaller dense model, losing the efficiency benefits. Early MoE models often suffered from this collapse, where the routing mechanism would learn to always route to the same few experts, defeating the purpose of having multiple experts.
The 2021 MoE architectures introduced several techniques to address load balancing. The GShard model used capacity constraints that limited how many tokens could be routed to each expert, forcing more even distribution. The Switch Transformer introduced an auxiliary load balancing loss that penalized uneven expert usage. This loss term measured the variance in expert usage across a batch of examples and added it to the main training loss, encouraging the routing mechanism to distribute work more evenly while still maintaining quality routing decisions.
The load balancing objective typically measures how evenly experts are used. One common approach computes the coefficient of variation of expert usage, which penalizes cases where some experts receive many more inputs than others. By including this term in the training loss, the model learns to route inputs to appropriate experts while maintaining roughly equal usage across all experts. This balance between quality routing and load balancing was crucial for making MoE architectures work effectively at scale.
Expert Architecture
In the MoE architectures from 2021, each expert was typically implemented as a feed-forward neural network within the transformer architecture. A standard transformer layer contains a multi-head self-attention mechanism followed by a feed-forward network. In MoE variants, the feed-forward network was replaced with multiple expert feed-forward networks and a routing mechanism. Each expert was a complete two-layer feed-forward network with the same architecture, typically expanding the input dimension, applying a non-linear activation, and then projecting back to the original dimension.
The expert networks learned distinct specializations during training, even though they started with identical architectures. Through the routing mechanism, different experts would process different types of inputs, leading them to develop specialized knowledge or patterns. Researchers observed that experts would naturally specialize in different domains, languages, or reasoning patterns without explicit supervision. This emergent specialization was one of the most intriguing aspects of MoE architectures, demonstrating how the routing mechanism could create a form of modular intelligence within a single model.
Scaling Advantages
The computational advantages of MoE architectures became clear when scaling model size. In a dense model with parameters, every forward pass requires computing with all parameters. In an MoE model with experts each of size parameters, the total parameter count is , but each forward pass only activates parameters, where is the number of active experts (typically 1-2). This means that an MoE model could have significantly more total parameters than a dense model while maintaining similar computational cost per forward pass.
For example, a dense model with 175 billion parameters requires computing with all 175 billion parameters for every input. An MoE model with 8 experts each of size 20 billion parameters would have 160 billion total parameters, but only compute with 20-40 billion parameters per input (depending on whether 1 or 2 experts are active). This sparse activation pattern enabled researchers to scale models to sizes that would be impractical with dense architectures, while maintaining reasonable computational costs for training and inference.
One of the remarkable aspects of MoE architectures is how experts naturally develop specializations during training without explicit supervision. Researchers analyzing trained MoE models have found that different experts specialize in distinct patterns. Some experts might specialize in scientific terminology, others in conversational language, and still others in specific types of reasoning. This emergent modularity suggests that MoE architectures can learn to organize knowledge in ways that dense models cannot, potentially making them more interpretable and useful for specialized applications.
Applications and Impact
The MoE architectures introduced in 2021, particularly the Switch Transformer and GShard models, demonstrated significant improvements over dense models of similar computational cost. Google's Switch Transformer, which scaled to over 1.6 trillion parameters, showed that MoE models could achieve better performance per computational unit than dense models. The model used 128 experts per layer and achieved substantial improvements in perplexity on language modeling tasks while requiring similar training time to smaller dense models. This demonstrated the practical viability of MoE architectures for scaling language models.
The efficiency gains from MoE architectures made it possible for organizations with moderate computational resources to train and deploy very large models. While dense models like GPT-3 required massive computational infrastructure, MoE models with similar or better capabilities could be trained more efficiently. This democratization of access to large language models opened new possibilities for research and deployment across different organizations and use cases. Smaller research groups could now experiment with architectures that would have been impossible with dense models.
MoE models proved particularly effective for tasks requiring both broad knowledge and specialized expertise. The emergent specialization of experts meant that a single model could handle diverse inputs effectively. Different experts might specialize in different domains, allowing the model to excel at both general language tasks and specialized applications. This capability made MoE models valuable for applications that needed to process varied content types, from scientific literature to conversational text to code.
The routing mechanism in MoE models also provided a form of interpretability that dense models lacked. By examining which experts were activated for different types of inputs, researchers could gain insights into how the model processed information. This interpretability helped researchers understand model behavior and debug issues more effectively than with dense architectures, where all parameters were always active and it was difficult to understand which parts of the model contributed to specific predictions.
The successful scaling of MoE models in 2021 influenced subsequent development of large language models. The architectural principles demonstrated in Switch Transformer and GShard would be adopted and refined in later models. The efficiency gains from sparse activation would become increasingly important as models continued to scale, making MoE architectures a fundamental approach for training very large language models.
Limitations
Despite their advantages, MoE architectures introduced several new challenges that researchers had to address. One significant limitation was the complexity of training MoE models compared to dense models. The routing mechanism added an extra component that needed to be trained, and the load balancing objectives added complexity to the training process. Training MoE models required careful tuning of hyperparameters related to routing, load balancing, and capacity constraints, making them more difficult to train effectively than dense models.
The dynamic routing in MoE models could also introduce training instability. The routing decisions depended on learned parameters, and the routing patterns could change during training as the model learned. This dynamic nature could lead to inconsistent training dynamics, where routing patterns would shift suddenly and cause training to become unstable. Researchers had to develop techniques to stabilize training, such as using auxiliary losses and carefully designed capacity constraints.
Memory requirements for MoE models were also more complex than for dense models. While MoE models could achieve better computational efficiency through sparse activation, they still needed to store all expert parameters in memory, which could be substantial for models with many experts. Additionally, the routing mechanism and load balancing computations added some memory overhead. For very large MoE models with hundreds of experts, the memory requirements could still be significant, though typically more manageable than equivalent dense models.
The routing mechanism itself introduced some computational overhead. Computing routing scores for each token and selecting top-k experts added computation that dense models didn't require. While this overhead was typically small compared to the savings from sparse activation, it did reduce some of the efficiency gains. The routing overhead became more significant for models with many experts, where computing routing scores for all experts could add noticeable computational cost.
Another limitation was that MoE models could be less predictable than dense models. Because different experts were activated for different inputs, the computational cost of inference could vary depending on which experts were selected. This variability made it more difficult to predict inference time and could complicate deployment scenarios where consistent latency was important. Dense models, by contrast, had consistent computational cost for all inputs.
Legacy and Looking Forward
The MoE architectures introduced in 2021 established sparse activation as a fundamental approach for scaling large language models. The success of Switch Transformer and GShard demonstrated that architectural innovation could be as important as simply scaling model size, opening new directions for improving model efficiency and capability. This insight would influence subsequent model development, with MoE principles being adopted and refined in later models.
The emergent specialization observed in MoE models suggested new possibilities for how neural networks could organize knowledge. The ability of experts to naturally develop specializations without explicit supervision demonstrated a form of learned modularity that dense models couldn't achieve. This capability would prove valuable for applications requiring specialized knowledge across different domains, making MoE architectures particularly relevant for multilingual models, multimodal systems, and applications with diverse input types.
The efficiency gains from MoE architectures became increasingly important as models continued to scale in subsequent years. The ability to train models with trillions of parameters while maintaining practical computational costs enabled new capabilities that wouldn't have been feasible with dense architectures. Models like Google's PaLM, which incorporated MoE principles, would demonstrate the continued value of sparse activation patterns for very large models.
The routing mechanisms developed for MoE models also influenced research in other areas of neural network architecture. The principles of learned routing and sparse activation would be explored in other contexts, such as conditional computation and dynamic neural networks. The techniques for load balancing and capacity management developed for MoE models would also inform research in distributed systems and resource allocation.
The limitations of MoE architectures would drive further research into improving training stability, reducing routing overhead, and developing more sophisticated routing mechanisms. Subsequent work would explore alternative routing strategies, better load balancing techniques, and methods for making MoE models more predictable and easier to train. These improvements would make MoE architectures even more practical and effective.
The impact of MoE architectures extended beyond just computational efficiency. By demonstrating that models could achieve better performance through architectural innovation rather than simply scaling parameter count, MoE architectures helped shift the field's focus toward more sophisticated model designs. This shift would influence how researchers approached model development, encouraging exploration of alternative architectures that could achieve better efficiency and capability. The principles demonstrated in MoE architectures would become fundamental to understanding how to scale language models effectively while maintaining practical computational requirements.
Quiz
Ready to test your understanding of Mixture of Experts architectures? Challenge yourself with these questions about sparse activation, routing mechanisms, and the innovations that made MoE models viable at scale. Good luck!
Sign in to mark chapters as read and track your learning journey
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Whisper: Large-Scale Multilingual Speech Recognition with Transformer Architecture
A comprehensive guide covering Whisper, OpenAI's 2022 breakthrough in automatic speech recognition. Learn how large-scale multilingual training on diverse audio data enabled robust transcription across 90+ languages, how the transformer-based encoder-decoder architecture simplified speech recognition, and how Whisper established new standards for multilingual ASR systems.

Flamingo: Few-Shot Vision-Language Learning with Gated Cross-Attention
A comprehensive guide to DeepMind's Flamingo, the breakthrough few-shot vision-language model that achieved state-of-the-art performance across image-text tasks without task-specific fine-tuning. Learn about gated cross-attention mechanisms, few-shot learning in multimodal settings, and Flamingo's influence on modern AI systems.

PaLM: Pathways Language Model - Large-Scale Training, Reasoning, and Multilingual Capabilities
A comprehensive guide to Google's PaLM, the 540 billion parameter language model that demonstrated breakthrough capabilities in complex reasoning, multilingual understanding, and code generation. Learn about the Pathways system, efficient distributed training, and how PaLM established new benchmarks for large language model performance.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
No spam, unsubscribe anytime.
Create a free account to unlock exclusive features, track your progress, and join the conversation.
Comments