Discover how sparse models decouple capacity from compute using conditional computation and mixture of experts to achieve efficient scaling.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Sparse Models
The scaling laws we explored in Part XXI revealed a fundamental insight: model performance improves predictably with more compute, more data, and more parameters. This created an arms race toward ever-larger models, with GPT-3's 175 billion parameters quickly becoming a stepping stone rather than a ceiling. But this scaling approach faces a harsh economic reality: every parameter must be processed for every token, making the cost of inference grow linearly with model size.
What if we could have the capacity of a massive model while only paying the computational cost of a much smaller one? This is the promise of sparse models. They break the assumption that bigger models must be proportionally more expensive to run. Instead of activating all parameters for every input, sparse models selectively engage different subsets of their parameters based on what the input requires.
This chapter introduces the foundational concepts behind sparse computation in neural networks. We'll examine why dense models hit efficiency walls, how conditional computation offers an escape, and what challenges arise when we abandon the all-parameters-all-the-time paradigm.
Dense Models and Their Limitations
Every transformer architecture we've studied so far follows the same pattern. When a token enters the model, it flows through every layer, every attention head, and every feed-forward network. The model's full parameter count participates in every computation. This uniformity of computation, where the same operations apply regardless of the input's content, has been both a strength and a limitation of the transformer architecture.
A model where all parameters are activated and participate in the forward pass for every input. The computational cost scales directly with the total parameter count.
The "dense" property means every parameter contributes to every prediction, creating a direct relationship between model size and computational cost. If we double the parameters, we double the compute required per token. This tight coupling between capacity and computation may seem inevitable, but as we'll see, it's actually a design choice rather than a fundamental constraint of neural networks.
Consider the feed-forward network we examined in Part XII. For a model with hidden dimension and FFN intermediate dimension , each token passes through two linear transformations. The feed-forward network is where sparse models make their intervention.
The feed-forward network transforms each token through an expansion to a larger intermediate dimension, applies a non-linearity, then projects back to the original dimension. This two-stage transformation allows the network to learn complex non-linear mappings. The network first projects into a higher-dimensional space where patterns are easier to separate, then projects back to the model dimension. This expansion and contraction pattern is remarkably effective at capturing complex relationships in the data. Given an input token , the FFN computes:
where:
- : input token representation with dimension
- : first weight matrix that projects to the intermediate dimension
- : bias vector for the first transformation
- : activation function (typically ReLU or GELU) applied element-wise
- : second weight matrix that projects back to
- : bias vector for the output transformation
The activation function has two key properties that make it ideal for the transformation. It ensures non-linearity in the transformation and allows the network to learn complex patterns by projecting into a space where features can be more easily separated before projecting back to the model dimension. Without this non-linearity, the composition of two linear transformations would simply collapse into a single linear transformation, severely limiting the network's expressive power.
We can derive the total parameter count for these two weight matrices as follows. This derivation helps us understand exactly where the computational burden lies in a transformer layer:
In a typical transformer where , substituting into our formula gives parameters per FFN layer. To put this in concrete terms, for GPT-3 with , that's over 1.2 billion parameters per layer, and every single parameter gets used for every single token. This enormous parameter count per layer, multiplied across all layers, explains why large language models require such substantial hardware resources.
This creates a coupling between three quantities that we might prefer to scale independently:
- Model capacity: The ability to store and process complex patterns, which is tied to parameter count
- Training compute: The FLOPs required to update parameters during training
- Inference compute: The FLOPs required to process each token at runtime
In dense models, these three quantities move together in lockstep. Want more capacity? Add parameters. More parameters means more compute for both training and inference. The Chinchilla scaling laws help us balance compute and data efficiently, but they don't break this fundamental coupling. This coupling represents a significant constraint: we cannot increase the model's knowledge capacity without simultaneously increasing the cost of using that knowledge.
The practical impact is severe. A 175 billion parameter dense model requires roughly 350 billion FLOPs per token (using the approximation that FLOPs = 2 times parameters per forward pass). At scale, this determines your hardware requirements, inference latency, and operating costs. For organizations deploying these models to millions of users, even small inefficiencies translate into substantial infrastructure expenses.
The results reveal the scaling problem. As model size increases from 7B to 540B parameters, computational requirements grow proportionally. The 7B model processes 1000 tokens per second with approximately 1 A100 GPU, while the 540B model requires nearly 9 GPUs for the same throughput. This linear scaling between parameters and compute creates a fundamental efficiency barrier. Even with optimized hardware running at 40% utilization, the infrastructure costs become prohibitive at large scales. This growth in hardware requirements motivates the search for sparse architectures that can decouple model capacity from computational cost.
The Conditional Computation Paradigm
The insight behind sparse models is elegant: not all parameters need to contribute to every prediction. Different inputs may benefit from different computational patterns. A question about chemistry might activate different knowledge than a question about history, even within the same model. This observation suggests that the dense model's uniform treatment of all inputs might be wasteful: we're applying computational resources indiscriminately rather than directing them where they're most needed.
A computational paradigm where the operations performed depend on the input. Different inputs follow different computational paths through the network, activating different subsets of parameters based on the input's characteristics.
Human experts provide an intuitive analogy. When a hospital faces a complex case, they don't have every specialist examine the patient. Instead, a routing process directs the case to relevant experts: perhaps a cardiologist and neurologist for certain symptoms, or an oncologist and radiologist for others. Each expert has deep knowledge in their domain, but only the relevant experts participate in each case. The hospital maintains a large total capacity (many specialists) while each patient receives focused attention from only the most relevant subset.
Sparse models implement this principle mathematically. Instead of a single large FFN layer, imagine dividing those parameters into multiple separate "expert" networks. A learned routing mechanism examines each input and decides which experts should process it. This routing decision is the key innovation: rather than treating all parameters uniformly, we learn to selectively activate the most relevant subset for each input.
The sparse layer computes its output as a weighted combination of expert outputs. This combination allows the model to blend specialized knowledge from different experts based on the input's characteristics. The weighting scheme ensures that more relevant experts contribute more strongly to the final output, while less relevant experts contribute proportionally less or not at all. For an input , the output is computed as:
where:
- : the output of the sparse layer for input
- : the total number of expert networks available in the layer
- : the output of expert network when processing input
- : the gating weight for expert given input , which determines how much expert contributes to the final output
- : summation operator that sums over all experts (from expert 1 to expert )
- : the weighted contribution of expert , scaling the expert's output by how relevant it is to this input
This weighted sum allows the model to combine specialized knowledge from different experts in a flexible, input-dependent manner. The key property is that is sparse: most values are zero, meaning most experts don't participate for any given input. This sparsity is what makes the computation efficient: we only compute for experts where . Typically, only the top-k experts with highest gating weights are activated, with for all others. This selection mechanism transforms a potentially expensive computation over all experts into a much cheaper computation over just a few.
To ensure computational efficiency while maintaining proper probability distributions, we impose two constraints on the gating weights. The first constraint enforces sparsity, ensuring that only k experts are active for any given input. The second ensures the weights form a valid probability distribution, so the output is a proper weighted average. These constraints can be formalized as:
These two constraints work together to enable efficient sparse computation. The first constraint (sparsity) ensures that most values are exactly zero, so we can skip computing for those experts entirely. This is where the computational savings originate: if only 2 out of 64 experts are active, we perform approximately 1/32 of the computation we would need for a dense layer with equivalent total parameters. The second constraint (normalization) ensures that the non-zero weights form a proper probability distribution, making the output a valid weighted average of expert outputs. This normalization preserves the mathematical properties that allow the model to be trained effectively with standard gradient-based optimization.
The routing mechanism implements these constraints through a three-step process that transforms continuous router scores into sparse, normalized gating weights. First, compute unnormalized router scores for all experts, typically via a learned linear projection followed by softmax. This linear projection learns which features of the input are relevant for selecting each expert. Second, select the top-k experts with highest scores and set all other weights to zero, creating a sparse mask. This hard selection step is what enables the computational efficiency of sparse models. Third, renormalize only the selected weights so they sum to 1:
where:
- : the unnormalized router score for expert on input
- : the set of experts with the highest router scores
- : normalization constant that ensures selected weights sum to 1. This process ensures that only k experts perform forward passes, dramatically reducing computation while maintaining the mathematical properties of a weighted mixture.
This mechanism fundamentally breaks the dense model's coupling between capacity and compute. We can have many experts (large total parameter count), but only activate a few per token (small computational cost). The model gains capacity to store diverse knowledge while maintaining efficient inference. The key insight is that this doesn't sacrifice quality: different tokens genuinely benefit from different experts, so selective activation isn't just an approximation but rather a better match to how knowledge should be applied.
The configuration demonstrates the core efficiency principle of sparse models. Each token activates only 2 out of 8 experts, using just 25% of available capacity. The routing decisions show that different tokens select different experts, with each making independent routing choices based on the learned router weights. While the layer contains 8 experts worth of parameters, any single token only triggers computation in 2 of them. This achieves the key sparse model trade-off: maintaining high total capacity (all 8 experts available) while keeping per-token compute cost low (only 2 experts activate per forward pass).
Efficiency Analysis: Decoupling Parameters from Compute
The power of sparse models becomes clear when we quantify the efficiency gains. We can compare a dense model to a sparse model with equivalent computational cost to see the capacity advantage. This analysis reveals the mathematical foundation for why sparse models can achieve better performance per unit of compute than their dense counterparts.
Consider a dense FFN with intermediate dimension . This layer has two weight matrices: and . These matrices define the expansion into a higher-dimensional space and the subsequent projection back to the model dimension.
The dense FFN has the following computational characteristics:
- Total parameters: . This count comes from two weight matrices: (containing parameters) and (containing parameters)
- FLOPs per token: , because each of the two matrix multiplications requires approximately FLOPs (standard approximation that a matrix multiplication of dimensions by requires FLOPs).
where:
- : the dimension of the model's hidden representations (the size of token embeddings throughout the network)
- : the intermediate dimension of the feed-forward network, typically set to
- : the FLOP count for matrix multiplication comes from computing output elements, each requiring multiplications and additions, giving approximately FLOPs
Now consider a sparse layer with experts, where each expert has intermediate dimension , using top-2 routing. This configuration divides the total parameters across multiple smaller experts, each specialized for different types of inputs. The top-2 routing means each token will be processed by exactly two experts, combining their outputs with learned weights.
For this sparse configuration, we can compute the total parameter count, active parameters per token, and FLOPs per token. These derivations illuminate exactly where the efficiency gains originate.
Total parameters: Each expert is a smaller FFN with intermediate dimension instead of the full . Since each expert has two weight matrices, it contains parameters. The expert's reduced size means it specializes in a narrower range of computations, but collectively the experts cover the same representational space as the original dense layer. Summing across all experts:
This matches the dense model exactly, since the terms cancel. This result is important because it shows that we haven't reduced capacity: the total number of learnable parameters remains the same. The parameters are simply distributed across multiple experts rather than concentrated in a single network.
Active parameters per token: The key efficiency gain comes from activating only a subset of experts. With top-2 routing, only 2 out of experts activate for any given token. Each active expert contributes its parameters to the computation. This selective activation is the heart of sparse efficiency:
FLOPs per token: The computational cost follows the active parameters, since we only perform computations for the experts that are activated. Each active expert must perform two matrix multiplications (expansion and contraction), costing FLOPs. With 2 active experts:
With experts, the sparse model uses only the FLOPs per token while maintaining the same total parameter count. This is a remarkable result: we get the same capacity at a fraction of the computational cost. We can verify this by computing the ratio:
This ratio reveals the key insight: with 8 experts and top-2 routing, we use only 25% of the dense model's compute per token. The general formula shows that computational cost decreases linearly with the number of experts, while total model capacity (parameter count) increases linearly with . This inverse relationship between experts and compute cost is the mathematical foundation for sparse model efficiency.
A significant advantage of sparse models is the ability to increase total parameters without increasing compute. Rather than keeping the same total parameter count as the dense model, we can make each expert the same size as the original dense FFN, giving us times more parameters while only paying times the compute. This is the real power of sparse architectures: they let us scale capacity independently of computational cost.
The sparse model achieves significantly higher parameter count while maintaining reasonable computational cost. With 8 experts and top-2 routing, the total parameter count is 8x larger than the dense baseline, but FLOPs per token only double because each token activates just 2 of the 8 experts. This creates a 4x improvement in capacity per unit of compute. The dense model's single FFN processes every token, while the sparse model distributes capacity across multiple experts and activates only a subset per token. This fundamental trade-off allows sparse models to scale capacity beyond what would be computationally feasible with dense architectures.
In practice, this enables a new training and deployment paradigm:
| Metric | Dense 7B | Sparse 47B (8 experts, top-2) |
|---|---|---|
| Total Parameters | 7B | 47B |
| Active Parameters | 7B | ~12B |
| FLOPs per Token | ~14 GFLOP | ~24 GFLOP |
| Effective Capacity | 1x | ~4-6x |
The sparse model processes each token with similar cost to a ~12B dense model but has the knowledge capacity of a much larger model. This has significant implications for the scaling laws we discussed in Part XXI: we can push the Pareto frontier of performance vs. compute. The sparse architecture essentially lets us move along a different efficiency curve, one where we can trade increased memory (for storing more expert parameters) for improved quality at fixed compute.
Why Replace the Feed-Forward Network?
You might wonder why sparse architectures typically replace the FFN rather than the attention mechanism. As we explored in Part X, attention's computational cost comes from the quadratic relationship between sequence length and compute. But there's another consideration: the role each component plays in the model. Understanding these distinct roles helps explain why the FFN is the natural candidate for sparsification.
Attention mechanisms handle information routing, determining which tokens should attend to which other tokens. This function is inherently about relationships within the input and benefits from seeing all tokens simultaneously. Making attention sparse (as in Longformer and BigBird from Part XIV) requires careful design to ensure important relationships aren't missed. The attention mechanism needs to maintain a global view of the sequence to effectively aggregate information across positions.
Feed-forward networks, by contrast, operate on each token position independently. They're often interpreted as key-value memories that store factual knowledge and learned transformations. This independence is crucial: since each token's FFN computation doesn't depend on other tokens in the sequence, we can freely route different tokens to different experts without disrupting information flow. This makes them natural candidates for specialization: different experts can store different types of knowledge without interfering with the information routing that attention provides.
The FFN also dominates the parameter count in typical transformers. In a standard architecture with , the FFN contains roughly two-thirds of each layer's parameters. This makes the FFN the natural target for sparsification. Converting it to a mixture of experts yields the largest efficiency gains. To see why the FFN dominates, let's calculate the parameter counts step by step, starting with the FFN:
For comparison, the attention mechanism contains four projection matrices (Q, K, V, and output projection), each with dimension . These projections transform the input into queries, keys, and values for the attention computation, then project the attention output back to the model dimension. This gives:
where:
- : the number of projection matrices in multi-head attention (query, key, value, and output projections)
- : the dimensions of each projection matrix, mapping from the model dimension to itself
To compute the FFN's share of total layer parameters, we divide FFN parameters by the sum of FFN and attention parameters. This ratio tells us what fraction of each layer's capacity resides in the feed-forward network:
This means the FFN accounts for approximately 67% of each layer's parameters. This two-thirds dominance makes the FFN the natural target for sparsification: converting it to a mixture of experts allows us to scale the parameter count (by adding more experts) while keeping computational cost proportional to the number of active experts rather than total experts. Since the FFN already holds most of the parameters, this is where we get the biggest capacity gains from sparse computation. If we were to sparsify the attention mechanism instead, we would affect only one-third of the layer's parameters, yielding proportionally smaller efficiency gains.
The FFN's dominance at approximately two-thirds of layer parameters makes it the natural target for sparsification. In typical transformer configurations, the FFN contains roughly twice as many parameters as the attention mechanism. This 2:1 ratio means that converting the FFN to a mixture of experts yields the largest efficiency gains. When we add more experts to the FFN, we scale the parameter count significantly while keeping computational cost proportional to the number of active experts. Since the FFN already holds most parameters, this is where sparse computation provides the biggest capacity increase per unit of additional compute. Replacing the dense FFN with a sparse mixture of experts while keeping attention dense preserves the model's ability to route information effectively while dramatically increasing its knowledge capacity.
Challenges of Sparse Architectures
Sparse models introduce complexity that dense models don't face. Sparse models offer decoupled capacity and compute, but they also introduce several engineering challenges.
Load Imbalance
The most immediate challenge is load imbalance. If the router learns to send most tokens to a small subset of experts, the model degrades to an expensive dense network where popular experts become bottlenecks while unused experts contribute nothing. This scenario represents a failure to achieve the sparse model's promise: we pay for all the expert parameters in memory but only use a fraction of them effectively.
This happens naturally during training without intervention. Early in training, some experts may randomly produce slightly better outputs for common inputs. The router learns to prefer these experts, which then receive more gradient updates and improve further. Meanwhile, neglected experts fall behind, creating a rich-get-richer dynamic that can collapse the model to using just one or two experts. This collapse is self-reinforcing: once an expert dominates, it receives most training signal and becomes even more dominant, while underused experts stagnate.
The load balance coefficient quantifies this imbalance mathematically, providing a single number that summarizes how evenly tokens are distributed across experts. For experts and total tokens, perfect balance would send tokens to each expert. The load balance coefficient is defined as:
where:
- : the total number of experts in the layer
- : the total number of tokens being processed
- : the minimum number of tokens sent to any single expert (the most underutilized expert)
- : the ideal load per expert under perfect balance
This coefficient equals 1.0 when all experts receive exactly tokens (perfect balance), and approaches 0 as the distribution becomes more skewed. A coefficient near 0 indicates severe imbalance where some experts are heavily overused while others sit idle. The focus on the minimum load is intentional: it identifies the weakest link, the expert receiving the least training signal, which determines how effectively the model uses its full capacity.
The visualization demonstrates two load distribution scenarios. The imbalanced case (left) shows a coefficient near 0, where Expert 0 receives the majority of tokens while other experts are severely underutilized. This creates a bottleneck where one expert must process most inputs sequentially. The balanced case (right) achieves a coefficient near 1.0, with all experts receiving roughly equal loads near the ideal line of 125 tokens per expert. This even distribution enables true parallel processing across all experts. Real sparse models require auxiliary loss functions during training to prevent collapse toward the imbalanced state, as the routing mechanism naturally tends toward specialization that can create bottlenecks.
Communication Overhead in Distributed Settings
Sparse models shine for large-scale systems, but they introduce unique distributed computing challenges. In a dense model, each GPU holds a copy of the full layer and processes its assigned batch of tokens independently. The communication pattern is regular and predictable: gradients aggregate across devices during backward passes, but forward passes require no cross-device communication within a layer. In a sparse model with experts distributed across GPUs, tokens must travel to whichever GPU holds their assigned expert, fundamentally changing the communication pattern.
This creates an all-to-all communication pattern: every GPU may need to send tokens to every other GPU and receive tokens back. The communication volume depends on how tokens route, which depends on the input data. This is fundamentally different from the predictable, structured communication patterns in tensor parallelism or pipeline parallelism for dense models. Network bandwidth becomes a potential bottleneck, and communication latency adds to the overall processing time.
This reveals an important trade-off in sparse model deployment. Configurations with fewer GPUs relative to experts show lower communication costs because multiple experts can reside on the same GPU, reducing cross-GPU traffic. Conversely, configurations with one expert per GPU or many distributed GPUs experience higher communication overhead as tokens must be routed across the network. The communication cost per layer can range from negligible to several gigabytes depending on the deployment configuration. These results demonstrate that sparse models trade reduced computation for increased communication, and this trade-off becomes favorable only when the computational savings outweigh the communication overhead.
Training Instability
Sparse models introduce non-differentiable routing decisions into the computational graph. The choice of which experts to activate is discrete: tokens either go to an expert or they don't. This discreteness complicates gradient computation because the standard backpropagation algorithm requires differentiable operations to compute gradients.
Several techniques address this challenge, each with its own trade-offs:
- Soft routing: Instead of hard selection, use weighted combinations of experts (though this partially defeats the computational savings)
- Straight-through estimators: Approximate gradients through discrete choices
- Auxiliary losses: Add terms to the training objective that encourage desirable routing behavior
The router also creates a circular dependency: the router should learn to send tokens to experts that process them well, but experts can only learn to process tokens they receive. Early training can be unstable as this chicken-and-egg problem resolves. This interplay between router learning and expert learning requires careful initialization and sometimes curriculum-based training strategies.
Inference Complexity
Dense model inference is straightforward: run the forward pass, and you're done. Sparse model inference requires additional bookkeeping that adds engineering complexity. For each layer, the system must:
- Run the router to determine expert assignments
- Group tokens by their assigned experts
- Process each expert's assigned tokens
- Reassemble results in the correct order
When serving requests with different inputs, the routing patterns vary, making batching less efficient than in dense models. Some inputs might activate experts heavily while others need different experts entirely. This variability complicates capacity planning and can lead to uneven GPU utilization across a serving cluster.
The analysis reveals significant inefficiency from uneven expert loads. The expert receiving the most tokens processes roughly twice as many as the average expert, creating substantial load imbalance. This imbalance translates to high padding overhead: when we batch process all experts in parallel, we must pad shorter queues to match the longest queue, wasting computational cycles. The compute utilization metric quantifies this waste by showing what fraction of hardware cycles perform useful computation versus processing padding. This demonstrates why load balancing is critical in production systems. Without it, the theoretical efficiency gains from sparsity disappear as we waste compute matching uneven loads across experts.
The Historical Arc
The idea of conditional computation predates the transformer era. Early work in the 1990s explored mixture of experts for various machine learning tasks. However, these early systems used small numbers of experts and faced challenges with training stability and load balancing.
The key enabler for modern sparse models was scale. Small models don't benefit much from sparsity because the routing overhead dominates. But as models grew to billions of parameters, the economics shifted. Google's research on sparse transformers, including the Switch Transformer, demonstrated that sparse models could match dense model quality while using a fraction of the compute.
Recent architectures like Mixtral, which we'll examine later in this part, have brought sparse models into the mainstream. By carefully engineering the routing mechanism and balancing losses, these models achieve compelling quality-to-compute ratios that make them practical for both training and deployment.
Summary
Sparse models represent a fundamental shift in how we think about neural network scaling. Rather than accepting that more parameters always means more compute, sparse architectures decouple these quantities through conditional computation.
The core concepts we've covered lay the groundwork for understanding mixture of experts:
- Dense models activate all parameters for every input, creating a tight coupling between capacity and cost
- Conditional computation activates different parameter subsets based on the input, breaking this coupling
- Sparse efficiency comes from having many experts (high capacity) while using few per token (low compute)
- Feed-forward networks are natural targets for sparsification due to their per-token independence and large parameter share
The challenges of sparse models, including load imbalance, communication overhead, training instability, and inference complexity, have driven a rich body of research into routing mechanisms and training techniques. The next chapter introduces expert networks in detail, followed by chapters on gating mechanisms, load balancing strategies, and landmark architectures that have made sparse models practical.
Sparse models don't replace dense models entirely. They represent a new point on the Pareto frontier of capability vs. compute, particularly valuable when you need more capacity than you can afford to run densely. Understanding when and how to leverage sparsity is becoming an essential skill as models continue to scale.
Key Parameters
The key parameters for sparse models are:
- n_experts: Number of expert networks in the sparse layer. More experts increase total model capacity but also increase memory requirements and communication overhead.
- top_k: Number of experts to activate per token. Lower values reduce compute cost but may limit the model's ability to combine diverse knowledge.
- d_ff (or expert_dim): Hidden dimension of each expert's feed-forward network. Larger dimensions increase expert capacity but also increase per-expert compute cost.
- load_balance_coefficient: Metric measuring how evenly tokens distribute across experts. Values near 1.0 indicate good balance, while values near 0 suggest severe imbalance requiring intervention.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Sparse Models.













Comments