Gated Linear Units: The FFN Architecture Behind Modern LLMs

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how GLUs transform feed-forward networks through multiplicative gating. Understand SwiGLU, GeGLU, and the parameter trade-offs that power LLaMA, Mistral, and other state-of-the-art language models.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Gated Linear UnitsLink Copied

The standard feed-forward network in transformers applies a simple formula: expand the dimension, apply an activation function, contract back. This works, but researchers discovered something better. By adding a gating mechanism, where one pathway controls how much of another pathway passes through, the FFN becomes significantly more expressive. This insight, borrowed from recurrent neural networks, has become the default choice in modern large language models.

Gated Linear Units (GLUs) replace the single nonlinearity in the FFN with a product of two linear projections, one of which passes through an activation function to produce "gate" values between 0 and 1. These gates control information flow: they can fully pass, partially attenuate, or completely block the signal from the other pathway. This multiplicative interaction creates richer representations than a simple pointwise nonlinearity.

The impact has been dramatic. LLaMA, PaLM, Mistral, and virtually every state-of-the-art LLM now uses a GLU variant. The SwiGLU formulation has become particularly dominant, offering improved training dynamics and better downstream performance at comparable parameter counts. This chapter explains why gating works, how to implement it, and the trade-offs that come with this more sophisticated architecture.

The Gating MechanismLink Copied

Before diving into the GLU formulation, let's understand what gating means. A gate is a learned mechanism that controls information flow. The idea comes from LSTM networks, where gates solve the vanishing gradient problem by allowing gradients to flow unimpeded through time. In the context of feed-forward layers, gates serve a different purpose: they create multiplicative interactions that increase expressiveness.

Gating Mechanism

A gate is a learned function that produces values in a bounded range (typically 0 to 1) that multiply another signal. When the gate outputs 0, the signal is blocked; when it outputs 1, the signal passes through unchanged. Values between 0 and 1 attenuate the signal proportionally.

Consider the difference between additive and multiplicative interactions. A standard FFN uses purely additive interactions: the output is a weighted sum of transformed features. Multiplication introduces second-order terms. When you multiply two linear functions of the input, you get terms like $x_i \cdot x_j$ , which allows the network to model interactions between features that addition alone cannot capture.

The power of gating becomes clear when you consider what a gate can learn. A gate can learn to be "always open" (outputting 1), in which case the GLU behaves like a standard linear layer. It can learn to be "always closed" (outputting 0), effectively pruning that dimension. Or it can learn input-dependent behavior, opening for some patterns and closing for others. This flexibility subsumes the standard FFN as a special case while enabling much richer transformations.

In[2]:

Code

import numpy as np

np.random.seed(42)


def sigmoid(x):
    """Sigmoid activation for gating."""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))


# Demonstrate gating behavior
x = np.linspace(-3, 3, 100)
gate_values = sigmoid(x)
signal = np.sin(x)
gated_output = signal * gate_values

import numpy as np

np.random.seed(42)


def sigmoid(x):
    """Sigmoid activation for gating."""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))


# Demonstrate gating behavior
x = np.linspace(-3, 3, 100)
gate_values = sigmoid(x)
signal = np.sin(x)
gated_output = signal * gate_values

Out[3]:

Visualization

Line plot showing signal, gate values, and their product demonstrating how gating controls information flow. — The gating mechanism in action. The signal (blue) is multiplied by gate values (orange dashed) that range from 0 to 1. The gated output (green) shows how the gate selectively passes or blocks the signal: when gate values approach 0 (left side), the output is suppressed regardless of signal magnitude.

The GLU FormulationLink Copied

Now that we understand what gating does, let's formalize how it works mathematically. The Gated Linear Unit was introduced by Dauphin et al. in 2017 for language modeling, and its elegance lies in a simple but powerful insight: instead of applying an activation function to transform features, let the network learn which features to keep and which to suppress.

The Core Idea: Two Pathways, One OutputLink Copied

Standard neural network layers compute a single transformation of their input and apply an activation function. GLU takes a fundamentally different approach: it computes two separate linear projections of the same input and combines them through multiplication. One projection produces the "values" we want to transmit, while the other produces "gates" that control how much of each value passes through.

Think of it like a mixing console in a recording studio. Each channel has both a signal (the audio content) and a fader (the volume control). The final output for each channel is the signal multiplied by the fader position. The GLU works the same way, but learns both the signals and the fader positions from data.

The Mathematical FormulationLink Copied

The original GLU formula captures this two-pathway structure:

\text{GLU}(x) = (xW + b) \otimes \sigma(xV + c)

Let's unpack each component to understand what it contributes:

The Value Pathway: $(xW + b)$

The first term computes a linear transformation of the input. This is the "content" that we might want to pass through to the next layer. Without the gate, this would just be a standard linear layer:

$x \in \mathbb{R}^{d_{\text{model}}}$ : the input vector (a token's representation)
$W \in \mathbb{R}^{d_{\text{model}} \times d}$ : the "value" projection weights that transform the input
$b \in \mathbb{R}^{d}$ : the value projection bias

The Gate Pathway: $\sigma(xV + c)$

The second term computes another linear transformation of the same input, but passes it through the sigmoid function to produce gate values:

$V \in \mathbb{R}^{d_{\text{model}} \times d}$ : the "gate" projection weights (separate from $W$ )
$c \in \mathbb{R}^{d}$ : the gate projection bias
$\sigma$ : the sigmoid activation function, which squashes values to the range $(0, 1)$
$d$ : the output dimension of each projection

The Combination: $\otimes$ (Element-wise Multiplication)

The $\otimes$ symbol denotes the Hadamard product, which simply multiplies corresponding elements. If the value pathway produces $[v_1, v_2, v_3]$ and the gate pathway produces $[g_1, g_2, g_3]$ , the output is $[v_1 \cdot g_1, v_2 \cdot g_2, v_3 \cdot g_3]$ .

Why Sigmoid for Gating?Link Copied

The sigmoid function produces values between 0 and 1, making it a natural choice for gating:

\sigma(z) = \frac{1}{1 + e^{-z}}

where $z$ is the input value (applied element-wise when the input is a vector). The sigmoid's behavior at extreme values is what makes it effective as a gate:

As $z \to -\infty$ , $\sigma(z) \to 0$ (gate closes completely)
As $z \to +\infty$ , $\sigma(z) \to 1$ (gate opens fully)
At $z = 0$ , $\sigma(z) = 0.5$ (gate is half-open)

When $\sigma(xV + c)$ is close to 0, the corresponding element of $xW + b$ is blocked, meaning that dimension contributes nothing to the output. When it's close to 1, the element passes through unchanged. Values between 0 and 1 allow partial transmission, giving the network fine-grained control over information flow.

Out[4]:

Visualization

2D heatmap showing GLU output surface as function of value and gate inputs, demonstrating how gating controls information flow. — The GLU output as a function of value and gate inputs. The horizontal axis shows the value pathway (xW), the vertical axis shows the gate input (before sigmoid), and the color shows the output. When gate input is negative (bottom), output is suppressed regardless of value. When gate input is positive (top), the output closely follows the value.

From Theory to CodeLink Copied

Let's implement the basic GLU and see these components in action:

In[5]:

Code

def glu(x, W, b, V, c):
    """
    Gated Linear Unit.

    Args:
        x: Input vector, shape (d_model,) or (batch, d_model)
        W: Value projection weights, shape (d_model, d)
        b: Value projection bias, shape (d,)
        V: Gate projection weights, shape (d_model, d)
        c: Gate projection bias, shape (d,)

    Returns:
        Output vector, shape (d,) or (batch, d)
    """
    value = x @ W + b
    gate = sigmoid(x @ V + c)
    return value * gate


# Example dimensions
d_model = 512
d_output = 256

# Initialize weights
W = np.random.randn(d_model, d_output) * np.sqrt(2.0 / (d_model + d_output))
b = np.zeros(d_output)
V = np.random.randn(d_model, d_output) * np.sqrt(2.0 / (d_model + d_output))
c = np.zeros(d_output)

# Test with a single input
x_test = np.random.randn(d_model)
y_glu = glu(x_test, W, b, V, c)

def glu(x, W, b, V, c):
    """
    Gated Linear Unit.

    Args:
        x: Input vector, shape (d_model,) or (batch, d_model)
        W: Value projection weights, shape (d_model, d)
        b: Value projection bias, shape (d,)
        V: Gate projection weights, shape (d_model, d)
        c: Gate projection bias, shape (d,)

    Returns:
        Output vector, shape (d,) or (batch, d)
    """
    value = x @ W + b
    gate = sigmoid(x @ V + c)
    return value * gate


# Example dimensions
d_model = 512
d_output = 256

# Initialize weights
W = np.random.randn(d_model, d_output) * np.sqrt(2.0 / (d_model + d_output))
b = np.zeros(d_output)
V = np.random.randn(d_model, d_output) * np.sqrt(2.0 / (d_model + d_output))
c = np.zeros(d_output)

# Test with a single input
x_test = np.random.randn(d_model)
y_glu = glu(x_test, W, b, V, c)

Out[6]:

Console

GLU basic example:
  Input dimension:  512
  Output dimension: 256
  Input shape:      (512,)
  Output shape:     (256,)
  Output range:     [-1.7685, 2.3423]

The output range shows that GLU produces both positive and negative values, unlike ReLU which only produces non-negative outputs. This is because the sigmoid gate modulates but doesn't constrain the sign of the value pathway.

The Dimension Trade-offLink Copied

Notice something important about GLU's output dimension. If we project from $d_{\text{model}}$ to $d$ , we get $d$ outputs, not $2d$ . This is because we use two projections (value and gate), each producing $d$ dimensions, and multiply them together element-wise to get $d$ outputs. The multiplication combines the two pathways rather than concatenating them.

This has significant implications for parameter efficiency that we'll explore shortly. First, let's see how GLU fits into the larger transformer architecture.

GLU in the Feed-Forward NetworkLink Copied

With the GLU mechanism understood, our next step is integrating it into the transformer's feed-forward network. This requires adapting the standard two-layer FFN architecture, and understanding this integration reveals why GLU is such a natural fit for transformers.

Recalling the Standard FFNLink Copied

The standard FFN applies a simple pattern: expand the dimension, apply nonlinearity, contract back:

\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2

where $x$ is the input, $W_1$ and $W_2$ are weight matrices, $b_1$ and $b_2$ are bias vectors, and $\sigma$ is a nonlinear activation function (e.g., ReLU or GELU).

The key insight is that the nonlinearity $\sigma$ is applied after the expansion. This creates a high-dimensional feature space where the nonlinearity can carve out complex decision boundaries, before projecting back to the original dimension.

Replacing Activation with GatingLink Copied

GLU replaces the simple activation function with something more sophisticated: a learned gating mechanism. Instead of uniformly transforming all features with the same nonlinearity, GLU lets the network selectively amplify or suppress features based on the input itself:

\text{FFN}_{\text{GLU}}(x) = \left[(xW + b) \otimes \sigma(xV + c)\right] W_2 + b_2

Let's trace through what each component does in this context:

Expansion to Hidden Space: Both $xW$ and $xV$ project the input from $d_{\text{model}}$ to the larger hidden dimension $d_{ff}$
Gated Combination: The sigmoid-gated multiplication produces a $d_{ff}$ -dimensional hidden representation
Contraction Back: $W_2$ projects back to $d_{\text{model}}$ for the residual connection

The variables are:

$x \in \mathbb{R}^{d_{\text{model}}}$ : input from attention/normalization
$W, V \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : the two projection matrices for value and gate
$b, c \in \mathbb{R}^{d_{ff}}$ : corresponding bias vectors
$W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$ : output projection back to model dimension
$b_2 \in \mathbb{R}^{d_{\text{model}}}$ : output bias

The Parameter CostLink Copied

Notice that the GLU FFN requires three weight matrices ( $W$ , $V$ , and $W_2$ ) instead of the standard FFN's two ( $W_1$ and $W_2$ ). This means GLU-based FFNs have 50% more parameters for the same hidden dimension. We'll analyze this trade-off in detail later and see how modern models address it.

In[7]:

Code

def ffn_glu(x, W, b, V, c, W2, b2):
    """
    Feed-forward network with GLU activation.

    Args:
        x: Input, shape (..., d_model)
        W, b: Value projection (d_model -> d_ff)
        V, c: Gate projection (d_model -> d_ff)
        W2, b2: Output projection (d_ff -> d_model)

    Returns:
        Output, shape (..., d_model)
    """
    # GLU combines value and gate
    value = x @ W + b
    gate = sigmoid(x @ V + c)
    hidden = value * gate

    # Project back to model dimension
    output = hidden @ W2 + b2
    return output


# FFN dimensions
d_model_ffn = 512
d_ff = 2048

# Initialize all weights
np.random.seed(42)
W_ffn = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
b_ffn = np.zeros(d_ff)
V_ffn = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
c_ffn = np.zeros(d_ff)
W2_ffn = np.random.randn(d_ff, d_model_ffn) * np.sqrt(
    2.0 / (d_ff + d_model_ffn)
)
b2_ffn = np.zeros(d_model_ffn)

# Process a sequence
seq_len = 16
X = np.random.randn(seq_len, d_model_ffn)
Y = ffn_glu(X, W_ffn, b_ffn, V_ffn, c_ffn, W2_ffn, b2_ffn)

def ffn_glu(x, W, b, V, c, W2, b2):
    """
    Feed-forward network with GLU activation.

    Args:
        x: Input, shape (..., d_model)
        W, b: Value projection (d_model -> d_ff)
        V, c: Gate projection (d_model -> d_ff)
        W2, b2: Output projection (d_ff -> d_model)

    Returns:
        Output, shape (..., d_model)
    """
    # GLU combines value and gate
    value = x @ W + b
    gate = sigmoid(x @ V + c)
    hidden = value * gate

    # Project back to model dimension
    output = hidden @ W2 + b2
    return output


# FFN dimensions
d_model_ffn = 512
d_ff = 2048

# Initialize all weights
np.random.seed(42)
W_ffn = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
b_ffn = np.zeros(d_ff)
V_ffn = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
c_ffn = np.zeros(d_ff)
W2_ffn = np.random.randn(d_ff, d_model_ffn) * np.sqrt(
    2.0 / (d_ff + d_model_ffn)
)
b2_ffn = np.zeros(d_model_ffn)

# Process a sequence
seq_len = 16
X = np.random.randn(seq_len, d_model_ffn)
Y = ffn_glu(X, W_ffn, b_ffn, V_ffn, c_ffn, W2_ffn, b2_ffn)

Out[8]:

Console

GLU FFN example:
  d_model: 512
  d_ff:    2048
  Sequence length: 16
  Input shape:  (16, 512)
  Output shape: (16, 512)

The output shape matches the input shape, confirming that the GLU FFN correctly projects back to the model dimension. This dimensional consistency is essential for the residual connection that adds the FFN output back to its input in a transformer block.

SwiGLU: The Modern StandardLink Copied

The original GLU with sigmoid gating works, but researchers discovered an even better formulation. SwiGLU, introduced by Shazeer in 2020, has become the dominant choice in modern LLMs, powering LLaMA, Mistral, and virtually every frontier model. Understanding why SwiGLU outperforms the original GLU reveals important insights about activation function design.

The Limitation of Sigmoid GatingLink Copied

Recall that the original GLU uses sigmoid to produce gate values between 0 and 1. While this works well for controlling information flow, sigmoid has a subtle limitation: it saturates at both ends. For very large positive inputs, the gate is always ~1; for very large negative inputs, always ~0. This saturation can limit expressiveness and cause gradient flow issues in deep networks.

From Sigmoid to Swish: A Self-Gating ActivationLink Copied

Swish (also called SiLU for Sigmoid Linear Unit) offers an elegant solution. Instead of using sigmoid as a separate gate, Swish embeds gating directly into the activation function:

\text{Swish}(z) = z \cdot \sigma(z)

where $z$ is the input value and $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function.

This formula is deceptively simple but powerful. Let's understand what it does:

For large positive $z$ : $\sigma(z) \approx 1$ , so $\text{Swish}(z) \approx z$ (passes through like a linear function)
For large negative $z$ : $\sigma(z) \approx 0$ , so $\text{Swish}(z) \approx 0$ (suppressed like ReLU)
For small $z$ near zero: both $z$ and $\sigma(z)$ contribute, creating a smooth transition

The key insight is that Swish is self-gating: the input determines its own gate value. This creates an implicit "soft attention" where each value decides how much of itself to pass through.

The SwiGLU FormulaLink Copied

SwiGLU combines Swish with the two-pathway GLU structure:

\text{SwiGLU}(x) = \text{Swish}(xW) \otimes (xV)

where:

$x \in \mathbb{R}^{d_{\text{model}}}$ : the input vector
$W \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : the first projection matrix (Swish-activated pathway)
$V \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : the second projection matrix (linear pathway)
$\otimes$ : element-wise multiplication (Hadamard product)

Expanding Swish reveals the full structure:

\text{SwiGLU}(x) = (xW \cdot \sigma(xW)) \otimes (xV)

The Asymmetry ExplainedLink Copied

Notice something different from the original GLU: the asymmetry. In SwiGLU:

The first pathway ( $xW$ ) passes through Swish, which includes self-gating
The second pathway ( $xV$ ) remains completely linear

This differs from the original GLU where sigmoid was applied specifically to the "gate" pathway and the "value" pathway was linear. In SwiGLU, the Swish-activated pathway acts as a kind of self-gated value, while the linear pathway provides additional expressiveness through the multiplication.

The combination creates a rich interaction: you have two different views of the input (through $W$ and $V$ ), one filtered through self-gating nonlinearity, combined multiplicatively. This gives the network more degrees of freedom than either a simple activation or the original GLU.

Why SwiGLU Outperforms the OriginalLink Copied

Several factors contribute to SwiGLU's success:

Unbounded positive values: Unlike sigmoid (which saturates at 1), Swish can output arbitrarily large positive values, matching the range of ReLU-family activations
Smooth gradients: Swish is differentiable everywhere with smooth gradients, avoiding the sharp transitions at zero that ReLU has
Self-gating property: The formula $z \cdot \sigma(z)$ means each dimension gates itself, creating input-dependent behavior without explicit gate parameters
Better gradient flow: The linear term in Swish ensures gradients can flow even for extreme inputs, helping train deep networks

The gradient behavior is particularly important for training deep networks. Let's visualize how gradients flow differently through each activation:

Out[9]:

Visualization

Line plot comparing gradients of ReLU, GELU, Swish, and sigmoid, showing how Swish maintains gradients for negative inputs. — Gradient (derivative) of different activation functions. ReLU has zero gradient for negative inputs ('dead neurons' problem), while Swish and GELU maintain small but non-zero gradients that allow information to flow during backpropagation.

ImplementationLink Copied

Let's implement SwiGLU and see these properties in action:

In[10]:

Code

def swish(x):
    """Swish (SiLU) activation: x * sigmoid(x)."""
    return x * sigmoid(x)


def swiglu(x, W, V):
    """
    SwiGLU activation (biases often omitted in modern implementations).

    Args:
        x: Input, shape (..., d_model)
        W: First projection weights, shape (d_model, d_ff)
        V: Second projection weights, shape (d_model, d_ff)

    Returns:
        Output, shape (..., d_ff)
    """
    return swish(x @ W) * (x @ V)


def ffn_swiglu(x, W, V, W2):
    """
    Feed-forward network with SwiGLU activation.

    Modern implementations typically omit biases.
    """
    hidden = swiglu(x, W, V)
    return hidden @ W2


# Initialize weights for SwiGLU FFN
W_swi = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
V_swi = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
W2_swi = np.random.randn(d_ff, d_model_ffn) * np.sqrt(
    2.0 / (d_ff + d_model_ffn)
)

# Test
x_single = np.random.randn(d_model_ffn)
y_swiglu = ffn_swiglu(x_single, W_swi, V_swi, W2_swi)

def swish(x):
    """Swish (SiLU) activation: x * sigmoid(x)."""
    return x * sigmoid(x)


def swiglu(x, W, V):
    """
    SwiGLU activation (biases often omitted in modern implementations).

    Args:
        x: Input, shape (..., d_model)
        W: First projection weights, shape (d_model, d_ff)
        V: Second projection weights, shape (d_model, d_ff)

    Returns:
        Output, shape (..., d_ff)
    """
    return swish(x @ W) * (x @ V)


def ffn_swiglu(x, W, V, W2):
    """
    Feed-forward network with SwiGLU activation.

    Modern implementations typically omit biases.
    """
    hidden = swiglu(x, W, V)
    return hidden @ W2


# Initialize weights for SwiGLU FFN
W_swi = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
V_swi = np.random.randn(d_model_ffn, d_ff) * np.sqrt(2.0 / (d_model_ffn + d_ff))
W2_swi = np.random.randn(d_ff, d_model_ffn) * np.sqrt(
    2.0 / (d_ff + d_model_ffn)
)

# Test
x_single = np.random.randn(d_model_ffn)
y_swiglu = ffn_swiglu(x_single, W_swi, V_swi, W2_swi)

Out[11]:

Console

SwiGLU FFN example:
  Input shape:  (512,)
  Output shape: (512,)
  Output norm:  5.8287

The output norm provides a rough measure of the magnitude of the transformation. With Xavier initialization, we expect the output norm to be similar to the input norm, indicating stable signal propagation through the network.

Let's examine the self-gating behavior of Swish more closely by looking at how the implicit gate values (the sigmoid component) distribute for random inputs:

Out[12]:

Visualization

Histogram showing distribution of Swish gate values with 50% suppression threshold marked. — Distribution of implicit gate values in Swish activation. Values below 0.5 indicate suppression, while values above allow signal to pass through.

Scatter plot of Swish input vs output colored by gate value, showing smooth transition. — Swish input-output relationship showing smooth transition from suppression (negative inputs) to linear pass-through (positive inputs).

Let's also visualize how Swish compares to other activation functions and how this affects the gating behavior:

Out[13]:

Visualization

Line plot comparing ReLU, GELU, Swish, and sigmoid activation functions. — Comparison of activation functions. Swish (used in SwiGLU) provides a smooth, unbounded alternative to ReLU with a small negative region that can help gradient flow.

Line plot showing gating characteristics of different activations. — The effective gating behavior of each activation when used in a GLU-style multiplication. Swish provides smoother gating than sigmoid while allowing values greater than 1.

GeGLU: The GELU VariantLink Copied

Another popular variant is GeGLU, which uses GELU instead of Swish for the gating function. The GeGLU formula follows the same pattern as SwiGLU, replacing the Swish activation with GELU:

\text{GeGLU}(x) = \text{GELU}(xW) \otimes (xV)

where:

$x \in \mathbb{R}^{d_{\text{model}}}$ : the input vector
$W \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : the first projection matrix (GELU-activated pathway)
$V \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : the second projection matrix (linear pathway)
$\otimes$ : element-wise multiplication (Hadamard product)

GELU (Gaussian Error Linear Unit) is defined as the product of the input and the cumulative distribution function (CDF) of the standard normal distribution:

\text{GELU}(z) = z \cdot \Phi(z)

where:

$z$ : the input value (a scalar, applied element-wise to vectors)
$\Phi(z)$ : the cumulative distribution function of the standard normal distribution, i.e., $\Phi(z) = P(Z \leq z)$ for $Z \sim \mathcal{N}(0, 1)$

Computing the exact CDF is expensive, so in practice a fast approximation is used:

\text{GELU}(z) \approx 0.5z \left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715z^3\right)\right]\right)

This approximation uses the hyperbolic tangent function to approximate the error function that underlies the normal CDF. The constant $0.044715$ was empirically chosen to minimize approximation error across the typical input range.

GeGLU has similar properties to SwiGLU but with slightly different gradient characteristics. Some practitioners prefer GeGLU for encoder models (following BERT's use of GELU), while SwiGLU has become more common in decoder-only LLMs.

In[14]:

Code

def geglu(x, W, V):
    """
    GeGLU activation.

    Args:
        x: Input, shape (..., d_model)
        W: First projection weights, shape (d_model, d_ff)
        V: Second projection weights, shape (d_model, d_ff)

    Returns:
        Output, shape (..., d_ff)
    """
    return gelu(x @ W) * (x @ V)


def ffn_geglu(x, W, V, W2):
    """Feed-forward network with GeGLU activation."""
    hidden = geglu(x, W, V)
    return hidden @ W2


# Compare outputs of different GLU variants
y_geglu = ffn_geglu(x_single, W_swi, V_swi, W2_swi)

def geglu(x, W, V):
    """
    GeGLU activation.

    Args:
        x: Input, shape (..., d_model)
        W: First projection weights, shape (d_model, d_ff)
        V: Second projection weights, shape (d_model, d_ff)

    Returns:
        Output, shape (..., d_ff)
    """
    return gelu(x @ W) * (x @ V)


def ffn_geglu(x, W, V, W2):
    """Feed-forward network with GeGLU activation."""
    hidden = geglu(x, W, V)
    return hidden @ W2


# Compare outputs of different GLU variants
y_geglu = ffn_geglu(x_single, W_swi, V_swi, W2_swi)

Out[15]:

Console

GLU variant comparison:
  SwiGLU output norm: 5.8287
  GeGLU output norm:  6.1543
  Difference norm:    1.0694

The output norms are similar between SwiGLU and GeGLU, indicating that both variants produce representations of comparable magnitude. The non-zero difference norm confirms that while the outputs are similar, they're not identical. These differences are subtle in terms of output statistics but can compound over many layers and matter for training dynamics and final model quality at scale.

Parameter Efficiency AnalysisLink Copied

A crucial consideration when adopting GLU variants is the parameter cost. The standard FFN has two weight matrices, while GLU-based FFNs have three. Let's analyze this trade-off carefully.

For a standard FFN with hidden dimension $d_{ff}$ , the parameter count (ignoring biases) is:

\text{Params}_{\text{standard}} = 2 \times d_{\text{model}} \times d_{ff}

This comes from two weight matrices:

$W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : $d_{\text{model}} \times d_{ff}$ parameters
$W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$ : $d_{ff} \times d_{\text{model}}$ parameters

For a GLU-based FFN with the same hidden dimension $d_{ff}$ , the parameter count is:

\text{Params}_{\text{GLU}} = 3 \times d_{\text{model}} \times d_{ff}

This comes from three weight matrices:

$W \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : $d_{\text{model}} \times d_{ff}$ parameters (value/Swish projection)
$V \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ : $d_{\text{model}} \times d_{ff}$ parameters (gate/linear projection)
$W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$ : $d_{ff} \times d_{\text{model}}$ parameters (output projection)

The GLU variant has 50% more parameters for the same hidden dimension. To maintain the same parameter count as a standard FFN, we can solve for the reduced hidden dimension $d'_{ff}$ :

3 \times d_{\text{model}} \times d'_{ff} = 2 \times d_{\text{model}} \times d_{ff}

Solving for $d'_{ff}$ :

d'_{ff} = \frac{2}{3} \times d_{ff}

This means we need to reduce $d_{ff}$ by a factor of $\frac{2}{3}$ to match the parameter count of a standard FFN.

In[16]:

Code

def count_ffn_params(d_model, d_ff, variant="standard"):
    """Count FFN parameters for different variants."""
    if variant == "standard":
        # W1 and W2
        return 2 * d_model * d_ff
    elif variant in ["glu", "swiglu", "geglu"]:
        # W, V, and W2
        return 3 * d_model * d_ff
    else:
        raise ValueError(f"Unknown variant: {variant}")


def equivalent_glu_hidden_dim(d_model, d_ff_standard):
    """Compute d_ff for GLU to match standard FFN parameter count."""
    standard_params = count_ffn_params(d_model, d_ff_standard, "standard")
    # 3 * d_model * d_ff_glu = 2 * d_model * d_ff_standard
    # d_ff_glu = (2/3) * d_ff_standard
    return int((2 / 3) * d_ff_standard)


# Example: GPT-2 small configuration
d_model_gpt = 768
d_ff_standard = 3072  # Standard 4x expansion

d_ff_glu = equivalent_glu_hidden_dim(d_model_gpt, d_ff_standard)
params_standard = count_ffn_params(d_model_gpt, d_ff_standard, "standard")
params_glu_full = count_ffn_params(d_model_gpt, d_ff_standard, "glu")
params_glu_reduced = count_ffn_params(d_model_gpt, d_ff_glu, "glu")

def count_ffn_params(d_model, d_ff, variant="standard"):
    """Count FFN parameters for different variants."""
    if variant == "standard":
        # W1 and W2
        return 2 * d_model * d_ff
    elif variant in ["glu", "swiglu", "geglu"]:
        # W, V, and W2
        return 3 * d_model * d_ff
    else:
        raise ValueError(f"Unknown variant: {variant}")


def equivalent_glu_hidden_dim(d_model, d_ff_standard):
    """Compute d_ff for GLU to match standard FFN parameter count."""
    standard_params = count_ffn_params(d_model, d_ff_standard, "standard")
    # 3 * d_model * d_ff_glu = 2 * d_model * d_ff_standard
    # d_ff_glu = (2/3) * d_ff_standard
    return int((2 / 3) * d_ff_standard)


# Example: GPT-2 small configuration
d_model_gpt = 768
d_ff_standard = 3072  # Standard 4x expansion

d_ff_glu = equivalent_glu_hidden_dim(d_model_gpt, d_ff_standard)
params_standard = count_ffn_params(d_model_gpt, d_ff_standard, "standard")
params_glu_full = count_ffn_params(d_model_gpt, d_ff_standard, "glu")
params_glu_reduced = count_ffn_params(d_model_gpt, d_ff_glu, "glu")

Out[17]:

Console

Parameter count comparison (GPT-2 Small scale):
  d_model: 768
  Standard FFN d_ff: 3072

  Standard FFN params:          4,718,592
  GLU FFN (same d_ff) params:   7,077,888 (+50%)
  GLU FFN (reduced d_ff=2048) params:  4,718,592 (matched)

The comparison shows that keeping the same hidden dimension increases parameters by 50%, while reducing $d_{ff}$ to 2048 brings the parameter count close to the standard FFN. This demonstrates the practical trade-off: you can either accept more parameters for the same hidden capacity, or reduce hidden capacity to match parameters while gaining the expressiveness of gating.

LLaMA and other modern models typically use the reduced hidden dimension approach. For example, LLaMA-7B uses $d_{\text{model}} = 4096$ with $d_{ff} = 11008$ , which is approximately $\frac{8}{3} \times d_{\text{model}}$ rather than the standard $4 \times d_{\text{model}}$ . This gives roughly the same parameter count as a standard FFN with $4 \times$ expansion while gaining the benefits of gated activation.

Out[18]:

Visualization

Bar chart comparing parameter counts of standard FFN, GLU with same hidden dim, and GLU with matched parameters. — Parameter count comparison between standard FFN and GLU variants across different model dimensions. The 'GLU (matched)' variant uses a reduced hidden dimension to match the standard FFN parameter count, while 'GLU (same d_ff)' shows the 50% parameter overhead of using the same hidden dimension.

Hidden Dimension Choices in PracticeLink Copied

Different models make different choices about how to handle the parameter trade-off. Let's examine the configurations used by several prominent models:

In[19]:

Code

# Real-world model configurations
model_configs = [
    ("GPT-2 (standard FFN)", 768, 3072, "standard"),
    ("BERT-Base (standard FFN)", 768, 3072, "standard"),
    ("LLaMA-7B (SwiGLU)", 4096, 11008, "swiglu"),
    ("LLaMA-13B (SwiGLU)", 5120, 13824, "swiglu"),
    ("Mistral-7B (SwiGLU)", 4096, 14336, "swiglu"),
    ("PaLM-8B (SwiGLU)", 4096, 16384, "swiglu"),
]

# Real-world model configurations
model_configs = [
    ("GPT-2 (standard FFN)", 768, 3072, "standard"),
    ("BERT-Base (standard FFN)", 768, 3072, "standard"),
    ("LLaMA-7B (SwiGLU)", 4096, 11008, "swiglu"),
    ("LLaMA-13B (SwiGLU)", 5120, 13824, "swiglu"),
    ("Mistral-7B (SwiGLU)", 4096, 14336, "swiglu"),
    ("PaLM-8B (SwiGLU)", 4096, 16384, "swiglu"),
]

Out[20]:

Console

Real-world model FFN configurations:

Model                      d_model     d_ff    Ratio       Params
----------------------------------------------------------------------
GPT-2 (standard FFN)           768     3072     4.00x        4.7M
BERT-Base (standard FFN)       768     3072     4.00x        4.7M
LLaMA-7B (SwiGLU)             4096    11008     2.69x      135.3M
LLaMA-13B (SwiGLU)            5120    13824     2.70x      212.3M
Mistral-7B (SwiGLU)           4096    14336     3.50x      176.2M
PaLM-8B (SwiGLU)              4096    16384     4.00x      201.3M

The table reveals interesting design choices across different models. Standard FFN models like GPT-2 and BERT use a consistent 4x ratio. SwiGLU models show more variation: LLaMA-7B uses 2.69x (close to parameter parity with 4x standard), while Mistral-7B and PaLM-8B use higher ratios, trading additional parameters for greater expressiveness.

Notice the pattern: models using SwiGLU tend to use ratios around 2.7x to 4x, depending on whether they prioritize matching the parameter count of standard FFNs or pushing expressiveness further. Mistral-7B, for example, uses a higher ratio (3.5x), accepting more parameters in exchange for potentially better quality.

Out[21]:

Visualization

Horizontal bar chart showing d_ff to d_model ratios for different models, with 4x reference line. — Hidden dimension ratios used in modern LLMs. Standard FFNs use 4x expansion. SwiGLU models reduce this to maintain parameter parity (LLaMA's 2.69x) or increase it for expressiveness (Mistral's 3.5x, PaLM's 4x). The choice balances parameters against representational capacity.

Visualizing GLU RepresentationsLink Copied

To understand how GLU transforms representations differently from standard activations, let's visualize the hidden activations for the same inputs processed through different FFN variants:

In[22]:

Code

# Create comparable FFNs
np.random.seed(42)
d_model_vis = 64
d_ff_vis = 256

# Standard FFN weights
W1_std = np.random.randn(d_model_vis, d_ff_vis) * np.sqrt(
    2.0 / (d_model_vis + d_ff_vis)
)
W2_std = np.random.randn(d_ff_vis, d_model_vis) * np.sqrt(
    2.0 / (d_ff_vis + d_model_vis)
)

# GLU weights (same initialization for comparison)
W_glu = np.random.randn(d_model_vis, d_ff_vis) * np.sqrt(
    2.0 / (d_model_vis + d_ff_vis)
)
V_glu = np.random.randn(d_model_vis, d_ff_vis) * np.sqrt(
    2.0 / (d_model_vis + d_ff_vis)
)
W2_glu = np.random.randn(d_ff_vis, d_model_vis) * np.sqrt(
    2.0 / (d_ff_vis + d_model_vis)
)


def get_hidden_activations(x, variant="standard"):
    """Get hidden layer activations for visualization."""
    if variant == "standard":
        return relu(x @ W1_std)
    elif variant == "swiglu":
        return swish(x @ W_glu) * (x @ V_glu)
    elif variant == "geglu":
        return gelu(x @ W_glu) * (x @ V_glu)


# Generate a batch of inputs
batch_size = 50
X_batch = np.random.randn(batch_size, d_model_vis)

# Get hidden activations for each variant
hidden_std = np.array([get_hidden_activations(x, "standard") for x in X_batch])
hidden_swi = np.array([get_hidden_activations(x, "swiglu") for x in X_batch])

# Create comparable FFNs
np.random.seed(42)
d_model_vis = 64
d_ff_vis = 256

# Standard FFN weights
W1_std = np.random.randn(d_model_vis, d_ff_vis) * np.sqrt(
    2.0 / (d_model_vis + d_ff_vis)
)
W2_std = np.random.randn(d_ff_vis, d_model_vis) * np.sqrt(
    2.0 / (d_ff_vis + d_model_vis)
)

# GLU weights (same initialization for comparison)
W_glu = np.random.randn(d_model_vis, d_ff_vis) * np.sqrt(
    2.0 / (d_model_vis + d_ff_vis)
)
V_glu = np.random.randn(d_model_vis, d_ff_vis) * np.sqrt(
    2.0 / (d_model_vis + d_ff_vis)
)
W2_glu = np.random.randn(d_ff_vis, d_model_vis) * np.sqrt(
    2.0 / (d_ff_vis + d_model_vis)
)


def get_hidden_activations(x, variant="standard"):
    """Get hidden layer activations for visualization."""
    if variant == "standard":
        return relu(x @ W1_std)
    elif variant == "swiglu":
        return swish(x @ W_glu) * (x @ V_glu)
    elif variant == "geglu":
        return gelu(x @ W_glu) * (x @ V_glu)


# Generate a batch of inputs
batch_size = 50
X_batch = np.random.randn(batch_size, d_model_vis)

# Get hidden activations for each variant
hidden_std = np.array([get_hidden_activations(x, "standard") for x in X_batch])
hidden_swi = np.array([get_hidden_activations(x, "swiglu") for x in X_batch])

Out[23]:

Visualization

Heatmap of ReLU FFN hidden activations showing many zero values and binary-like sparsity. — Hidden activations with standard ReLU FFN. Many dimensions are exactly zero (dark vertical bands), creating sparse representations with sharp on/off behavior.

Heatmap of SwiGLU FFN hidden activations showing smoother, more continuous activation patterns. — Hidden activations with SwiGLU FFN. The gating mechanism creates smoother, less sparse patterns with more nuanced activation levels across dimensions.

Let's also compare the distribution of activation values:

Out[24]:

Visualization

Overlaid histograms showing ReLU activations with sharp spike at zero versus SwiGLU with smoother distribution. — Distribution of hidden activation values for standard ReLU FFN versus SwiGLU FFN. ReLU creates a spike at zero (from zeroing negative values) plus a positive tail. SwiGLU produces a smoother, more symmetric distribution with negative values preserved, allowing for richer gradient flow during training.

The key observation is that SwiGLU produces more continuous, less sparse activations. While ReLU creates hard sparsity (many exactly-zero values), SwiGLU allows for small negative values and smoother transitions. This can improve gradient flow during training and allows the network to represent more nuanced information.

Let's quantify this sparsity difference more precisely by looking at how activation values distribute across magnitude ranges:

Out[25]:

Visualization

Stacked bar chart comparing activation magnitude distributions between ReLU and SwiGLU FFNs. — Sparsity patterns in hidden activations. ReLU creates many exact zeros (high sparsity), while SwiGLU distributes activations more evenly. The near-zero region (-0.1 to 0.1) shows ReLU's hard cutoff versus SwiGLU's smooth transition.

A Complete Worked ExampleLink Copied

The formulas we've covered capture the essence of SwiGLU, but seeing actual numbers flow through the computation solidifies understanding. Let's trace through a concrete SwiGLU computation step by step, using deliberately small dimensions so you can follow every multiplication and understand exactly how the gating mechanism transforms an input vector.

Our goal is threefold:

See how the two pathways ( $xW$ and $xV$ ) produce different views of the same input
Observe how Swish's self-gating modulates the first pathway
Watch how the element-wise multiplication combines both pathways into the final representation

In[26]:

Code

# Small example for hand calculation
d_model_tiny = 4
d_ff_tiny = 6

np.random.seed(42)

# Initialize with small, traceable values
W_tiny = np.array(
    [
        [0.5, -0.3, 0.2, 0.4, -0.1, 0.3],
        [-0.2, 0.4, 0.1, -0.3, 0.5, -0.2],
        [0.3, -0.1, -0.4, 0.2, 0.3, 0.1],
        [0.1, 0.2, 0.3, -0.1, -0.2, 0.4],
    ]
)

V_tiny = np.array(
    [
        [0.2, 0.4, -0.3, 0.1, 0.3, -0.2],
        [0.4, -0.2, 0.5, -0.1, 0.2, 0.3],
        [-0.1, 0.3, 0.2, 0.4, -0.3, 0.1],
        [0.3, -0.1, 0.1, 0.2, 0.4, -0.4],
    ]
)

W2_tiny = np.array(
    [
        [0.3, -0.2, 0.4, 0.1],
        [-0.1, 0.3, -0.2, 0.4],
        [0.2, 0.1, 0.3, -0.3],
        [0.4, -0.1, 0.2, 0.3],
        [-0.2, 0.4, -0.1, 0.2],
        [0.1, 0.2, 0.4, -0.2],
    ]
)

# Input vector
x_tiny = np.array([1.0, -0.5, 0.8, 0.3])

# Small example for hand calculation
d_model_tiny = 4
d_ff_tiny = 6

np.random.seed(42)

# Initialize with small, traceable values
W_tiny = np.array(
    [
        [0.5, -0.3, 0.2, 0.4, -0.1, 0.3],
        [-0.2, 0.4, 0.1, -0.3, 0.5, -0.2],
        [0.3, -0.1, -0.4, 0.2, 0.3, 0.1],
        [0.1, 0.2, 0.3, -0.1, -0.2, 0.4],
    ]
)

V_tiny = np.array(
    [
        [0.2, 0.4, -0.3, 0.1, 0.3, -0.2],
        [0.4, -0.2, 0.5, -0.1, 0.2, 0.3],
        [-0.1, 0.3, 0.2, 0.4, -0.3, 0.1],
        [0.3, -0.1, 0.1, 0.2, 0.4, -0.4],
    ]
)

W2_tiny = np.array(
    [
        [0.3, -0.2, 0.4, 0.1],
        [-0.1, 0.3, -0.2, 0.4],
        [0.2, 0.1, 0.3, -0.3],
        [0.4, -0.1, 0.2, 0.3],
        [-0.2, 0.4, -0.1, 0.2],
        [0.1, 0.2, 0.4, -0.2],
    ]
)

# Input vector
x_tiny = np.array([1.0, -0.5, 0.8, 0.3])

We have a 4-dimensional input vector that we'll project into a 6-dimensional hidden space. The weight matrices $W$ and $V$ are initialized with small values for readability, and $W_2$ projects back from hidden to input dimension.

Now let's trace through each step of the SwiGLU computation:

Out[27]:

Console

SwiGLU Worked Example
==================================================

Input x: [ 1.  -0.5  0.8  0.3]
Shape: (4,)

Step 1: x @ W = [ 0.87 -0.52 -0.08  0.68 -0.17  0.6 ]

Step 2: Swish(x @ W) = [ 0.6131 -0.1939 -0.0384  0.4513 -0.0778  0.3874]

Step 3: x @ V = [ 0.01  0.71 -0.36  0.53  0.08 -0.39]

Step 4: Swish(x @ W) * (x @ V) = [ 0.0061 -0.1377  0.0138  0.2392 -0.0062 -0.1511]

Step 5: hidden @ W2 = [ 0.1002 -0.0978  0.0222  0.0421]

Final output shape: (4,)
Input and output have the same dimension, ready for residual connection.

The step-by-step output reveals several important patterns:

Step 1 vs Step 2 (Swish effect): Compare $xW$ with $\text{Swish}(xW)$ . Notice how Swish preserves the sign but modulates the magnitude. Larger positive values shrink slightly (due to the sigmoid factor being less than 1 for small values), while negative values get pushed toward zero but can remain slightly negative (unlike ReLU which would zero them completely).

Step 3 (Linear pathway): The values in $xV$ are independent from $xW$ , giving the network a second "view" of the same input through different learned weights.

Step 4 (Multiplicative combination): The final hidden representation emerges from element-wise multiplication. When both pathways have the same sign, the product is positive; opposite signs yield negative products. Near-zero values in either pathway suppress that dimension in the output.

Let's visualize how each step transforms the representation:

Out[28]:

Visualization

Grouped bar chart showing values at each computation step of SwiGLU across hidden dimensions. — Step-by-step visualization of SwiGLU computation. The bars show values at each stage: xW (before Swish), Swish(xW), xV (linear path), and their product (final hidden). The gating mechanism modulates each dimension independently based on both projections.

Implementation: A Complete SwiGLU ModuleLink Copied

Having traced through the mathematics by hand, we can now consolidate everything into a reusable implementation. This module follows the patterns used in production transformer libraries: it initializes weights using proper scaling (Xavier/Glorot initialization), supports optional biases, handles both single vectors and batched sequences, and provides a clean interface for integration into larger architectures.

The implementation below captures the complete SwiGLU FFN as used in LLaMA, Mistral, and other modern LLMs:

In[29]:

Code

class SwiGLUFeedForward:
    """
    Feed-forward network with SwiGLU activation.

    Implements: FFN(x) = (Swish(xW) * xV) @ W2

    This is the architecture used in LLaMA, Mistral, and other modern LLMs.
    """

    def __init__(self, d_model, d_ff, use_bias=False):
        """
        Initialize the SwiGLU feed-forward network.

        Args:
            d_model: Input and output dimension
            d_ff: Hidden dimension (note: requires 3x params vs standard FFN)
            use_bias: Whether to include bias terms (modern LLMs often omit)
        """
        self.d_model = d_model
        self.d_ff = d_ff
        self.use_bias = use_bias

        # Xavier/Glorot initialization
        scale = np.sqrt(2.0 / (d_model + d_ff))
        self.W = np.random.randn(d_model, d_ff) * scale
        self.V = np.random.randn(d_model, d_ff) * scale
        self.W2 = np.random.randn(d_ff, d_model) * scale

        if use_bias:
            self.b_w = np.zeros(d_ff)
            self.b_v = np.zeros(d_ff)
            self.b_out = np.zeros(d_model)

    def __call__(self, x):
        """
        Apply SwiGLU feed-forward transformation.

        Args:
            x: Input tensor of shape (..., d_model)

        Returns:
            Output tensor of shape (..., d_model)
        """
        # Compute both projections
        gate_input = x @ self.W
        linear_input = x @ self.V

        if self.use_bias:
            gate_input = gate_input + self.b_w
            linear_input = linear_input + self.b_v

        # SwiGLU: Swish on gate path, multiply with linear path
        hidden = swish(gate_input) * linear_input

        # Project back to model dimension
        output = hidden @ self.W2

        if self.use_bias:
            output = output + self.b_out

        return output

    def num_parameters(self):
        """Return total parameter count."""
        params = 3 * self.d_model * self.d_ff
        if self.use_bias:
            params += 2 * self.d_ff + self.d_model
        return params


# Test the module
np.random.seed(42)
swiglu_module = SwiGLUFeedForward(
    d_model=512, d_ff=1365
)  # Reduced d_ff for param matching

# Single vector
x_test_mod = np.random.randn(512)
y_test_mod = swiglu_module(x_test_mod)

# Batch
X_test_mod = np.random.randn(16, 512)
Y_test_mod = swiglu_module(X_test_mod)

class SwiGLUFeedForward:
    """
    Feed-forward network with SwiGLU activation.

    Implements: FFN(x) = (Swish(xW) * xV) @ W2

    This is the architecture used in LLaMA, Mistral, and other modern LLMs.
    """

    def __init__(self, d_model, d_ff, use_bias=False):
        """
        Initialize the SwiGLU feed-forward network.

        Args:
            d_model: Input and output dimension
            d_ff: Hidden dimension (note: requires 3x params vs standard FFN)
            use_bias: Whether to include bias terms (modern LLMs often omit)
        """
        self.d_model = d_model
        self.d_ff = d_ff
        self.use_bias = use_bias

        # Xavier/Glorot initialization
        scale = np.sqrt(2.0 / (d_model + d_ff))
        self.W = np.random.randn(d_model, d_ff) * scale
        self.V = np.random.randn(d_model, d_ff) * scale
        self.W2 = np.random.randn(d_ff, d_model) * scale

        if use_bias:
            self.b_w = np.zeros(d_ff)
            self.b_v = np.zeros(d_ff)
            self.b_out = np.zeros(d_model)

    def __call__(self, x):
        """
        Apply SwiGLU feed-forward transformation.

        Args:
            x: Input tensor of shape (..., d_model)

        Returns:
            Output tensor of shape (..., d_model)
        """
        # Compute both projections
        gate_input = x @ self.W
        linear_input = x @ self.V

        if self.use_bias:
            gate_input = gate_input + self.b_w
            linear_input = linear_input + self.b_v

        # SwiGLU: Swish on gate path, multiply with linear path
        hidden = swish(gate_input) * linear_input

        # Project back to model dimension
        output = hidden @ self.W2

        if self.use_bias:
            output = output + self.b_out

        return output

    def num_parameters(self):
        """Return total parameter count."""
        params = 3 * self.d_model * self.d_ff
        if self.use_bias:
            params += 2 * self.d_ff + self.d_model
        return params


# Test the module
np.random.seed(42)
swiglu_module = SwiGLUFeedForward(
    d_model=512, d_ff=1365
)  # Reduced d_ff for param matching

# Single vector
x_test_mod = np.random.randn(512)
y_test_mod = swiglu_module(x_test_mod)

# Batch
X_test_mod = np.random.randn(16, 512)
Y_test_mod = swiglu_module(X_test_mod)

Out[30]:

Console

SwiGLUFeedForward module test:

Configuration:
  d_model:    512
  d_ff:       1365
  use_bias:   False
  Parameters: 2,096,640

Single vector:
  Input shape:  (512,)
  Output shape: (512,)

Batch (sequence):
  Input shape:  (16, 512)
  Output shape: (16, 512)

Comparison to standard FFN (d_ff=2048):
  Standard FFN params: 2,097,152
  SwiGLU FFN params:   2,096,640

The SwiGLU module with reduced $d_{ff} = 1365$ achieves comparable parameter count to a standard FFN with $d_{ff} = 2048$ . The module correctly handles both single vectors and batched sequences, producing outputs of the expected shapes. This implementation follows the pattern used in production LLMs like LLaMA.

GLU in Modern ArchitecturesLink Copied

Gated Linear Units have become the de facto standard in modern large language models. Here's how different architectures implement GLU variants:

LLaMA (Meta, 2023): Uses SwiGLU with $d_{ff} = \frac{8}{3} \times d_{\text{model}}$ rounded to the nearest multiple of 256 for hardware efficiency. The rounding ensures tensor dimensions align well with GPU architectures. Biases are completely removed from the FFN.
Mistral (2023): Also uses SwiGLU but with a higher ratio ( $d_{ff} \approx 3.5 \times d_{\text{model}}$ ), trading parameters for expressiveness. Like LLaMA, no biases are used.
PaLM (Google, 2022): Uses SwiGLU with a 4x expansion ratio, accepting the higher parameter count in exchange for expressiveness. The model uses separate embedding and softmax matrices and applies RMS normalization.
Falcon (TII, 2023): Uses GELU rather than Swish in its GLU variant, following the BERT tradition while still benefiting from the gating mechanism.

Out[31]:

Visualization

Timeline showing model releases with color coding for FFN type, showing transition from standard to gated variants. — Timeline of GLU adoption in major language models. The shift from standard ReLU/GELU FFNs to gated variants accelerated after SwiGLU demonstrated consistent improvements. Modern models (2022+) almost universally use gated activations.

Limitations and ImpactLink Copied

Gated Linear Units represent a significant improvement over standard feed-forward networks, but they come with trade-offs that practitioners should understand.

The primary limitation is computational overhead. While GLU variants match the parameter count of standard FFNs when using reduced hidden dimensions, they require three matrix multiplications instead of two (for $W$ , $V$ , and $W_2$ ). The element-wise multiplication between the two pathways adds minimal compute but requires storing both intermediate results, increasing memory pressure during training. On modern GPU hardware, this overhead is typically modest (10-15% additional compute), but it can matter for latency-critical applications or when operating near memory limits.

The reduced hidden dimension that maintains parameter parity also has implications. A standard FFN with $d_{ff} = 4 \times d_{\text{model}}$ provides 4x expansion before the nonlinearity. A SwiGLU FFN with matched parameters uses only $\frac{8}{3} \times d_{\text{model}} \approx 2.67\times$ expansion. While the gating mechanism compensates by providing richer interactions, some researchers hypothesize that very high expansion ratios may be beneficial for certain tasks, and GLU's parameter efficiency gains come at the cost of this expansion.

Despite these considerations, the impact of GLU on language model quality has been substantial. Empirical studies consistently show that SwiGLU improves perplexity and downstream task performance compared to standard FFNs with equivalent parameter counts. The improvement is robust across model scales, from small models with hundreds of millions of parameters to frontier models with hundreds of billions. This consistent benefit, combined with straightforward implementation, explains why virtually every state-of-the-art LLM released since 2022 uses some form of gated activation.

The gating mechanism also provides interpretability benefits. Researchers have found that individual dimensions in the gated hidden layer correspond more cleanly to semantic concepts than in standard FFNs. The gate values provide a natural measure of which features are "active" for a given input, enabling analysis techniques that probe what the model has learned. This interpretability, while not the primary motivation for GLU adoption, offers valuable tools for understanding model behavior.

SummaryLink Copied

Gated Linear Units transform the standard feed-forward network by introducing multiplicative interactions through learned gates. Rather than applying a simple activation function, GLU computes two parallel projections and multiplies them together, with one pathway controlling how much of the other passes through.

Key takeaways:

The gating mechanism: GLU computes $(xW) \otimes \sigma(xV)$ , where $x$ is the input, $W$ and $V$ are learned projection matrices, $\sigma$ is the sigmoid function producing gate values in $[0, 1]$ , and $\otimes$ denotes element-wise multiplication. This multiplicative interaction enables richer representations than additive transformations alone, capturing feature interactions that standard activations cannot express.
SwiGLU dominance: Modern LLMs predominantly use SwiGLU, defined as $\text{Swish}(xW) \otimes (xV)$ where $\text{Swish}(z) = z \cdot \sigma(z)$ is a self-gating activation that multiplies the input by its own sigmoid. The self-gating property of Swish, combined with the linear pathway, provides smooth gradients and unbounded positive outputs.
GeGLU alternative: GeGLU replaces Swish with GELU, offering similar benefits with slightly different gradient characteristics. It's sometimes preferred for encoder models following BERT's GELU tradition.
Parameter trade-offs: GLU requires three weight matrices ( $W$ , $V$ , $W_2$ ) instead of two, adding 50% more parameters for the same hidden dimension. To maintain parameter parity, models reduce $d_{ff}$ to approximately $\frac{2}{3}$ of the standard value. LLaMA uses $d_{ff} = \frac{8}{3} \times d_{\text{model}}$ instead of the standard $4 \times d_{\text{model}}$ .
Universal adoption: Since 2022, virtually every state-of-the-art LLM uses SwiGLU or a similar gated variant. LLaMA, Mistral, PaLM, and their derivatives all employ gated activations, demonstrating consistent improvements in perplexity and downstream tasks.
Smoother activations: Unlike ReLU's hard sparsity, SwiGLU produces smoother activation distributions with preserved negative values. This improves gradient flow during training and allows for more nuanced representations.

The next chapter examines how all the components we've covered, including residual connections, layer normalization, attention, and gated feed-forward networks, assemble into complete transformer blocks. We'll see how the ordering and connections of these elements create the powerful architecture that underlies modern language models.

Key ParametersLink Copied

When implementing or configuring GLU-based feed-forward networks:

d_model (input/output dimension): The embedding dimension, unchanged from standard FFNs. Must match the attention layer output for proper residual connections.
d_ff (hidden dimension): The dimension of the gated intermediate representation. For parameter parity with standard FFNs, use $d_{ff} = \frac{2}{3} \times 4 \times d_{\text{model}} = \frac{8}{3} \times d_{\text{model}}$ . LLaMA rounds this to the nearest multiple of 256 for hardware efficiency. Higher ratios (3.5x or 4x) trade parameters for expressiveness.
glu_variant: The choice of activation function for gating. SwiGLU ( $\text{Swish}(xW) \otimes xV$ ) is the dominant choice. GeGLU uses GELU instead. ReGLU uses ReLU but is less common due to ReLU's sharp transitions.
use_bias: Whether to include bias terms. Modern architectures (LLaMA, Mistral, PaLM) typically omit biases entirely, reducing parameters and simplifying quantization. The impact on model quality is minimal.
multiple_of (LLaMA-specific): Round $d_{ff}$ to a multiple of this value (typically 256) for GPU memory alignment. The formula used is:

d_{ff} = m \times \left\lceil \frac{\frac{2}{3} \times 4 \times d_{\text{model}}}{m} \right\rceil

where $m$ is the multiple_of value and $\lceil \cdot \rceil$ denotes the ceiling function. For example, with $d_{\text{model}} = 4096$ and $m = 256$ : the raw value is $\frac{8}{3} \times 4096 = 10922.67$ , and rounding up to the nearest multiple of 256 gives $d_{ff} = 11008$ .

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Gated Linear Units and their role in modern language models.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

FFN Activation Functions

Next Chapter

Transformer Block Assembly

Reference

BIBTEXAcademic

@misc{gatedlinearunitstheffnarchitecturebehindmodernllms, author = {Michael Brenndoerfer}, title = {Gated Linear Units: The FFN Architecture Behind Modern LLMs}, year = {2025}, url = {https://mbrenndoerfer.com/writing/gated-linear-units-swiglu-transformer-ffn}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Gated Linear Units: The FFN Architecture Behind Modern LLMs. Retrieved from https://mbrenndoerfer.com/writing/gated-linear-units-swiglu-transformer-ffn

MLAAcademic

Michael Brenndoerfer. "Gated Linear Units: The FFN Architecture Behind Modern LLMs." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/gated-linear-units-swiglu-transformer-ffn>.

CHICAGOAcademic

Michael Brenndoerfer. "Gated Linear Units: The FFN Architecture Behind Modern LLMs." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/gated-linear-units-swiglu-transformer-ffn.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Gated Linear Units: The FFN Architecture Behind Modern LLMs'. Available at: https://mbrenndoerfer.com/writing/gated-linear-units-swiglu-transformer-ffn (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Gated Linear Units: The FFN Architecture Behind Modern LLMs. https://mbrenndoerfer.com/writing/gated-linear-units-swiglu-transformer-ffn

Direct link:

https://mbrenndoerfer.com/writing/gated-linear-units-swiglu-transformer-ffn

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Gated Linear Units: The FFN Architecture Behind Modern LLMs

Gated Linear UnitsLink Copied

The Gating MechanismLink Copied

The GLU FormulationLink Copied

The Core Idea: Two Pathways, One OutputLink Copied

The Mathematical FormulationLink Copied

Why Sigmoid for Gating?Link Copied

From Theory to CodeLink Copied

The Dimension Trade-offLink Copied

GLU in the Feed-Forward NetworkLink Copied

Recalling the Standard FFNLink Copied

Replacing Activation with GatingLink Copied

The Parameter CostLink Copied

SwiGLU: The Modern StandardLink Copied

The Limitation of Sigmoid GatingLink Copied

From Sigmoid to Swish: A Self-Gating ActivationLink Copied

The SwiGLU FormulaLink Copied

The Asymmetry ExplainedLink Copied

Why SwiGLU Outperforms the OriginalLink Copied

ImplementationLink Copied

GeGLU: The GELU VariantLink Copied

Parameter Efficiency AnalysisLink Copied

Hidden Dimension Choices in PracticeLink Copied

Visualizing GLU RepresentationsLink Copied

A Complete Worked ExampleLink Copied

Implementation: A Complete SwiGLU ModuleLink Copied

GLU in Modern ArchitecturesLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models

Transformer Block Assembly: Building Complete Encoder & Decoder Blocks from Components

Layer Normalization: Stabilizing Transformer Training

Stay updated