LLaMA Components: RMSNorm, SwiGLU, and RoPE

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Deep dive into LLaMA's core architectural components: pre-norm with RMSNorm for stable training, SwiGLU feed-forward networks for expressive computation, and RoPE for relative position encoding. Learn how these pieces fit together.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

LLaMA ComponentsLink Copied

LLaMA's success stems not from revolutionary new ideas, but from carefully selecting and combining the best components available. Each architectural choice, from normalization to positional encoding, was picked based on empirical evidence of what works at scale. The result is a reference design that has influenced virtually every major open-source language model since.

This chapter examines LLaMA's core components in detail: RMSNorm for efficient normalization, SwiGLU for expressive feed-forward computation, and RoPE for relative position encoding. We'll see how these pieces fit together into a coherent architecture, implement a complete LLaMA block from scratch, and understand the design decisions that make LLaMA both powerful and efficient to train.

Pre-Norm with RMSNormLink Copied

LLaMA places normalization before each sublayer rather than after. This pre-norm configuration, combined with the simpler RMSNorm instead of LayerNorm, defines the normalization strategy throughout the architecture.

Why Pre-Norm?Link Copied

The original transformer used post-norm: normalize after adding the residual connection. This creates a sequence where the residual and sublayer output are summed first, then normalized together. While mathematically elegant, post-norm creates training instabilities in deep networks. Gradients must pass through the normalization layer before reaching the residual path, which can attenuate or distort the gradient signal.

Pre-norm flips this order: normalize the input before the sublayer, then add the unnormalized input via the residual connection. The key benefit is that gradients can flow directly through the residual path without encountering normalization. For a stack of $L$ blocks, this means gradients from the output reach early layers with minimal degradation.

The pre-norm equations for a single transformer block are:

\begin{aligned} \mathbf{h} &= \mathbf{x} + \text{Attention}(\text{Norm}(\mathbf{x})) \\ \mathbf{y} &= \mathbf{h} + \text{FFN}(\text{Norm}(\mathbf{h})) \end{aligned}

where:

$\mathbf{x} \in \mathbb{R}^{n \times d}$ : input to the block with $n$ tokens and dimension $d$
$\text{Norm}(\cdot)$ : normalization function (RMSNorm in LLaMA)
$\text{Attention}(\cdot)$ : multi-head self-attention
$\text{FFN}(\cdot)$ : feed-forward network with SwiGLU
$\mathbf{h}$ : intermediate representation after the attention sublayer
$\mathbf{y}$ : block output

Notice that $\mathbf{x}$ is added directly to the attention output without normalization, and similarly $\mathbf{h}$ is added directly to the FFN output. This creates an unimpeded gradient highway through the model.

Out[3]:

Visualization

Line plot comparing gradient magnitude vs layer depth for pre-norm and post-norm configurations, showing pre-norm maintains more stable gradients in deep networks. — Simulated gradient magnitude through transformer layers. Pre-norm maintains more consistent gradients across depth, while post-norm shows gradual attenuation. The residual connection in pre-norm creates a direct path for gradients to reach early layers.

Why RMSNorm?Link Copied

To understand RMSNorm, let's first consider what LayerNorm does. LayerNorm performs two operations on each token's representation: centering (subtracting the mean to shift values toward zero) and scaling (dividing by the standard deviation to make the spread consistent). Both operations seem intuitively important for stabilizing neural network training.

RMSNorm challenges this intuition by asking: do we really need both? The answer, surprisingly, is no. Empirical studies found that centering contributes little to training stability, while scaling does the heavy lifting. RMSNorm removes the centering step entirely, normalizing only by the root mean square:

\text{RMSNorm}(\mathbf{x}) = \boldsymbol{\gamma} \odot \frac{\mathbf{x}}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}

where:

$\mathbf{x} \in \mathbb{R}^d$ : input vector (one token's representation)
$\boldsymbol{\gamma} \in \mathbb{R}^d$ : learnable scale parameters
$\epsilon$ : small constant for numerical stability (typically $10^{-6}$ )
$\odot$ : element-wise multiplication

Why does removing centering not hurt performance? The key insight is that the learnable scale parameter $\boldsymbol{\gamma}$ gives the network flexibility to adjust the distribution as needed. If centering were truly essential, the model could learn to approximate it through subsequent layers. In practice, models converge just as well without explicit centering.

The practical benefit is speed. RMSNorm saves one reduction operation (computing the mean) and one element-wise operation (subtracting the mean) compared to LayerNorm. These savings seem minor for a single forward pass, but normalization runs at every layer, for every token, at every training step. Over billions of operations, the efficiency gains compound significantly.

Let's implement both normalizations and compare their behavior:

In[4]:

Code

import numpy as np

np.random.seed(42)


class RMSNorm:
    """
    Root Mean Square Layer Normalization as used in LLaMA.

    Normalizes by RMS without mean centering, saving computation
    while maintaining training stability.
    """

    def __init__(self, dim, eps=1e-6):
        self.eps = eps
        self.weight = np.ones(dim)  # Learnable scale

    def __call__(self, x):
        # Compute RMS: sqrt(mean(x^2))
        rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + self.eps)
        # Normalize and apply scale
        return self.weight * (x / rms)


class LayerNorm:
    """Standard LayerNorm for comparison."""

    def __init__(self, dim, eps=1e-6):
        self.eps = eps
        self.weight = np.ones(dim)
        self.bias = np.zeros(dim)

    def __call__(self, x):
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        return self.weight * (x - mean) / np.sqrt(var + self.eps) + self.bias

import numpy as np

np.random.seed(42)


class RMSNorm:
    """
    Root Mean Square Layer Normalization as used in LLaMA.

    Normalizes by RMS without mean centering, saving computation
    while maintaining training stability.
    """

    def __init__(self, dim, eps=1e-6):
        self.eps = eps
        self.weight = np.ones(dim)  # Learnable scale

    def __call__(self, x):
        # Compute RMS: sqrt(mean(x^2))
        rms = np.sqrt(np.mean(x**2, axis=-1, keepdims=True) + self.eps)
        # Normalize and apply scale
        return self.weight * (x / rms)


class LayerNorm:
    """Standard LayerNorm for comparison."""

    def __init__(self, dim, eps=1e-6):
        self.eps = eps
        self.weight = np.ones(dim)
        self.bias = np.zeros(dim)

    def __call__(self, x):
        mean = np.mean(x, axis=-1, keepdims=True)
        var = np.var(x, axis=-1, keepdims=True)
        return self.weight * (x - mean) / np.sqrt(var + self.eps) + self.bias

Out[5]:

Visualization

Histogram showing RMSNorm normalized activation distribution centered near zero. — RMSNorm output distribution. RMSNorm produces outputs centered near zero but not exactly, since the mean wasn't explicitly subtracted.

Histogram showing LayerNorm normalized activation distribution precisely centered at zero. — LayerNorm output distribution. LayerNorm produces zero-centered outputs by construction through explicit mean subtraction.

LayerNorm outputs are precisely centered at zero by construction. RMSNorm outputs cluster near zero but aren't exactly centered, since the mean wasn't explicitly subtracted. In practice, this difference rarely matters: the learnable scale parameter $\boldsymbol{\gamma}$ can adjust the distribution as needed, and the model learns to work with either normalization scheme.

The SwiGLU Feed-Forward NetworkLink Copied

The feed-forward network (FFN) in each transformer block provides the model's primary source of nonlinearity and parameter capacity. While attention handles communication between tokens, the FFN handles computation within each token's representation. This is where the model stores and applies learned knowledge, from factual information to linguistic patterns.

LLaMA uses SwiGLU, a gated variant that has become the standard in modern language models. To understand why SwiGLU works better than simpler alternatives, we'll trace the evolution from basic FFNs to gated architectures, seeing how each improvement addresses specific limitations.

From Standard FFN to Gated UnitsLink Copied

The original transformer FFN follows a simple recipe: project each token to a higher dimension, apply a nonlinearity, then project back. This "expand and contract" pattern allows the network to compute complex functions of the input through the nonlinearity in the high-dimensional space:

\text{FFN}(\mathbf{x}) = \text{ReLU}(\mathbf{x}\mathbf{W}_1)\mathbf{W}_2

where:

$\mathbf{x} \in \mathbb{R}^d$ : input token representation with dimension $d$
$\mathbf{W}_1 \in \mathbb{R}^{d \times d_{\text{ff}}}$ : first projection, expanding to hidden dimension $d_{\text{ff}}$ (typically $4d$ )
$\mathbf{W}_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$ : second projection, compressing back to model dimension
$\text{ReLU}(z) = \max(0, z)$ : rectified linear unit activation

This architecture works, but it has a limitation: every hidden dimension is processed identically by the ReLU. The network cannot selectively emphasize some features while suppressing others based on the input content.

Gated linear units (GLUs) address this limitation by introducing a learned control mechanism. Instead of applying a fixed activation to all hidden dimensions, a GLU learns which dimensions to activate and which to suppress. It does this by splitting the hidden layer into two parallel paths: one computes a "gate" signal that controls information flow, while the other computes the actual content. These paths are multiplied element-wise:

\text{GLU}(\mathbf{x}) = (\mathbf{x}\mathbf{W}_1 \odot \sigma(\mathbf{x}\mathbf{V}))\mathbf{W}_2

where:

$\mathbf{x} \in \mathbb{R}^d$ : input token representation
$\mathbf{W}_1 \in \mathbb{R}^{d \times d_{\text{ff}}}$ : linear path projection
$\mathbf{V} \in \mathbb{R}^{d \times d_{\text{ff}}}$ : gate path projection
$\mathbf{W}_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$ : output projection
$\sigma(z) = 1/(1 + e^{-z})$ : sigmoid activation function
$\odot$ : element-wise (Hadamard) multiplication

The sigmoid-gated path acts as a learned filter, controlling how much of the linear path's information passes through. When $\sigma(\mathbf{x}\mathbf{V})$ is near zero for some hidden dimension, that dimension's contribution is blocked; when near one, it passes through fully.

SwiGLU: Swish-Gated Linear UnitsLink Copied

While the sigmoid-gated GLU works well, researchers found an even better gating function: Swish (also called SiLU). The Swish function has an elegant property: it is "self-gating," meaning the input gates itself. Rather than having a separate sigmoid control the gate, Swish multiplies the input by its own sigmoid, creating a smooth, adaptive activation.

SwiGLU applies this insight to the GLU architecture:

\text{SwiGLU}(\mathbf{x}) = (\text{Swish}(\mathbf{x}\mathbf{W}_{\text{gate}}) \odot (\mathbf{x}\mathbf{W}_{\text{up}}))\mathbf{W}_{\text{down}}

where:

$\mathbf{x} \in \mathbb{R}^d$ : input token representation with model dimension $d$
$\mathbf{W}_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ : gate projection matrix
$\mathbf{W}_{\text{up}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ : up projection matrix (the "linear path")
$\mathbf{W}_{\text{down}} \in \mathbb{R}^{d_{\text{ff}} \times d}$ : down projection, mapping back to model dimension
$\odot$ : element-wise multiplication between gate and up projections

The Swish activation function is defined as:

\text{Swish}(z) = z \cdot \sigma(z) = \frac{z}{1 + e^{-z}}

where $\sigma(z)$ is the sigmoid function. Swish is smooth everywhere (unlike ReLU's sharp corner at zero) and self-gated: the input $z$ is multiplied by its own sigmoid, meaning large positive values pass through nearly unchanged while negative values are attenuated. This combination has been shown to improve training dynamics and final model quality compared to both ReLU and GELU-based FFNs.

Out[6]:

Visualization

Line plot comparing ReLU, GELU, and Swish activation functions, showing their different behaviors for positive and negative inputs. — Comparison of activation functions used in transformer FFNs. ReLU has a hard cutoff at zero; GELU provides smooth gating; Swish combines self-gating with smooth gradients.

Line plot comparing gradients of ReLU, GELU, and Swish, demonstrating that Swish and GELU maintain non-zero gradients throughout. — Gradient (derivative) of each activation function. ReLU has zero gradient for negative inputs (dead neurons). GELU and Swish maintain non-zero gradients throughout, improving training.

Parameter Count Trade-off

SwiGLU requires three weight matrices ( $\mathbf{W}_{\text{gate}}$ , $\mathbf{W}_{\text{up}}$ , $\mathbf{W}_{\text{down}}$ ) instead of two ( $\mathbf{W}_1$ , $\mathbf{W}_2$ ). To keep the total parameter count similar, LLaMA reduces the hidden dimension from $4d$ to approximately $\frac{8d}{3} \approx 2.67d$ . The exact value is often rounded to be divisible by hardware-friendly numbers.

Out[7]:

Visualization

Bar chart comparing parameter counts between Standard FFN and SwiGLU, showing how the reduced hidden dimension compensates for the extra matrix. — FFN parameter comparison: Standard FFN (2 matrices with 4d hidden) vs SwiGLU (3 matrices with 2.67d hidden). Despite using three matrices, SwiGLU achieves similar parameter count by reducing the hidden dimension.

With the theory in place, let's implement SwiGLU and see how the gating mechanism works in practice. We'll first define the Swish activation, then build the complete SwiGLU module:

In[8]:

Code

def swish(x):
    """Swish (SiLU) activation: x * sigmoid(x)"""
    return x / (1 + np.exp(-x))


class SwiGLU:
    """
    SwiGLU feed-forward network as used in LLaMA.

    Implements: FFN(x) = (Swish(x @ W_gate) * (x @ W_up)) @ W_down
    """

    def __init__(self, d_model, d_ff):
        """
        Args:
            d_model: Model dimension
            d_ff: Hidden dimension (typically ~2.67 * d_model for SwiGLU)
        """
        self.d_model = d_model
        self.d_ff = d_ff

        # Xavier/Glorot initialization
        scale_up = np.sqrt(2.0 / (d_model + d_ff))
        scale_down = np.sqrt(2.0 / (d_ff + d_model))

        self.W_gate = np.random.randn(d_model, d_ff) * scale_up
        self.W_up = np.random.randn(d_model, d_ff) * scale_up
        self.W_down = np.random.randn(d_ff, d_model) * scale_down

    def __call__(self, x):
        """Apply SwiGLU transformation."""
        gate = swish(x @ self.W_gate)  # Gating signal
        up = x @ self.W_up  # Linear projection
        hidden = gate * up  # Element-wise gating
        return hidden @ self.W_down  # Project back to d_model

def swish(x):
    """Swish (SiLU) activation: x * sigmoid(x)"""
    return x / (1 + np.exp(-x))


class SwiGLU:
    """
    SwiGLU feed-forward network as used in LLaMA.

    Implements: FFN(x) = (Swish(x @ W_gate) * (x @ W_up)) @ W_down
    """

    def __init__(self, d_model, d_ff):
        """
        Args:
            d_model: Model dimension
            d_ff: Hidden dimension (typically ~2.67 * d_model for SwiGLU)
        """
        self.d_model = d_model
        self.d_ff = d_ff

        # Xavier/Glorot initialization
        scale_up = np.sqrt(2.0 / (d_model + d_ff))
        scale_down = np.sqrt(2.0 / (d_ff + d_model))

        self.W_gate = np.random.randn(d_model, d_ff) * scale_up
        self.W_up = np.random.randn(d_model, d_ff) * scale_up
        self.W_down = np.random.randn(d_ff, d_model) * scale_down

    def __call__(self, x):
        """Apply SwiGLU transformation."""
        gate = swish(x @ self.W_gate)  # Gating signal
        up = x @ self.W_up  # Linear projection
        hidden = gate * up  # Element-wise gating
        return hidden @ self.W_down  # Project back to d_model

Out[9]:

Visualization

Line plot showing Swish activation compared to ReLU, with Swish providing smooth gradients throughout. — Swish compared to ReLU activation. Swish has smooth non-zero gradients for negative inputs, unlike ReLU's hard cutoff at zero.

Line plot showing how gating modulates information flow in SwiGLU. — SwiGLU gating mechanism. The gate signal modulates information flow, blocking when negative and allowing passage when positive.

The left plot shows why Swish is preferred over ReLU. Unlike ReLU's hard cutoff at zero, Swish has a smooth curve with non-zero gradients for negative inputs. This helps gradient flow during training, particularly for inputs that happen to fall in negative regions.

The right plot illustrates gating. When the gate input is negative, the gate output is near zero, blocking information. As the gate input increases, more of the up projection passes through. This learned gating allows the network to selectively process different aspects of the input.

Rotary Position Embedding (RoPE)Link Copied

Position information in LLaMA comes from Rotary Position Embedding, applied directly to query and key vectors before attention computation. Unlike additive position encodings that modify token embeddings once at the input, RoPE encodes position through rotation at every attention layer.

To understand why LLaMA chose RoPE over simpler alternatives, we need to appreciate a fundamental tension in position encoding: we want the model to understand both where tokens are in a sequence and how far apart they are from each other. Absolute position matters for tasks like "what's the first word?", but relative position matters for language understanding: the relationship between a verb and its subject depends on their distance, not their absolute positions.

Previous approaches added position information to token embeddings at the input layer. This works but has drawbacks: the position signal can degrade through deep networks, and the model must learn to extract relative position from absolute encodings. RoPE solves this elegantly by encoding position in a way that naturally produces relative position information when computing attention scores.

The Rotation PrincipleLink Copied

The key insight behind RoPE comes from a beautiful property of rotations. Imagine two arrows (vectors) on a piece of paper. If you rotate both arrows by different amounts and then measure the angle between them, you'll find something remarkable: the angle between them depends only on how much more you rotated one than the other. The absolute rotation amounts don't matter, only their difference.

This is exactly what we want for position encoding. If we rotate the query vector by an amount proportional to its position $m$ , and rotate the key vector by an amount proportional to its position $n$ , then when we compute their dot product (which measures similarity), the result depends only on the difference $m - n$ . The absolute positions cancel out, leaving pure relative position information.

Out[10]:

Visualization

2D plane showing arrows (vectors) rotated by different amounts based on their position, demonstrating how RoPE encodes position through rotation. — Geometric intuition for RoPE: rotating vectors by position-dependent angles. The query at position 4 is rotated 4 times as much as a query at position 1. When computing attention scores, only the angular difference matters, naturally encoding relative position.

For a $d$ -dimensional embedding, RoPE treats it as $d/2$ independent 2D pairs. Think of each pair as a separate 2D plane where we can apply rotation. Each pair rotates at a different frequency, creating a multi-scale representation: some dimensions capture fine-grained local position differences, while others track global position across the entire sequence.

The rotation angle for dimension pair $i$ at position $m$ determines how much to rotate that pair. Different dimension pairs rotate at exponentially different rates, following a formula inspired by the original sinusoidal position encodings:

\theta_i(m) = m \cdot \theta_i = m \cdot \frac{1}{10000^{2i/d}}

where:

$\theta_i(m)$ : the rotation angle (in radians) for dimension pair $i$ at sequence position $m$
$m$ : sequence position (0, 1, 2, ...), with $m = 0$ for the first token
$i$ : dimension pair index (0, 1, ..., $d/2 - 1$ ), grouping dimensions into pairs
$d$ : total embedding dimension (must be even for RoPE)
$\theta_i = 1/10000^{2i/d}$ : base frequency for dimension pair $i$ , which decreases exponentially as $i$ increases
$10000$ : base constant chosen empirically (same as in sinusoidal position encodings)

The term $2i/d$ in the exponent creates the multi-scale effect: when $i = 0$ , $\theta_0 = 1$ (fast rotation, one radian per position); when $i$ approaches $d/2$ , $\theta_i$ approaches $1/10000$ (very slow rotation). This exponential spacing ensures that the model can distinguish positions at multiple scales simultaneously.

Now we can write the complete RoPE transformation. Given a vector $\mathbf{x}$ at position $m$ , we rotate each pair of dimensions $(x_{2i+1}, x_{2i+2})$ by angle $\theta_i \cdot m$ . The rotation uses the standard 2D rotation formula you might remember from linear algebra: to rotate a point $(x, y)$ by angle $\theta$ , you compute $(x\cos\theta - y\sin\theta, x\sin\theta + y\cos\theta)$ . Applying this to all dimension pairs gives:

\text{RoPE}(\mathbf{x}, m) = \begin{pmatrix} x_1 \cos(\theta_0 m) - x_2 \sin(\theta_0 m) \\ x_1 \sin(\theta_0 m) + x_2 \cos(\theta_0 m) \\ x_3 \cos(\theta_1 m) - x_4 \sin(\theta_1 m) \\ x_3 \sin(\theta_1 m) + x_4 \cos(\theta_1 m) \\ \vdots \end{pmatrix}

where:

$\mathbf{x} = (x_1, x_2, x_3, x_4, \ldots, x_d)^T$ : the input query or key vector
$(x_1, x_2)$ : the first dimension pair, rotated by angle $\theta_0 m$
$(x_3, x_4)$ : the second dimension pair, rotated by angle $\theta_1 m$
$\cos(\theta_i m), \sin(\theta_i m)$ : rotation components for pair $i$ at position $m$

Each pair is rotated independently using the standard 2D rotation formulas: $x' = x\cos\theta - y\sin\theta$ and $y' = x\sin\theta + y\cos\theta$ . This preserves the vector's magnitude while encoding position through angle. The preservation of magnitude is crucial: we're adding position information without distorting the semantic content of the embedding.

In practice, we don't need to construct explicit rotation matrices. Instead, we can compute the rotations efficiently using element-wise operations. The key observation is that for each dimension pair, we just need to:

Compute the angle $\theta_i \cdot m$ for that pair at position $m$
Calculate $\cos$ and $\sin$ of the angle
Apply the rotation formula to the pair

Let's implement this step by step:

In[11]:

Code

def compute_rope_frequencies(dim, base=10000):
    """Compute RoPE frequency for each dimension pair."""
    freqs = 1.0 / (base ** (np.arange(0, dim, 2) / dim))
    return freqs


def apply_rope(x, freqs, positions):
    """
    Apply rotary position embedding to input tensor.

    Args:
        x: Input tensor of shape (seq_len, dim)
        freqs: Frequency tensor of shape (dim/2,)
        positions: Position indices of shape (seq_len,)

    Returns:
        Rotated tensor of shape (seq_len, dim)
    """
    seq_len, dim = x.shape

    # Compute angles for each position and frequency
    angles = positions[:, None] * freqs[None, :]  # (seq_len, dim/2)

    # Compute sin and cos
    cos = np.cos(angles)
    sin = np.sin(angles)

    # Split input into pairs
    x_even = x[:, 0::2]  # Even dimensions
    x_odd = x[:, 1::2]  # Odd dimensions

    # Apply rotation to each pair
    x_rotated_even = x_even * cos - x_odd * sin
    x_rotated_odd = x_even * sin + x_odd * cos

    # Interleave back
    output = np.zeros_like(x)
    output[:, 0::2] = x_rotated_even
    output[:, 1::2] = x_rotated_odd

    return output

def compute_rope_frequencies(dim, base=10000):
    """Compute RoPE frequency for each dimension pair."""
    freqs = 1.0 / (base ** (np.arange(0, dim, 2) / dim))
    return freqs


def apply_rope(x, freqs, positions):
    """
    Apply rotary position embedding to input tensor.

    Args:
        x: Input tensor of shape (seq_len, dim)
        freqs: Frequency tensor of shape (dim/2,)
        positions: Position indices of shape (seq_len,)

    Returns:
        Rotated tensor of shape (seq_len, dim)
    """
    seq_len, dim = x.shape

    # Compute angles for each position and frequency
    angles = positions[:, None] * freqs[None, :]  # (seq_len, dim/2)

    # Compute sin and cos
    cos = np.cos(angles)
    sin = np.sin(angles)

    # Split input into pairs
    x_even = x[:, 0::2]  # Even dimensions
    x_odd = x[:, 1::2]  # Odd dimensions

    # Apply rotation to each pair
    x_rotated_even = x_even * cos - x_odd * sin
    x_rotated_odd = x_even * sin + x_odd * cos

    # Interleave back
    output = np.zeros_like(x)
    output[:, 0::2] = x_rotated_even
    output[:, 1::2] = x_rotated_odd

    return output

Let's visualize how RoPE creates position-dependent rotations across different frequencies:

Out[12]:

Visualization

Heatmap showing rotation angles for different dimension pairs across sequence positions, with faster oscillations for lower dimension indices. — RoPE rotation patterns across positions and dimension pairs. Early dimension pairs (low index) rotate quickly, capturing local position differences. Later dimension pairs rotate slowly, encoding global position information.

The heatmap reveals RoPE's multi-scale nature. Dimension pairs near the top (low index) oscillate rapidly: they complete many cycles across the sequence and can distinguish nearby positions precisely. Dimension pairs near the bottom (high index) change slowly: they differentiate positions that are far apart. Together, they create a rich position encoding that works at all scales.

RoPE in AttentionLink Copied

RoPE is applied to query and key vectors before computing attention scores. Critically, it is not applied to values. To understand why, consider what happens when we compute the attention score between a query at position $m$ and a key at position $n$ .

Without RoPE, the dot product between query $\mathbf{q}$ and key $\mathbf{k}$ gives:

\text{score} = \mathbf{q}^T \mathbf{k}

With RoPE, both vectors are rotated before the dot product. For a single 2D pair, if we rotate $\mathbf{q}$ by angle $\theta m$ and $\mathbf{k}$ by angle $\theta n$ , the dot product becomes:

\text{score} = \mathbf{q}^T R(\theta m)^T R(\theta n) \mathbf{k} = \mathbf{q}^T R(\theta(n - m)) \mathbf{k}

where:

$\mathbf{q}, \mathbf{k} \in \mathbb{R}^2$ : query and key vectors for a single dimension pair
$R(\theta)$ : the 2D rotation matrix $\begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$
$R(\theta)^T = R(-\theta)$ : the transpose of a rotation matrix is its inverse (rotation by the negative angle)
$\theta m, \theta n$ : rotation angles at positions $m$ and $n$ respectively

The key insight: $R(\theta m)^T R(\theta n) = R(-\theta m) R(\theta n) = R(\theta(n - m))$ . The absolute positions $m$ and $n$ cancel out, leaving only their difference $n - m$ . This is how RoPE encodes relative position: the attention score between two positions depends only on how far apart they are, not where they appear in the sequence.

Out[13]:

Visualization

Heatmap showing attention score modulation based on relative position, with diagonal bands indicating position-dependent attention patterns. — RoPE creates attention patterns that depend on relative position. This heatmap shows the positional contribution to attention scores: each row represents a query position, and each column a key position. The pattern depends only on the distance between positions, creating the characteristic diagonal structure.

Values don't participate in this relative position mechanism; they simply carry the information to be aggregated based on the position-aware attention weights.

In[14]:

Code

class RoPEAttention:
    """
    Multi-head self-attention with Rotary Position Embedding.

    Applies RoPE to queries and keys to encode relative positions.
    """

    def __init__(self, d_model, n_heads, base=10000):
        assert d_model % n_heads == 0

        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        # Projections
        scale = np.sqrt(2.0 / (d_model + d_model))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

        # RoPE frequencies (per head dimension)
        self.freqs = compute_rope_frequencies(self.d_head, base)

    def __call__(self, x, mask=None):
        """
        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional causal mask of shape (seq_len, seq_len)

        Returns:
            Output of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]
        positions = np.arange(seq_len)

        # Project to Q, K, V
        Q = x @ self.W_q  # (seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape to (seq_len, n_heads, d_head)
        Q = Q.reshape(seq_len, self.n_heads, self.d_head)
        K = K.reshape(seq_len, self.n_heads, self.d_head)
        V = V.reshape(seq_len, self.n_heads, self.d_head)

        # Apply RoPE to each head's Q and K
        for h in range(self.n_heads):
            Q[:, h, :] = apply_rope(
                Q[:, h, :].reshape(seq_len, self.d_head), self.freqs, positions
            )
            K[:, h, :] = apply_rope(
                K[:, h, :].reshape(seq_len, self.d_head), self.freqs, positions
            )

        # Compute attention scores
        # (seq_len, n_heads, d_head) @ (d_head, n_heads, seq_len) per head
        scores = np.zeros((seq_len, self.n_heads, seq_len))
        for h in range(self.n_heads):
            scores[:, h, :] = Q[:, h, :] @ K[:, h, :].T / np.sqrt(self.d_head)

        # Apply causal mask if provided
        if mask is not None:
            scores = np.where(mask[:, None, :], scores, -1e9)

        # Softmax over keys
        scores_max = scores.max(axis=-1, keepdims=True)
        exp_scores = np.exp(scores - scores_max)
        attention = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

        # Apply attention to values
        output = np.zeros((seq_len, self.n_heads, self.d_head))
        for h in range(self.n_heads):
            output[:, h, :] = attention[:, h, :] @ V[:, h, :]

        # Reshape and project output
        output = output.reshape(seq_len, self.d_model)
        return output @ self.W_o

class RoPEAttention:
    """
    Multi-head self-attention with Rotary Position Embedding.

    Applies RoPE to queries and keys to encode relative positions.
    """

    def __init__(self, d_model, n_heads, base=10000):
        assert d_model % n_heads == 0

        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads

        # Projections
        scale = np.sqrt(2.0 / (d_model + d_model))
        self.W_q = np.random.randn(d_model, d_model) * scale
        self.W_k = np.random.randn(d_model, d_model) * scale
        self.W_v = np.random.randn(d_model, d_model) * scale
        self.W_o = np.random.randn(d_model, d_model) * scale

        # RoPE frequencies (per head dimension)
        self.freqs = compute_rope_frequencies(self.d_head, base)

    def __call__(self, x, mask=None):
        """
        Args:
            x: Input of shape (seq_len, d_model)
            mask: Optional causal mask of shape (seq_len, seq_len)

        Returns:
            Output of shape (seq_len, d_model)
        """
        seq_len = x.shape[0]
        positions = np.arange(seq_len)

        # Project to Q, K, V
        Q = x @ self.W_q  # (seq_len, d_model)
        K = x @ self.W_k
        V = x @ self.W_v

        # Reshape to (seq_len, n_heads, d_head)
        Q = Q.reshape(seq_len, self.n_heads, self.d_head)
        K = K.reshape(seq_len, self.n_heads, self.d_head)
        V = V.reshape(seq_len, self.n_heads, self.d_head)

        # Apply RoPE to each head's Q and K
        for h in range(self.n_heads):
            Q[:, h, :] = apply_rope(
                Q[:, h, :].reshape(seq_len, self.d_head), self.freqs, positions
            )
            K[:, h, :] = apply_rope(
                K[:, h, :].reshape(seq_len, self.d_head), self.freqs, positions
            )

        # Compute attention scores
        # (seq_len, n_heads, d_head) @ (d_head, n_heads, seq_len) per head
        scores = np.zeros((seq_len, self.n_heads, seq_len))
        for h in range(self.n_heads):
            scores[:, h, :] = Q[:, h, :] @ K[:, h, :].T / np.sqrt(self.d_head)

        # Apply causal mask if provided
        if mask is not None:
            scores = np.where(mask[:, None, :], scores, -1e9)

        # Softmax over keys
        scores_max = scores.max(axis=-1, keepdims=True)
        exp_scores = np.exp(scores - scores_max)
        attention = exp_scores / exp_scores.sum(axis=-1, keepdims=True)

        # Apply attention to values
        output = np.zeros((seq_len, self.n_heads, self.d_head))
        for h in range(self.n_heads):
            output[:, h, :] = attention[:, h, :] @ V[:, h, :]

        # Reshape and project output
        output = output.reshape(seq_len, self.d_model)
        return output @ self.W_o

Component InteractionsLink Copied

Understanding each component in isolation is valuable, but the real insight comes from seeing how they interact. In LLaMA, the components form a carefully orchestrated pipeline where each element's design choices complement the others.

The Information FlowLink Copied

A single LLaMA block processes input through the following sequence:

Pre-attention RMSNorm: Stabilizes activations before attention
RoPE-enhanced attention: Mixes information across positions with relative position awareness
Residual connection: Adds attention output to original input
Pre-FFN RMSNorm: Stabilizes activations before the feed-forward network
SwiGLU FFN: Transforms each position independently through gated nonlinearity
Residual connection: Adds FFN output to post-attention representation

Let's visualize this flow:

Out[15]:

Visualization

Block diagram showing sequential processing through RMSNorm, RoPE Attention, residual add, RMSNorm, SwiGLU FFN, and residual add, with curved arrows indicating skip connections. — Information flow through a LLaMA transformer block. Residual connections (curved arrows) create direct paths for gradient flow, while RMSNorm ensures stable activations before each sublayer.

Why This Combination WorksLink Copied

The components weren't selected arbitrarily. Each choice addresses specific challenges in training and inference:

RMSNorm + Pre-Norm provides stable gradients for deep networks. The pre-norm configuration creates direct gradient paths through residuals, while RMSNorm's efficiency matters when normalization runs at every layer.

RoPE enables long-context extension through interpolation. Since RoPE applies rotation at inference time rather than fixed position embeddings, the model can extrapolate to longer sequences than seen during training (with appropriate scaling techniques).

SwiGLU provides expressive nonlinearity without the dead neuron problem of ReLU. The gating mechanism learns to selectively process information, and the smooth Swish activation ensures gradients flow through all neurons.

Together, these choices create an architecture that trains stably at scale, infers efficiently, and generalizes well to new contexts.

Implementation: A Complete LLaMA BlockLink Copied

Let's assemble all components into a complete LLaMA transformer block:

In[16]:

Code

class LLaMABlock:
    """
    Complete LLaMA transformer block.

    Combines RMSNorm, RoPE attention, and SwiGLU FFN with
    pre-norm configuration and residual connections.
    """

    def __init__(self, d_model, n_heads, d_ff, rope_base=10000, eps=1e-6):
        """
        Args:
            d_model: Model dimension
            n_heads: Number of attention heads
            d_ff: FFN hidden dimension (typically ~2.67 * d_model for SwiGLU)
            rope_base: Base for RoPE frequency computation
            eps: Epsilon for RMSNorm
        """
        self.d_model = d_model
        self.n_heads = n_heads

        # Normalization layers
        self.norm1 = RMSNorm(d_model, eps)
        self.norm2 = RMSNorm(d_model, eps)

        # Attention with RoPE
        self.attention = RoPEAttention(d_model, n_heads, rope_base)

        # SwiGLU FFN
        self.ffn = SwiGLU(d_model, d_ff)

    def __call__(self, x, mask=None):
        """
        Forward pass through the block.

        Args:
            x: Input tensor of shape (seq_len, d_model)
            mask: Optional causal mask of shape (seq_len, seq_len)

        Returns:
            Output tensor of shape (seq_len, d_model)
        """
        # Pre-norm attention with residual
        h = x + self.attention(self.norm1(x), mask)

        # Pre-norm FFN with residual
        y = h + self.ffn(self.norm2(h))

        return y

class LLaMABlock:
    """
    Complete LLaMA transformer block.

    Combines RMSNorm, RoPE attention, and SwiGLU FFN with
    pre-norm configuration and residual connections.
    """

    def __init__(self, d_model, n_heads, d_ff, rope_base=10000, eps=1e-6):
        """
        Args:
            d_model: Model dimension
            n_heads: Number of attention heads
            d_ff: FFN hidden dimension (typically ~2.67 * d_model for SwiGLU)
            rope_base: Base for RoPE frequency computation
            eps: Epsilon for RMSNorm
        """
        self.d_model = d_model
        self.n_heads = n_heads

        # Normalization layers
        self.norm1 = RMSNorm(d_model, eps)
        self.norm2 = RMSNorm(d_model, eps)

        # Attention with RoPE
        self.attention = RoPEAttention(d_model, n_heads, rope_base)

        # SwiGLU FFN
        self.ffn = SwiGLU(d_model, d_ff)

    def __call__(self, x, mask=None):
        """
        Forward pass through the block.

        Args:
            x: Input tensor of shape (seq_len, d_model)
            mask: Optional causal mask of shape (seq_len, seq_len)

        Returns:
            Output tensor of shape (seq_len, d_model)
        """
        # Pre-norm attention with residual
        h = x + self.attention(self.norm1(x), mask)

        # Pre-norm FFN with residual
        y = h + self.ffn(self.norm2(h))

        return y

Let's test our implementation with a small example:

In[17]:

Code

# Configuration matching LLaMA 7B proportions (scaled down)
d_model = 128
n_heads = 4
d_ff = int(d_model * 2.67)  # SwiGLU hidden dimension
seq_len = 32

# Create a block
block = LLaMABlock(d_model=d_model, n_heads=n_heads, d_ff=d_ff)

# Create sample input
x = np.random.randn(seq_len, d_model).astype(np.float32)

# Create causal mask (lower triangular)
mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))

# Forward pass
y = block(x, mask=mask)

# Configuration matching LLaMA 7B proportions (scaled down)
d_model = 128
n_heads = 4
d_ff = int(d_model * 2.67)  # SwiGLU hidden dimension
seq_len = 32

# Create a block
block = LLaMABlock(d_model=d_model, n_heads=n_heads, d_ff=d_ff)

# Create sample input
x = np.random.randn(seq_len, d_model).astype(np.float32)

# Create causal mask (lower triangular)
mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))

# Forward pass
y = block(x, mask=mask)

Out[18]:

Console

LLaMA Block Configuration:
  Model dimension: 128
  Attention heads: 4
  Head dimension:  32
  FFN dimension:   341
  Sequence length: 32

Input shape:  (32, 128)
Output shape: (32, 128)

Input stats:  mean=-0.0089, std=0.9850
Output stats: mean=-0.0086, std=1.1446

The output maintains the same shape as the input, as expected for a transformer block. The mean and standard deviation of the output remain in a reasonable range, demonstrating that RMSNorm successfully stabilizes activations. Without normalization, activations could grow or shrink exponentially through many layers, making training unstable. The fact that output statistics stay close to the input statistics is a good sign that the block is well-behaved.

Stacking Blocks into a ModelLink Copied

A complete LLaMA model stacks many blocks, adds embeddings at the input, and applies a final normalization before the output projection:

In[19]:

Code

class LLaMAModel:
    """
    Simplified LLaMA model for demonstration.

    Stacks transformer blocks with token embeddings and output projection.
    """

    def __init__(
        self, vocab_size, d_model, n_heads, d_ff, n_layers, rope_base=10000
    ):
        self.d_model = d_model

        # Token embeddings (no positional embeddings - RoPE handles position)
        self.token_embed = np.random.randn(vocab_size, d_model) * 0.02

        # Stack of transformer blocks
        self.blocks = [
            LLaMABlock(d_model, n_heads, d_ff, rope_base)
            for _ in range(n_layers)
        ]

        # Final normalization
        self.final_norm = RMSNorm(d_model)

        # Output projection (weight-tied with embedding in real LLaMA)
        self.output_proj = np.random.randn(d_model, vocab_size) * 0.02

    def __call__(self, token_ids):
        """
        Forward pass.

        Args:
            token_ids: Array of token IDs, shape (seq_len,)

        Returns:
            Logits of shape (seq_len, vocab_size)
        """
        seq_len = len(token_ids)

        # Token embeddings
        x = self.token_embed[token_ids]  # (seq_len, d_model)

        # Causal mask
        mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))

        # Pass through all blocks
        for block in self.blocks:
            x = block(x, mask)

        # Final normalization and projection
        x = self.final_norm(x)
        logits = x @ self.output_proj

        return logits

class LLaMAModel:
    """
    Simplified LLaMA model for demonstration.

    Stacks transformer blocks with token embeddings and output projection.
    """

    def __init__(
        self, vocab_size, d_model, n_heads, d_ff, n_layers, rope_base=10000
    ):
        self.d_model = d_model

        # Token embeddings (no positional embeddings - RoPE handles position)
        self.token_embed = np.random.randn(vocab_size, d_model) * 0.02

        # Stack of transformer blocks
        self.blocks = [
            LLaMABlock(d_model, n_heads, d_ff, rope_base)
            for _ in range(n_layers)
        ]

        # Final normalization
        self.final_norm = RMSNorm(d_model)

        # Output projection (weight-tied with embedding in real LLaMA)
        self.output_proj = np.random.randn(d_model, vocab_size) * 0.02

    def __call__(self, token_ids):
        """
        Forward pass.

        Args:
            token_ids: Array of token IDs, shape (seq_len,)

        Returns:
            Logits of shape (seq_len, vocab_size)
        """
        seq_len = len(token_ids)

        # Token embeddings
        x = self.token_embed[token_ids]  # (seq_len, d_model)

        # Causal mask
        mask = np.tril(np.ones((seq_len, seq_len), dtype=bool))

        # Pass through all blocks
        for block in self.blocks:
            x = block(x, mask)

        # Final normalization and projection
        x = self.final_norm(x)
        logits = x @ self.output_proj

        return logits

In[20]:

Code

# Create a small model
model = LLaMAModel(
    vocab_size=1000,
    d_model=128,
    n_heads=4,
    d_ff=341,  # ~2.67 * 128
    n_layers=4,
)

# Sample input (random token IDs)
tokens = np.array([42, 123, 7, 891, 456, 0, 333, 222])
logits = model(tokens)

# Create a small model
model = LLaMAModel(
    vocab_size=1000,
    d_model=128,
    n_heads=4,
    d_ff=341,  # ~2.67 * 128
    n_layers=4,
)

# Sample input (random token IDs)
tokens = np.array([42, 123, 7, 891, 456, 0, 333, 222])
logits = model(tokens)

Out[21]:

Console

Model Configuration:
  Vocabulary size: 1000
  Model dimension: 128
  Attention heads: 4
  FFN dimension:   341
  Layers:          4

Input: 8 tokens
Output logits shape: (8, 1000)

Top 5 predictions for next token:
  1. Token 92: logit = 0.7138
  2. Token 791: logit = 0.5828
  3. Token 853: logit = 0.5712
  4. Token 155: logit = 0.5544
  5. Token 18: logit = 0.5504

This simplified model demonstrates the complete LLaMA architecture. The output logits have shape (8, 1000), meaning for each of the 8 input tokens, we get a probability distribution over the 1000-word vocabulary. The top predictions for the next token show varying logit values, which would become probabilities after applying softmax. Since the model uses random weights, these predictions are meaningless, but the structure shows how a trained model would generate text token by token.

Note the absence of learned position embeddings at the input, since RoPE encodes position directly in the attention mechanism. This is a key architectural difference from GPT-style models, where position embeddings are added to token embeddings at the input layer.

Parameter DistributionLink Copied

Understanding where parameters reside helps explain why certain design choices matter. Let's analyze the parameter distribution in a LLaMA block:

In[22]:

Code

def count_block_params(d_model, n_heads, d_ff):
    """Count parameters in each component of a LLaMA block."""
    d_head = d_model // n_heads

    params = {
        "RMSNorm (×2)": 2 * d_model,  # Two norm layers, one scale per dim
        "Q projection": d_model * d_model,
        "K projection": d_model * d_model,
        "V projection": d_model * d_model,
        "O projection": d_model * d_model,
        "FFN W_gate": d_model * d_ff,
        "FFN W_up": d_model * d_ff,
        "FFN W_down": d_ff * d_model,
    }

    return params


# LLaMA 7B proportions
d_model = 4096
n_heads = 32
d_ff = 11008  # Actual LLaMA 7B FFN dimension

params = count_block_params(d_model, n_heads, d_ff)
total = sum(params.values())

def count_block_params(d_model, n_heads, d_ff):
    """Count parameters in each component of a LLaMA block."""
    d_head = d_model // n_heads

    params = {
        "RMSNorm (×2)": 2 * d_model,  # Two norm layers, one scale per dim
        "Q projection": d_model * d_model,
        "K projection": d_model * d_model,
        "V projection": d_model * d_model,
        "O projection": d_model * d_model,
        "FFN W_gate": d_model * d_ff,
        "FFN W_up": d_model * d_ff,
        "FFN W_down": d_ff * d_model,
    }

    return params


# LLaMA 7B proportions
d_model = 4096
n_heads = 32
d_ff = 11008  # Actual LLaMA 7B FFN dimension

params = count_block_params(d_model, n_heads, d_ff)
total = sum(params.values())

Out[23]:

Visualization

Pie chart showing FFN taking 67%, attention 33%, and normalization less than 1%. — Parameter distribution by component. SwiGLU FFN dominates with approximately 67% of parameters.

Bar chart showing parameter counts for each weight matrix in the transformer block. — Detailed parameter breakdown by matrix. Each FFN projection contributes about 45M parameters.

Out[24]:

Console

Parameter Count per Block (LLaMA 7B scale):
--------------------------------------------------
  RMSNorm (×2)        :        8,192 (  0.0%)
  Q projection        :   16,777,216 (  8.3%)
  K projection        :   16,777,216 (  8.3%)
  V projection        :   16,777,216 (  8.3%)
  O projection        :   16,777,216 (  8.3%)
  FFN W_gate          :   45,088,768 ( 22.3%)
  FFN W_up            :   45,088,768 ( 22.3%)
  FFN W_down          :   45,088,768 ( 22.3%)
--------------------------------------------------
  Total               :  202,383,360 (100.0%)

For 32 layers: 6.48B parameters in transformer blocks

The FFN dominates parameter count despite SwiGLU's efficiency improvements. This is because:

The hidden dimension $d_{\text{ff}}$ is large (about 2.7x $d_{\text{model}}$ )
SwiGLU uses three matrices instead of two

However, this investment pays off. The FFN provides the model's primary capacity for storing learned patterns and facts. The attention mechanism, while containing fewer parameters, handles the crucial task of routing information between positions.

Limitations and Trade-offsLink Copied

LLaMA's component choices, while effective, come with trade-offs worth understanding.

RMSNorm's simplification has theoretical gaps. By removing mean centering, RMSNorm assumes that the mean of activations is either unimportant or learnable through other means. For most language modeling tasks, this assumption holds. However, tasks requiring precise tracking of activation magnitudes across layers might benefit from full LayerNorm. The empirical success of RMSNorm suggests these cases are rare in practice.

SwiGLU's parameter overhead. The three-matrix design of SwiGLU means that for a fixed parameter budget, the hidden dimension must be smaller than with a two-matrix FFN. This trades width for gating capability. Whether this is optimal depends on the model scale: very large models might benefit from simpler, wider FFNs, while smaller models gain from SwiGLU's expressiveness.

RoPE's computational cost. Applying rotations at every attention layer adds overhead compared to one-time positional embeddings. For short sequences, this overhead is negligible. For very long contexts, the repeated rotation computation becomes more significant. Modern implementations optimize this through precomputation and efficient kernels.

The pre-norm stability assumption. Pre-norm works because it assumes the residual stream is the "main highway" for information. This can make it harder for deep layers to fundamentally transform representations, since they always add to the existing stream. Some architectures experiment with adaptive residual weights to address this.

Despite these limitations, LLaMA's component choices represent a well-tuned balance. The architecture trains stably, infers efficiently, and achieves strong performance across diverse tasks. The design decisions have proven robust across model scales from 7B to 405B parameters, validating the choices made at the component level.

Key ParametersLink Copied

When implementing or configuring LLaMA-style models, several parameters significantly impact model behavior and performance:

Model Dimension (d_model)

Controls the width of the model and the size of token representations
Typical values: 4096 (7B), 5120 (13B), 8192 (70B)
Larger values increase capacity but also memory and compute requirements
Must be divisible by the number of attention heads

Number of Attention Heads (n_heads)

Determines how many parallel attention patterns the model can learn
Typical values: 32 (7B), 40 (13B), 64 (70B)
More heads allow diverse attention patterns but increase complexity
Head dimension (d_model / n_heads) should typically be 64-128 for efficiency

FFN Hidden Dimension (d_ff)

Sets the expansion factor in the feed-forward network
For SwiGLU, typically $\approx 2.67 \times d\_model$ (compared to $4 \times$ for standard FFN)
Often rounded to be divisible by 256 or 1024 for hardware efficiency
Example: LLaMA 7B uses 11008 (vs. naive 10923 from $4096 \times 2.67$ )

RoPE Base (rope_base)

Controls the frequency range of position encodings
Default: 10000 (inherited from original sinusoidal encodings)
Higher values extend effective context length but may reduce precision for nearby positions
Some models use 500000 or higher for very long contexts

RMSNorm Epsilon (eps)

Small constant for numerical stability in normalization
Typical value: $10^{-6}$ or $10^{-5}$
Rarely needs tuning unless encountering numerical issues

Number of Layers (n_layers)

Determines model depth
Typical values: 32 (7B), 40 (13B), 80 (70B)
More layers increase capacity and sequential dependency but slow inference
Deeper models require more careful initialization and learning rate tuning

SummaryLink Copied

LLaMA's architecture demonstrates how careful component selection creates a coherent, efficient design. The key components and their roles:

Pre-norm with RMSNorm provides stable gradient flow through deep networks. Normalization before each sublayer creates direct residual paths, while RMSNorm's efficiency (skipping mean centering) reduces computational overhead without sacrificing quality.
SwiGLU FFN delivers expressive nonlinearity through gated linear units. The Swish activation provides smooth gradients, and the gating mechanism learns to selectively process information. The trade-off is three weight matrices instead of two, compensated by a reduced hidden dimension.
RoPE encodes relative position through rotation. Applied to queries and keys at every layer, it enables position-aware attention without additive embeddings. The multi-frequency design captures both local and global position information.
Component interaction matters as much as individual choices. Pre-norm + residuals create gradient highways. RoPE enables context extension through interpolation. SwiGLU provides the nonlinearity that attention alone cannot.

The complete LLaMA block follows a consistent pattern: normalize, transform, add residual. This pattern repeats for attention and FFN sublayers, creating a modular, stackable unit that scales to hundreds of billions of parameters.

Understanding these components prepares you for the next chapters, where we'll examine variations like grouped-query attention that further optimize the LLaMA design for inference efficiency.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

LLaMA Architecture

Next Chapter

Grouped Query Attention

Reference

BIBTEXAcademic

@misc{llamacomponentsrmsnormswigluandrope, author = {Michael Brenndoerfer}, title = {LLaMA Components: RMSNorm, SwiGLU, and RoPE}, year = {2025}, url = {https://mbrenndoerfer.com/writing/llama-components-rmsnorm-swiglu-rope}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). LLaMA Components: RMSNorm, SwiGLU, and RoPE. Retrieved from https://mbrenndoerfer.com/writing/llama-components-rmsnorm-swiglu-rope

MLAAcademic

Michael Brenndoerfer. "LLaMA Components: RMSNorm, SwiGLU, and RoPE." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/llama-components-rmsnorm-swiglu-rope>.

CHICAGOAcademic

Michael Brenndoerfer. "LLaMA Components: RMSNorm, SwiGLU, and RoPE." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/llama-components-rmsnorm-swiglu-rope.

HARVARDAcademic

Michael Brenndoerfer (2025) 'LLaMA Components: RMSNorm, SwiGLU, and RoPE'. Available at: https://mbrenndoerfer.com/writing/llama-components-rmsnorm-swiglu-rope (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). LLaMA Components: RMSNorm, SwiGLU, and RoPE. https://mbrenndoerfer.com/writing/llama-components-rmsnorm-swiglu-rope

Direct link:

https://mbrenndoerfer.com/writing/llama-components-rmsnorm-swiglu-rope

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

LLaMA Components: RMSNorm, SwiGLU, and RoPE

LLaMA ComponentsLink Copied

Pre-Norm with RMSNormLink Copied

Why Pre-Norm?Link Copied

Why RMSNorm?Link Copied

The SwiGLU Feed-Forward NetworkLink Copied

From Standard FFN to Gated UnitsLink Copied

SwiGLU: Swish-Gated Linear UnitsLink Copied

Rotary Position Embedding (RoPE)Link Copied

The Rotation PrincipleLink Copied

RoPE in AttentionLink Copied

Component InteractionsLink Copied

The Information FlowLink Copied

Why This Combination WorksLink Copied

Implementation: A Complete LLaMA BlockLink Copied

Stacking Blocks into a ModelLink Copied

Parameter DistributionLink Copied

Limitations and Trade-offsLink Copied

Key ParametersLink Copied

SummaryLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

LLaMA Architecture: Design Philosophy and Training Efficiency

Qwen Architecture: Alibaba's Multilingual LLM Design

Mistral Architecture: Sliding Window Attention & Efficient LLM Design

Stay updated