LLaMA Architecture: Design Philosophy and Training Efficiency

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

A complete guide to LLaMA's architectural choices including RMSNorm, SwiGLU, and RoPE, plus training data strategies that enabled competitive performance at smaller model sizes.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

LLaMA ArchitectureLink Copied

In February 2023, Meta AI released LLaMA (Large Language Model Meta AI), a collection of foundation language models that fundamentally changed the landscape of open-source AI research. While GPT-3 and similar models had demonstrated remarkable capabilities, they remained locked behind API walls, limiting what researchers and practitioners could study and build. LLaMA offered something different: openly available model weights across multiple sizes, trained on publicly available data, achieving performance competitive with much larger proprietary models.

The significance of LLaMA extends beyond just releasing weights. The model family demonstrated that careful attention to training data quality and established architectural improvements could produce models that punch well above their weight class. LLaMA-13B matched GPT-3's performance on many benchmarks despite having fewer than 10% of the parameters. This efficiency came not from revolutionary new techniques, but from thoughtful combination of proven innovations: RMSNorm for stable training, Rotary Position Embeddings for better position encoding, SwiGLU activations for improved expressiveness, and pre-normalization for training stability.

This chapter examines the LLaMA architecture in depth. We'll dissect each design choice, understand why it was made, and see how the pieces fit together. By the end, you'll understand the blueprint that influenced nearly every open-source language model that followed.

Design PhilosophyLink Copied

LLaMA emerged from a specific research question: how well can we train language models using only publicly available data? Large models like PaLM and Chinchilla had shown impressive results, but their training data included proprietary sources. Meta's researchers wanted to demonstrate that public data, properly curated and processed, could produce competitive models.

This constraint led to a focus on efficiency. If you're limited to public data, you need to extract maximum value from it. The LLaMA team applied insights from scaling laws research, particularly the Chinchilla findings which showed that many models were undertrained relative to their size. Rather than training larger models for fewer steps, they trained smaller models for longer on more tokens.

The architectural choices reflect this efficiency focus. Each component was selected not for novelty, but for proven effectiveness:

RMSNorm instead of LayerNorm: Eliminates unnecessary mean centering, saving computation at every layer
Rotary Position Embeddings (RoPE) instead of learned positions: Provides better extrapolation to longer sequences without additional parameters
SwiGLU activation instead of ReLU or GELU: Improves model expressiveness with modest parameter increase
Pre-normalization instead of post-normalization: Stabilizes training of deep networks
No bias terms in linear layers: Reduces parameters without hurting performance

These weren't new inventions. Each had been introduced and validated in prior work. LLaMA's contribution was combining them into a coherent, well-tuned architecture and demonstrating their effectiveness at scale.

Architecture OverviewLink Copied

LLaMA follows the decoder-only transformer architecture introduced by GPT. Each input sequence passes through an embedding layer, multiple transformer blocks, and a final output projection. The key innovations lie in the details of each component.

Out[2]:

Visualization

High-level LLaMA architecture. Each transformer block applies RMSNorm before attention and before the feed-forward network (pre-norm architecture). Rotary position embeddings are applied to queries and keys within the attention mechanism. SwiGLU replaces the standard feed-forward activation.

Let's walk through a forward pass. Given input token IDs, the model first converts them to continuous embeddings. Unlike some architectures, LLaMA adds no position embeddings at this stage. Position information enters only through RoPE, applied inside the attention mechanism.

Each transformer block processes these embeddings through two sub-layers:

Attention sub-layer: RMSNorm normalizes the input, multi-head attention computes contextual representations (with RoPE applied to queries and keys), and a residual connection adds the result back to the input.
Feed-forward sub-layer: Another RMSNorm normalizes the attention output, the SwiGLU feed-forward network processes each position independently, and another residual connection preserves information flow.

After all transformer blocks, a final RMSNorm stabilizes the representations before the output projection maps them to vocabulary logits.

In[4]:

Code

class LLaMAConfig:
    """Configuration for LLaMA model."""

    def __init__(
        self,
        vocab_size: int = 32000,
        hidden_dim: int = 4096,
        intermediate_dim: int = 11008,
        n_layers: int = 32,
        n_heads: int = 32,
        n_kv_heads: int = None,  # For grouped-query attention
        max_seq_len: int = 2048,
        norm_eps: float = 1e-5,
        rope_theta: float = 10000.0,
    ):
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.intermediate_dim = intermediate_dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads if n_kv_heads else n_heads
        self.max_seq_len = max_seq_len
        self.norm_eps = norm_eps
        self.rope_theta = rope_theta
        self.head_dim = hidden_dim // n_heads

class LLaMAConfig:
    """Configuration for LLaMA model."""

    def __init__(
        self,
        vocab_size: int = 32000,
        hidden_dim: int = 4096,
        intermediate_dim: int = 11008,
        n_layers: int = 32,
        n_heads: int = 32,
        n_kv_heads: int = None,  # For grouped-query attention
        max_seq_len: int = 2048,
        norm_eps: float = 1e-5,
        rope_theta: float = 10000.0,
    ):
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.intermediate_dim = intermediate_dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads if n_kv_heads else n_heads
        self.max_seq_len = max_seq_len
        self.norm_eps = norm_eps
        self.rope_theta = rope_theta
        self.head_dim = hidden_dim // n_heads

The configuration captures the key hyperparameters. Notice n_kv_heads, which allows the number of key-value heads to differ from query heads. This enables Grouped-Query Attention (GQA), used in LLaMA 2 and later versions to reduce memory bandwidth during inference.

RMSNorm: Efficient NormalizationLink Copied

Layer normalization appears throughout transformer architectures, stabilizing training by normalizing activations. Standard LayerNorm centers the data around zero (subtracting the mean) and scales it to unit variance. RMSNorm simplifies this by removing the centering step, normalizing only by the root mean square.

\text{RMSNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x}}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}

where:

$\mathbf{x} \in \mathbb{R}^d$ : the input vector to normalize
$\gamma \in \mathbb{R}^d$ : learnable scale parameters
$\epsilon$ : small constant for numerical stability
$d$ : the hidden dimension

The simplification works because, in well-initialized transformers with residual connections, activations tend to be approximately centered already. Explicitly computing and subtracting the mean adds computational overhead without meaningful benefit.

In[5]:

Code

class RMSNorm(nn.Module):
    """RMSNorm as used in LLaMA."""

    def __init__(self, dim: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Compute in float32 for numerical stability
        x_float = x.float()
        rms = torch.sqrt(
            torch.mean(x_float**2, dim=-1, keepdim=True) + self.eps
        )
        x_normed = x_float / rms
        return (self.weight * x_normed).type_as(x)

class RMSNorm(nn.Module):
    """RMSNorm as used in LLaMA."""

    def __init__(self, dim: int, eps: float = 1e-5):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Compute in float32 for numerical stability
        x_float = x.float()
        rms = torch.sqrt(
            torch.mean(x_float**2, dim=-1, keepdim=True) + self.eps
        )
        x_normed = x_float / rms
        return (self.weight * x_normed).type_as(x)

The implementation computes normalization in float32 even when the model uses lower precision (float16 or bfloat16). This prevents numerical instability when the RMS value is very small, which could cause division issues in reduced precision.

Let's compare RMSNorm and LayerNorm on typical transformer activations:

In[10]:

Code

def compare_normalizations():
    """Compare RMSNorm and LayerNorm behavior."""
    torch.manual_seed(42)

    # Simulate transformer activations
    batch, seq_len, dim = 4, 128, 512
    x = torch.randn(batch, seq_len, dim) * 0.5 + 0.1  # Slight positive bias

    # Apply both normalizations
    rmsnorm = RMSNorm(dim)
    layernorm = nn.LayerNorm(dim)

    rms_out = rmsnorm(x)
    ln_out = layernorm(x)

    return x, rms_out, ln_out


x, rms_out, ln_out = compare_normalizations()

def compare_normalizations():
    """Compare RMSNorm and LayerNorm behavior."""
    torch.manual_seed(42)

    # Simulate transformer activations
    batch, seq_len, dim = 4, 128, 512
    x = torch.randn(batch, seq_len, dim) * 0.5 + 0.1  # Slight positive bias

    # Apply both normalizations
    rmsnorm = RMSNorm(dim)
    layernorm = nn.LayerNorm(dim)

    rms_out = rmsnorm(x)
    ln_out = layernorm(x)

    return x, rms_out, ln_out


x, rms_out, ln_out = compare_normalizations()

The distributions are nearly identical. LayerNorm's output mean is exactly zero by construction, while RMSNorm's mean is very close to zero because the input was already approximately centered. For training stability, this small difference has no practical impact.

Rotary Position Embeddings (RoPE)Link Copied

Position information is essential for language understanding. Unlike recurrent networks that process tokens sequentially, transformers see all tokens simultaneously. They need explicit position signals to distinguish "the cat sat on the mat" from "the mat sat on the cat."

Early transformers used additive position embeddings: learned or sinusoidal vectors added to token embeddings before attention. RoPE takes a different approach, encoding position through rotation of query and key vectors rather than addition.

The core insight is geometric. When you rotate two vectors and compute their dot product, the result depends on the angle difference, not the absolute angles. If we associate each position with a rotation angle, the attention score between positions $m$ and $n$ naturally depends on their difference $m - n$ .

For a query or key vector, RoPE groups dimensions into pairs and rotates each pair by a position-dependent angle:

\begin{pmatrix} x_1' \\ x_2' \end{pmatrix} = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}

where:

$(x_1, x_2)$ : a pair of adjacent dimensions from the query or key vector
$m$ : the position index
$\theta$ : the base rotation frequency
$(x_1', x_2')$ : the rotated pair

Different dimension pairs use different base frequencies, creating a multi-scale representation similar to sinusoidal position encodings. Lower-frequency rotations capture long-range position relationships; higher-frequency rotations capture local structure.

In[13]:

Code

def precompute_rope_cache(
    head_dim: int, max_seq_len: int, theta: float = 10000.0, device: str = "cpu"
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Precompute the cos and sin values for RoPE.

    Returns cached values to avoid recomputation during forward pass.
    """
    # Compute inverse frequencies for each dimension pair
    # Shape: (head_dim // 2,)
    inv_freq = 1.0 / (
        theta
        ** (torch.arange(0, head_dim, 2, device=device).float() / head_dim)
    )

    # Position indices
    # Shape: (max_seq_len,)
    positions = torch.arange(max_seq_len, device=device).float()

    # Outer product gives angles for each position and frequency
    # Shape: (max_seq_len, head_dim // 2)
    angles = torch.outer(positions, inv_freq)

    # Duplicate each angle for both dimensions in the pair
    # Shape: (max_seq_len, head_dim)
    angles = torch.cat([angles, angles], dim=-1)

    return torch.cos(angles), torch.sin(angles)

def precompute_rope_cache(
    head_dim: int, max_seq_len: int, theta: float = 10000.0, device: str = "cpu"
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Precompute the cos and sin values for RoPE.

    Returns cached values to avoid recomputation during forward pass.
    """
    # Compute inverse frequencies for each dimension pair
    # Shape: (head_dim // 2,)
    inv_freq = 1.0 / (
        theta
        ** (torch.arange(0, head_dim, 2, device=device).float() / head_dim)
    )

    # Position indices
    # Shape: (max_seq_len,)
    positions = torch.arange(max_seq_len, device=device).float()

    # Outer product gives angles for each position and frequency
    # Shape: (max_seq_len, head_dim // 2)
    angles = torch.outer(positions, inv_freq)

    # Duplicate each angle for both dimensions in the pair
    # Shape: (max_seq_len, head_dim)
    angles = torch.cat([angles, angles], dim=-1)

    return torch.cos(angles), torch.sin(angles)

In[14]:

Code

def apply_rope(
    x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
) -> torch.Tensor:
    """
    Apply rotary position embedding to query or key tensor.

    Args:
        x: Query or key tensor of shape (batch, n_heads, seq_len, head_dim)
        cos: Precomputed cosines of shape (seq_len, head_dim)
        sin: Precomputed sines of shape (seq_len, head_dim)

    Returns:
        Rotated tensor of same shape as x
    """
    seq_len = x.shape[2]
    head_dim = x.shape[3]

    # Reshape x to separate even and odd dimensions
    x_reshape = x.view(*x.shape[:-1], head_dim // 2, 2)
    x_even = x_reshape[..., 0]  # First element of each pair
    x_odd = x_reshape[..., 1]  # Second element of each pair

    # Get cos/sin for current sequence length
    cos = cos[:seq_len].unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, head_dim)
    sin = sin[:seq_len].unsqueeze(0).unsqueeze(0)

    # Separate cos/sin for even and odd
    cos_even = cos[..., : head_dim // 2]
    sin_even = sin[..., : head_dim // 2]

    # Apply rotation: rotate each pair
    # (x_even, x_odd) -> (x_even * cos - x_odd * sin, x_even * sin + x_odd * cos)
    x_even_new = x_even * cos_even - x_odd * sin_even
    x_odd_new = x_even * sin_even + x_odd * cos_even

    # Interleave back
    x_out = torch.stack([x_even_new, x_odd_new], dim=-1)
    return x_out.view(*x.shape)

def apply_rope(
    x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor
) -> torch.Tensor:
    """
    Apply rotary position embedding to query or key tensor.

    Args:
        x: Query or key tensor of shape (batch, n_heads, seq_len, head_dim)
        cos: Precomputed cosines of shape (seq_len, head_dim)
        sin: Precomputed sines of shape (seq_len, head_dim)

    Returns:
        Rotated tensor of same shape as x
    """
    seq_len = x.shape[2]
    head_dim = x.shape[3]

    # Reshape x to separate even and odd dimensions
    x_reshape = x.view(*x.shape[:-1], head_dim // 2, 2)
    x_even = x_reshape[..., 0]  # First element of each pair
    x_odd = x_reshape[..., 1]  # Second element of each pair

    # Get cos/sin for current sequence length
    cos = cos[:seq_len].unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, head_dim)
    sin = sin[:seq_len].unsqueeze(0).unsqueeze(0)

    # Separate cos/sin for even and odd
    cos_even = cos[..., : head_dim // 2]
    sin_even = sin[..., : head_dim // 2]

    # Apply rotation: rotate each pair
    # (x_even, x_odd) -> (x_even * cos - x_odd * sin, x_even * sin + x_odd * cos)
    x_even_new = x_even * cos_even - x_odd * sin_even
    x_odd_new = x_even * sin_even + x_odd * cos_even

    # Interleave back
    x_out = torch.stack([x_even_new, x_odd_new], dim=-1)
    return x_out.view(*x.shape)

Let's visualize how RoPE transforms vectors at different positions:

The beauty of RoPE lies in how attention scores naturally capture relative position. When a query at position $m$ attends to a key at position $n$ , their rotated dot product depends only on $m - n$ . Tokens that are close together have similar rotations, producing higher attention scores. Tokens far apart have very different rotations, naturally attenuating their attention.

SwiGLU Activation FunctionLink Copied

The feed-forward network (FFN) in each transformer block expands the hidden dimension, applies a nonlinearity, and projects back. Original transformers used ReLU: $\text{ReLU}(x) = \max(0, x)$ . GPT-2 switched to GELU, a smoother activation. LLaMA uses SwiGLU, which combines a gating mechanism with the Swish activation.

SwiGLU is defined as:

\text{SwiGLU}(\mathbf{x}) = (\mathbf{x} W_1) \odot \sigma(\mathbf{x} W_{\text{gate}}) \cdot (W_2)^T

where:

$\mathbf{x}$ : input from the previous layer
$W_1 \in \mathbb{R}^{d \times d_{\text{ff}}}$ : the "up" projection weight
$W_{\text{gate}} \in \mathbb{R}^{d \times d_{\text{ff}}}$ : the gate projection weight
$\sigma(z) = z \cdot \text{sigmoid}(z)$ : the Swish activation function
$\odot$ : element-wise multiplication
$W_2 \in \mathbb{R}^{d_{\text{ff}} \times d}$ : the "down" projection weight

The Swish function $\sigma(z) = z \cdot \text{sigmoid}(z)$ is smooth, non-monotonic, and self-gated. It allows small negative values through (unlike ReLU) while still providing nonlinearity.

In[18]:

Code

class SwiGLU(nn.Module):
    """SwiGLU feed-forward network as used in LLaMA."""

    def __init__(self, hidden_dim: int, intermediate_dim: int):
        super().__init__()
        # Up projection (creates the value to be gated)
        self.w1 = nn.Linear(hidden_dim, intermediate_dim, bias=False)
        # Gate projection (creates the gate values)
        self.w_gate = nn.Linear(hidden_dim, intermediate_dim, bias=False)
        # Down projection
        self.w2 = nn.Linear(intermediate_dim, hidden_dim, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Swish activation: x * sigmoid(x)
        gate = F.silu(self.w_gate(x))  # F.silu is Swish
        # Element-wise product with up-projected input
        hidden = gate * self.w1(x)
        # Project back to hidden dimension
        return self.w2(hidden)

class SwiGLU(nn.Module):
    """SwiGLU feed-forward network as used in LLaMA."""

    def __init__(self, hidden_dim: int, intermediate_dim: int):
        super().__init__()
        # Up projection (creates the value to be gated)
        self.w1 = nn.Linear(hidden_dim, intermediate_dim, bias=False)
        # Gate projection (creates the gate values)
        self.w_gate = nn.Linear(hidden_dim, intermediate_dim, bias=False)
        # Down projection
        self.w2 = nn.Linear(intermediate_dim, hidden_dim, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Swish activation: x * sigmoid(x)
        gate = F.silu(self.w_gate(x))  # F.silu is Swish
        # Element-wise product with up-projected input
        hidden = gate * self.w1(x)
        # Project back to hidden dimension
        return self.w2(hidden)

Notice that SwiGLU requires three weight matrices instead of the usual two. To maintain similar parameter count, LLaMA reduces the intermediate dimension. Where a standard FFN might use $4 \times d$ intermediate size, LLaMA uses approximately $\frac{8}{3} \times d$ (specifically $\frac{2}{3} \times 4d = 2.67d$ , rounded to the nearest multiple of 256).

Let's compare the activation functions:

The gating mechanism in SwiGLU gives the model more control over information flow. Rather than applying a fixed nonlinearity to all values, the gate can learn to pass certain features through more strongly than others. Research has shown this improves model performance, particularly for larger models.

Multi-Head Attention with RoPELink Copied

LLaMA's attention mechanism follows the standard multi-head attention pattern, with RoPE applied to queries and keys before computing attention scores.

In[22]:

Code

class LLaMAAttention(nn.Module):
    """Multi-head attention with RoPE as used in LLaMA."""

    def __init__(self, config: LLaMAConfig):
        super().__init__()
        self.n_heads = config.n_heads
        self.n_kv_heads = config.n_kv_heads
        self.head_dim = config.head_dim
        self.hidden_dim = config.hidden_dim

        # Query, Key, Value projections (no bias)
        self.wq = nn.Linear(
            config.hidden_dim, config.n_heads * config.head_dim, bias=False
        )
        self.wk = nn.Linear(
            config.hidden_dim, config.n_kv_heads * config.head_dim, bias=False
        )
        self.wv = nn.Linear(
            config.hidden_dim, config.n_kv_heads * config.head_dim, bias=False
        )
        self.wo = nn.Linear(
            config.n_heads * config.head_dim, config.hidden_dim, bias=False
        )

        # Precompute RoPE cache
        cos, sin = precompute_rope_cache(
            config.head_dim, config.max_seq_len, config.rope_theta
        )
        self.register_buffer("rope_cos", cos)
        self.register_buffer("rope_sin", sin)

    def forward(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch, seq_len, _ = x.shape

        # Compute Q, K, V
        q = self.wq(x).view(batch, seq_len, self.n_heads, self.head_dim)
        k = self.wk(x).view(batch, seq_len, self.n_kv_heads, self.head_dim)
        v = self.wv(x).view(batch, seq_len, self.n_kv_heads, self.head_dim)

        # Transpose for attention: (batch, n_heads, seq_len, head_dim)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # Apply RoPE to Q and K
        q = apply_rope(q, self.rope_cos, self.rope_sin)
        k = apply_rope(k, self.rope_cos, self.rope_sin)

        # If using GQA, expand K and V to match number of query heads
        if self.n_kv_heads < self.n_heads:
            n_rep = self.n_heads // self.n_kv_heads
            k = k.repeat_interleave(n_rep, dim=1)
            v = v.repeat_interleave(n_rep, dim=1)

        # Scaled dot-product attention
        scale = self.head_dim**-0.5
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale

        if mask is not None:
            scores = scores + mask

        attn = F.softmax(scores, dim=-1)
        out = torch.matmul(attn, v)

        # Reshape and project output
        out = out.transpose(1, 2).contiguous().view(batch, seq_len, -1)
        return self.wo(out)

class LLaMAAttention(nn.Module):
    """Multi-head attention with RoPE as used in LLaMA."""

    def __init__(self, config: LLaMAConfig):
        super().__init__()
        self.n_heads = config.n_heads
        self.n_kv_heads = config.n_kv_heads
        self.head_dim = config.head_dim
        self.hidden_dim = config.hidden_dim

        # Query, Key, Value projections (no bias)
        self.wq = nn.Linear(
            config.hidden_dim, config.n_heads * config.head_dim, bias=False
        )
        self.wk = nn.Linear(
            config.hidden_dim, config.n_kv_heads * config.head_dim, bias=False
        )
        self.wv = nn.Linear(
            config.hidden_dim, config.n_kv_heads * config.head_dim, bias=False
        )
        self.wo = nn.Linear(
            config.n_heads * config.head_dim, config.hidden_dim, bias=False
        )

        # Precompute RoPE cache
        cos, sin = precompute_rope_cache(
            config.head_dim, config.max_seq_len, config.rope_theta
        )
        self.register_buffer("rope_cos", cos)
        self.register_buffer("rope_sin", sin)

    def forward(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        batch, seq_len, _ = x.shape

        # Compute Q, K, V
        q = self.wq(x).view(batch, seq_len, self.n_heads, self.head_dim)
        k = self.wk(x).view(batch, seq_len, self.n_kv_heads, self.head_dim)
        v = self.wv(x).view(batch, seq_len, self.n_kv_heads, self.head_dim)

        # Transpose for attention: (batch, n_heads, seq_len, head_dim)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # Apply RoPE to Q and K
        q = apply_rope(q, self.rope_cos, self.rope_sin)
        k = apply_rope(k, self.rope_cos, self.rope_sin)

        # If using GQA, expand K and V to match number of query heads
        if self.n_kv_heads < self.n_heads:
            n_rep = self.n_heads // self.n_kv_heads
            k = k.repeat_interleave(n_rep, dim=1)
            v = v.repeat_interleave(n_rep, dim=1)

        # Scaled dot-product attention
        scale = self.head_dim**-0.5
        scores = torch.matmul(q, k.transpose(-2, -1)) * scale

        if mask is not None:
            scores = scores + mask

        attn = F.softmax(scores, dim=-1)
        out = torch.matmul(attn, v)

        # Reshape and project output
        out = out.transpose(1, 2).contiguous().view(batch, seq_len, -1)
        return self.wo(out)

Key implementation details:

No bias terms: All linear projections have bias=False, reducing parameters
RoPE buffers: Cosine and sine values are precomputed and stored as buffers, not parameters
GQA support: When n_kv_heads < n_heads, keys and values are repeated to match query count

Transformer BlockLink Copied

Each transformer block combines attention and feed-forward processing with residual connections and pre-normalization:

In[24]:

Code

class LLaMABlock(nn.Module):
    """A single transformer block in LLaMA."""

    def __init__(self, config: LLaMAConfig):
        super().__init__()
        self.attention_norm = RMSNorm(config.hidden_dim, config.norm_eps)
        self.attention = LLaMAAttention(config)
        self.ffn_norm = RMSNorm(config.hidden_dim, config.norm_eps)
        self.ffn = SwiGLU(config.hidden_dim, config.intermediate_dim)

    def forward(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        # Attention with pre-norm and residual
        x = x + self.attention(self.attention_norm(x), mask)
        # FFN with pre-norm and residual
        x = x + self.ffn(self.ffn_norm(x))
        return x

class LLaMABlock(nn.Module):
    """A single transformer block in LLaMA."""

    def __init__(self, config: LLaMAConfig):
        super().__init__()
        self.attention_norm = RMSNorm(config.hidden_dim, config.norm_eps)
        self.attention = LLaMAAttention(config)
        self.ffn_norm = RMSNorm(config.hidden_dim, config.norm_eps)
        self.ffn = SwiGLU(config.hidden_dim, config.intermediate_dim)

    def forward(
        self, x: torch.Tensor, mask: torch.Tensor = None
    ) -> torch.Tensor:
        # Attention with pre-norm and residual
        x = x + self.attention(self.attention_norm(x), mask)
        # FFN with pre-norm and residual
        x = x + self.ffn(self.ffn_norm(x))
        return x

The pre-normalization pattern (applying normalization before each sub-layer rather than after) has become standard in modern LLMs. Research showed it provides more stable gradients during training, especially for very deep networks. The original "post-norm" pattern could cause gradient issues at initialization, requiring careful learning rate warmup.

Complete LLaMA ModelLink Copied

Assembling all components gives us the complete model:

In[26]:

Code

class LLaMA(nn.Module):
    """Complete LLaMA model."""

    def __init__(self, config: LLaMAConfig):
        super().__init__()
        self.config = config

        # Token embeddings
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_dim)

        # Transformer blocks
        self.layers = nn.ModuleList(
            [LLaMABlock(config) for _ in range(config.n_layers)]
        )

        # Final normalization
        self.norm = RMSNorm(config.hidden_dim, config.norm_eps)

        # Output projection (shares weights with embedding in some variants)
        self.lm_head = nn.Linear(
            config.hidden_dim, config.vocab_size, bias=False
        )

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self, input_ids: torch.Tensor, return_hidden: bool = False
    ) -> torch.Tensor:
        batch, seq_len = input_ids.shape

        # Create causal mask
        mask = torch.triu(
            torch.full(
                (seq_len, seq_len), float("-inf"), device=input_ids.device
            ),
            diagonal=1,
        )

        # Token embeddings
        h = self.embed_tokens(input_ids)

        # Apply transformer blocks
        for layer in self.layers:
            h = layer(h, mask)

        # Final normalization
        h = self.norm(h)

        if return_hidden:
            return h

        # Project to vocabulary
        logits = self.lm_head(h)
        return logits

class LLaMA(nn.Module):
    """Complete LLaMA model."""

    def __init__(self, config: LLaMAConfig):
        super().__init__()
        self.config = config

        # Token embeddings
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_dim)

        # Transformer blocks
        self.layers = nn.ModuleList(
            [LLaMABlock(config) for _ in range(config.n_layers)]
        )

        # Final normalization
        self.norm = RMSNorm(config.hidden_dim, config.norm_eps)

        # Output projection (shares weights with embedding in some variants)
        self.lm_head = nn.Linear(
            config.hidden_dim, config.vocab_size, bias=False
        )

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self, input_ids: torch.Tensor, return_hidden: bool = False
    ) -> torch.Tensor:
        batch, seq_len = input_ids.shape

        # Create causal mask
        mask = torch.triu(
            torch.full(
                (seq_len, seq_len), float("-inf"), device=input_ids.device
            ),
            diagonal=1,
        )

        # Token embeddings
        h = self.embed_tokens(input_ids)

        # Apply transformer blocks
        for layer in self.layers:
            h = layer(h, mask)

        # Final normalization
        h = self.norm(h)

        if return_hidden:
            return h

        # Project to vocabulary
        logits = self.lm_head(h)
        return logits

Let's instantiate a small version and verify it works:

In[28]:

Code

# Create a small LLaMA for testing
small_config = LLaMAConfig(
    vocab_size=1000,
    hidden_dim=256,
    intermediate_dim=688,  # ~2.67 * 256
    n_layers=4,
    n_heads=4,
    max_seq_len=512,
)

model = LLaMA(small_config)

# Test forward pass
test_input = torch.randint(0, 1000, (2, 64))  # batch=2, seq_len=64
output = model(test_input)

# Create a small LLaMA for testing
small_config = LLaMAConfig(
    vocab_size=1000,
    hidden_dim=256,
    intermediate_dim=688,  # ~2.67 * 256
    n_layers=4,
    n_heads=4,
    max_seq_len=512,
)

model = LLaMA(small_config)

# Test forward pass
test_input = torch.randint(0, 1000, (2, 64))  # batch=2, seq_len=64
output = model(test_input)

Model ConfigurationsLink Copied

LLaMA was released in multiple sizes, each tuned for different compute budgets and use cases:

In[31]:

Code

LLAMA_CONFIGS = {
    "7B": {
        "hidden_dim": 4096,
        "intermediate_dim": 11008,
        "n_layers": 32,
        "n_heads": 32,
    },
    "13B": {
        "hidden_dim": 5120,
        "intermediate_dim": 13824,
        "n_layers": 40,
        "n_heads": 40,
    },
    "33B": {
        "hidden_dim": 6656,
        "intermediate_dim": 17920,
        "n_layers": 60,
        "n_heads": 52,
    },
    "65B": {
        "hidden_dim": 8192,
        "intermediate_dim": 22016,
        "n_layers": 80,
        "n_heads": 64,
    },
}


def estimate_params(config: dict) -> int:
    """Estimate total parameters for a LLaMA configuration."""
    d = config["hidden_dim"]
    d_ff = config["intermediate_dim"]
    n_layers = config["n_layers"]
    n_heads = config["n_heads"]
    vocab_size = 32000

    # Embedding + output: 2 * vocab * d (if weight-tied, divide by 2)
    embed_params = vocab_size * d  # Typically not weight-tied in LLaMA
    output_params = vocab_size * d

    # Per layer:
    # - Attention: Q, K, V, O projections = 4 * d * d
    # - FFN: up, gate, down = 3 * d * d_ff
    # - Norms: 2 * d (RMSNorm only has gamma)
    attn_params = 4 * d * d
    ffn_params = 3 * d * d_ff
    norm_params = 2 * d
    layer_params = attn_params + ffn_params + norm_params

    # Final norm
    final_norm = d

    total = (
        embed_params + output_params + (layer_params * n_layers) + final_norm
    )
    return total

LLAMA_CONFIGS = {
    "7B": {
        "hidden_dim": 4096,
        "intermediate_dim": 11008,
        "n_layers": 32,
        "n_heads": 32,
    },
    "13B": {
        "hidden_dim": 5120,
        "intermediate_dim": 13824,
        "n_layers": 40,
        "n_heads": 40,
    },
    "33B": {
        "hidden_dim": 6656,
        "intermediate_dim": 17920,
        "n_layers": 60,
        "n_heads": 52,
    },
    "65B": {
        "hidden_dim": 8192,
        "intermediate_dim": 22016,
        "n_layers": 80,
        "n_heads": 64,
    },
}


def estimate_params(config: dict) -> int:
    """Estimate total parameters for a LLaMA configuration."""
    d = config["hidden_dim"]
    d_ff = config["intermediate_dim"]
    n_layers = config["n_layers"]
    n_heads = config["n_heads"]
    vocab_size = 32000

    # Embedding + output: 2 * vocab * d (if weight-tied, divide by 2)
    embed_params = vocab_size * d  # Typically not weight-tied in LLaMA
    output_params = vocab_size * d

    # Per layer:
    # - Attention: Q, K, V, O projections = 4 * d * d
    # - FFN: up, gate, down = 3 * d * d_ff
    # - Norms: 2 * d (RMSNorm only has gamma)
    attn_params = 4 * d * d
    ffn_params = 3 * d * d_ff
    norm_params = 2 * d
    layer_params = attn_params + ffn_params + norm_params

    # Final norm
    final_norm = d

    total = (
        embed_params + output_params + (layer_params * n_layers) + final_norm
    )
    return total

The configurations reveal the scaling strategy. Moving from 7B to 65B, all dimensions grow:

Hidden dimension scales from 4096 to 8192 (2 $\times$ )
Number of layers scales from 32 to 80 (2.5 $\times$ )
FFN intermediate dimension scales proportionally with hidden dimension

This balanced scaling follows insights from research showing that making models both wider and deeper works better than extreme depth or width alone.

Training Data and ApproachLink Copied

LLaMA's training data consisted exclusively of publicly available sources:

CommonCrawl (67%): Web text filtered for quality using a classifier trained on Wikipedia and curated reference data
C4 (15%): Cleaned version of CommonCrawl with additional filtering
GitHub (4.5%): Public code repositories filtered by license and quality
Wikipedia (4.5%): Multiple languages
Books (4.5%): Public domain books from Project Gutenberg and Books3
ArXiv (2.5%): Scientific papers
StackExchange (2%): Question-answer pairs

The total training set contained approximately 1.4 trillion tokens. Critically, the models were trained for multiple epochs on this data, meaning each token was seen multiple times. LLaMA-7B trained on approximately 1 trillion tokens; LLaMA-65B trained on 1.4 trillion tokens.

This approach contrasted with the prevailing wisdom of "more data is always better." By training longer on high-quality data, LLaMA achieved strong performance without requiring access to proprietary datasets.

Efficiency ConsiderationsLink Copied

LLaMA's design prioritizes inference efficiency alongside training performance. Several choices contribute to this:

Reduced parameters per layer: RMSNorm eliminates the bias term, and all linear projections are bias-free. This reduces both memory footprint and computation.

RoPE efficiency: Position embeddings are computed from cosine and sine functions applied to cached values, not stored as learned parameters. This reduces memory while enabling arbitrary sequence length extension.

Attention optimization: The architecture is compatible with FlashAttention and other fused kernels. No architectural quirks prevent these optimizations.

Let's measure the computational profile:

In[37]:

Code

def count_flops_per_token(config: dict, seq_len: int) -> dict:
    """Estimate FLOPs per token for a LLaMA configuration."""
    d = config["hidden_dim"]
    d_ff = config["intermediate_dim"]
    n_layers = config["n_layers"]
    vocab_size = 32000

    flops = {}

    # Embedding lookup (negligible)
    flops["embedding"] = 0

    # Per layer attention:
    # - QKV projection: 3 * 2 * d * d (MatMul = 2 * m * n * k)
    # - Attention scores: 2 * d * seq_len
    # - Attention output: 2 * d * seq_len
    # - Output projection: 2 * d * d
    flops["attention"] = n_layers * (
        3 * 2 * d * d  # Q, K, V projections
        + 2 * d * seq_len  # Attention scores
        + 2 * d * seq_len  # Attention @ V
        + 2 * d * d  # Output projection
    )

    # Per layer FFN:
    # - Up projection: 2 * d * d_ff
    # - Gate projection: 2 * d * d_ff
    # - Down projection: 2 * d_ff * d
    # Plus element-wise operations (negligible)
    flops["ffn"] = n_layers * (
        2 * d * d_ff  # Up
        + 2 * d * d_ff  # Gate
        + 2 * d_ff * d  # Down
    )

    # Output projection
    flops["output"] = 2 * d * vocab_size

    flops["total"] = sum(flops.values())

    return flops

def count_flops_per_token(config: dict, seq_len: int) -> dict:
    """Estimate FLOPs per token for a LLaMA configuration."""
    d = config["hidden_dim"]
    d_ff = config["intermediate_dim"]
    n_layers = config["n_layers"]
    vocab_size = 32000

    flops = {}

    # Embedding lookup (negligible)
    flops["embedding"] = 0

    # Per layer attention:
    # - QKV projection: 3 * 2 * d * d (MatMul = 2 * m * n * k)
    # - Attention scores: 2 * d * seq_len
    # - Attention output: 2 * d * seq_len
    # - Output projection: 2 * d * d
    flops["attention"] = n_layers * (
        3 * 2 * d * d  # Q, K, V projections
        + 2 * d * seq_len  # Attention scores
        + 2 * d * seq_len  # Attention @ V
        + 2 * d * d  # Output projection
    )

    # Per layer FFN:
    # - Up projection: 2 * d * d_ff
    # - Gate projection: 2 * d * d_ff
    # - Down projection: 2 * d_ff * d
    # Plus element-wise operations (negligible)
    flops["ffn"] = n_layers * (
        2 * d * d_ff  # Up
        + 2 * d * d_ff  # Gate
        + 2 * d_ff * d  # Down
    )

    # Output projection
    flops["output"] = 2 * d * vocab_size

    flops["total"] = sum(flops.values())

    return flops

The FFN dominates computation for all model sizes, consistent with the 2/3 FFN to 1/3 attention ratio typical of transformers. This is why efficient FFN implementations (like SwiGLU with its fused operations) matter for inference speed.

LLaMA 2 and BeyondLink Copied

Following the original release, Meta released LLaMA 2 with several improvements:

Longer context: Extended from 2048 to 4096 tokens
Grouped-Query Attention (GQA): Reduced key-value heads for better inference efficiency
More training data: Increased from 1.4T to 2T tokens
RLHF-tuned variants: LLaMA 2-Chat models fine-tuned for dialogue

The core architecture remained the same, validating that the original design choices were sound. GQA was the main architectural change, reducing memory bandwidth during inference by sharing key-value heads across multiple query heads.

In[40]:

Code

# LLaMA 2 uses grouped-query attention
LLAMA2_CONFIGS = {
    "7B": {"n_heads": 32, "n_kv_heads": 32},  # Full MHA
    "13B": {"n_heads": 40, "n_kv_heads": 40},  # Full MHA
    "70B": {"n_heads": 64, "n_kv_heads": 8},  # GQA with 8:1 ratio
}

# LLaMA 2 uses grouped-query attention
LLAMA2_CONFIGS = {
    "7B": {"n_heads": 32, "n_kv_heads": 32},  # Full MHA
    "13B": {"n_heads": 40, "n_kv_heads": 40},  # Full MHA
    "70B": {"n_heads": 64, "n_kv_heads": 8},  # GQA with 8:1 ratio
}

The 70B model uses GQA with an 8:1 ratio, meaning 8 query heads share each key-value head. This reduces the KV cache size by 8 $\times$ during inference, dramatically improving throughput for long sequences.

Limitations and ImpactLink Copied

LLaMA's open release sparked a wave of derivative models and research. Within weeks, fine-tuned variants like Alpaca, Vicuna, and Koala demonstrated that instruction-following could be added with modest additional training. Quantized versions enabled running 7B and 13B models on consumer hardware. The architecture became the de facto standard for open-source LLMs.

However, several limitations deserve attention:

Context length constraints. The original LLaMA was trained with 2048-token context. While RoPE theoretically enables extension to longer sequences, practical performance degraded beyond the training length without additional fine-tuning. LLaMA 2 addressed this partially with 4096 tokens, but modern applications often demand much longer contexts.

Tokenizer limitations. LLaMA used a SentencePiece tokenizer with 32,000 vocabulary size, trained primarily on English. This led to inefficient tokenization for other languages and code, where more tokens are needed to represent the same content.

Training compute requirements. Despite being more efficient than comparably-performing models, training LLaMA still required substantial compute. The 65B model used 2048 A100 GPUs for approximately 21 days. This remains out of reach for most research groups, though inference and fine-tuning are much more accessible.

Safety considerations. As a base model without RLHF alignment, the original LLaMA could generate harmful content. The open release required navigating the tension between research access and potential misuse. Meta's approach of releasing weights under a research license attempted to balance these concerns.

Despite these limitations, LLaMA's impact on the field was profound. It demonstrated that architectural innovation alone isn't necessary for strong performance. Careful data curation, appropriate training duration, and thoughtful combination of established techniques can produce excellent models. The architecture influenced virtually every open-source LLM that followed, from Mistral to Qwen to Gemma.

SummaryLink Copied

LLaMA established a blueprint for modern open-source language models through careful combination of proven architectural improvements rather than novel techniques.

Key architectural choices that define LLaMA:

RMSNorm instead of LayerNorm: Removes mean centering for 5-15% computational savings at every normalization layer without sacrificing training stability.
Rotary Position Embeddings (RoPE): Encodes position through rotation of query and key vectors, naturally producing relative position sensitivity in attention scores. No additional parameters needed.
SwiGLU activation: Combines gating with Swish activation for improved expressiveness in the feed-forward network. Uses three projections but with reduced intermediate dimension to maintain parameter count.
Pre-normalization: Applies RMSNorm before each sub-layer rather than after, providing more stable gradients for deep networks.
No bias terms: Eliminates bias parameters from all linear projections, reducing memory and computation.

The training approach emphasized quality over quantity:

Training on curated public data only, demonstrating that proprietary datasets aren't strictly necessary
Training for more steps on the same data rather than always seeking more unique tokens
Careful data mixing across sources including web text, code, books, and academic papers

LLaMA's model configurations span from 7B to 65B parameters, with the 13B model notably matching GPT-3's performance on many benchmarks despite having fewer than 10% of the parameters. This efficiency came from the confluence of architectural choices and training methodology.

The release catalyzed the open-source LLM ecosystem. By providing both weights and a clean architecture, LLaMA enabled rapid experimentation with fine-tuning, quantization, and deployment. Nearly every subsequent open-source model has built upon or been influenced by its design choices.

Understanding LLaMA's architecture provides a foundation for comprehending modern language models. The techniques it popularized, from RMSNorm to RoPE to SwiGLU, appear throughout the current generation of models. Whether you're studying model internals, fine-tuning for specific applications, or building new architectures, LLaMA's design decisions offer valuable lessons in practical deep learning engineering.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{llamaarchitecturedesignphilosophyandtrainingefficiency, author = {Michael Brenndoerfer}, title = {LLaMA Architecture: Design Philosophy and Training Efficiency}, year = {2025}, url = {https://mbrenndoerfer.com/writing/llama-architecture-design-training-efficiency}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). LLaMA Architecture: Design Philosophy and Training Efficiency. Retrieved from https://mbrenndoerfer.com/writing/llama-architecture-design-training-efficiency

MLAAcademic

Michael Brenndoerfer. "LLaMA Architecture: Design Philosophy and Training Efficiency." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/llama-architecture-design-training-efficiency>.

CHICAGOAcademic

Michael Brenndoerfer. "LLaMA Architecture: Design Philosophy and Training Efficiency." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/llama-architecture-design-training-efficiency.

HARVARDAcademic

Michael Brenndoerfer (2025) 'LLaMA Architecture: Design Philosophy and Training Efficiency'. Available at: https://mbrenndoerfer.com/writing/llama-architecture-design-training-efficiency (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). LLaMA Architecture: Design Philosophy and Training Efficiency. https://mbrenndoerfer.com/writing/llama-architecture-design-training-efficiency

Direct link:

https://mbrenndoerfer.com/writing/llama-architecture-design-training-efficiency

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

LLaMA Architecture: Design Philosophy and Training Efficiency

LLaMA ArchitectureLink Copied

Design PhilosophyLink Copied

Architecture OverviewLink Copied

RMSNorm: Efficient NormalizationLink Copied

Rotary Position Embeddings (RoPE)Link Copied

SwiGLU Activation FunctionLink Copied

Multi-Head Attention with RoPELink Copied

Transformer BlockLink Copied

Complete LLaMA ModelLink Copied

Model ConfigurationsLink Copied

Training Data and ApproachLink Copied

Efficiency ConsiderationsLink Copied

LLaMA 2 and BeyondLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Qwen Architecture: Alibaba's Multilingual LLM Design

Mistral Architecture: Sliding Window Attention & Efficient LLM Design

Grouped Query Attention: Memory-Efficient LLM Inference

Stay updated