Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Learn how to design transformer architectures by understanding the key hyperparameters: model depth, width, attention heads, and FFN dimensions. Complete guide with parameter calculations and design principles.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Architecture HyperparametersLink Copied

Building a transformer is like designing a building: you must decide how tall to make it, how wide each floor should be, and how to allocate space between different rooms. These architectural decisions, the hyperparameters of transformer design, determine everything from model capacity to training dynamics to inference speed. Get them wrong, and your model might train slowly, underperform, or consume excessive memory. Get them right, and you unlock the performance that makes transformers so powerful.

This chapter explores the key architectural hyperparameters: model depth versus width trade-offs, the number of attention heads, feed-forward network dimensions, and how these choices interact. We'll derive formulas for calculating total parameters, examine how successful models like GPT, BERT, and LLaMA made these decisions, and develop intuition for designing architectures tailored to specific constraints.

The Core HyperparametersLink Copied

Every transformer architecture is defined by a small set of core hyperparameters that determine its capacity and computational requirements. Understanding these parameters and their interactions is essential for both implementing existing architectures and designing new ones.

The Six Core Hyperparameters

A transformer's architecture is primarily determined by six numbers: vocabulary size ( $V$ ), number of layers ( $L$ ), model dimension ( $d_{\text{model}}$ ), number of attention heads ( $n_{\text{heads}}$ ), head dimension ( $d_{\text{head}}$ ), and feed-forward dimension ( $d_{\text{ff}}$ ). Everything else, including parameter count, memory usage, and computational cost, derives from these choices.

Let's define each parameter precisely:

$V$ (vocabulary size): The number of unique tokens the model can process. Typical values range from 30,000 (BERT) to 100,000 (GPT-4 class models). Larger vocabularies reduce sequence length but increase embedding memory.
$L$ (number of layers): How many transformer blocks are stacked. Also called model depth. Values range from 6 (small models) to 96+ (frontier models). More layers provide more representational capacity but increase compute and memory linearly.
$d_{\text{model}}$ (model dimension): The hidden dimension that flows through the entire model. Also called embedding dimension or hidden size. This dimension is preserved throughout the transformer, from embeddings through attention and FFN layers. Common values: 768 (BERT-base), 1024 (BERT-large), 4096 (LLaMA-7B), 8192 (LLaMA-70B).
$n_{\text{heads}}$ (number of attention heads): How many parallel attention operations run in each layer. The model dimension must be evenly divisible by the number of heads. More heads allow the model to attend to different types of relationships simultaneously.
$d_{\text{head}}$ (head dimension): The dimension of queries, keys, and values within each attention head. Computed as $d_{\text{head}} = d_{\text{model}} / n_{\text{heads}}$ in standard multi-head attention. Modern architectures sometimes set this independently, often to 64 or 128.
$d_{\text{ff}}$ (feed-forward dimension): The hidden dimension of the feed-forward network within each block. Traditionally set to $4 \times d_{\text{model}}$ . For SwiGLU-based FFNs, typically $\frac{8}{3} \times d_{\text{model}}$ to maintain similar parameter count.

In[2]:

Code

import numpy as np

np.random.seed(42)

# Example configurations from well-known models
model_configs = {
    "BERT-base": {
        "V": 30522,
        "L": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
    },
    "BERT-large": {
        "V": 30522,
        "L": 24,
        "d_model": 1024,
        "n_heads": 16,
        "d_ff": 4096,
    },
    "GPT-2 Small": {
        "V": 50257,
        "L": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
    },
    "GPT-2 XL": {
        "V": 50257,
        "L": 48,
        "d_model": 1600,
        "n_heads": 25,
        "d_ff": 6400,
    },
    "LLaMA-7B": {
        "V": 32000,
        "L": 32,
        "d_model": 4096,
        "n_heads": 32,
        "d_ff": 11008,  # SwiGLU: ~2.7x
    },
    "LLaMA-70B": {
        "V": 32000,
        "L": 80,
        "d_model": 8192,
        "n_heads": 64,
        "d_ff": 28672,
    },
}

# Compute derived parameters
for name, config in model_configs.items():
    config["d_head"] = config["d_model"] // config["n_heads"]

import numpy as np

np.random.seed(42)

# Example configurations from well-known models
model_configs = {
    "BERT-base": {
        "V": 30522,
        "L": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
    },
    "BERT-large": {
        "V": 30522,
        "L": 24,
        "d_model": 1024,
        "n_heads": 16,
        "d_ff": 4096,
    },
    "GPT-2 Small": {
        "V": 50257,
        "L": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
    },
    "GPT-2 XL": {
        "V": 50257,
        "L": 48,
        "d_model": 1600,
        "n_heads": 25,
        "d_ff": 6400,
    },
    "LLaMA-7B": {
        "V": 32000,
        "L": 32,
        "d_model": 4096,
        "n_heads": 32,
        "d_ff": 11008,  # SwiGLU: ~2.7x
    },
    "LLaMA-70B": {
        "V": 32000,
        "L": 80,
        "d_model": 8192,
        "n_heads": 64,
        "d_ff": 28672,
    },
}

# Compute derived parameters
for name, config in model_configs.items():
    config["d_head"] = config["d_model"] // config["n_heads"]

Out[3]:

Console

Model configurations with derived head dimensions:

Model            Layers  d_model  Heads  d_head     d_ff d_ff ratio
----------------------------------------------------------------------
BERT-base            12      768     12      64     3072       4.00x
BERT-large           24     1024     16      64     4096       4.00x
GPT-2 Small          12      768     12      64     3072       4.00x
GPT-2 XL             48     1600     25      64     6400       4.00x
LLaMA-7B             32     4096     32     128    11008       2.69x
LLaMA-70B            80     8192     64     128    28672       3.50x

Notice the patterns across these models. Head dimensions cluster around 64 to 128, suggesting this range provides a good balance between per-head expressiveness and enabling multiple heads. The FFN ratio is consistently around 4x for traditional architectures (BERT, GPT-2) but closer to 2.7x for SwiGLU-based models (LLaMA) because the gating mechanism effectively doubles the transformation, requiring less explicit expansion.

Out[4]:

Visualization

Scatter plot showing transformer models positioned by layers and model dimension, with bubble size proportional to parameters. — Major transformer architectures plotted in depth × width space. Bubble size indicates total parameter count. Models generally follow a diagonal pattern, scaling both dimensions together rather than maximizing one at the expense of the other.

This scatter plot reveals that successful architectures don't maximize just one dimension. Instead, they follow a roughly linear relationship between depth and width, scaling both together. The diagonal trend suggests an empirical consensus about optimal proportions.

Depth vs Width Trade-offsLink Copied

Given a fixed parameter budget, should you build a deep, narrow model or a shallow, wide one? This fundamental question shapes every transformer design decision. Think of it like allocating floors and square footage in a building: you can build a tall skyscraper with modest floor plans, or a sprawling low-rise complex. Both might have similar total space, but they serve different purposes.

Depth, the number of transformer layers, determines how many sequential processing stages information passes through. Each layer can refine, combine, and abstract the representations from the previous layer. This sequential refinement is crucial for tasks requiring multi-step reasoning: understanding that "the trophy wouldn't fit in the suitcase because it was too big" requires first parsing the sentence, then resolving "it" to either trophy or suitcase, then applying physical reasoning. Deeper models have more opportunities for such compositional processing.

Width, primarily governed by $d_{\text{model}}$ , determines how rich each representation can be. A wider model can encode more nuanced distinctions at each layer: more semantic dimensions, finer-grained syntactic features, more factual associations. Think of width as the vocabulary of concepts available at each processing stage.

Research has revealed important patterns in how these two dimensions contribute to capability. Deeper models tend to develop hierarchical representations, with early layers capturing surface patterns (word shapes, common phrases) and later layers capturing abstractions (sentiment, intent, reasoning chains). Wider models have more capacity per layer but may not develop the same hierarchical structure, instead learning richer but flatter representations.

Depth for Abstraction, Width for Capacity

Empirical studies suggest that depth is more important for tasks requiring multi-step reasoning or hierarchical understanding. Width matters more for tasks requiring rich representations at a single level of abstraction. Most practical tasks benefit from a balance of both.

Understanding this trade-off requires knowing how each dimension affects parameter count. Let's derive the relationship from first principles. For a simplified transformer (ignoring biases and layer norms for clarity), the main parameter contributors are:

Embedding layer: $V \times d_{\text{model}}$ parameters
Attention (per layer): $4 \times d_{\text{model}}^2$ parameters (Q, K, V, and output projections)
FFN (per layer): $2 \times d_{\text{model}} \times d_{\text{ff}}$ parameters (two weight matrices)
Output projection: Often tied to embeddings, so 0 additional parameters

Summing these components gives us the total parameter count for a transformer model:

\text{Total} \approx V \cdot d_{\text{model}} + L \cdot (4 \cdot d_{\text{model}}^2 + 2 \cdot d_{\text{model}} \cdot d_{\text{ff}})

where:

$V$ : vocabulary size (number of unique tokens)
$d_{\text{model}}$ : model dimension (hidden size)
$L$ : number of transformer layers
$d_{\text{ff}}$ : feed-forward hidden dimension
$4 \cdot d_{\text{model}}^2$ : attention parameters per layer (four projection matrices: Q, K, V, O)
$2 \cdot d_{\text{model}} \cdot d_{\text{ff}}$ : FFN parameters per layer (two weight matrices: up-projection and down-projection)

For large models, the embedding term $V \cdot d_{\text{model}}$ becomes negligible compared to the layer terms. With the standard expansion ratio $d_{\text{ff}} = 4 \times d_{\text{model}}$ , we can simplify by substituting this relationship into the per-layer FFN term:

2 \cdot d_{\text{model}} \cdot d_{\text{ff}} = 2 \cdot d_{\text{model}} \cdot (4 \cdot d_{\text{model}}) = 8 \cdot d_{\text{model}}^2

Adding the attention term gives us the simplified total:

\text{Total} \approx L \cdot (4 \cdot d_{\text{model}}^2 + 8 \cdot d_{\text{model}}^2) = 12 \cdot L \cdot d_{\text{model}}^2

This formula reveals a key insight about scaling behavior: parameters scale linearly with depth $L$ but quadratically with width $d_{\text{model}}$ . Doubling the depth doubles the parameters; doubling the width quadruples them.

In[5]:

Code

def count_parameters(V, L, d_model, d_ff, include_output=False):
    """
    Count approximate transformer parameters.

    Args:
        V: Vocabulary size
        L: Number of layers
        d_model: Model dimension
        d_ff: Feed-forward dimension
        include_output: Whether to include separate output projection (vs tied embeddings)

    Returns:
        Dictionary with parameter breakdown
    """
    params = {}

    # Embedding layer
    params["embedding"] = V * d_model

    # Per layer: attention (Q, K, V, O projections) and FFN
    params["attention_per_layer"] = 4 * d_model * d_model
    params["ffn_per_layer"] = 2 * d_model * d_ff
    params["layer_norm_per_layer"] = 2 * d_model  # Two norms per layer

    params["per_layer_total"] = (
        params["attention_per_layer"]
        + params["ffn_per_layer"]
        + params["layer_norm_per_layer"]
    )

    params["all_layers"] = L * params["per_layer_total"]

    # Final layer norm
    params["final_norm"] = d_model

    # Output projection (often tied with embedding)
    params["output"] = V * d_model if include_output else 0

    params["total"] = (
        params["embedding"]
        + params["all_layers"]
        + params["final_norm"]
        + params["output"]
    )

    return params


# Compare depth vs width trade-off
# Configuration A: Deep and narrow
deep_narrow = count_parameters(V=32000, L=48, d_model=2048, d_ff=8192)

# Configuration B: Shallow and wide (similar total params)
shallow_wide = count_parameters(V=32000, L=12, d_model=4096, d_ff=16384)

def count_parameters(V, L, d_model, d_ff, include_output=False):
    """
    Count approximate transformer parameters.

    Args:
        V: Vocabulary size
        L: Number of layers
        d_model: Model dimension
        d_ff: Feed-forward dimension
        include_output: Whether to include separate output projection (vs tied embeddings)

    Returns:
        Dictionary with parameter breakdown
    """
    params = {}

    # Embedding layer
    params["embedding"] = V * d_model

    # Per layer: attention (Q, K, V, O projections) and FFN
    params["attention_per_layer"] = 4 * d_model * d_model
    params["ffn_per_layer"] = 2 * d_model * d_ff
    params["layer_norm_per_layer"] = 2 * d_model  # Two norms per layer

    params["per_layer_total"] = (
        params["attention_per_layer"]
        + params["ffn_per_layer"]
        + params["layer_norm_per_layer"]
    )

    params["all_layers"] = L * params["per_layer_total"]

    # Final layer norm
    params["final_norm"] = d_model

    # Output projection (often tied with embedding)
    params["output"] = V * d_model if include_output else 0

    params["total"] = (
        params["embedding"]
        + params["all_layers"]
        + params["final_norm"]
        + params["output"]
    )

    return params


# Compare depth vs width trade-off
# Configuration A: Deep and narrow
deep_narrow = count_parameters(V=32000, L=48, d_model=2048, d_ff=8192)

# Configuration B: Shallow and wide (similar total params)
shallow_wide = count_parameters(V=32000, L=12, d_model=4096, d_ff=16384)

In[6]:

Code

def format_params(n):
    """Format parameter count with appropriate suffix (K, M, B)."""
    if n >= 1e9:
        return f"{n / 1e9:.2f}B"
    elif n >= 1e6:
        return f"{n / 1e6:.2f}M"
    else:
        return f"{n / 1e3:.2f}K"

def format_params(n):
    """Format parameter count with appropriate suffix (K, M, B)."""
    if n >= 1e9:
        return f"{n / 1e9:.2f}B"
    elif n >= 1e6:
        return f"{n / 1e6:.2f}M"
    else:
        return f"{n / 1e3:.2f}K"

Out[7]:

Console

Depth vs Width Trade-off Comparison
==================================================

Configuration A: Deep and Narrow
  Layers: 48, d_model: 2048, d_ff: 8192
  Total parameters: 2.48B
  - Embedding: 65.54M
  - All layers: 2.42B

Configuration B: Shallow and Wide
  Layers: 12, d_model: 4096, d_ff: 16384
  Total parameters: 2.55B
  - Embedding: 131.07M
  - All layers: 2.42B

Ratio of layer parameters (Wide/Deep): 1.00x

Although both configurations have similar total parameters, their characteristics differ significantly. The deep model processes information through 48 stages of refinement. The wide model has only 12 stages but can represent much richer information at each stage. Research by Tay et al. (2021) and others suggests that for language modeling, depth tends to be more important, but the optimal ratio depends on the specific task and compute constraints.

Out[8]:

Visualization

Line plot showing linear growth of parameters with depth for different model widths. — Parameter count as a function of depth and width. Each curve represents a fixed width (d_model), showing how parameters grow linearly with depth. Doubling width shifts the curve up by 4x due to quadratic scaling.

The visualization shows why width is so costly: moving from $d_{\text{model}} = 2048$ to $d_{\text{model}} = 4096$ quadruples the parameters at every depth level. In contrast, adding layers adds parameters linearly. This explains why many successful architectures (like GPT-3 and LLaMA) achieve large parameter counts primarily through depth, keeping width moderate.

The Scaling Laws PerspectiveLink Copied

The depth-width trade-off isn't just theoretical. Large-scale empirical studies have quantified how each dimension affects model capability as a function of compute budget. Scaling laws research, notably by Hoffmann et al. (2022) in the Chinchilla paper, provides rigorous guidance on optimal model shapes.

The Chinchilla findings were striking: most large language models were undertrained. Given a fixed compute budget, models achieved better performance with fewer parameters and more training data than the prevailing GPT-3 style of maximizing model size. But within the model size constraint, how should parameters be distributed between depth and width?

Studies have found that:

Depth should scale roughly as the square root of the parameter count
Width should scale similarly to depth
The ratio of depth to width (in terms of $d_{\text{model}}$ ) tends to stay fairly constant across scales

Modern frontier models follow these principles. LLaMA-7B has 32 layers and $d_{\text{model}} = 4096$ , giving a ratio of about 128. LLaMA-70B has 80 layers and $d_{\text{model}} = 8192$ , maintaining a similar ratio of about 100. This consistency suggests that the field has converged on roughly optimal proportions.

Number of Attention HeadsLink Copied

Multi-head attention works by splitting the model dimension into parallel "heads," each computing its own attention pattern independently. One head might attend to syntactic relationships (subject-verb agreement), another to semantic similarity (synonyms and related concepts), and yet another to positional patterns (nearby words). This parallel structure allows the model to capture multiple types of relationships simultaneously.

But how many heads should you use? The answer involves a fundamental trade-off. More heads mean more parallel attention patterns, enabling the model to track more distinct relationships. Fewer heads mean each head has more dimensions to work with, enabling richer attention computations per head. Since the model dimension must be divided among heads, you can't maximize both simultaneously.

The key constraint is that $d_{\text{model}}$ must be evenly divisible by $n_{\text{heads}}$ . This divisibility requirement determines the per-head dimension:

d_{\text{head}} = \frac{d_{\text{model}}}{n_{\text{heads}}}

where:

$d_{\text{head}}$ : the dimension of queries, keys, and values within each attention head
$d_{\text{model}}$ : the model's hidden dimension
$n_{\text{heads}}$ : the number of parallel attention heads

More heads mean smaller per-head dimensions. With $d_{\text{model}} = 768$ and 12 heads, each head has 64 dimensions. With 24 heads, each would have only 32 dimensions. The trade-off is between having many specialized heads versus having fewer but more expressive heads.

The Head Dimension Sweet Spot

Empirically, head dimensions between 64 and 128 work best. Smaller heads lack the capacity to compute meaningful attention patterns. Larger heads provide diminishing returns while reducing the number of parallel attention mechanisms.

In[9]:

Code

def analyze_head_configurations(d_model, min_d_head=32, max_d_head=256):
    """
    Find valid head configurations for a given model dimension.

    Returns list of (n_heads, d_head) tuples that divide evenly.
    """
    configs = []

    for n_heads in range(1, d_model + 1):
        if d_model % n_heads == 0:
            d_head = d_model // n_heads
            if min_d_head <= d_head <= max_d_head:
                configs.append((n_heads, d_head))

    return configs


# Analyze configurations for common model dimensions
dimensions = [768, 1024, 2048, 4096]
all_configs = {}

for d_model in dimensions:
    configs = analyze_head_configurations(d_model)
    all_configs[d_model] = configs

def analyze_head_configurations(d_model, min_d_head=32, max_d_head=256):
    """
    Find valid head configurations for a given model dimension.

    Returns list of (n_heads, d_head) tuples that divide evenly.
    """
    configs = []

    for n_heads in range(1, d_model + 1):
        if d_model % n_heads == 0:
            d_head = d_model // n_heads
            if min_d_head <= d_head <= max_d_head:
                configs.append((n_heads, d_head))

    return configs


# Analyze configurations for common model dimensions
dimensions = [768, 1024, 2048, 4096]
all_configs = {}

for d_model in dimensions:
    configs = analyze_head_configurations(d_model)
    all_configs[d_model] = configs

Out[10]:

Console

Valid head configurations (d_head between 32 and 256):

d_model = 768:
  3h×256d, 4h×192d, 6h×128d, 8h×96d, 12h×64d, 16h×48d, 24h×32d

d_model = 1024:
  4h×256d, 8h×128d, 16h×64d, 32h×32d

d_model = 2048:
  8h×256d, 16h×128d, 32h×64d, 64h×32d

d_model = 4096:
  16h×256d, 32h×128d, 64h×64d, 128h×32d

The number of valid configurations varies with model dimension. Highly composite numbers like 768 (which factors as $2^8 \times 3 = 256 \times 3$ ) offer many choices, while prime-heavy dimensions constrain options. Model designers often choose dimensions specifically for their factorization properties.

Out[11]:

Visualization

Line plot showing head dimension decreasing as number of heads increases for different model dimensions, with preferred zone highlighted. — The inverse relationship between number of heads and per-head dimension for common model sizes. All points along each curve represent valid configurations where the model dimension divides evenly. Shaded region shows the empirically preferred range (64-128 dimensions per head).

The curves illustrate the fundamental constraint: for any model dimension, increasing heads necessarily decreases per-head dimension. The shaded zone shows where most successful architectures operate. Too few heads (right side of curves, high per-head dimension) means limited parallel attention patterns. Too many heads (left side, low per-head dimension) means each head lacks capacity for meaningful computation.

Multi-Query and Grouped-Query AttentionLink Copied

During autoregressive generation, transformers face a memory bottleneck: they must store the key and value vectors for every previous token in the sequence. This "KV cache" grows linearly with sequence length and can dominate GPU memory for long contexts. With 32 heads and 128-dimensional vectors, a 4096-token context requires storing over 33 million floating-point values per layer, just for the cache.

Modern architectures address this through an elegant asymmetry: reduce the number of key-value heads while keeping query heads high. The insight is that queries need fine-grained distinctions (each position asks different questions), but keys and values can be shared across groups of queries without significant quality loss. This design, called grouped-query attention (GQA), dramatically reduces cache size while preserving most of the model's expressiveness.

To understand the parameter and memory implications, let's examine how projections work in standard multi-head attention. Each head has its own Q, K, and V projections, with parameter counts computed as the product of the number of heads, the per-head dimension, and the model dimension:

Q projections: $n_{\text{heads}} \times d_{\text{head}} \times d_{\text{model}}$ parameters
K projections: $n_{\text{heads}} \times d_{\text{head}} \times d_{\text{model}}$ parameters
V projections: $n_{\text{heads}} \times d_{\text{head}} \times d_{\text{model}}$ parameters
Output projection: $n_{\text{heads}} \times d_{\text{head}} \times d_{\text{model}}$ parameters

where $n_{\text{heads}}$ is the number of attention heads, $d_{\text{head}}$ is the dimension per head, and $d_{\text{model}}$ is the model dimension. Since $n_{\text{heads}} \times d_{\text{head}} = d_{\text{model}}$ , each projection contributes $d_{\text{model}}^2$ parameters.

In grouped-query attention (GQA) with $n_{\text{kv\_heads}}$ key-value heads, the K and V projections are shared across groups of query heads:

Q projections: unchanged ( $n_{\text{heads}} \times d_{\text{head}} \times d_{\text{model}}$ )
K projections: reduced ( $n_{\text{kv\_heads}} \times d_{\text{head}} \times d_{\text{model}}$ )
V projections: reduced ( $n_{\text{kv\_heads}} \times d_{\text{head}} \times d_{\text{model}}$ )
Output projection: unchanged

Each group of $n_{\text{heads}} / n_{\text{kv\_heads}}$ query heads shares the same K and V projections, reducing parameters while maintaining the full expressiveness of query computations.

In[12]:

Code

def attention_parameters(d_model, n_heads, n_kv_heads=None):
    """
    Calculate attention parameters for different configurations.

    Args:
        d_model: Model dimension
        n_heads: Number of query heads
        n_kv_heads: Number of KV heads (None = standard MHA, 1 = MQA)

    Returns:
        Dictionary with parameter breakdown
    """
    if n_kv_heads is None:
        n_kv_heads = n_heads  # Standard multi-head attention

    d_head = d_model // n_heads

    params = {}
    params["q_proj"] = n_heads * d_head * d_model
    params["k_proj"] = n_kv_heads * d_head * d_model
    params["v_proj"] = n_kv_heads * d_head * d_model
    params["o_proj"] = n_heads * d_head * d_model
    params["total"] = sum(params.values())

    # Also compute KV cache size per token (for inference)
    params["kv_cache_per_token"] = 2 * n_kv_heads * d_head  # K and V vectors

    return params


# Compare different attention configurations
d_model = 4096
n_heads = 32
configs = {
    "MHA (standard)": attention_parameters(d_model, n_heads, n_kv_heads=32),
    "GQA (8 KV heads)": attention_parameters(d_model, n_heads, n_kv_heads=8),
    "GQA (4 KV heads)": attention_parameters(d_model, n_heads, n_kv_heads=4),
    "MQA (1 KV head)": attention_parameters(d_model, n_heads, n_kv_heads=1),
}

def attention_parameters(d_model, n_heads, n_kv_heads=None):
    """
    Calculate attention parameters for different configurations.

    Args:
        d_model: Model dimension
        n_heads: Number of query heads
        n_kv_heads: Number of KV heads (None = standard MHA, 1 = MQA)

    Returns:
        Dictionary with parameter breakdown
    """
    if n_kv_heads is None:
        n_kv_heads = n_heads  # Standard multi-head attention

    d_head = d_model // n_heads

    params = {}
    params["q_proj"] = n_heads * d_head * d_model
    params["k_proj"] = n_kv_heads * d_head * d_model
    params["v_proj"] = n_kv_heads * d_head * d_model
    params["o_proj"] = n_heads * d_head * d_model
    params["total"] = sum(params.values())

    # Also compute KV cache size per token (for inference)
    params["kv_cache_per_token"] = 2 * n_kv_heads * d_head  # K and V vectors

    return params


# Compare different attention configurations
d_model = 4096
n_heads = 32
configs = {
    "MHA (standard)": attention_parameters(d_model, n_heads, n_kv_heads=32),
    "GQA (8 KV heads)": attention_parameters(d_model, n_heads, n_kv_heads=8),
    "GQA (4 KV heads)": attention_parameters(d_model, n_heads, n_kv_heads=4),
    "MQA (1 KV head)": attention_parameters(d_model, n_heads, n_kv_heads=1),
}

Out[13]:

Console

Attention configurations for d_model=4096, n_heads=32:

Configuration            Q params     K params     V params        Total   KV cache/tok
------------------------------------------------------------------------------------------
MHA (standard)             16.78M       16.78M       16.78M       67.11M           8192
GQA (8 KV heads)           16.78M        4.19M        4.19M       41.94M           2048
GQA (4 KV heads)           16.78M        2.10M        2.10M       37.75M           1024
MQA (1 KV head)            16.78M      524.29K      524.29K       34.60M            256

The parameter savings from GQA are modest (the reduction is only in K and V projections), but the real benefit is during inference. The KV cache, which stores key and value vectors for all previous tokens during autoregressive generation, shrinks proportionally with $n_{\text{kv\_heads}}$ . For long sequences, this can be the difference between fitting a model in GPU memory or not.

LLaMA-2 70B uses GQA with 8 KV heads and 64 query heads, reducing KV cache memory by 8x compared to standard MHA while maintaining quality. This design choice enables longer context windows with the same hardware.

Out[14]:

Visualization

Line plot comparing KV cache memory growth for MHA, GQA, and MQA attention variants across sequence lengths. — KV cache memory requirements across sequence lengths for different attention configurations. GQA with 8 KV heads uses 8x less memory than standard MHA, enabling much longer context windows within the same memory budget.

The logarithmic scales reveal the dramatic impact of GQA. At 128K tokens, standard MHA would require over 1TB of KV cache memory (per layer, summed across all layers), while GQA with 8 heads needs only about 130GB. This difference is why long-context models universally adopt GQA or MQA.

Feed-Forward Network DimensionsLink Copied

While attention layers decide which tokens to combine, the feed-forward network (FFN) performs the actual transformation of each token's representation. This division of labor has a surprising consequence: the FFN contains far more parameters than attention, making it the dominant factor in model size.

Understanding why requires examining what each component does. Attention computes weighted averages of value vectors, a fundamentally linear operation (the only nonlinearity is softmax for computing weights). The FFN, by contrast, applies a nonlinear transformation that can reshape the representation space, activate or suppress features, and encode factual associations. Research has shown that specific FFN neurons activate for particular concepts: "Eiffel Tower," "Python programming," or "past tense verbs." The FFN is where the model stores much of its learned knowledge.

This knowledge-storage role explains why FFNs need substantial capacity. The original transformer paper established the convention $d_{\text{ff}} = 4 \times d_{\text{model}}$ , providing the FFN with four times more hidden dimensions than the model's representation size. This 4x expansion gives each layer a wide intermediate space for computation before projecting back to $d_{\text{model}}$ dimensions.

The Standard Expansion RatioLink Copied

Let's derive exactly how this expansion affects parameter counts and understand why FFNs dominate model size.

For a standard two-layer FFN, the computation applies an up-projection, activation, and down-projection:

\text{FFN}(\mathbf{x}) = \text{GELU}(\mathbf{x} \mathbf{W}_1 + \mathbf{b}_1) \mathbf{W}_2 + \mathbf{b}_2

where:

$\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}$ : the input vector (one token's representation)
$\mathbf{W}_1 \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ : the up-projection weight matrix
$\mathbf{W}_2 \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ : the down-projection weight matrix
$\mathbf{b}_1 \in \mathbb{R}^{d_{\text{ff}}}$ , $\mathbf{b}_2 \in \mathbb{R}^{d_{\text{model}}}$ : bias vectors
$\text{GELU}(\cdot)$ : the Gaussian Error Linear Unit activation function

The total parameter count includes both weight matrices and bias vectors:

\text{FFN parameters} = 2 \times d_{\text{model}} \times d_{\text{ff}} + d_{\text{ff}} + d_{\text{model}}

The first term accounts for the two weight matrices (each with $d_{\text{model}} \times d_{\text{ff}}$ parameters), while the remaining terms count the bias vectors. With the standard expansion ratio $d_{\text{ff}} = 4 \times d_{\text{model}}$ and ignoring the relatively small bias terms:

\text{FFN parameters} \approx 2 \times d_{\text{model}} \times 4 \times d_{\text{model}} = 8 \times d_{\text{model}}^2

Compare this to attention parameters for standard multi-head attention (MHA), which has four projection matrices (Q, K, V, and output), each of size $d_{\text{model}} \times d_{\text{model}}$ :

\text{Attention parameters} = 4 \times d_{\text{model}}^2

The FFN has twice as many parameters as attention. This 2:1 ratio is consistent across most transformer architectures and explains why FFN optimization is so important for efficiency.

Out[15]:

Visualization

Horizontal bar chart comparing attention and FFN parameters per layer across different model configurations. — Per-layer parameter distribution between attention and FFN components. The FFN consistently dominates, containing roughly two-thirds of each layer's parameters regardless of model architecture.

The visualization confirms that regardless of architecture choice, FFN parameters dominate. Even with SwiGLU's reduced expansion ratio, the FFN still accounts for roughly twice the parameters of attention. This consistent 2:1 ratio has important implications for optimization techniques like mixture-of-experts, which specifically target FFN layers.

SwiGLU and Adjusted RatiosLink Copied

The standard FFN applies a simple pattern: project up, apply nonlinearity, project down. But research has shown that adding a gating mechanism can improve performance significantly. The gating idea comes from LSTMs and GRUs: instead of passing all information through uniformly, learn which dimensions to emphasize and which to suppress.

SwiGLU (SiLU-Gated Linear Unit) implements this by computing two parallel projections of the input. One projection becomes the "content" that might pass through. The other projection, after applying the SiLU activation, becomes the "gate" that controls how much of each dimension actually passes. The element-wise product of content and gate creates a selective filtering effect.

This gating mechanism requires three weight matrices instead of two:

\text{SwiGLU}(\mathbf{x}) = (\text{SiLU}(\mathbf{x}\mathbf{W}_{\text{gate}}) \odot \mathbf{x}\mathbf{W}_{\text{up}}) \mathbf{W}_{\text{down}}

where:

$\mathbf{x} \in \mathbb{R}^{d_{\text{model}}}$ : the input vector
$\mathbf{W}_{\text{gate}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ : the gate projection matrix
$\mathbf{W}_{\text{up}} \in \mathbb{R}^{d_{\text{model}} \times d_{\text{ff}}}$ : the up-projection matrix
$\mathbf{W}_{\text{down}} \in \mathbb{R}^{d_{\text{ff}} \times d_{\text{model}}}$ : the down-projection matrix
$\text{SiLU}(\cdot)$ : the Sigmoid Linear Unit activation (also called Swish)
$\odot$ : element-wise multiplication

With three matrices each containing $d_{\text{model}} \times d_{\text{ff}}$ parameters, the total count is:

\text{SwiGLU parameters} = 3 \times d_{\text{model}} \times d_{\text{ff}}

To maintain the same parameter count as a standard FFN with 4x expansion, we can solve for the required SwiGLU expansion ratio. Setting the SwiGLU parameter count equal to the standard FFN count:

3 \times d_{\text{model}} \times d_{\text{ff}}^{\text{SwiGLU}} = 2 \times d_{\text{model}} \times 4 \times d_{\text{model}}

Dividing both sides by $3 \times d_{\text{model}}$ gives the required FFN dimension:

d_{\text{ff}}^{\text{SwiGLU}} = \frac{8}{3} \times d_{\text{model}} \approx 2.67 \times d_{\text{model}}

In practice, $d_{\text{ff}}$ is often rounded to a multiple of 256 for efficient GPU computation. LLaMA uses $d_{\text{ff}} = 2.6875 \times d_{\text{model}}$ (specifically 11008 for $d_{\text{model}} = 4096$ ), which is close to the theoretical value of $\frac{8}{3} \times 4096 = 10923$ .

In[16]:

Code

def ffn_parameters(d_model, d_ff, ffn_type="standard"):
    """
    Calculate FFN parameters for different architectures.

    Args:
        d_model: Model dimension
        d_ff: Feed-forward hidden dimension
        ffn_type: "standard" (2 matrices) or "swiglu" (3 matrices)
    """
    if ffn_type == "standard":
        # W1: d_model -> d_ff, W2: d_ff -> d_model
        return 2 * d_model * d_ff
    else:  # SwiGLU
        # gate, up, down: each d_model <-> d_ff
        return 3 * d_model * d_ff


# Compare FFN configurations
d_model = 4096

configs = {
    "Standard (4x)": {"d_ff": 4 * d_model, "type": "standard"},
    "SwiGLU (2.67x)": {"d_ff": int(8 / 3 * d_model), "type": "swiglu"},
    "SwiGLU (LLaMA)": {"d_ff": 11008, "type": "swiglu"},  # Actual LLaMA config
}

ffn_params = {
    name: ffn_parameters(d_model, cfg["d_ff"], cfg["type"])
    for name, cfg in configs.items()
}

def ffn_parameters(d_model, d_ff, ffn_type="standard"):
    """
    Calculate FFN parameters for different architectures.

    Args:
        d_model: Model dimension
        d_ff: Feed-forward hidden dimension
        ffn_type: "standard" (2 matrices) or "swiglu" (3 matrices)
    """
    if ffn_type == "standard":
        # W1: d_model -> d_ff, W2: d_ff -> d_model
        return 2 * d_model * d_ff
    else:  # SwiGLU
        # gate, up, down: each d_model <-> d_ff
        return 3 * d_model * d_ff


# Compare FFN configurations
d_model = 4096

configs = {
    "Standard (4x)": {"d_ff": 4 * d_model, "type": "standard"},
    "SwiGLU (2.67x)": {"d_ff": int(8 / 3 * d_model), "type": "swiglu"},
    "SwiGLU (LLaMA)": {"d_ff": 11008, "type": "swiglu"},  # Actual LLaMA config
}

ffn_params = {
    name: ffn_parameters(d_model, cfg["d_ff"], cfg["type"])
    for name, cfg in configs.items()
}

Out[17]:

Console

FFN parameter comparison (d_model = 4096):

Configuration            d_ff    Ratio   Parameters
-------------------------------------------------------
Standard (4x)           16384     4.00x      134.22M
SwiGLU (2.67x)          10922     2.67x      134.21M
SwiGLU (LLaMA)          11008     2.69x      135.27M

The three configurations have nearly identical parameter counts, demonstrating that SwiGLU's third matrix is compensated by the smaller expansion ratio. This equivalence is intentional: it allows fair comparisons between architectures and predictable parameter budgets.

Total Parameter CalculationLink Copied

We've now examined each component of a transformer: embeddings that map tokens to vectors, attention mechanisms that mix information across positions, and feed-forward networks that transform each position independently. To design architectures or understand published models, we need to combine these pieces into a complete parameter count.

This matters for several practical reasons:

Hardware planning: GPU memory limits how many parameters you can train. A 70B parameter model in FP16 requires 140GB just for weights, before accounting for optimizer states, activations, or gradients.
Cost estimation: Training compute scales with parameter count. Knowing your model's size helps estimate training time and cost.
Architecture verification: When implementing published models, parameter counting helps verify your implementation matches the original.
Design iteration: When exploring architectures, quick parameter estimates help evaluate trade-offs without running experiments.

Let's build a comprehensive parameter calculator that breaks down contributions from each component.

In[18]:

Code

def full_parameter_count(
    V,  # Vocabulary size
    L,  # Number of layers
    d_model,  # Model dimension
    n_heads,  # Number of attention heads
    d_ff,  # Feed-forward dimension
    n_kv_heads=None,  # Number of KV heads (None = standard MHA)
    ffn_type="standard",  # "standard" or "swiglu"
    tie_embeddings=True,  # Whether output embeddings are tied to input
    include_biases=False,  # Whether to count bias parameters
):
    """
    Calculate total parameters for a transformer language model.

    Returns detailed breakdown by component.
    """
    if n_kv_heads is None:
        n_kv_heads = n_heads

    d_head = d_model // n_heads

    params = {}

    # Embedding layer
    params["embedding"] = V * d_model

    # Attention parameters per layer
    q_params = n_heads * d_head * d_model
    k_params = n_kv_heads * d_head * d_model
    v_params = n_kv_heads * d_head * d_model
    o_params = n_heads * d_head * d_model

    if include_biases:
        q_params += n_heads * d_head
        k_params += n_kv_heads * d_head
        v_params += n_kv_heads * d_head
        o_params += d_model

    params["attention_per_layer"] = q_params + k_params + v_params + o_params

    # FFN parameters per layer
    if ffn_type == "standard":
        ffn_weights = 2 * d_model * d_ff
        ffn_biases = d_ff + d_model if include_biases else 0
    else:  # SwiGLU
        ffn_weights = 3 * d_model * d_ff
        ffn_biases = 0  # SwiGLU typically doesn't use biases

    params["ffn_per_layer"] = ffn_weights + ffn_biases

    # Layer norms (2 per layer, each with d_model scale params)
    # RMSNorm has d_model params, LayerNorm has 2*d_model (scale + shift)
    params["norm_per_layer"] = 2 * d_model  # Assuming RMSNorm

    # Total per layer
    params["per_layer"] = (
        params["attention_per_layer"]
        + params["ffn_per_layer"]
        + params["norm_per_layer"]
    )

    # All layers
    params["all_layers"] = L * params["per_layer"]

    # Final layer norm
    params["final_norm"] = d_model

    # Output projection
    if tie_embeddings:
        params["lm_head"] = 0
    else:
        params["lm_head"] = d_model * V
        if include_biases:
            params["lm_head"] += V

    # Total
    params["total"] = (
        params["embedding"]
        + params["all_layers"]
        + params["final_norm"]
        + params["lm_head"]
    )

    # Additional metadata
    params["metadata"] = {
        "attention_fraction": params["attention_per_layer"]
        / params["per_layer"],
        "ffn_fraction": params["ffn_per_layer"] / params["per_layer"],
        "embedding_fraction": params["embedding"] / params["total"],
        "layers_fraction": params["all_layers"] / params["total"],
    }

    return params

def full_parameter_count(
    V,  # Vocabulary size
    L,  # Number of layers
    d_model,  # Model dimension
    n_heads,  # Number of attention heads
    d_ff,  # Feed-forward dimension
    n_kv_heads=None,  # Number of KV heads (None = standard MHA)
    ffn_type="standard",  # "standard" or "swiglu"
    tie_embeddings=True,  # Whether output embeddings are tied to input
    include_biases=False,  # Whether to count bias parameters
):
    """
    Calculate total parameters for a transformer language model.

    Returns detailed breakdown by component.
    """
    if n_kv_heads is None:
        n_kv_heads = n_heads

    d_head = d_model // n_heads

    params = {}

    # Embedding layer
    params["embedding"] = V * d_model

    # Attention parameters per layer
    q_params = n_heads * d_head * d_model
    k_params = n_kv_heads * d_head * d_model
    v_params = n_kv_heads * d_head * d_model
    o_params = n_heads * d_head * d_model

    if include_biases:
        q_params += n_heads * d_head
        k_params += n_kv_heads * d_head
        v_params += n_kv_heads * d_head
        o_params += d_model

    params["attention_per_layer"] = q_params + k_params + v_params + o_params

    # FFN parameters per layer
    if ffn_type == "standard":
        ffn_weights = 2 * d_model * d_ff
        ffn_biases = d_ff + d_model if include_biases else 0
    else:  # SwiGLU
        ffn_weights = 3 * d_model * d_ff
        ffn_biases = 0  # SwiGLU typically doesn't use biases

    params["ffn_per_layer"] = ffn_weights + ffn_biases

    # Layer norms (2 per layer, each with d_model scale params)
    # RMSNorm has d_model params, LayerNorm has 2*d_model (scale + shift)
    params["norm_per_layer"] = 2 * d_model  # Assuming RMSNorm

    # Total per layer
    params["per_layer"] = (
        params["attention_per_layer"]
        + params["ffn_per_layer"]
        + params["norm_per_layer"]
    )

    # All layers
    params["all_layers"] = L * params["per_layer"]

    # Final layer norm
    params["final_norm"] = d_model

    # Output projection
    if tie_embeddings:
        params["lm_head"] = 0
    else:
        params["lm_head"] = d_model * V
        if include_biases:
            params["lm_head"] += V

    # Total
    params["total"] = (
        params["embedding"]
        + params["all_layers"]
        + params["final_norm"]
        + params["lm_head"]
    )

    # Additional metadata
    params["metadata"] = {
        "attention_fraction": params["attention_per_layer"]
        / params["per_layer"],
        "ffn_fraction": params["ffn_per_layer"] / params["per_layer"],
        "embedding_fraction": params["embedding"] / params["total"],
        "layers_fraction": params["all_layers"] / params["total"],
    }

    return params

Let's verify our calculator against known model sizes:

In[19]:

Code

# Verify against published model sizes
verification_models = {
    "GPT-2 Small (124M)": {
        "V": 50257,
        "L": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
        "ffn_type": "standard",
        "tie_embeddings": True,
    },
    "GPT-2 XL (1.5B)": {
        "V": 50257,
        "L": 48,
        "d_model": 1600,
        "n_heads": 25,
        "d_ff": 6400,
        "ffn_type": "standard",
        "tie_embeddings": True,
    },
    "LLaMA-7B": {
        "V": 32000,
        "L": 32,
        "d_model": 4096,
        "n_heads": 32,
        "d_ff": 11008,
        "ffn_type": "swiglu",
        "tie_embeddings": True,
    },
    "LLaMA-2 70B": {
        "V": 32000,
        "L": 80,
        "d_model": 8192,
        "n_heads": 64,
        "n_kv_heads": 8,  # GQA with 8 KV heads
        "d_ff": 28672,
        "ffn_type": "swiglu",
        "tie_embeddings": True,
    },
}

verification_results = {}
for name, config in verification_models.items():
    result = full_parameter_count(**config)
    verification_results[name] = result

# Verify against published model sizes
verification_models = {
    "GPT-2 Small (124M)": {
        "V": 50257,
        "L": 12,
        "d_model": 768,
        "n_heads": 12,
        "d_ff": 3072,
        "ffn_type": "standard",
        "tie_embeddings": True,
    },
    "GPT-2 XL (1.5B)": {
        "V": 50257,
        "L": 48,
        "d_model": 1600,
        "n_heads": 25,
        "d_ff": 6400,
        "ffn_type": "standard",
        "tie_embeddings": True,
    },
    "LLaMA-7B": {
        "V": 32000,
        "L": 32,
        "d_model": 4096,
        "n_heads": 32,
        "d_ff": 11008,
        "ffn_type": "swiglu",
        "tie_embeddings": True,
    },
    "LLaMA-2 70B": {
        "V": 32000,
        "L": 80,
        "d_model": 8192,
        "n_heads": 64,
        "n_kv_heads": 8,  # GQA with 8 KV heads
        "d_ff": 28672,
        "ffn_type": "swiglu",
        "tie_embeddings": True,
    },
}

verification_results = {}
for name, config in verification_models.items():
    result = full_parameter_count(**config)
    verification_results[name] = result

Out[20]:

Console

Parameter count verification:

Model                          Calculated  Embedding %   Layers %
-----------------------------------------------------------------
GPT-2 Small (124M)                123.55M        31.2%      68.8%
GPT-2 XL (1.5B)                     1.56B         5.2%      94.8%
LLaMA-7B                            6.61B         2.0%      98.0%
LLaMA-2 70B                        68.71B         0.4%      99.6%

Our calculations align with published model sizes. Note how the embedding fraction decreases dramatically as models scale: for GPT-2 Small, embeddings represent a significant fraction of parameters, but for LLaMA-70B, the transformer layers dominate overwhelmingly. This is why larger models are more parameter-efficient: the fixed embedding cost amortizes over more layer parameters.

Out[21]:

Visualization

Stacked bar chart showing attention, FFN, and embedding parameter fractions for GPT-2 through LLaMA-70B. — Parameter distribution across transformer components for models of increasing size. Attention and FFN parameters (the transformer layers) dominate larger models, while embeddings become a negligible fraction.

The visualization confirms what the numbers suggested: as models grow, FFN parameters increasingly dominate. The FFN is the workhorse of parameter storage, containing roughly twice the parameters of attention in each layer. This explains why techniques like mixture-of-experts (MoE), which primarily target FFN layers, can dramatically increase effective capacity.

Architecture Design PrinciplesLink Copied

We've now covered the individual hyperparameters: depth, width, heads, FFN dimensions, and their variants. But knowing what each parameter does is different from knowing how to choose values. Architecture design is both an art and a science, combining theoretical understanding with empirical intuition and practical constraints.

The good news is that the field has converged on patterns that work. Successful open models like LLaMA, Mistral, and Qwen share similar proportions, and these patterns provide reliable starting points for new designs. The following principles distill this collective wisdom into actionable guidance.

Principle 1: Start with Proven RatiosLink Copied

Rather than searching the vast space of possible architectures, begin with ratios that have proven successful across many models and tasks:

$d_{\text{ff}} = 4 \times d_{\text{model}}$ for standard FFN, or $\approx 2.7 \times d_{\text{model}}$ for SwiGLU
$d_{\text{head}} \in [64, 128]$ , which determines $n_{\text{heads}} = d_{\text{model}} / d_{\text{head}}$
Depth/width ratio: $L \approx d_{\text{model}} / 100$ to $d_{\text{model}} / 50$ (very rough guideline)

In[22]:

Code

def suggest_architecture(target_params_billions, ffn_type="swiglu", V=32000):
    """
    Suggest architecture hyperparameters for a target parameter count.

    This uses heuristics based on successful model designs.
    """
    target_params = target_params_billions * 1e9

    # Rough relationship: params ≈ 12 * L * d_model^2 (ignoring embeddings)
    # Also, L ≈ d_model / 100 for balanced designs
    # So: params ≈ 12 * (d_model/100) * d_model^2 = 0.12 * d_model^3
    # Therefore: d_model ≈ (params / 0.12)^(1/3)

    d_model_estimate = (target_params / 0.12) ** (1 / 3)

    # Round to nice multiple of 128
    d_model = int(round(d_model_estimate / 128) * 128)
    d_model = max(512, min(d_model, 16384))  # Reasonable bounds

    # Set depth based on width
    L = max(6, d_model // 80)  # Slightly deeper than d_model/100

    # Head dimension of 128, compute number of heads
    d_head = 128
    n_heads = d_model // d_head

    # FFN dimension
    if ffn_type == "swiglu":
        d_ff_raw = int(8 / 3 * d_model)
        d_ff = int(round(d_ff_raw / 256) * 256)  # Multiple of 256
    else:
        d_ff = 4 * d_model

    # Compute actual parameter count
    actual = full_parameter_count(
        V=V, L=L, d_model=d_model, n_heads=n_heads, d_ff=d_ff, ffn_type=ffn_type
    )

    return {
        "target_B": target_params_billions,
        "actual_B": actual["total"] / 1e9,
        "L": L,
        "d_model": d_model,
        "n_heads": n_heads,
        "d_head": d_head,
        "d_ff": d_ff,
        "V": V,
    }


# Generate suggestions for various scales
target_sizes = [0.125, 0.35, 1.0, 3.0, 7.0, 13.0, 30.0, 70.0]
suggestions = [suggest_architecture(size) for size in target_sizes]

def suggest_architecture(target_params_billions, ffn_type="swiglu", V=32000):
    """
    Suggest architecture hyperparameters for a target parameter count.

    This uses heuristics based on successful model designs.
    """
    target_params = target_params_billions * 1e9

    # Rough relationship: params ≈ 12 * L * d_model^2 (ignoring embeddings)
    # Also, L ≈ d_model / 100 for balanced designs
    # So: params ≈ 12 * (d_model/100) * d_model^2 = 0.12 * d_model^3
    # Therefore: d_model ≈ (params / 0.12)^(1/3)

    d_model_estimate = (target_params / 0.12) ** (1 / 3)

    # Round to nice multiple of 128
    d_model = int(round(d_model_estimate / 128) * 128)
    d_model = max(512, min(d_model, 16384))  # Reasonable bounds

    # Set depth based on width
    L = max(6, d_model // 80)  # Slightly deeper than d_model/100

    # Head dimension of 128, compute number of heads
    d_head = 128
    n_heads = d_model // d_head

    # FFN dimension
    if ffn_type == "swiglu":
        d_ff_raw = int(8 / 3 * d_model)
        d_ff = int(round(d_ff_raw / 256) * 256)  # Multiple of 256
    else:
        d_ff = 4 * d_model

    # Compute actual parameter count
    actual = full_parameter_count(
        V=V, L=L, d_model=d_model, n_heads=n_heads, d_ff=d_ff, ffn_type=ffn_type
    )

    return {
        "target_B": target_params_billions,
        "actual_B": actual["total"] / 1e9,
        "L": L,
        "d_model": d_model,
        "n_heads": n_heads,
        "d_head": d_head,
        "d_ff": d_ff,
        "V": V,
    }


# Generate suggestions for various scales
target_sizes = [0.125, 0.35, 1.0, 3.0, 7.0, 13.0, 30.0, 70.0]
suggestions = [suggest_architecture(size) for size in target_sizes]

Out[23]:

Console

Suggested architectures for target parameter counts:

  Target     Actual   Layers   d_model   Heads   d_head     d_ff
----------------------------------------------------------------------
   0.12B      0.19B       12      1024       8      128     2816
   0.35B      0.46B       17      1408      11      128     3840
   1.00B      1.31B       25      2048      16      128     5376
   3.00B      3.87B       36      2944      23      128     7936
   7.00B      8.62B       48      3840      30      128    10240
  13.00B     15.96B       59      4736      37      128    12544
  30.00B     36.90B       78      6272      49      128    16640
  70.00B     86.88B      104      8320      65      128    22272

These suggestions provide reasonable starting points, but optimal architectures depend on hardware constraints, training budget, and intended use case. The suggestions follow the pattern of successful open models, scaling both depth and width together while maintaining consistent per-head dimensions.

Principle 2: Hardware AwarenessLink Copied

The best architecture on paper can be the worst in practice if it doesn't align with hardware constraints. Modern GPUs have specific requirements for efficient matrix operations, and dimensions that seem arbitrary (like 11008 instead of 11000) often reflect careful optimization for these constraints.

Efficient computation requires dimensions that align with hardware characteristics. Key considerations include:

Tensor core alignment: NVIDIA tensor cores work most efficiently with dimensions that are multiples of 8 (for FP16) or 16 (for INT8). Multiples of 64 or 128 are even better.
Memory alignment: Dimensions that are powers of 2 or multiples of 256 often achieve better memory bandwidth utilization.
Parallelism: The number of heads should divide evenly across GPUs in tensor-parallel setups. For 8-GPU training, 32 or 64 heads work well.

In[24]:

Code

def check_hardware_alignment(d_model, n_heads, d_ff):
    """Check if dimensions are hardware-friendly."""

    checks = {}

    # Tensor core alignment (multiples of 8 for FP16)
    checks["d_model_tensor_aligned"] = d_model % 8 == 0
    checks["d_ff_tensor_aligned"] = d_ff % 8 == 0

    # Better alignment (multiples of 64)
    checks["d_model_well_aligned"] = d_model % 64 == 0
    checks["d_ff_well_aligned"] = d_ff % 64 == 0

    # Best alignment (multiples of 128)
    checks["d_model_best_aligned"] = d_model % 128 == 0
    checks["d_ff_best_aligned"] = d_ff % 128 == 0

    # Multi-GPU friendliness (heads divisible by 8)
    checks["heads_multi_gpu"] = n_heads % 8 == 0

    # Even head division
    d_head = d_model // n_heads
    checks["even_head_division"] = d_model % n_heads == 0
    checks["d_head"] = d_head

    return checks


# Check LLaMA-7B configuration
llama_check = check_hardware_alignment(d_model=4096, n_heads=32, d_ff=11008)

def check_hardware_alignment(d_model, n_heads, d_ff):
    """Check if dimensions are hardware-friendly."""

    checks = {}

    # Tensor core alignment (multiples of 8 for FP16)
    checks["d_model_tensor_aligned"] = d_model % 8 == 0
    checks["d_ff_tensor_aligned"] = d_ff % 8 == 0

    # Better alignment (multiples of 64)
    checks["d_model_well_aligned"] = d_model % 64 == 0
    checks["d_ff_well_aligned"] = d_ff % 64 == 0

    # Best alignment (multiples of 128)
    checks["d_model_best_aligned"] = d_model % 128 == 0
    checks["d_ff_best_aligned"] = d_ff % 128 == 0

    # Multi-GPU friendliness (heads divisible by 8)
    checks["heads_multi_gpu"] = n_heads % 8 == 0

    # Even head division
    d_head = d_model // n_heads
    checks["even_head_division"] = d_model % n_heads == 0
    checks["d_head"] = d_head

    return checks


# Check LLaMA-7B configuration
llama_check = check_hardware_alignment(d_model=4096, n_heads=32, d_ff=11008)

Out[25]:

Console

Hardware alignment check for LLaMA-7B (d_model=4096, n_heads=32, d_ff=11008):

  d_model_tensor_aligned: ✓
  d_ff_tensor_aligned: ✓
  d_model_well_aligned: ✓
  d_ff_well_aligned: ✓
  d_model_best_aligned: ✓
  d_ff_best_aligned: ✓
  heads_multi_gpu: ✓
  even_head_division: ✓
  d_head: = 128

LLaMA's configuration is carefully chosen for hardware efficiency. The model dimension and head count align perfectly, and the FFN dimension is chosen to balance parameter equivalence with computational efficiency (11008 is close to the theoretical 10923 but rounds to a more efficient value).

Principle 3: Consider the Full PipelineLink Copied

Architecture design doesn't happen in isolation. The optimal model shape depends on how you'll train it, how long your sequences are, and where you'll deploy it. A model optimized for training efficiency might be suboptimal for inference, and vice versa. Consider the full pipeline from training to deployment.

Key interactions to consider:

Sequence length: Longer sequences increase memory for attention (quadratic in sequence length) and KV cache. GQA becomes more valuable for long-context models.
Batch size: Larger batches favor wider models that can better utilize parallel computation.
Training tokens: Scaling laws suggest parameter count and training data should scale together. A 7B model typically needs 1-2 trillion tokens for optimal performance.
Inference hardware: If deploying to specific hardware, optimize dimensions for that target rather than for training efficiency.

SummaryLink Copied

Transformer architecture hyperparameters determine model capacity, computational requirements, and ultimately performance. The key takeaways from this chapter are:

Six core hyperparameters define a transformer: vocabulary size ( $V$ ), layers ( $L$ ), model dimension ( $d_{\text{model}}$ ), attention heads ( $n_{\text{heads}}$ ), head dimension ( $d_{\text{head}}$ ), and FFN dimension ( $d_{\text{ff}}$ ). Everything else derives from these choices.
Depth scales linearly, width quadratically. Doubling the number of layers doubles parameters; doubling the model dimension quadruples them. This makes depth a more parameter-efficient way to add capacity, but both contribute to model expressiveness in different ways.
Head dimensions cluster around 64-128. This empirical sweet spot balances per-head capacity with the benefits of multiple parallel attention mechanisms. The number of heads is then determined by $n_{\text{heads}} = d_{\text{model}} / d_{\text{head}}$ .
FFN expansion follows predictable ratios. Standard FFNs use $d_{\text{ff}} = 4 \times d_{\text{model}}$ , while SwiGLU-based FFNs use approximately $2.7 \times d_{\text{model}}$ to maintain equivalent parameter counts.
Grouped-query attention trades parameters for memory. GQA reduces KV cache size during inference by sharing key-value projections across multiple query heads, enabling longer context windows without proportional memory increases.
FFN dominates parameter counts. At larger scales, feed-forward networks contain the majority of parameters, making them the primary target for efficiency innovations like mixture-of-experts.
Hardware alignment matters. Dimensions should be multiples of 8, 64, or 128 for efficient tensor core utilization. The number of heads should divide evenly across GPUs for tensor parallelism.

When designing a new architecture, start with proven configurations from similar-scale models, ensure hardware-friendly dimensions, and adjust based on your specific constraints and empirical results. The art of architecture design lies in balancing these considerations while respecting the fundamental trade-offs between capacity, compute, and memory.

Key ParametersLink Copied

When designing or analyzing transformer architectures, the following parameters have the most significant impact:

$L$ (number of layers): Controls model depth and hierarchical abstraction capability. Start with 12-24 layers for smaller models (< 1B parameters) and scale to 32-80+ for larger models. Deeper models generally improve reasoning and multi-step tasks but require more careful initialization and training stability measures.
$d_{\text{model}}$ (model dimension): The primary driver of per-layer capacity. Should be a multiple of 128 for optimal hardware utilization. Common values range from 768 (small models) to 8192 (frontier models). Quadratic impact on parameters means width is expensive.
$n_{\text{heads}}$ (attention heads): Determines how many parallel attention patterns the model can learn. Set based on desired $d_{\text{head}}$ : for $d_{\text{head}} = 64$ , use $n_{\text{heads}} = d_{\text{model}} / 64$ . More heads allow finer-grained attention but increase overhead.
$d_{\text{head}}$ (head dimension): Target 64-128 for most applications. Smaller values allow more heads but reduce per-head expressiveness. Larger values (256+) rarely improve performance and reduce head count.
$d_{\text{ff}}$ (FFN dimension): Set to $4 \times d_{\text{model}}$ for standard FFN or approximately $2.7 \times d_{\text{model}}$ for SwiGLU. Round to multiples of 256 for efficient GPU computation. This parameter dominates total parameter count.
$n_{\text{kv\_heads}}$ (GQA heads): For inference-optimized models, reduce KV heads to 8-16 regardless of query head count. Reduces KV cache memory proportionally with minimal quality loss. Essential for long-context models.
$V$ (vocabulary size): Typically 32K-100K tokens. Larger vocabularies improve compression but increase embedding memory. Choose based on target languages and tokenization strategy.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about transformer architecture hyperparameters.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Weight Tying

Next Chapter

Quadratic Attention Bottleneck

Reference

BIBTEXAcademic

@misc{transformerarchitecturehyperparametersdepthwidthheadsffnguide, author = {Michael Brenndoerfer}, title = {Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide}, year = {2025}, url = {https://mbrenndoerfer.com/writing/transformer-architecture-hyperparameters-design-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide. Retrieved from https://mbrenndoerfer.com/writing/transformer-architecture-hyperparameters-design-guide

MLAAcademic

Michael Brenndoerfer. "Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/transformer-architecture-hyperparameters-design-guide>.

CHICAGOAcademic

Michael Brenndoerfer. "Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/transformer-architecture-hyperparameters-design-guide.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide'. Available at: https://mbrenndoerfer.com/writing/transformer-architecture-hyperparameters-design-guide (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide. https://mbrenndoerfer.com/writing/transformer-architecture-hyperparameters-design-guide

Direct link:

https://mbrenndoerfer.com/writing/transformer-architecture-hyperparameters-design-guide

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Transformer Architecture Hyperparameters: Depth, Width, Heads & FFN Guide

Architecture HyperparametersLink Copied

The Core HyperparametersLink Copied

Depth vs Width Trade-offsLink Copied

The Scaling Laws PerspectiveLink Copied

Number of Attention HeadsLink Copied

Multi-Query and Grouped-Query AttentionLink Copied

Feed-Forward Network DimensionsLink Copied

The Standard Expansion RatioLink Copied

SwiGLU and Adjusted RatiosLink Copied

Total Parameter CalculationLink Copied

Architecture Design PrinciplesLink Copied

Principle 1: Start with Proven RatiosLink Copied

Principle 2: Hardware AwarenessLink Copied

Principle 3: Consider the Full PipelineLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Encoder-Decoder Architecture: Cross-Attention & Sequence-to-Sequence Transformers

Decoder Architecture: Causal Masking & Autoregressive Generation

Cross-Attention: Connecting Encoder and Decoder in Transformers

Stay updated