Search

Search articles

Feed-Forward Networks in Transformers: Architecture, Parameters & Efficiency

Michael BrenndoerferUpdated June 10, 202537 min read

Learn how feed-forward networks provide nonlinearity in transformers, with 2-layer architecture, 4x dimension expansion, parameter analysis, and computational cost comparisons with attention.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Feed-Forward Networks

Self-attention lets tokens gather contextual information from across the sequence. But attention alone is limited: it computes weighted averages of value vectors, a fundamentally linear operation. After softmax normalization, each output position is a convex combination of value vectors. To learn complex functions of language, transformers need nonlinearity. The feed-forward network (FFN) provides exactly this, applying the same pointwise transformation to every position independently.

Every transformer block contains two main components: a self-attention sublayer and a feed-forward sublayer. While attention handles inter-token communication, the FFN handles per-token computation. This division of labor is elegant: attention routes information between positions, and the FFN transforms information at each position. Together, they enable transformers to learn rich, hierarchical representations of language.

This chapter examines the FFN in detail. You'll learn its two-layer architecture, understand why the hidden dimension is expanded, see how the computation is independent across positions, and calculate the substantial parameter count that makes FFNs the largest component of most transformer models.

The Position-Wise Feed-Forward Network

To understand why transformers need the feed-forward network, consider what attention alone provides. Self-attention computes weighted averages of value vectors, where the weights come from query-key similarities. This is powerful for gathering contextual information, but it's fundamentally a linear operation. After all the softmax normalization, each output is just a convex combination of the input representations. No matter how sophisticated the attention patterns, the function from inputs to outputs remains linear.

Linear functions have severe limitations. They can only rotate, scale, and translate the input space. They cannot learn the curved decision boundaries that separate "cat" from "car," or the complex feature interactions that distinguish sarcasm from sincerity. For a model to approximate arbitrary functions of language, it needs nonlinearity.

The feed-forward network (FFN) provides exactly this missing ingredient. It applies a nonlinear transformation to each token's representation, injecting the expressiveness that attention lacks. But there's a key design choice: instead of mixing information across positions like attention does, the FFN processes each position independently. If your sequence has 100 tokens, the FFN applies the same transformation to each of those 100 representations separately, using identical weights for all positions.

Position-Wise Transformation

A position-wise operation applies the same function to each position in a sequence independently. The function's parameters are shared across positions, but the inputs and outputs at each position don't interact with each other.

This division of labor is elegant and deliberate. Attention handles inter-position communication: it routes information between tokens, allowing the model to build representations that depend on context. The FFN handles intra-position computation: it transforms each position's representation using a learned nonlinear function. By separating these concerns, the transformer achieves both contextual awareness (from attention) and expressive power (from the FFN).

Why share the same network across all positions? Two reasons justify this choice. First, language exhibits translation invariance: the grammatical patterns that help understand "the cat" are equally useful whether those words appear at the beginning, middle, or end of a sentence. A useful transformation should work wherever it's needed. Second, parameter sharing dramatically reduces model size. Instead of learning separate networks for each of the potentially thousands of positions in a sequence, we learn one network that generalizes across all positions.

The Two-Layer Architecture

With this motivation in place, let's examine what the FFN actually computes. The architecture is surprisingly simple: two linear transformations with a nonlinear activation function sandwiched between them. But this simplicity is deceptive, as we'll see, this structure enables remarkably expressive transformations.

Consider a single token's representation, a vector xx with dmodeld_{\text{model}} dimensions. The FFN transforms this vector in three stages:

  1. Expand: Project xx into a higher-dimensional space
  2. Transform: Apply nonlinearity to enable learning of curved boundaries
  3. Contract: Project back to the original dimension

The complete formula captures all three stages in one expression:

FFN(x)=σ(xW1+b1)W2+b2\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2

To understand this formula, let's read it from the inside out, following the order of operations:

Stage 1: The Expansion (xW1+b1xW_1 + b_1)

The input vector xx is multiplied by a weight matrix W1W_1 and shifted by a bias vector b1b_1. This is a standard linear transformation, but with a twist: the output dimension is larger than the input dimension. If xx has 512 dimensions, the result might have 2048 dimensions. We're projecting into a higher-dimensional space where the data is easier to manipulate.

Stage 2: The Nonlinearity (σ()\sigma(\cdot))

The activation function σ\sigma is applied element-wise to the expanded representation. Common choices include ReLU (which zeros out negative values) and GELU (a smoother alternative). This is where the magic happens: the nonlinearity allows the network to learn curved decision boundaries that would be impossible with linear transformations alone.

Stage 3: The Contraction (()W2+b2(\cdot)W_2 + b_2)

Finally, the transformed representation is projected back to the original dimension via weight matrix W2W_2 and bias b2b_2. The output has the same dimensionality as the input, which is essential for the residual connection that adds the FFN output back to its input.

Here's the complete specification of all variables:

  • xRdmodelx \in \mathbb{R}^{d_{\text{model}}}: the input vector for a single position (the token's representation from the previous layer)
  • W1Rdmodel×dffW_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}: the first projection matrix, which expands the dimension from dmodeld_{\text{model}} to dffd_{ff}
  • b1Rdffb_1 \in \mathbb{R}^{d_{ff}}: the first bias vector, added after the first linear transformation
  • σ\sigma: a nonlinear activation function (e.g., ReLU, GELU) applied element-wise to the hidden representation
  • W2Rdff×dmodelW_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}: the second projection matrix, which contracts the dimension back from dffd_{ff} to dmodeld_{\text{model}}
  • b2Rdmodelb_2 \in \mathbb{R}^{d_{\text{model}}}: the second bias vector, added to produce the final output
  • dffd_{ff}: the hidden (or intermediate) dimension, typically set to 4×dmodel4 \times d_{\text{model}}

The expansion ratio dff/dmodeld_{ff} / d_{\text{model}} is a key hyperparameter. The original transformer used a 4x expansion: for dmodel=512d_{\text{model}} = 512, the hidden dimension was dff=2048d_{ff} = 2048. This ratio has become a de facto standard, though modern architectures sometimes use different values (especially when combined with gated variants, as we'll see in a later chapter).

Implementation

With the formula understood, translating it to code reveals its simplicity. The entire FFN is just two matrix multiplications with a nonlinearity in between:

In[2]:
Code
import numpy as np

np.random.seed(42)


def relu(x):
    """ReLU activation function."""
    return np.maximum(0, x)


def ffn(x, W1, b1, W2, b2, activation=relu):
    """
    Position-wise feed-forward network.

    Args:
        x: Input tensor, shape (n, d_model) or (d_model,)
        W1: First layer weights, shape (d_model, d_ff)
        b1: First layer bias, shape (d_ff,)
        W2: Second layer weights, shape (d_ff, d_model)
        b2: Second layer bias, shape (d_model,)
        activation: Nonlinear activation function

    Returns:
        Output tensor, same shape as x
    """
    hidden = activation(x @ W1 + b1)
    output = hidden @ W2 + b2
    return output


# Example dimensions (from original transformer)
d_model = 512  # Model dimension
d_ff = 2048  # Hidden dimension (4x expansion)

# Initialize weights with Xavier/Glorot initialization
W1 = np.random.randn(d_model, d_ff) * np.sqrt(2.0 / (d_model + d_ff))
b1 = np.zeros(d_ff)
W2 = np.random.randn(d_ff, d_model) * np.sqrt(2.0 / (d_ff + d_model))
b2 = np.zeros(d_model)

# Process a single position
x_single = np.random.randn(d_model)
y_single = ffn(x_single, W1, b1, W2, b2)
Out[3]:
Console
Single position FFN:
  Input shape:  (512,)
  Output shape: (512,)
  Input norm:   22.2545
  Output norm:  13.1094

The FFN preserves the dimensionality of its input: a 512-dimensional vector goes in, and a 512-dimensional vector comes out. This is essential for the residual connection that adds the FFN output back to its input.

The Hidden Dimension Expansion

The most striking aspect of the FFN architecture is the dimension expansion. The first linear layer projects from dmodeld_{\text{model}} to dffd_{ff}, typically with dff=4×dmodeld_{ff} = 4 \times d_{\text{model}}. For a model with dmodel=512d_{\text{model}} = 512, this means expanding to 2048 dimensions before projecting back down.

Why expand the dimension? The answer lies in the expressiveness of the network. A fundamental result from neural network theory shows that wider hidden layers can approximate more complex functions. Consider what happens without expansion: if dff=dmodeld_{ff} = d_{\text{model}}, the hidden layer has the same dimensionality as the input. While the formula σ(xW1+b1)W2+b2\sigma(xW_1 + b_1)W_2 + b_2 still applies, the network has limited capacity to decompose and recombine features.

With a larger hidden dimension, the network can decompose the input into more components, transform them independently, and recombine them in complex ways. Think of it as temporarily working in a higher-dimensional space where the data is easier to manipulate, then projecting back to the original space.

In[4]:
Code
# Visualize the dimension flow through the FFN
def analyze_ffn_dimensions(d_model, expansion_factor):
    """Analyze dimensions through FFN layers."""
    d_ff = d_model * expansion_factor

    return {
        "input": d_model,
        "after_W1": d_ff,
        "expansion_ratio": d_ff / d_model,
        "after_W2": d_model,
    }


# Common configurations
configs = [
    ("GPT-2 Small", 768, 4),
    ("GPT-2 Medium", 1024, 4),
    ("GPT-2 Large", 1280, 4),
    ("GPT-3 (175B)", 12288, 4),
    ("LLaMA-7B", 4096, 2.6875),  # Uses 11008 hidden dim
]
Out[5]:
Console
FFN dimension expansion across models:

Model               d_model     d_ff    Ratio
----------------------------------------------
GPT-2 Small             768     3072     4.00x
GPT-2 Medium           1024     4096     4.00x
GPT-2 Large            1280     5120     4.00x
GPT-3 (175B)          12288    49152     4.00x
LLaMA-7B               4096    11008     2.69x

The 4x expansion factor was established in the original "Attention Is All You Need" paper and has become a standard choice. Modern models like LLaMA use slightly different ratios (around 2.7x) when using gated linear units (GLUs), which effectively increase the hidden dimension through gating.

Let's visualize how the actual activation values change through each stage of the FFN:

Out[6]:
Visualization
Histogram of input values showing normal distribution centered at zero.
Input values are normally distributed around zero (512 dimensions).
Histogram of hidden activations after ReLU showing only positive values and many zeros.
After ReLU, many values are zero (shown as red line) and surviving values are positive only.
Histogram of output values showing distribution around zero.
Output values are again distributed around zero but with different statistics (512 dimensions).

Now let's visualize the dimension sizes at each stage with a schematic:

Out[7]:
Visualization
Diagram showing dimension sizes at each FFN layer, with expansion from 512 to 2048 then contraction back to 512.
Dimension flow through the FFN. The first layer expands from d_model to d_ff (typically 4x larger), applies nonlinearity, then the second layer contracts back to d_model. This expansion provides a larger representational space for the nonlinear transformation.

Position Independence

A critical property of the FFN is that it processes each position independently. Unlike attention, where every position can influence every other position, the FFN applies an identical transformation to each position in isolation. This has important implications for both computation and interpretation.

Let's verify this independence empirically:

In[8]:
Code
# Demonstrate position independence
seq_len = 5
X = np.random.randn(seq_len, d_model)

# Process all positions at once (batch processing)
Y_batch = ffn(X, W1, b1, W2, b2)

# Process each position individually
Y_individual = np.zeros_like(X)
for i in range(seq_len):
    Y_individual[i] = ffn(X[i], W1, b1, W2, b2)

# Check they're identical
difference = np.abs(Y_batch - Y_individual).max()
Out[9]:
Console
Position independence verification:
  Maximum difference between batch and individual processing: 4.44e-15
  Outputs are identical: True

The batch and individual processing produce identical results (up to floating-point precision). This confirms that positions don't interact within the FFN.

Why is this important? Position independence means the FFN can be computed in parallel across all positions. On a GPU, this is extremely efficient: instead of processing tokens sequentially, we process the entire sequence simultaneously. The FFN is embarrassingly parallel.

Position independence also clarifies the division of labor in a transformer block. Attention handles inter-position communication: it routes information between tokens, allowing the model to build representations that depend on context. The FFN handles intra-position transformation: it transforms each position's representation using the same learned function, adding nonlinearity and processing capacity to what would otherwise be a purely linear attention mechanism.

Out[10]:
Visualization
Diagram showing three input positions being processed independently through the same FFN to produce three output positions.
Position independence in the FFN. Each position is processed by the same network with shared weights, but positions don't communicate with each other. This enables parallel computation and separates the FFN's role (per-position transformation) from attention's role (inter-position communication).

Interpreting the FFN as Key-Value Memory

Recent research has revealed a fascinating interpretation of feed-forward layers: they function as associative memories, storing key-value pairs in their weights. This perspective, developed by researchers at Tel Aviv University and other institutions, helps explain how transformers store and retrieve factual knowledge.

The idea is elegant. Consider the first layer's weight matrix W1W_1. Each column of W1W_1 (which we can denote as ki\mathbf{k}_i for the ii-th column) can be thought of as a "key" that matches certain input patterns. The pre-activation for hidden dimension ii is computed as:

hi=xki+b1,ih_i = x \cdot \mathbf{k}_i + b_{1,i}

where:

  • hih_i: the pre-activation value for hidden dimension ii
  • xx: the input vector (dmodeld_{\text{model}}-dimensional)
  • ki\mathbf{k}_i: the ii-th column of W1W_1, acting as a "key" pattern
  • b1,ib_{1,i}: the ii-th element of the bias vector b1b_1

When the input xx aligns well with key ki\mathbf{k}_i (high dot product), the corresponding hidden dimension activates strongly. After ReLU, only positive activations survive.

The second layer's weight matrix W2W_2 contains the "values" associated with each key. Each row of W2W_2 (denoted vi\mathbf{v}_i for the ii-th row) is the value vector that gets added to the output when hidden dimension ii is active. The final output is a weighted sum of these values, where the weights are the hidden activations.

FFN as Key-Value Memory

The feed-forward layer can be interpreted as an associative memory where W1W_1 columns are keys, hidden activations are match scores, and W2W_2 rows are values. Input patterns that match certain keys retrieve their associated values.

Let's visualize this interpretation:

In[11]:
Code
# Interpret FFN as key-value memory
d_model_small = 8
d_ff_small = 16

np.random.seed(42)
W1_small = np.random.randn(d_model_small, d_ff_small) * 0.5
b1_small = np.zeros(d_ff_small)
W2_small = np.random.randn(d_ff_small, d_model_small) * 0.5
b2_small = np.zeros(d_model_small)

# Create an input that strongly activates certain hidden dimensions
x_test = np.random.randn(d_model_small)

# Compute hidden activations (before and after ReLU)
pre_activation = x_test @ W1_small + b1_small
hidden_activations = relu(pre_activation)

# See which "keys" were matched (high activation)
active_dims = hidden_activations > 0.5
Out[12]:
Console
FFN as Key-Value Memory:
  Input dimension: 8
  Number of 'keys' (hidden dim): 16

Pre-activation (dot product with keys):
  [-0.08 -3.44  1.12  0.82  1.29  1.33  0.86  1.59 -0.38 -0.71 -2.35 -0.09
 -0.89 -2.19  2.18  0.06]

After ReLU (matched keys):
  [0.   0.   1.12 0.82 1.29 1.33 0.86 1.59 0.   0.   0.   0.   0.   0.
 2.18 0.06]

Strongly activated dimensions: [ 2  3  4  5  6  7 14]

Let's visualize this sparsity pattern across multiple inputs to see how different inputs activate different subsets of hidden dimensions:

Out[13]:
Visualization
Heatmap showing hidden activations across inputs and dimensions, with many zero values demonstrating sparsity.
Hidden activation patterns for 10 different random inputs. Each row shows the activation values across 16 hidden dimensions for one input. Note the sparsity: many cells are zero (white) because ReLU zeroed out negative pre-activations. Different inputs activate different subsets of dimensions.

This memory interpretation explains several empirical observations about transformers. Researchers have found that specific neurons in FFN layers activate for particular concepts: there are neurons that activate for "the Eiffel Tower," "programming languages," or "past tense verbs." These neurons encode factual associations in their weights, and the FFN retrieves them when relevant patterns appear in the input.

Parameter Count Analysis

Feed-forward networks are the largest component of transformer models by parameter count. In a standard transformer, the FFN contains significantly more parameters than the attention mechanism. Understanding this is crucial for model scaling and efficiency optimization.

For a single FFN layer, we can count the total number of learnable parameters by summing the sizes of all weight matrices and bias vectors:

FFN parameters=2×dmodel×dff+dff+dmodel\text{FFN parameters} = 2 \times d_{\text{model}} \times d_{ff} + d_{ff} + d_{\text{model}}

where:

  • dmodel×dffd_{\text{model}} \times d_{ff}: the number of parameters in weight matrix W1W_1 (first term contributes this once)
  • dff×dmodeld_{ff} \times d_{\text{model}}: the number of parameters in weight matrix W2W_2 (note: W2W_2 has the same total count as W1W_1, just transposed dimensions)
  • dffd_{ff}: the number of parameters in bias vector b1b_1
  • dmodeld_{\text{model}}: the number of parameters in bias vector b2b_2

The factor of 2 in front of dmodel×dffd_{\text{model}} \times d_{ff} accounts for both weight matrices W1W_1 and W2W_2.

With the standard expansion factor dff=4×dmodeld_{ff} = 4 \times d_{\text{model}}, we can simplify this expression. Substituting:

FFN parameters=2×dmodel×(4×dmodel)+(4×dmodel)+dmodel=8×dmodel2+5×dmodel\text{FFN parameters} = 2 \times d_{\text{model}} \times (4 \times d_{\text{model}}) + (4 \times d_{\text{model}}) + d_{\text{model}} = 8 \times d_{\text{model}}^2 + 5 \times d_{\text{model}}

For large models where dmodeld_{\text{model}} is in the hundreds or thousands, the quadratic term dominates and the linear bias terms become negligible, giving approximately 8×dmodel28 \times d_{\text{model}}^2 parameters per FFN layer.

In[14]:
Code
def count_ffn_params(d_model, d_ff, include_bias=True):
    """Count parameters in a feed-forward network."""
    weight_params = 2 * d_model * d_ff  # W1 and W2
    bias_params = d_ff + d_model if include_bias else 0
    return weight_params + bias_params


def count_attention_params(d_model, num_heads, include_bias=True):
    """Count parameters in multi-head attention."""
    # Q, K, V projections and output projection
    weight_params = 4 * d_model * d_model  # W_Q, W_K, W_V, W_O
    bias_params = 4 * d_model if include_bias else 0
    return weight_params + bias_params


# Compare for different model sizes
model_configs = [
    ("GPT-2 Small", 768, 3072, 12),
    ("GPT-2 Medium", 1024, 4096, 16),
    ("BERT-Base", 768, 3072, 12),
    ("GPT-3 (175B)", 12288, 49152, 96),
]
Out[15]:
Console
Parameter count comparison: FFN vs Attention (per layer)

Model            d_model     d_ff     FFN Params    Attn Params   FFN/Attn
--------------------------------------------------------------------------------
GPT-2 Small          768     3072      4,718,592      2,359,296        2.0x
GPT-2 Medium        1024     4096      8,388,608      4,194,304        2.0x
BERT-Base            768     3072      4,718,592      2,359,296        2.0x
GPT-3 (175B)       12288    49152  1,207,959,552    603,979,776        2.0x

The FFN consistently contains about twice as many parameters as the attention mechanism per layer. For large models like GPT-3, each transformer layer has over 2.4 billion parameters in the FFN alone. This makes FFN optimization critical for model efficiency.

Let's visualize how FFN parameters scale with model dimension:

Out[16]:
Visualization
Line plot showing FFN parameters increasing quadratically from millions to billions as d_model increases from 256 to 12288.
FFN parameter count scales quadratically with model dimension. The blue line shows actual FFN parameters (with 4x expansion), while the dashed line shows the approximation 8 * d_model^2. The quadratic growth means doubling d_model quadruples the parameter count.

Let's also visualize the parameter distribution in a transformer block:

Out[17]:
Visualization
Pie chart showing FFN with ~67% of parameters, attention with ~33%, and layer norm with <1%.
Parameter distribution in a transformer block. The feed-forward network contains roughly two-thirds of the parameters, with the remaining third split between the attention mechanism and layer normalization.

Computational Cost

Beyond parameter count, we should consider computational cost, measured in floating-point operations (FLOPs). Understanding FFN compute requirements helps explain why these layers dominate inference time for short sequences.

For each token, the FFN performs two matrix-vector multiplications:

  1. First layer: Computing xW1xW_1 (where xx is a dmodeld_{\text{model}}-dimensional vector and W1W_1 is dmodel×dffd_{\text{model}} \times d_{ff}) requires dmodel×dffd_{\text{model}} \times d_{ff} multiply-add operations
  2. Second layer: Computing hW2hW_2 (where hh is a dffd_{ff}-dimensional hidden vector and W2W_2 is dff×dmodeld_{ff} \times d_{\text{model}}) requires dff×dmodeld_{ff} \times d_{\text{model}} multiply-add operations

Each multiply-add consists of one multiplication and one addition, so it counts as 2 floating-point operations. The total FLOPs per token for the FFN is:

FLOPsFFN=2×(dmodel×dff)+2×(dff×dmodel)=4×dmodel×dff\text{FLOPs}_{\text{FFN}} = 2 \times (d_{\text{model}} \times d_{ff}) + 2 \times (d_{ff} \times d_{\text{model}}) = 4 \times d_{\text{model}} \times d_{ff}

where:

  • dmodeld_{\text{model}}: the input and output dimension of the FFN
  • dffd_{ff}: the hidden dimension (typically 4×dmodel4 \times d_{\text{model}})
  • The factor of 4 comes from: 2 layers ×\times 2 operations per multiply-add

For a sequence of nn tokens, each token is processed independently, so the total FFN cost scales linearly with sequence length:

Total FLOPsFFN=4×n×dmodel×dff\text{Total FLOPs}_{\text{FFN}} = 4 \times n \times d_{\text{model}} \times d_{ff}

where nn is the number of tokens in the sequence.

This linear scaling contrasts sharply with attention, which has quadratic complexity O(n2dk)O(n^2 \cdot d_k) due to the n×nn \times n attention matrix computation. As sequences grow longer, attention's quadratic term eventually dominates.

In[18]:
Code
def ffn_flops(n, d_model, d_ff):
    """Calculate FFN FLOPs for sequence length n."""
    return 4 * n * d_model * d_ff


def attention_flops(n, d_model):
    """Calculate attention FLOPs (simplified)."""
    # Q, K, V projections: 3 * 2 * n * d_model^2
    # QK^T: 2 * n^2 * d_model
    # Attention @ V: 2 * n^2 * d_model
    # Output projection: 2 * n * d_model^2
    projection_flops = 4 * 2 * n * d_model * d_model
    attention_matrix_flops = 4 * n * n * d_model
    return projection_flops + attention_matrix_flops


# Compare across sequence lengths
sequence_lengths = [128, 512, 1024, 2048, 4096, 8192]
d_model_test = 768
d_ff_test = 3072
Out[19]:
Console
FLOPs comparison: FFN vs Attention (GPT-2 Small)

  Seq Length       FFN FLOPs      Attn FLOPs     FFN/Attn
------------------------------------------------------------
         128   1,207,959,552     654,311,424         1.85x
         512   4,831,838,208   3,221,225,472         1.50x
        1024   9,663,676,416   8,053,063,680         1.20x
        2048  19,327,352,832  22,548,578,304         0.86x
        4096  38,654,705,664  70,866,960,384         0.55x
        8192  77,309,411,328 244,813,135,872         0.32x

For short sequences, the FFN dominates computational cost (due to its large weight matrices). As sequences grow longer, attention's quadratic complexity catches up. At around 1024-2048 tokens, attention and FFN have comparable costs. Beyond that, attention becomes the bottleneck.

Out[20]:
Visualization
Line plot showing FFN FLOPs as a straight line and attention FLOPs as a curve that crosses over and grows faster at longer sequences.
Computational cost (FLOPs) comparison between FFN and attention as sequence length increases. FFN cost grows linearly while attention cost grows quadratically, with the crossover point around 1000-2000 tokens for typical model sizes.

This crossover point explains why long-context models focus heavily on attention efficiency (sparse attention, linear attention, etc.) while short-context applications often emphasize FFN optimization (quantization, pruning).

A Complete Worked Example

The formula FFN(x)=σ(xW1+b1)W2+b2\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 is compact, but its compactness can obscure what's actually happening. To truly understand the FFN, let's trace through a concrete computation with actual numbers. We'll use intentionally small dimensions, dmodel=3d_{\text{model}} = 3 and dff=4d_{ff} = 4, so you can follow every multiplication and addition by hand.

The goal of this example is threefold: (1) see exactly how the input vector is transformed at each stage, (2) observe how ReLU zeros out negative activations to introduce nonlinearity, and (3) verify that the output dimension matches the input dimension, ready for the residual connection.

In[21]:
Code
# Small example for hand-traceable computation
d_model_tiny = 3
d_ff_tiny = 4

np.random.seed(123)

# Initialize weights with simple values
W1_tiny = np.array(
    [
        [0.5, -0.3, 0.8, 0.2],
        [-0.2, 0.6, 0.1, -0.4],
        [0.3, 0.1, -0.5, 0.7],
    ]
)  # Shape: (3, 4)

b1_tiny = np.array([0.1, -0.1, 0.2, 0.0])  # Shape: (4,)

W2_tiny = np.array(
    [
        [0.4, -0.2, 0.3],
        [0.1, 0.5, -0.1],
        [-0.3, 0.2, 0.4],
        [0.2, -0.4, 0.1],
    ]
)  # Shape: (4, 3)

b2_tiny = np.array([0.05, -0.05, 0.1])  # Shape: (3,)

# Input vector
x_tiny = np.array([1.0, -0.5, 0.8])

Stage 1: Expansion via the First Linear Layer

The first operation computes xW1+b1xW_1 + b_1. Our 3-dimensional input vector gets multiplied by a 3×43 \times 4 weight matrix, producing a 4-dimensional hidden representation. Each element of this output is a weighted sum of the input elements plus a bias term.

Out[22]:
Console
Step 1: First linear transformation (x @ W1 + b1)

Input x: [ 1.  -0.5  0.8]

W1:
[[ 0.5 -0.3  0.8  0.2]
 [-0.2  0.6  0.1 -0.4]
 [ 0.3  0.1 -0.5  0.7]]

b1: [ 0.1 -0.1  0.2  0. ]

x @ W1 + b1 = [ 0.94 -0.62  0.55  0.96]

Notice the dimension change: a 3-dimensional vector goes in, a 4-dimensional vector comes out. This expansion is the "higher-dimensional space" we discussed earlier, where the network has more room to manipulate the representation. The result is called the pre-activation because we haven't applied the nonlinearity yet.

Stage 2: Nonlinearity via ReLU

Here's where the FFN gains its expressive power. The ReLU activation function applies a simple rule: keep positive values unchanged, but set negative values to zero. Mathematically, ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z).

Out[23]:
Console

Step 2: Apply ReLU activation

Pre-activation: [ 0.94 -0.62  0.55  0.96]
After ReLU:     [0.94 0.   0.55 0.96]

Negative values become zero, positive values pass through unchanged.

This seemingly simple operation has profound consequences. By zeroing out some dimensions, ReLU creates sparse hidden representations, meaning only a subset of hidden dimensions are active for any given input. Different inputs activate different subsets, allowing the network to learn piece-wise linear functions that approximate arbitrary curves. This is the source of the FFN's nonlinear expressive power.

Out[24]:
Visualization
Bar chart showing pre-activation values with some negative (red) and some positive (blue).
Pre-activation values before ReLU. Negative values (red) will be zeroed out.
Bar chart showing values after ReLU with negative values now at zero.
After ReLU: negative values become zero (red bars at 0), positive values (blue) pass through unchanged.

Stage 3: Contraction via the Second Linear Layer

Finally, we project back to the original dimension. The 4-dimensional hidden vector is multiplied by a 4×34 \times 3 weight matrix, and a 3-dimensional bias is added.

Out[25]:
Console

Step 3: Second linear transformation (hidden @ W2 + b2)

Hidden: [0.94 0.   0.55 0.96]

W2:
[[ 0.4 -0.2  0.3]
 [ 0.1  0.5 -0.1]
 [-0.3  0.2  0.4]
 [ 0.2 -0.4  0.1]]

b2: [ 0.05 -0.05  0.1 ]

hidden @ W2 + b2 = [ 0.453 -0.512  0.698]

The output has the same dimension as the input, exactly 3 elements. This is essential: in the full transformer, this output will be added to the original input via a residual connection, and both must have the same shape.

Verification and Summary

Let's verify that our step-by-step calculation matches the complete FFN function, and summarize what we've learned:

Out[26]:
Console

Verification using ffn() function:
  Output: [ 0.453 -0.512  0.698]
  Match: True

Summary:
  Input:  [ 1.  -0.5  0.8] (dimension 3)
  Output: [ 0.453 -0.512  0.698] (dimension 3)

The step-by-step and function-based computations match exactly. This worked example demonstrates the complete journey: a 3-dimensional input expands to 4 dimensions, passes through nonlinearity (with some dimensions zeroed out), and contracts back to 3 dimensions. The FFN has transformed the input representation while preserving its dimensionality, exactly as the formula FFN(x)=σ(xW1+b1)W2+b2\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 describes.

Implementation: A Complete FFN Module

Having traced through the mathematics by hand, we can now build a reusable FFN module that encapsulates everything we've learned. This implementation follows patterns used in production transformer libraries: it initializes weights using proper scaling (Xavier/Glorot initialization), supports optional biases, and handles both single vectors and batched sequences.

In[27]:
Code
class FeedForwardNetwork:
    """
    Position-wise feed-forward network for transformer blocks.

    Implements: FFN(x) = activation(x @ W1 + b1) @ W2 + b2
    """

    def __init__(self, d_model, d_ff, activation=relu, use_bias=True):
        """
        Initialize the feed-forward network.

        Args:
            d_model: Input and output dimension
            d_ff: Hidden dimension (typically 4 * d_model)
            activation: Nonlinear activation function
            use_bias: Whether to include bias terms
        """
        self.d_model = d_model
        self.d_ff = d_ff
        self.activation = activation
        self.use_bias = use_bias

        # Xavier/Glorot initialization
        self.W1 = np.random.randn(d_model, d_ff) * np.sqrt(
            2.0 / (d_model + d_ff)
        )
        self.W2 = np.random.randn(d_ff, d_model) * np.sqrt(
            2.0 / (d_ff + d_model)
        )

        if use_bias:
            self.b1 = np.zeros(d_ff)
            self.b2 = np.zeros(d_model)
        else:
            self.b1 = None
            self.b2 = None

    def __call__(self, x):
        """
        Apply the feed-forward transformation.

        Args:
            x: Input tensor of shape (..., d_model)

        Returns:
            Output tensor of shape (..., d_model)
        """
        # First linear layer
        hidden = x @ self.W1
        if self.use_bias:
            hidden = hidden + self.b1

        # Activation
        hidden = self.activation(hidden)

        # Second linear layer
        output = hidden @ self.W2
        if self.use_bias:
            output = output + self.b2

        return output

    def num_parameters(self):
        """Return total parameter count."""
        params = self.d_model * self.d_ff + self.d_ff * self.d_model
        if self.use_bias:
            params += self.d_ff + self.d_model
        return params
In[28]:
Code
# Test the module
np.random.seed(42)
ffn_module = FeedForwardNetwork(d_model=512, d_ff=2048)

# Single vector
x_single_test = np.random.randn(512)
y_single_test = ffn_module(x_single_test)

# Batch of vectors (sequence)
x_batch_test = np.random.randn(16, 512)  # 16 tokens
y_batch_test = ffn_module(x_batch_test)
Out[29]:
Console
FeedForwardNetwork module test:

Configuration:
  d_model: 512
  d_ff: 2048
  Parameters: 2,099,712

Single vector:
  Input shape: (512,)
  Output shape: (512,)

Batch (sequence):
  Input shape: (16, 512)
  Output shape: (16, 512)

Limitations and Impact

The feed-forward network is conceptually simple: two linear layers with a nonlinearity in between. Yet this simplicity masks significant computational cost. The FFN accounts for roughly two-thirds of a transformer block's parameters and dominates compute for short sequences. This has driven extensive research into FFN efficiency.

Sparsity offers one path forward. The ReLU activation naturally creates sparse hidden representations, as negative pre-activations become zero. Researchers have exploited this by identifying which hidden dimensions will be active for a given input and computing only those, skipping computation for dimensions that would be zeroed anyway. The Mixture of Experts (MoE) architecture takes this further, replacing the single FFN with multiple "expert" FFNs and routing each token to only a subset of experts. This allows models to scale parameters without proportionally scaling compute.

Quantization provides another avenue for efficiency. Since FFN weights are static after training, they can be compressed to lower precision (8-bit, 4-bit, or even 2-bit integers) with careful calibration. This reduces memory bandwidth, often the bottleneck in FFN computation, and enables faster inference.

Despite its computational weight, the FFN's role is essential. It provides the nonlinearity that transforms attention's linear weighted averages into expressive function approximation. It serves as the model's "memory," storing factual associations in its weights that attention retrieves based on context. And its position-wise nature enables massive parallelization that makes transformer training tractable.

The interplay between attention and FFN defines transformer expressiveness. Attention routes information between positions, creating context-dependent representations. The FFN transforms those representations position-by-position, adding computational depth. Together, they enable the hierarchical, compositional language understanding that powers modern NLP.

Summary

The feed-forward network is the workhorse of transformer computation, applying identical transformations to each position independently. This chapter explored its architecture, efficiency characteristics, and role in the broader transformer block.

Key takeaways:

  • Two-layer architecture: The FFN formula FFN(x)=σ(xW1+b1)W2+b2\text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 consists of an expansion layer (dmodeldffd_{\text{model}} \to d_{ff}), a nonlinear activation σ\sigma (ReLU, GELU, etc.), and a contraction layer (dffdmodeld_{ff} \to d_{\text{model}}). The standard expansion factor is dff=4×dmodeld_{ff} = 4 \times d_{\text{model}}.

  • Position independence: Each position is processed separately with shared weights. This enables parallel computation and separates the FFN's role (per-position transformation) from attention's role (inter-position communication).

  • Key-value memory interpretation: The FFN can be viewed as an associative memory where W1W_1 columns are keys, hidden activations are match scores, and W2W_2 rows are values. This explains how transformers store factual knowledge.

  • Dominant parameter count: The FFN contains approximately two-thirds of a transformer block's parameters. With 4x expansion, the total is approximately 8×dmodel28 \times d_{\text{model}}^2 parameters per layer (from the formula 2×dmodel×dff+dff+dmodel2 \times d_{\text{model}} \times d_{ff} + d_{ff} + d_{\text{model}}). This makes FFN optimization critical for model efficiency.

  • Linear computational scaling: FFN compute cost is 4×n×dmodel×dff4 \times n \times d_{\text{model}} \times d_{ff} FLOPs, scaling linearly with sequence length nn. This contrasts with attention's quadratic O(n2)O(n^2) scaling. For short sequences, FFN dominates compute; for long sequences, attention becomes the bottleneck.

  • Crossover point: Around 1000-2000 tokens, FFN and attention have comparable computational costs. This crossover influences optimization strategies for different sequence lengths.

The next chapter examines activation functions used in FFNs, comparing ReLU, GELU, SiLU/Swish, and understanding why modern models have moved beyond the original ReLU choice.

Key Parameters

When implementing or configuring feed-forward networks in transformers, these parameters control capacity and efficiency:

  • d_model (input/output dimension): The embedding dimension that the FFN preserves. Typical values range from 256 (small models) to 12288 (GPT-3 scale). This dimension must match the attention layer output and determines the FFN's interface with the rest of the transformer.

  • d_ff (hidden dimension): The expanded dimension of the intermediate representation. The standard choice is dff=4×dmodeld_{ff} = 4 \times d_{\text{model}}, though modern architectures like LLaMA use ratios around 2.7x when combined with gated linear units. Larger values increase expressiveness but proportionally increase parameters and compute.

  • activation: The nonlinear function applied element-wise after the first linear layer. ReLU was used in the original transformer, but GELU has become the standard for encoder models (BERT, RoBERTa) and SiLU/Swish for decoder models (LLaMA, GPT-NeoX). The choice affects gradient flow and sparsity patterns.

  • use_bias: Whether to include bias terms b1b_1 and b2b_2. Some modern architectures (LLaMA, PaLM) omit biases entirely to reduce parameters and simplify quantization. The impact on model quality is typically minimal.

  • dropout (not shown in our implementation): Dropout rate applied to the hidden representation after activation. Values of 0.1-0.2 are common during training to prevent overfitting. Set to 0 during inference.

  • initialization: Weight initialization scale affects training stability. Xavier/Glorot initialization (scaling by 2/(din+dout)\sqrt{2/(d_{in} + d_{out})}) is standard. Some architectures use scaled initialization for residual paths to maintain signal magnitude through deep networks.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about feed-forward networks in transformers.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{feedforwardnetworksintransformersarchitectureparametersefficiency, author = {Michael Brenndoerfer}, title = {Feed-Forward Networks in Transformers: Architecture, Parameters & Efficiency}, year = {2025}, url = {https://mbrenndoerfer.com/writing/transformer-feed-forward-networks}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Feed-Forward Networks in Transformers: Architecture, Parameters & Efficiency. Retrieved from https://mbrenndoerfer.com/writing/transformer-feed-forward-networks
MLAAcademic
Michael Brenndoerfer. "Feed-Forward Networks in Transformers: Architecture, Parameters & Efficiency." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/transformer-feed-forward-networks>.
CHICAGOAcademic
Michael Brenndoerfer. "Feed-Forward Networks in Transformers: Architecture, Parameters & Efficiency." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/transformer-feed-forward-networks.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Feed-Forward Networks in Transformers: Architecture, Parameters & Efficiency'. Available at: https://mbrenndoerfer.com/writing/transformer-feed-forward-networks (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Feed-Forward Networks in Transformers: Architecture, Parameters & Efficiency. https://mbrenndoerfer.com/writing/transformer-feed-forward-networks
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free