Search

Search articles

FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models

Michael BrenndoerferUpdated June 14, 202536 min read

Compare activation functions in transformer feed-forward networks: ReLU's simplicity and dead neuron problem, GELU's smooth probabilistic gating for BERT, and SiLU/Swish for modern LLMs like LLaMA.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

FFN Activation Functions

The feed-forward network's power comes from its nonlinear activation function. Without nonlinearity, stacking multiple linear layers would collapse into a single linear transformation, no matter how deep the network. The activation function is what enables FFNs to approximate arbitrarily complex functions of language.

The original transformer used ReLU, the workhorse of deep learning since 2012. But as researchers scaled transformers to billions of parameters and trained them on trillions of tokens, subtle differences between activation functions became significant. GELU emerged as the standard for encoder models like BERT. SiLU (also called Swish) now dominates decoder models like LLaMA and GPT-NeoX. Understanding why these activations differ, and when each excels, is essential for building modern language models.

This chapter traces the evolution from ReLU to GELU to SiLU/Swish, examining the mathematical properties that make each function suitable for different contexts. You'll implement each activation, visualize their differences, and understand the practical trade-offs that guide model design choices.

ReLU: The Original Choice

The Rectified Linear Unit (ReLU) is the simplest nonlinear activation function in modern deep learning. It applies a trivial rule: keep positive values unchanged, set negative values to zero. Mathematically:

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

where:

  • xx: the input value (a single element of the pre-activation vector)
  • max(0,x)\max(0, x): returns xx if x>0x > 0, otherwise returns 00

This simplicity is deceptive. ReLU revolutionized deep learning when introduced in AlexNet (2012), enabling training of much deeper networks than the sigmoid and tanh activations that preceded it. The original transformer adopted ReLU for its feed-forward layers, inheriting this proven workhorse.

Rectified Linear Unit (ReLU)

A piecewise linear activation function that outputs the input directly if positive, otherwise outputs zero. Its simplicity enables fast computation and its non-saturating positive region prevents vanishing gradients during backpropagation.

Let's implement ReLU and examine its behavior:

In[2]:
Code
import numpy as np

np.random.seed(42)


def relu(x):
    """Rectified Linear Unit activation."""
    return np.maximum(0, x)


def relu_derivative(x):
    """Derivative of ReLU."""
    return (x > 0).astype(float)


# Generate input range
x = np.linspace(-3, 3, 1000)
Out[3]:
Visualization
Plot showing ReLU function as a bent line at the origin.
ReLU activation function. The function is zero for negative inputs and linear for positive inputs.
Plot showing ReLU derivative as a step function.
ReLU derivative. The derivative is a step function: 0 for negative inputs, 1 for positive inputs. The sharp corner at zero is both ReLU''s strength (computational simplicity) and weakness (non-smooth gradients).

ReLU's advantages are clear. Its computation is trivial: a single comparison operation. The derivative is equally simple:

ReLU(x)={1if x>00if x0\text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}

where:

  • The derivative is 1 for positive inputs, meaning gradients flow through unchanged
  • The derivative is 0 for negative inputs, meaning gradients are blocked entirely
  • The derivative is technically undefined at exactly x=0x = 0, but implementations typically use 0 or 1

This step function gradient never vanishes for positive inputs like sigmoid's gradient does for large values. And ReLU creates sparse activations, as roughly half of all hidden units output zero for typical input distributions, which can improve efficiency and interpretability.

The Dead ReLU Problem

But ReLU has a critical weakness. When a neuron's pre-activation is consistently negative, its gradient is always zero. The neuron stops learning entirely. This "dying ReLU" problem becomes more severe as networks deepen or learning rates increase.

Consider what happens during training. If a weight update pushes a neuron's bias too negative, that neuron's output becomes zero for all inputs. With zero output, the gradient flowing through that neuron is zero. With zero gradient, the weights never update. The neuron is dead.

In[4]:
Code
# Simulate dying ReLU phenomenon
def simulate_dead_neurons(d_ff, n_samples, bias_shift=-2.0):
    """
    Simulate FFN hidden layer activations with shifted bias.

    Returns the fraction of neurons that are 'dead' (always zero).
    """
    # Random input activations (post-attention representations)
    inputs = np.random.randn(n_samples, d_ff)

    # Simulate pre-activations with negative bias shift
    pre_activations = inputs + bias_shift

    # Apply ReLU
    activations = relu(pre_activations)

    # Count neurons that are zero for all samples
    neuron_max = activations.max(axis=0)
    dead_fraction = (neuron_max == 0).mean()

    return dead_fraction, activations


# Test different bias shifts
bias_shifts = np.linspace(0, -3, 7)
dead_fractions = []

for shift in bias_shifts:
    dead_frac, _ = simulate_dead_neurons(1000, 500, shift)
    dead_fractions.append(dead_frac)
Out[5]:
Visualization
Line plot showing dead neuron fraction increasing from 0% at bias shift 0 to nearly 100% at bias shift -3.
Dead neuron fraction as a function of bias shift. As biases become more negative, an increasing fraction of neurons output zero for all inputs, effectively disconnecting from the network. At a bias shift of -3.0, nearly all neurons are dead.

The plot shows how quickly neurons die as bias values shift negative. In real training, this can happen in patches of the network, reducing effective capacity without obvious symptoms. The model trains, but parts of it have effectively disconnected from the computation.

GELU: The Smooth Alternative

Hendrycks and Gimpel introduced the Gaussian Error Linear Unit (GELU) in 2016, though it didn't gain widespread adoption until BERT popularized it in 2018. GELU addresses ReLU's sharp corner by introducing smooth, probabilistic gating.

The intuition behind GELU is elegant: instead of deterministically zeroing negative values, gate each input by its probability of being positive under a standard normal distribution. Inputs that are clearly positive (large positive values) pass through nearly unchanged. Inputs that are clearly negative (large negative values) are nearly zeroed. Inputs near zero, where classification is uncertain, are partially attenuated.

The mathematical formulation follows this intuition:

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

where:

  • xx: the input value to the activation function
  • Φ(x)\Phi(x): the cumulative distribution function (CDF) of the standard normal distribution, representing the probability that a standard normal random variable is less than or equal to xx

The CDF itself is computed using the error function:

Φ(x)=P(Xx)=12[1+erf(x2)]\Phi(x) = P(X \leq x) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where:

  • P(Xx)P(X \leq x): the probability that a standard normal random variable XX takes a value less than or equal to xx
  • erf()\text{erf}(\cdot): the Gauss error function, a special function that arises in probability and statistics
  • 2\sqrt{2}: a scaling factor that converts from the standard normal to the error function's parameterization

The CDF Φ(x)\Phi(x) ranges from 0 to 1. For large positive xx, Φ(x)1\Phi(x) \approx 1, so GELU(x)x\text{GELU}(x) \approx x. For large negative xx, Φ(x)0\Phi(x) \approx 0, so GELU(x)0\text{GELU}(x) \approx 0. The transition is smooth, governed by the bell curve of the normal distribution.

Gaussian Error Linear Unit (GELU)

An activation function that gates inputs by their Gaussian CDF values, creating a smooth, non-monotonic function that approximates stochastic regularization. GELU is the standard activation for encoder-style transformers like BERT and RoBERTa.

In[6]:
Code
from scipy.special import erf


def gelu_exact(x):
    """GELU activation using exact computation."""
    return x * 0.5 * (1 + erf(x / np.sqrt(2)))


def gelu_approximate(x):
    """GELU approximation using tanh (faster, commonly used)."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))


def gelu_derivative(x):
    """Derivative of GELU (exact)."""
    phi = 0.5 * (1 + erf(x / np.sqrt(2)))
    pdf = np.exp(-0.5 * x**2) / np.sqrt(2 * np.pi)
    return phi + x * pdf

GELU has a distinctive shape: it's nearly linear for positive values, smoothly transitions through zero, and has a slight dip into negative territory before approaching zero asymptotically. This dip is unique. For small negative inputs (around x0.5x \approx -0.5), GELU actually outputs slightly negative values. This non-monotonic behavior distinguishes it from ReLU variants.

The derivative of GELU, which determines gradient flow during backpropagation, is:

GELU(x)=Φ(x)+xϕ(x)\text{GELU}'(x) = \Phi(x) + x \cdot \phi(x)

where:

  • Φ(x)\Phi(x): the standard normal CDF (as defined above)
  • ϕ(x)=12πex2/2\phi(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}: the standard normal probability density function (PDF)
  • The first term Φ(x)\Phi(x) contributes the "identity-like" gradient for positive inputs
  • The second term xϕ(x)x \cdot \phi(x) creates a smooth transition region and ensures the derivative is continuous everywhere
Out[7]:
Visualization
Plot showing GELU as a smooth S-shaped curve with slight negative dip.
GELU activation function. Unlike ReLU, GELU is smooth everywhere and non-monotonic, with a slight negative dip around x = -0.5.
Plot showing GELU derivative as a smooth sigmoid-like curve.
GELU derivative. The derivative transitions smoothly from 0 to 1, avoiding ReLU's discontinuity at the origin.

GELU Approximations

Computing the exact error function is expensive. In practice, transformers use fast approximations. The most common is the tanh approximation:

GELUapprox(x)=0.5x(1+tanh(2π(x+0.044715x3)))\text{GELU}_{\text{approx}}(x) = 0.5 \cdot x \cdot \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715 x^3\right)\right)\right)

where:

  • xx: the input value
  • tanh()\tanh(\cdot): the hyperbolic tangent function, which outputs values in the range (1,1)(-1, 1)
  • 2/π0.7979\sqrt{2/\pi} \approx 0.7979: a scaling constant derived from properties of the normal distribution
  • 0.0447150.044715: an empirically-fitted coefficient that improves the approximation accuracy
  • x+0.044715x3x + 0.044715 x^3: a polynomial that approximates the argument to the error function

This formula looks complex, but it avoids the expensive erf computation while matching GELU closely. The constants were determined by fitting to the exact GELU curve, minimizing the maximum error across typical input ranges.

Let's compare the exact and approximate versions:

In[8]:
Code
# Compare exact and approximate GELU
x_test = np.linspace(-4, 4, 1000)
gelu_exact_vals = gelu_exact(x_test)
gelu_approx_vals = gelu_approximate(x_test)

max_error = np.max(np.abs(gelu_exact_vals - gelu_approx_vals))
mean_error = np.mean(np.abs(gelu_exact_vals - gelu_approx_vals))
Out[9]:
Console
GELU approximation accuracy:
  Maximum absolute error: 0.000473
  Mean absolute error:    0.000196

The approximation error is negligible for practical purposes. Most deep learning frameworks offer both versions, with the tanh approximation being faster on hardware that lacks specialized erf instructions.

Out[10]:
Visualization
Plot showing exact and approximate GELU curves overlapping almost perfectly, with a small inset showing the magnified error.
Comparison of exact GELU and its tanh approximation. The two curves are visually indistinguishable. The approximation error (shown magnified in the inset) peaks around x = 1 but remains below 0.005 across the entire range.

SiLU/Swish: The Modern Standard

While GELU became the activation of choice for encoder models like BERT, a different function emerged for large decoder models: SiLU, also known as Swish. Google researchers introduced Swish in 2017 through neural architecture search, discovering that this simple formula consistently outperformed ReLU across tasks.

SiLU(x)=xσ(x)=x1+ex\text{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

where:

  • xx: the input value
  • σ(x)\sigma(x): the logistic sigmoid function (defined below)
  • exe^{-x}: the exponential function with negative argument, which rapidly approaches 0 for large positive xx and infinity for large negative xx

The sigmoid function that gates the input is:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

where:

  • σ(x)\sigma(x): outputs a value in the range (0,1)(0, 1), interpretable as a "gate" or probability
  • For large positive xx: σ(x)1\sigma(x) \to 1, so SiLU(x)x\text{SiLU}(x) \to x (input passes through)
  • For large negative xx: σ(x)0\sigma(x) \to 0, so SiLU(x)0\text{SiLU}(x) \to 0 (input is suppressed)
  • At x=0x = 0: σ(0)=0.5\sigma(0) = 0.5, so SiLU(0)=0\text{SiLU}(0) = 0

The name "SiLU" stands for Sigmoid Linear Unit, emphasizing its structure: the input multiplied by its sigmoid. "Swish" was Google's original branding. The two names refer to the same function, though the literature isn't always consistent.

SiLU / Swish

An activation function defined as SiLU(x)=xσ(x)\text{SiLU}(x) = x \cdot \sigma(x), where σ\sigma is the sigmoid function. SiLU is smooth, non-monotonic, and unbounded above. It has become the standard activation for decoder-style transformers like LLaMA, Mistral, and GPT-NeoX.

In[11]:
Code
def sigmoid(x):
    """Logistic sigmoid function."""
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))


def silu(x):
    """SiLU/Swish activation."""
    return x * sigmoid(x)


def silu_derivative(x):
    """Derivative of SiLU."""
    sig = sigmoid(x)
    return sig + x * sig * (1 - sig)

SiLU shares GELU's smooth, non-monotonic character. Both have a slight negative dip, both transition smoothly through zero, and both asymptotically approach the identity for large positive values. But the functions differ subtly: SiLU's dip is slightly deeper (minimum around -0.28 compared to GELU's -0.17), and its transition is slightly sharper.

The derivative of SiLU, derived using the product rule, is:

SiLU(x)=σ(x)+xσ(x)(1σ(x))\text{SiLU}'(x) = \sigma(x) + x \cdot \sigma(x) \cdot (1 - \sigma(x))

where:

  • σ(x)\sigma(x): the sigmoid function
  • σ(x)(1σ(x))\sigma(x)(1 - \sigma(x)): the derivative of the sigmoid, which has a bell-shaped curve centered at x=0x = 0
  • The first term σ(x)\sigma(x) approaches 1 for large positive xx, giving gradient 1
  • The second term xσ(x)(1σ(x))x \cdot \sigma(x)(1 - \sigma(x)) adds a positive contribution near x=1.5x = 1.5, allowing SiLU's derivative to exceed 1
Out[12]:
Visualization
Plot showing SiLU as a smooth curve with negative dip.
SiLU/Swish activation function. Like GELU, SiLU is smooth and non-monotonic with a negative dip. The dip is slightly deeper than GELU's, reaching approximately -0.28 at x ≈ -1.28.
Plot showing SiLU derivative as a smooth curve peaking above 1.
SiLU derivative. The derivative peaks above 1.0, meaning it can amplify gradients for certain input values.

Why Decoder Models Prefer SiLU

The preference for SiLU in decoder models like LLaMA, Mistral, and Falcon isn't fully understood theoretically, but several factors contribute:

  1. Simpler computation: SiLU requires only sigmoid and multiplication, avoiding the error function or its approximations. On modern hardware with fast sigmoid implementations, this can be faster than GELU.

  2. Slightly stronger gradients: SiLU's derivative can exceed 1 (peaking around 1.1 near x=1.5x = 1.5), which may help gradient flow in very deep networks. GELU's derivative is bounded between 0 and 1.

  3. Empirical performance: In large-scale experiments at the scale of modern LLMs, SiLU consistently matches or slightly outperforms GELU for autoregressive language modeling. The differences are small but measurable.

  4. Pairing with gated linear units: Modern architectures like LLaMA use SiLU specifically within gated linear unit (GLU) variants, where the activation's properties interact with the gating mechanism. We'll explore this in the next chapter.

Activation Function Comparison

With all three activations implemented, let's visualize them together to understand their relationships and differences:

Out[13]:
Visualization
Plot showing three activation functions overlaid with ReLU as a bent line and GELU and SiLU as smooth curves with negative dips.
Comparison of ReLU, GELU, and SiLU activation functions. All three share the property of being approximately linear for large positive inputs and approximately zero for large negative inputs. The key differences are in the transition region near zero: ReLU has a sharp corner, while GELU and SiLU are smooth with slight negative dips.

The three functions converge for large positive inputs, approaching the identity function. They also converge for large negative inputs, approaching zero. The critical differences are in the transition region around zero, where:

  • ReLU has a sharp corner with discontinuous derivative
  • GELU has a smooth transition with a slight negative dip (minimum 0.17\approx -0.17 at x0.5x \approx -0.5)
  • SiLU has a smooth transition with a deeper negative dip (minimum 0.28\approx -0.28 at x1.28x \approx -1.28)

Let's also compare the derivatives, which directly affect gradient flow during training:

Out[14]:
Visualization
Plot showing three derivative curves with ReLU as a step function and GELU and SiLU as smooth sigmoid-like curves.
Derivatives of ReLU, GELU, and SiLU. ReLU's derivative is a step function with a discontinuity at zero. GELU and SiLU have smooth derivatives that transition from 0 to 1. Notably, SiLU's derivative can exceed 1, potentially amplifying gradients for certain inputs.

The derivative plots reveal a crucial difference: SiLU's derivative can exceed 1.0, reaching approximately 1.1 near x=1.5x = 1.5. This means SiLU can slightly amplify gradients for certain input values, potentially helping gradient flow in very deep networks. Let's visualize where this gradient amplification occurs:

Out[15]:
Visualization
Line plot of SiLU derivative with shaded region showing where derivative exceeds 1.0.
Gradient amplification regions for SiLU. The shaded region shows where SiLU's derivative exceeds 1.0, meaning gradients are amplified rather than attenuated. This occurs roughly between x = 0.5 and x = 2.5, with a peak around x = 1.28.

Quantitative Comparison

Let's compute specific properties that affect neural network behavior:

In[16]:
Code
def analyze_activation(name, func, x_range):
    """Compute key properties of an activation function."""
    y = func(x_range)

    # Find minimum value and location
    min_idx = np.argmin(y)
    min_val = y[min_idx]
    min_x = x_range[min_idx]

    # Sparsity: fraction of outputs that are zero or near-zero
    sparsity = (np.abs(y) < 0.01).mean()

    # Linearity for positive inputs (correlation with y=x)
    pos_mask = x_range > 0.5
    correlation = np.corrcoef(x_range[pos_mask], y[pos_mask])[0, 1]

    return {
        "name": name,
        "min_value": min_val,
        "min_location": min_x,
        "sparsity": sparsity,
        "linearity": correlation,
    }


x_analysis = np.linspace(-5, 5, 10000)
activations = [
    ("ReLU", relu),
    ("GELU", gelu_exact),
    ("SiLU", silu),
]

results = [
    analyze_activation(name, func, x_analysis) for name, func in activations
]
Out[17]:
Console
Activation function properties:

Function      Min Value   Min Location     Sparsity    Linearity
-----------------------------------------------------------------
ReLU             0.0000          -5.00       50.1%     1.000000
GELU            -0.1700          -0.75       23.6%     0.999769
SiLU            -0.2785          -1.28        0.4%     0.999823

Key observations from this analysis:

  • Minimum value: ReLU never goes negative (by definition). GELU's dip is mild (-0.17), while SiLU dips deeper (-0.28). These negative values can help with regularization but also introduce potential instabilities.

  • Sparsity: All three create some degree of sparsity (outputs near zero), but ReLU is the most sparse for normally distributed inputs. This sparsity can aid interpretability and efficiency.

  • Linearity: All three are highly linear for positive inputs (correlation > 0.999 with y=xy = x), which helps preserve signal magnitude through deep networks.

The Negative Dip: Feature or Bug?

GELU and SiLU's negative regions deserve closer examination. For certain negative inputs, these activations produce negative outputs rather than zero. This seems counterintuitive: why would we want activations to sometimes invert their input's sign?

The negative dip serves as a form of implicit regularization. Consider what happens during training: inputs that land in the dip region (roughly 2<x<0-2 < x < 0 for GELU) get attenuated but not eliminated. This "soft gating" provides a richer gradient signal than ReLU's hard zero, potentially helping the network escape local minima.

Out[18]:
Visualization
Zoomed plot of the negative dip region showing GELU and SiLU curves dipping below zero before asymptoting to zero for large negative inputs.
Detailed view of the negative dip region. GELU reaches its minimum of approximately -0.17 at x ≈ -0.52. SiLU reaches its deeper minimum of approximately -0.28 at x ≈ -1.28. The dashed line shows y = 0 for reference. These negative outputs provide non-zero gradients even for negative inputs, potentially aiding optimization.

The magnitude of the dip matters for training dynamics. A deeper dip (SiLU) means stronger negative signals for moderately negative inputs, which could either help or hurt depending on the task. Empirically, both work well, with the choice often coming down to which pairs better with other architectural decisions (like gated linear units).

Activation Functions in Practice

Let's simulate how these activations behave in an actual FFN layer, processing typical transformer hidden states:

In[19]:
Code
# Simulate FFN behavior with different activations
np.random.seed(42)

d_model = 768
d_ff = 3072
batch_size = 1000

# Initialize FFN weights (shared across activations)
W1 = np.random.randn(d_model, d_ff) * np.sqrt(2.0 / (d_model + d_ff))
b1 = np.zeros(d_ff)
W2 = np.random.randn(d_ff, d_model) * np.sqrt(2.0 / (d_ff + d_model))
b2 = np.zeros(d_model)

# Generate input batch (simulating post-attention representations)
X = np.random.randn(batch_size, d_model) * 0.5

# Compute pre-activations
pre_act = X @ W1 + b1

# Apply each activation
activations_dict = {
    "ReLU": relu(pre_act),
    "GELU": gelu_exact(pre_act),
    "SiLU": silu(pre_act),
}


# Compute statistics
def compute_stats(hidden):
    """Compute activation statistics."""
    return {
        "mean": hidden.mean(),
        "std": hidden.std(),
        "sparsity": (np.abs(hidden) < 0.01).mean(),
        "negative_fraction": (hidden < 0).mean(),
    }


stats = {name: compute_stats(h) for name, h in activations_dict.items()}
Out[20]:
Console
Hidden layer statistics by activation function:

Activation       Mean        Std     Sparsity     Negative
----------------------------------------------------------
ReLU           0.1262     0.1848       51.3%        0.0%
GELU           0.0380     0.1664        5.1%       50.0%
SiLU           0.0244     0.1618        5.1%       50.0%

The statistics reveal important differences. ReLU creates the sparsest representations (50% of values near zero) and has no negative outputs. GELU and SiLU have less sparsity and a small fraction of negative outputs. The mean activation is higher for the smooth activations because they don't hard-threshold negative inputs to zero.

Out[21]:
Visualization
Histogram of ReLU activations showing large spike at zero and positive tail.
ReLU hidden activations are sparse: a large spike at zero from zeroed negative inputs, and positive values following a truncated distribution.
Histogram of GELU activations showing smooth distribution with small negative tail.
GELU hidden activations are smoother, with a smaller peak near zero and a slight left tail of negative values.
Histogram of SiLU activations showing smooth distribution with moderate negative tail.
SiLU hidden activations are similar to GELU but with a slightly more pronounced negative tail.

Computational Efficiency

Activation function speed matters at scale. When processing billions of tokens through trillion-parameter models, even small differences in activation computation time accumulate. Let's benchmark our implementations:

In[22]:
Code
import time


def benchmark_activation(func, x, n_iterations=100):
    """Benchmark activation function speed."""
    # Warmup
    for _ in range(10):
        _ = func(x)

    # Timed runs
    start = time.perf_counter()
    for _ in range(n_iterations):
        _ = func(x)
    elapsed = time.perf_counter() - start

    return elapsed / n_iterations * 1000  # ms per call


# Large tensor for realistic benchmarking
x_bench = np.random.randn(1024, 4096).astype(np.float32)

benchmarks = {
    "ReLU": benchmark_activation(relu, x_bench),
    "GELU (exact)": benchmark_activation(gelu_exact, x_bench),
    "GELU (approx)": benchmark_activation(gelu_approximate, x_bench),
    "SiLU": benchmark_activation(silu, x_bench),
}
Out[23]:
Visualization
Horizontal bar chart showing relative computation times with ReLU fastest at 1x and exact GELU slowest.
Computational benchmarks for activation functions on a 1024x4096 tensor. ReLU is fastest due to its simple comparison operation. SiLU is faster than GELU variants because sigmoid has efficient hardware implementations. The exact GELU using the error function is slowest.

ReLU is fastest because it's just a comparison operation. The exact GELU using erf is slowest. The GELU tanh approximation is faster than exact GELU but still slower than SiLU. SiLU is relatively fast because sigmoid has efficient implementations on most hardware.

In practice, these differences are often dwarfed by memory bandwidth limitations and matrix multiplication costs. The activation function typically accounts for less than 1% of total FFN compute time. Still, at sufficient scale, even small improvements matter.

Which Activation Should You Use?

The choice of activation function depends on your model architecture and use case. Here's a practical guide:

Use ReLU when:

  • You need maximum computational efficiency and simplicity
  • You're building a custom architecture and want a well-understood baseline
  • Your model is small enough that the dead neuron problem is manageable
  • You're working with older codebases or frameworks that expect ReLU

Use GELU when:

  • You're building an encoder model (BERT, RoBERTa, ELECTRA style)
  • You want compatibility with pretrained encoder models
  • You prioritize smooth gradients over computational efficiency
  • You're fine-tuning an existing GELU-based model

Use SiLU when:

  • You're building a decoder model (GPT, LLaMA, Mistral style)
  • You're using gated linear units (SwiGLU, GeGLU) in your FFN
  • You want the slight performance edge that modern LLMs have demonstrated
  • You need a smooth activation but prefer simpler computation than GELU

The empirical differences between GELU and SiLU are often small. Unless you're training at massive scale where every fraction of a percent matters, either smooth activation will likely work well. The more important choice is between ReLU (with its dead neuron risk) and the smooth alternatives.

Out[24]:
Visualization
Three-column table visualization showing ReLU, GELU, and SiLU with their key properties and typical use cases.
Summary comparison of activation functions used in transformer FFNs. ReLU is the simplest but has a dead neuron problem. GELU is smooth and became standard for encoders (BERT). SiLU is smooth and became standard for decoders (LLaMA, GPT). The choice often follows the pretrained model you're building on.

Implementation: Configurable Activation Module

Let's create a configurable activation function module that supports all three options, following patterns used in modern transformer libraries:

In[25]:
Code
class Activation:
    """
    Configurable activation function for FFN layers.

    Supports ReLU, GELU (exact and approximate), and SiLU.
    """

    ACTIVATIONS = {
        "relu": lambda x: np.maximum(0, x),
        "gelu": lambda x: x * 0.5 * (1 + erf(x / np.sqrt(2))),
        "gelu_approximate": lambda x: 0.5
        * x
        * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3))),
        "silu": lambda x: x * (1 / (1 + np.exp(-np.clip(x, -500, 500)))),
    }

    def __init__(self, activation_type="gelu"):
        """
        Initialize the activation function.

        Args:
            activation_type: One of "relu", "gelu", "gelu_approximate", "silu"
        """
        if activation_type not in self.ACTIVATIONS:
            raise ValueError(
                f"Unknown activation: {activation_type}. "
                f"Choose from {list(self.ACTIVATIONS.keys())}"
            )

        self.activation_type = activation_type
        self._func = self.ACTIVATIONS[activation_type]

    def __call__(self, x):
        """Apply the activation function."""
        return self._func(x)

    def __repr__(self):
        return f"Activation({self.activation_type})"


# Test the module
test_input = np.array([-2, -1, 0, 1, 2])
Out[26]:
Console
Activation module test:

Input: [-2 -1  0  1  2]

Activation(relu)               → [0 0 0 1 2]
Activation(gelu)               → [-0.046 -0.159  0.     0.841  1.954]
Activation(gelu_approximate)   → [-0.045 -0.159  0.     0.841  1.955]
Activation(silu)               → [-0.238 -0.269  0.     0.731  1.762]

Limitations and Impact

The choice of activation function in transformer FFNs has subtle but measurable effects on model behavior. While the differences between GELU and SiLU are often small in benchmarks, they compound across billions of forward passes during training.

One underappreciated limitation is numerical stability. GELU's error function and SiLU's exponential can both produce numerical issues at extreme input values. Production implementations include clipping (as we did for sigmoid) and sometimes use mixed-precision computation to balance speed and stability.

Another consideration is hardware optimization. Modern GPUs and TPUs have specialized circuits for certain operations. The sigmoid function in SiLU benefits from this optimization on many accelerators. GELU's error function is less universally optimized, which partly explains the trend toward SiLU in recent models despite GELU's theoretical elegance.

The impact of activation functions extends beyond raw performance. The smooth gradients of GELU and SiLU allow for more stable training at large batch sizes and high learning rates. This was crucial for scaling transformers to billions of parameters, where training instability becomes a serious concern.

Looking forward, activation functions continue to evolve. Gated linear units (covered in the next chapter) combine activation functions with multiplicative gating, creating even richer nonlinearities. The field hasn't settled on a final answer, and future architectures may introduce entirely new activation functions that further improve on the current options.

Summary

Activation functions inject nonlinearity into the feed-forward network, enabling transformers to learn complex functions of language. This chapter traced the evolution from ReLU to GELU to SiLU, examining the mathematical properties that make each suitable for different contexts.

Key takeaways:

  • ReLU (max(0,x)\max(0, x)) is the simplest activation, with zero computation cost beyond a comparison. Its sharp corner creates sparse representations but risks "dead neurons" that stop learning entirely. ReLU was used in the original transformer but has been largely superseded in modern architectures.

  • GELU (xΦ(x)x \cdot \Phi(x)) smoothly gates inputs by their Gaussian CDF values. The result is a differentiable, non-monotonic function with a slight negative dip around x0.5x \approx -0.5. GELU became the standard for encoder models (BERT, RoBERTa) and remains widely used.

  • SiLU/Swish (xσ(x)x \cdot \sigma(x)) multiplies inputs by their sigmoid values. Like GELU, it's smooth and non-monotonic, but with a deeper negative dip and slightly simpler computation. SiLU has become the standard for decoder models (LLaMA, Mistral, GPT-NeoX).

  • Practical choice: The differences between GELU and SiLU are often small in practice. Choose based on your model family (encoder vs. decoder), compatibility with pretrained models, or specific architectural requirements like gated linear units.

  • Computational efficiency: ReLU is fastest, followed by SiLU, then GELU approximation, then exact GELU. However, activation computation is typically less than 1% of total FFN time, so speed rarely drives the choice.

The next chapter examines gated linear units (GLUs), which combine activation functions with multiplicative gating to create even more expressive FFN architectures. Variants like SwiGLU and GeGLU have become standard in state-of-the-art models like LLaMA and PaLM.

Key Parameters

When configuring activation functions for transformer FFNs, these parameters and choices affect model behavior:

  • activation_type: The activation function to use. Common options are "relu", "gelu", "gelu_approximate", and "silu". Choose based on your model architecture: GELU for encoder models (BERT-style), SiLU for decoder models (LLaMA-style), or ReLU for maximum simplicity and speed.

  • approximate (for GELU): Whether to use the tanh approximation instead of the exact error function. The approximation is faster on most hardware and introduces negligible error (< 0.005 maximum). Most production systems use the approximation.

  • inplace (framework-specific): Some frameworks allow in-place activation to reduce memory usage. This modifies the input tensor directly rather than creating a new output tensor. Use with caution, as it can cause issues with gradient computation if the original values are needed.

  • numerical clipping (for SiLU/sigmoid): Input values should be clipped to prevent overflow in the exponential function. A range of [500,500][-500, 500] is typical. Most deep learning frameworks handle this automatically, but custom implementations should include explicit clipping.

  • dtype considerations: Activation functions behave differently at different precisions. At float16 or bfloat16, the dynamic range is limited, which can cause issues with the exponential in SiLU or the error function in GELU. Mixed-precision training typically keeps activations in higher precision for stability.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about activation functions in transformer feed-forward networks.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{ffnactivationfunctionsrelugeluandsilufortransformermodels, author = {Michael Brenndoerfer}, title = {FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models}, year = {2025}, url = {https://mbrenndoerfer.com/writing/ffn-activation-functions}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models. Retrieved from https://mbrenndoerfer.com/writing/ffn-activation-functions
MLAAcademic
Michael Brenndoerfer. "FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/ffn-activation-functions>.
CHICAGOAcademic
Michael Brenndoerfer. "FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/ffn-activation-functions.
HARVARDAcademic
Michael Brenndoerfer (2025) 'FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models'. Available at: https://mbrenndoerfer.com/writing/ffn-activation-functions (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). FFN Activation Functions: ReLU, GELU, and SiLU for Transformer Models. https://mbrenndoerfer.com/writing/ffn-activation-functions
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free