Activation Functions: From Sigmoid to GELU and Beyond

Michael Brenndoerfer

Master neural network activation functions including sigmoid, tanh, ReLU variants, GELU, Swish, and Mish. Learn when to use each and why.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Activation FunctionsLink Copied

Neural networks derive their power from non-linearity. Without activation functions, even the deepest network would collapse into a simple linear transformation, no matter how many layers you stack. Activation functions are the secret ingredient that allows neural networks to learn complex, non-linear patterns in data.

In this chapter, we explore the evolution of activation functions, from the biologically-inspired sigmoid to modern innovations like GELU and Swish. You will understand not just what each function computes, but why it was designed that way and when to use it.

Why Non-Linearity MattersLink Copied

Consider stacking two linear transformations. If the first layer computes $\mathbf{h} = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1$ and the second computes $\mathbf{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2$ , we can substitute and simplify:

\mathbf{y} = \mathbf{W}_2 (\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (\mathbf{W}_2 \mathbf{W}_1) \mathbf{x} + (\mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2)

where:

$\mathbf{x}$ : the input vector
$\mathbf{W}_1, \mathbf{W}_2$ : weight matrices for layers 1 and 2
$\mathbf{b}_1, \mathbf{b}_2$ : bias vectors for layers 1 and 2
$\mathbf{h}$ : the hidden representation after the first layer
$\mathbf{y}$ : the final output

The result is equivalent to a single linear layer with weight matrix $\mathbf{W}' = \mathbf{W}_2 \mathbf{W}_1$ and bias $\mathbf{b}' = \mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2$ . No matter how many layers we add, without non-linearity, the network can only learn linear decision boundaries.

Activation functions break this limitation. By applying a non-linear function after each linear transformation, we enable networks to approximate arbitrarily complex functions.

The Sigmoid FunctionLink Copied

The sigmoid function was one of the first activation functions used in neural networks. Its S-shaped curve smoothly maps any real number to a value between 0 and 1, making it interpretable as a probability.

Mathematical DefinitionLink Copied

The sigmoid function is defined as:

\sigma(z) = \frac{1}{1 + e^{-z}}

where:

$z$ : the input value (pre-activation), which can be any real number
$e$ : Euler's number (approximately 2.718)
$\sigma(z)$ : the output, constrained to the range $(0, 1)$

The function works by exponentiating the negative input. When $z$ is large and positive, $e^{-z}$ approaches 0, so $\sigma(z) \approx 1$ . When $z$ is large and negative, $e^{-z}$ becomes very large, pushing $\sigma(z)$ toward 0. At $z = 0$ , we get $\sigma(0) = 0.5$ .

The DerivativeLink Copied

The sigmoid has an elegant derivative that can be expressed in terms of itself:

\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))

where:

$\sigma'(z)$ : the derivative of sigmoid with respect to $z$
$\sigma(z)$ : the sigmoid function evaluated at $z$

This property makes gradient computation efficient. Once we compute the forward pass value $\sigma(z)$ , we can immediately compute the gradient without re-evaluating the exponential.

The Saturation ProblemLink Copied

The maximum value of the derivative occurs at $z = 0$ , where $\sigma'(0) = 0.25$ . This means the gradient is at most 0.25, and it rapidly approaches 0 as $|z|$ increases. This phenomenon is called saturation.

Vanishing Gradients

When inputs to sigmoid neurons are very large or very small, the gradient becomes negligibly small. During backpropagation, these tiny gradients multiply together across layers, causing gradients to "vanish" in deep networks. This makes training deep networks with sigmoid activations extremely difficult.

Let's visualize the sigmoid function and its derivative to understand saturation:

Out[2]:

Visualization

S-shaped sigmoid curve with saturation regions shaded in red for |z| > 3 — The sigmoid function squashes inputs to the range (0, 1). Shaded regions show saturation zones where the function is nearly flat.

Bell-shaped sigmoid derivative curve peaking at 0.25 with vanishing regions shaded — The sigmoid derivative peaks at 0.25 and rapidly approaches zero for large |z|, causing the vanishing gradient problem.

The shaded regions highlight where saturation occurs. In these zones, the gradient is nearly zero, making learning extremely slow or impossible.

The Hyperbolic Tangent (tanh)Link Copied

The hyperbolic tangent function addresses one limitation of sigmoid: it is zero-centered, meaning its outputs are symmetric around zero. This property helps with gradient flow during training.

Mathematical DefinitionLink Copied

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

where:

$z$ : the input value
$e^z, e^{-z}$ : exponentials of the positive and negative input
$\tanh(z)$ : the output, constrained to the range $(-1, 1)$

An equivalent formulation relates tanh to sigmoid:

\tanh(z) = 2\sigma(2z) - 1

This shows that tanh is essentially a rescaled and shifted version of sigmoid, mapping to $(-1, 1)$ instead of $(0, 1)$ .

The DerivativeLink Copied

\tanh'(z) = 1 - \tanh^2(z)

where:

$\tanh'(z)$ : the derivative of tanh with respect to $z$
$\tanh^2(z)$ : the square of the tanh output

At $z = 0$ , the derivative equals 1, which is four times larger than sigmoid's maximum gradient of 0.25. This stronger gradient signal helps mitigate, though not eliminate, the vanishing gradient problem.

Comparing Sigmoid and TanhLink Copied

Out[3]:

Visualization

Two overlaid S-curves showing sigmoid ranging 0 to 1 and tanh ranging -1 to 1 — Comparison of sigmoid and tanh activation functions. Tanh is zero-centered with range (-1, 1) and has stronger gradients near the origin, making it generally preferable to sigmoid for hidden layers.

Both sigmoid and tanh suffer from saturation for large $|z|$ values. However, tanh's zero-centered output makes it preferable for hidden layers, while sigmoid remains useful for output layers when you need probabilities.

Rectified Linear Unit (ReLU)Link Copied

ReLU revolutionized deep learning. Its simplicity, computational efficiency, and resistance to vanishing gradients made training deep networks practical for the first time.

Mathematical DefinitionLink Copied

\text{ReLU}(z) = \max(0, z)

where:

$z$ : the input value
$\max(0, z)$ : returns $z$ if positive, otherwise 0

The function is piecewise linear: it passes positive values unchanged and clips negative values to zero.

The DerivativeLink Copied

\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases}

At exactly $z = 0$ , the derivative is technically undefined, but in practice, we typically set it to 0 or 1.

The key insight is that for positive inputs, the gradient is exactly 1. This means gradients flow through ReLU layers without diminishing, solving the vanishing gradient problem that plagued sigmoid and tanh.

The Dying ReLU ProblemLink Copied

While ReLU solves vanishing gradients, it introduces a new issue: neurons can "die" during training.

Dying ReLU

If a ReLU neuron's weights are updated such that its pre-activation $z$ becomes negative for all training examples, the neuron outputs zero for every input. Since the gradient is also zero for negative inputs, the neuron can never recover. It becomes permanently inactive, effectively reducing the network's capacity.

This typically happens when:

The learning rate is too high, causing large weight updates
Poor weight initialization pushes many neurons into negative territory
Strong negative gradients shift the bias terms too far negative

Out[4]:

Visualization

Piecewise linear ReLU function with dead zone shaded for negative inputs — ReLU activation function. The shaded region shows the dead zone where output is always zero.

Step function ReLU derivative with gradient 1 for positive and 0 for negative — ReLU derivative showing constant gradient of 1 for positive inputs and 0 for negative inputs.

Leaky ReLU and Parametric ReLULink Copied

Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient when the input is negative.

Mathematical DefinitionLink Copied

\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

where:

$z$ : the input value
$\alpha$ : a small positive constant, typically 0.01
$\alpha z$ : the scaled negative input, giving a small but non-zero output

The function can also be written compactly as:

\text{LeakyReLU}(z) = \max(\alpha z, z)

Parametric ReLU (PReLU)Link Copied

PReLU takes this further by making $\alpha$ a learnable parameter:

\text{PReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

The key difference is that $\alpha$ is learned during training via backpropagation, allowing the network to determine the optimal slope for negative inputs.

DerivativeLink Copied

For both Leaky ReLU and PReLU:

\frac{\partial}{\partial z}\text{LeakyReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \leq 0 \end{cases}

The small but non-zero gradient $\alpha$ for negative inputs allows the neuron to continue learning even when it receives negative pre-activations.

Out[5]:

Visualization

Two piecewise linear functions overlaid showing ReLU flat at zero and Leaky ReLU with small negative slope — ReLU vs Leaky ReLU. The small negative slope in Leaky ReLU (α = 0.1 shown here for visibility) prevents neurons from dying while preserving the computational benefits of ReLU.

Exponential Linear Unit (ELU)Link Copied

ELU provides smooth, differentiable transitions at zero and pushes mean activations closer to zero, which can accelerate learning.

Mathematical DefinitionLink Copied

\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha (e^z - 1) & \text{if } z \leq 0 \end{cases}

where:

$z$ : the input value
$\alpha$ : a hyperparameter controlling the saturation value for negative inputs (typically 1.0)
$e^z - 1$ : an exponential term that approaches $-1$ as $z \to -\infty$

For large negative inputs, ELU saturates at $-\alpha$ . This saturation provides noise robustness by limiting the influence of large negative values.

The DerivativeLink Copied

\text{ELU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ \text{ELU}(z) + \alpha & \text{if } z \leq 0 \end{cases}

Note that for $z \leq 0$ , the derivative can be computed from the function value itself, similar to sigmoid's self-referential gradient.

Scaled Exponential Linear Unit (SELU)Link Copied

SELU is a self-normalizing variant of ELU. When used with proper weight initialization, SELU activations automatically converge to zero mean and unit variance, eliminating the need for batch normalization.

\text{SELU}(z) = \lambda \begin{cases} z & \text{if } z > 0 \\ \alpha (e^z - 1) & \text{if } z \leq 0 \end{cases}

where the specific values are:

$\lambda \approx 1.0507$
$\alpha \approx 1.6733$

These constants were derived analytically to ensure self-normalizing properties.

Out[6]:

Visualization

Three overlaid activation curves showing ReLU, ELU, and SELU with different negative region behaviors — ELU and SELU activations compared to ReLU. ELU provides smooth negative outputs that saturate at -α, while SELU includes scaling factors for self-normalization.

Gaussian Error Linear Unit (GELU)Link Copied

GELU has become the default activation function in transformer architectures like BERT and GPT. Unlike ReLU, which deterministically zeroes out negative values, GELU applies a smooth, probabilistic gating.

IntuitionLink Copied

GELU can be understood through a stochastic regularization lens. Imagine each neuron is randomly multiplied by either 0 or 1, where the probability of being "on" depends on how positive the input is. GELU computes the expected value of this stochastic process:

\text{GELU}(z) = z \cdot P(Z \leq z)

where $Z \sim \mathcal{N}(0, 1)$ is a standard normal random variable.

Mathematical DefinitionLink Copied

The formal definition uses the cumulative distribution function (CDF) of the standard normal distribution:

\text{GELU}(z) = z \cdot \Phi(z)

where:

$z$ : the input value
$\Phi(z)$ : the CDF of the standard normal distribution, i.e., $\Phi(z) = P(Z \leq z)$
$z \cdot \Phi(z)$ : the input scaled by the probability that a standard normal is less than $z$

The CDF $\Phi(z)$ is defined as:

\Phi(z) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]

where $\text{erf}$ is the error function.

Practical ApproximationLink Copied

Computing the error function can be expensive. A commonly used approximation is:

\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715z^3\right)\right]\right)

This approximation is accurate to within a few percent and is computationally efficient.

The DerivativeLink Copied

The derivative of GELU involves both the CDF and the probability density function (PDF) of the standard normal:

\text{GELU}'(z) = \Phi(z) + z \cdot \phi(z)

where:

$\Phi(z)$ : the standard normal CDF
$\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-z^2/2}$ : the standard normal PDF

This derivative is always positive for large positive $z$ , and smoothly transitions through zero.

Out[7]:

Visualization

GELU smooth curve overlaid on ReLU showing gradual transition near zero versus ReLU sharp corner — GELU activation compared to ReLU. GELU provides a smooth, probabilistic gating mechanism where small negative values are gradually suppressed rather than hard-clipped to zero. The shaded region highlights where GELU allows some negative values to pass.

The key difference is visible near zero: GELU smoothly curves through the origin, allowing small negative values to pass through with reduced magnitude. This smooth gating is believed to help with gradient flow and model expressiveness.

Swish and MishLink Copied

Swish and Mish are modern activation functions discovered through automated search and designed to combine the benefits of ReLU-like behavior with smooth gradients.

SwishLink Copied

Swish was discovered by Google researchers using automated search techniques:

\text{Swish}(z) = z \cdot \sigma(\beta z)

where:

$z$ : the input value
$\sigma$ : the sigmoid function
$\beta$ : a learnable or fixed parameter (often $\beta = 1$ )
$z \cdot \sigma(\beta z)$ : the input scaled by the sigmoid of itself

When $\beta = 1$ , this simplifies to:

\text{Swish}(z) = \frac{z}{1 + e^{-z}}

Swish is non-monotonic: it dips below zero for negative inputs before rising again. This allows small negative gradients to flow backward, potentially helping escape local minima.

MishLink Copied

Mish uses the softplus function and tanh for an even smoother curve:

\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z))

where:

$\text{softplus}(z) = \ln(1 + e^z)$ : a smooth approximation to ReLU
$\tanh$ : the hyperbolic tangent function

Mish has continuous derivatives of all orders, making it particularly smooth. Like Swish, it is non-monotonic with a small negative region.

Out[8]:

Visualization

Three overlaid curves showing ReLU linear behavior and Swish/Mish smooth non-monotonic dips — Swish and Mish activations compared to ReLU. Both are smooth and non-monotonic, with a small dip below zero for negative inputs that can help gradient flow.

Comparing All Activation FunctionsLink Copied

Let's visualize all the activation functions we've covered in a single comprehensive comparison. This 2×4 grid allows direct comparison of each function's behavior, particularly around zero where their differences are most apparent.

Out[9]:

Visualization

2x4 grid of activation function plots showing sigmoid, tanh, ReLU, Leaky ReLU, ELU, GELU, Swish, and Mish — Comprehensive comparison of all activation functions. ReLU and its variants dominate modern deep learning, while GELU has become the standard for transformer architectures. Each panel uses the same x-axis range for direct comparison.

Implementation in PyTorchLink Copied

Let's implement and visualize these activation functions using PyTorch. Modern deep learning frameworks provide optimized implementations of all common activations.

In[10]:

Code

import torch
import torch.nn as nn

# Create input tensor
z = torch.linspace(-4, 4, 200)

# Built-in activations
activations = {
    "Sigmoid": nn.Sigmoid(),
    "Tanh": nn.Tanh(),
    "ReLU": nn.ReLU(),
    "LeakyReLU": nn.LeakyReLU(negative_slope=0.1),
    "ELU": nn.ELU(alpha=1.0),
    "SELU": nn.SELU(),
    "GELU": nn.GELU(),
    "SiLU (Swish)": nn.SiLU(),
    "Mish": nn.Mish(),
}

import torch
import torch.nn as nn

# Create input tensor
z = torch.linspace(-4, 4, 200)

# Built-in activations
activations = {
    "Sigmoid": nn.Sigmoid(),
    "Tanh": nn.Tanh(),
    "ReLU": nn.ReLU(),
    "LeakyReLU": nn.LeakyReLU(negative_slope=0.1),
    "ELU": nn.ELU(alpha=1.0),
    "SELU": nn.SELU(),
    "GELU": nn.GELU(),
    "SiLU (Swish)": nn.SiLU(),
    "Mish": nn.Mish(),
}

Out[11]:

Console

PyTorch Activation Functions
========================================

Sigmoid:
  f(-2) = 0.1192
  f(0)  = 0.5000
  f(2)  = 0.8808

Tanh:
  f(-2) = -0.9640
  f(0)  = 0.0000
  f(2)  = 0.9640

ReLU:
  f(-2) = 0.0000
  f(0)  = 0.0000
  f(2)  = 2.0000

LeakyReLU:
  f(-2) = -0.2000
  f(0)  = 0.0000
  f(2)  = 2.0000

ELU:
  f(-2) = -0.8647
  f(0)  = 0.0000
  f(2)  = 2.0000

SELU:
  f(-2) = -1.5202
  f(0)  = 0.0000
  f(2)  = 2.1014

GELU:
  f(-2) = -0.0455
  f(0)  = 0.0000
  f(2)  = 1.9545

SiLU (Swish):
  f(-2) = -0.2384
  f(0)  = 0.0000
  f(2)  = 1.7616

Mish:
  f(-2) = -0.2525
  f(0)  = 0.0000
  f(2)  = 1.9440

The output shows each activation's behavior at three key points: a negative value, zero, and a positive value. Notice how ReLU completely zeros out negative inputs, while Leaky ReLU preserves a scaled version. GELU and Swish (SiLU) show smooth, non-zero responses even for negative inputs.

Custom Activation ImplementationLink Copied

You can also implement custom activations as simple functions:

In[12]:

Code

def custom_gelu_approx(z):
    """GELU using the tanh approximation."""
    return (
        0.5
        * z
        * (
            1
            + torch.tanh(
                torch.sqrt(torch.tensor(2.0 / torch.pi)) * (z + 0.044715 * z**3)
            )
        )
    )


def parametric_swish(z, beta=1.0):
    """Swish with adjustable beta parameter."""
    return z * torch.sigmoid(beta * z)

def custom_gelu_approx(z):
    """GELU using the tanh approximation."""
    return (
        0.5
        * z
        * (
            1
            + torch.tanh(
                torch.sqrt(torch.tensor(2.0 / torch.pi)) * (z + 0.044715 * z**3)
            )
        )
    )


def parametric_swish(z, beta=1.0):
    """Swish with adjustable beta parameter."""
    return z * torch.sigmoid(beta * z)

Out[13]:

Console

GELU Approximation vs Built-in:
-----------------------------------
Input    Built-in     Approx      
-1.0     -0.1587      -0.1588     
0.0      0.0000       0.0000      
1.0      0.8413       0.8412      
2.0      1.9545       1.9546

The tanh approximation closely matches the exact GELU computation while being computationally efficient. The small differences are negligible in practice.

Choosing Activation FunctionsLink Copied

Selecting the right activation function depends on your architecture, task, and computational constraints. Here is a practical guide based on current best practices.

For Hidden Layers in Feedforward NetworksLink Copied

Activation function recommendations for feedforward hidden layers.

Activation	Best Use Case	Considerations
ReLU	Default choice for most networks	Fast, effective, but watch for dying neurons
Leaky ReLU	When dying ReLU is a concern	Small computational overhead
ELU	When faster convergence is needed	More expensive than ReLU
SELU	Self-normalizing networks	Requires specific initialization

For Transformer ArchitecturesLink Copied

GELU has become the de facto standard for transformers. Its smooth gating properties align well with the attention mechanism's probabilistic nature. Major models like BERT, GPT, and their variants all use GELU.

For Output LayersLink Copied

Activation function recommendations for output layers by task type.

Task	Activation	Output Range
Binary classification	Sigmoid	(0, 1)
Multi-class classification	Softmax	(0, 1) per class, sum = 1
Regression	None (linear)	Unbounded
Bounded regression	Sigmoid or Tanh	Scaled to target range

For Recurrent Networks (RNNs, LSTMs)Link Copied

Tanh remains common for recurrent connections because:

Zero-centered outputs help maintain stable hidden states
The bounded range prevents activations from exploding
Gate mechanisms (in LSTMs/GRUs) specifically rely on sigmoid's (0, 1) output

Practical ConsiderationsLink Copied

When implementing activation functions in production, consider these factors:

Computational Cost: ReLU is the fastest, requiring only a comparison and conditional. Exponential-based functions (sigmoid, tanh, ELU, SELU) are more expensive. GELU requires either the expensive error function or a polynomial approximation.

Numerical Stability: Sigmoid and softmax can overflow for very large inputs. Use numerically stable implementations that subtract the maximum value before exponentiating.

Initialization Compatibility: Some activations work best with specific initialization schemes. He initialization pairs well with ReLU variants, while Xavier/Glorot initialization suits tanh and sigmoid.

Gradient Flow: For very deep networks, favor activations with gradients that neither vanish nor explode. ReLU and its variants generally provide good gradient flow, while sigmoid and tanh can cause vanishing gradients.

SummaryLink Copied

Activation functions are the non-linear transformations that give neural networks their expressive power. The field has evolved from biologically-inspired functions like sigmoid to modern innovations designed for specific architectural needs.

Key takeaways:

Sigmoid and tanh were early standards but suffer from vanishing gradients in deep networks. Sigmoid remains useful for output layers when probabilities are needed.
ReLU revolutionized deep learning with its simple, sparse, and gradient-preserving properties. The dying ReLU problem led to variants like Leaky ReLU and ELU.
GELU provides smooth, probabilistic gating and has become the standard for transformer architectures. Its stochastic interpretation aligns well with the attention mechanism.
Swish and Mish offer smooth, non-monotonic alternatives that can outperform ReLU on certain tasks while maintaining good gradient flow.
For practical applications: Start with ReLU for general feedforward networks, GELU for transformers, and task-appropriate functions for output layers.

The choice of activation function can significantly impact training dynamics and final performance. While the differences may seem subtle, they compound across millions of neurons and billions of parameters in modern networks.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about activation functions in neural networks.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Linear Classifiers

Next Chapter

Multilayer Perceptrons

Reference

BIBTEXAcademic

@misc{activationfunctionsfromsigmoidtogeluandbeyond, author = {Michael Brenndoerfer}, title = {Activation Functions: From Sigmoid to GELU and Beyond}, year = {2025}, url = {https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). Activation Functions: From Sigmoid to GELU and Beyond. Retrieved from https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide

MLAAcademic

Michael Brenndoerfer. "Activation Functions: From Sigmoid to GELU and Beyond." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide>.

CHICAGOAcademic

Michael Brenndoerfer. "Activation Functions: From Sigmoid to GELU and Beyond." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Activation Functions: From Sigmoid to GELU and Beyond'. Available at: https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). Activation Functions: From Sigmoid to GELU and Beyond. https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide

Direct link:

https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Activation Functions: From Sigmoid to GELU and Beyond

Activation FunctionsLink Copied

Why Non-Linearity MattersLink Copied

The Sigmoid FunctionLink Copied

Mathematical DefinitionLink Copied

The DerivativeLink Copied

The Saturation ProblemLink Copied

The Hyperbolic Tangent (tanh)Link Copied

Mathematical DefinitionLink Copied

The DerivativeLink Copied

Comparing Sigmoid and TanhLink Copied

Rectified Linear Unit (ReLU)Link Copied

Mathematical DefinitionLink Copied

The DerivativeLink Copied

The Dying ReLU ProblemLink Copied

Leaky ReLU and Parametric ReLULink Copied

Mathematical DefinitionLink Copied

Parametric ReLU (PReLU)Link Copied

DerivativeLink Copied

Exponential Linear Unit (ELU)Link Copied

Mathematical DefinitionLink Copied

The DerivativeLink Copied

Scaled Exponential Linear Unit (SELU)Link Copied

Gaussian Error Linear Unit (GELU)Link Copied

IntuitionLink Copied

Mathematical DefinitionLink Copied

Practical ApproximationLink Copied

The DerivativeLink Copied

Swish and MishLink Copied

SwishLink Copied

MishLink Copied

Comparing All Activation FunctionsLink Copied

Implementation in PyTorchLink Copied

Custom Activation ImplementationLink Copied

Choosing Activation FunctionsLink Copied

For Hidden Layers in Feedforward NetworksLink Copied

For Transformer ArchitecturesLink Copied

For Output LayersLink Copied

For Recurrent Networks (RNNs, LSTMs)Link Copied

Practical ConsiderationsLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Linear Classifiers: The Foundation of Neural Networks

Stay updated