Search

Search articles

Activation Functions: From Sigmoid to GELU and Beyond

Michael BrenndoerferDecember 15, 202520 min read

Master neural network activation functions including sigmoid, tanh, ReLU variants, GELU, Swish, and Mish. Learn when to use each and why.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Activation Functions

Neural networks derive their power from non-linearity. Without activation functions, even the deepest network would collapse into a simple linear transformation, no matter how many layers you stack. Activation functions are the secret ingredient that allows neural networks to learn complex, non-linear patterns in data.

In this chapter, we explore the evolution of activation functions, from the biologically-inspired sigmoid to modern innovations like GELU and Swish. You will understand not just what each function computes, but why it was designed that way and when to use it.

Why Non-Linearity Matters

Consider stacking two linear transformations. If the first layer computes h=W1x+b1\mathbf{h} = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1 and the second computes y=W2h+b2\mathbf{y} = \mathbf{W}_2 \mathbf{h} + \mathbf{b}_2, we can substitute and simplify:

y=W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)\mathbf{y} = \mathbf{W}_2 (\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2 = (\mathbf{W}_2 \mathbf{W}_1) \mathbf{x} + (\mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2)

where:

  • x\mathbf{x}: the input vector
  • W1,W2\mathbf{W}_1, \mathbf{W}_2: weight matrices for layers 1 and 2
  • b1,b2\mathbf{b}_1, \mathbf{b}_2: bias vectors for layers 1 and 2
  • h\mathbf{h}: the hidden representation after the first layer
  • y\mathbf{y}: the final output

The result is equivalent to a single linear layer with weight matrix W=W2W1\mathbf{W}' = \mathbf{W}_2 \mathbf{W}_1 and bias b=W2b1+b2\mathbf{b}' = \mathbf{W}_2 \mathbf{b}_1 + \mathbf{b}_2. No matter how many layers we add, without non-linearity, the network can only learn linear decision boundaries.

Activation functions break this limitation. By applying a non-linear function after each linear transformation, we enable networks to approximate arbitrarily complex functions.

The Sigmoid Function

The sigmoid function was one of the first activation functions used in neural networks. Its S-shaped curve smoothly maps any real number to a value between 0 and 1, making it interpretable as a probability.

Mathematical Definition

The sigmoid function is defined as:

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

where:

  • zz: the input value (pre-activation), which can be any real number
  • ee: Euler's number (approximately 2.718)
  • σ(z)\sigma(z): the output, constrained to the range (0,1)(0, 1)

The function works by exponentiating the negative input. When zz is large and positive, eze^{-z} approaches 0, so σ(z)1\sigma(z) \approx 1. When zz is large and negative, eze^{-z} becomes very large, pushing σ(z)\sigma(z) toward 0. At z=0z = 0, we get σ(0)=0.5\sigma(0) = 0.5.

The Derivative

The sigmoid has an elegant derivative that can be expressed in terms of itself:

σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))

where:

  • σ(z)\sigma'(z): the derivative of sigmoid with respect to zz
  • σ(z)\sigma(z): the sigmoid function evaluated at zz

This property makes gradient computation efficient. Once we compute the forward pass value σ(z)\sigma(z), we can immediately compute the gradient without re-evaluating the exponential.

The Saturation Problem

The maximum value of the derivative occurs at z=0z = 0, where σ(0)=0.25\sigma'(0) = 0.25. This means the gradient is at most 0.25, and it rapidly approaches 0 as z|z| increases. This phenomenon is called saturation.

Vanishing Gradients

When inputs to sigmoid neurons are very large or very small, the gradient becomes negligibly small. During backpropagation, these tiny gradients multiply together across layers, causing gradients to "vanish" in deep networks. This makes training deep networks with sigmoid activations extremely difficult.

Let's visualize the sigmoid function and its derivative to understand saturation:

Out[2]:
Visualization
S-shaped sigmoid curve with saturation regions shaded in red for |z| > 3
The sigmoid function squashes inputs to the range (0, 1). Shaded regions show saturation zones where the function is nearly flat.
Bell-shaped sigmoid derivative curve peaking at 0.25 with vanishing regions shaded
The sigmoid derivative peaks at 0.25 and rapidly approaches zero for large |z|, causing the vanishing gradient problem.

The shaded regions highlight where saturation occurs. In these zones, the gradient is nearly zero, making learning extremely slow or impossible.

The Hyperbolic Tangent (tanh)

The hyperbolic tangent function addresses one limitation of sigmoid: it is zero-centered, meaning its outputs are symmetric around zero. This property helps with gradient flow during training.

Mathematical Definition

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

where:

  • zz: the input value
  • ez,eze^z, e^{-z}: exponentials of the positive and negative input
  • tanh(z)\tanh(z): the output, constrained to the range (1,1)(-1, 1)

An equivalent formulation relates tanh to sigmoid:

tanh(z)=2σ(2z)1\tanh(z) = 2\sigma(2z) - 1

This shows that tanh is essentially a rescaled and shifted version of sigmoid, mapping to (1,1)(-1, 1) instead of (0,1)(0, 1).

The Derivative

tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z)

where:

  • tanh(z)\tanh'(z): the derivative of tanh with respect to zz
  • tanh2(z)\tanh^2(z): the square of the tanh output

At z=0z = 0, the derivative equals 1, which is four times larger than sigmoid's maximum gradient of 0.25. This stronger gradient signal helps mitigate, though not eliminate, the vanishing gradient problem.

Comparing Sigmoid and Tanh

Out[3]:
Visualization
Two overlaid S-curves showing sigmoid ranging 0 to 1 and tanh ranging -1 to 1
Comparison of sigmoid and tanh activation functions. Tanh is zero-centered with range (-1, 1) and has stronger gradients near the origin, making it generally preferable to sigmoid for hidden layers.

Both sigmoid and tanh suffer from saturation for large z|z| values. However, tanh's zero-centered output makes it preferable for hidden layers, while sigmoid remains useful for output layers when you need probabilities.

Rectified Linear Unit (ReLU)

ReLU revolutionized deep learning. Its simplicity, computational efficiency, and resistance to vanishing gradients made training deep networks practical for the first time.

Mathematical Definition

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

where:

  • zz: the input value
  • max(0,z)\max(0, z): returns zz if positive, otherwise 0

The function is piecewise linear: it passes positive values unchanged and clips negative values to zero.

The Derivative

ReLU(z)={1if z>00if z<0\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z < 0 \end{cases}

At exactly z=0z = 0, the derivative is technically undefined, but in practice, we typically set it to 0 or 1.

The key insight is that for positive inputs, the gradient is exactly 1. This means gradients flow through ReLU layers without diminishing, solving the vanishing gradient problem that plagued sigmoid and tanh.

The Dying ReLU Problem

While ReLU solves vanishing gradients, it introduces a new issue: neurons can "die" during training.

Dying ReLU

If a ReLU neuron's weights are updated such that its pre-activation zz becomes negative for all training examples, the neuron outputs zero for every input. Since the gradient is also zero for negative inputs, the neuron can never recover. It becomes permanently inactive, effectively reducing the network's capacity.

This typically happens when:

  1. The learning rate is too high, causing large weight updates
  2. Poor weight initialization pushes many neurons into negative territory
  3. Strong negative gradients shift the bias terms too far negative
Out[4]:
Visualization
Piecewise linear ReLU function with dead zone shaded for negative inputs
ReLU activation function. The shaded region shows the dead zone where output is always zero.
Step function ReLU derivative with gradient 1 for positive and 0 for negative
ReLU derivative showing constant gradient of 1 for positive inputs and 0 for negative inputs.

Leaky ReLU and Parametric ReLU

Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient when the input is negative.

Mathematical Definition

LeakyReLU(z)={zif z>0αzif z0\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

where:

  • zz: the input value
  • α\alpha: a small positive constant, typically 0.01
  • αz\alpha z: the scaled negative input, giving a small but non-zero output

The function can also be written compactly as:

LeakyReLU(z)=max(αz,z)\text{LeakyReLU}(z) = \max(\alpha z, z)

Parametric ReLU (PReLU)

PReLU takes this further by making α\alpha a learnable parameter:

PReLU(z)={zif z>0αzif z0\text{PReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

The key difference is that α\alpha is learned during training via backpropagation, allowing the network to determine the optimal slope for negative inputs.

Derivative

For both Leaky ReLU and PReLU:

zLeakyReLU(z)={1if z>0αif z0\frac{\partial}{\partial z}\text{LeakyReLU}(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \leq 0 \end{cases}

The small but non-zero gradient α\alpha for negative inputs allows the neuron to continue learning even when it receives negative pre-activations.

Out[5]:
Visualization
Two piecewise linear functions overlaid showing ReLU flat at zero and Leaky ReLU with small negative slope
ReLU vs Leaky ReLU. The small negative slope in Leaky ReLU (α = 0.1 shown here for visibility) prevents neurons from dying while preserving the computational benefits of ReLU.

Exponential Linear Unit (ELU)

ELU provides smooth, differentiable transitions at zero and pushes mean activations closer to zero, which can accelerate learning.

Mathematical Definition

ELU(z)={zif z>0α(ez1)if z0\text{ELU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha (e^z - 1) & \text{if } z \leq 0 \end{cases}

where:

  • zz: the input value
  • α\alpha: a hyperparameter controlling the saturation value for negative inputs (typically 1.0)
  • ez1e^z - 1: an exponential term that approaches 1-1 as zz \to -\infty

For large negative inputs, ELU saturates at α-\alpha. This saturation provides noise robustness by limiting the influence of large negative values.

The Derivative

ELU(z)={1if z>0ELU(z)+αif z0\text{ELU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ \text{ELU}(z) + \alpha & \text{if } z \leq 0 \end{cases}

Note that for z0z \leq 0, the derivative can be computed from the function value itself, similar to sigmoid's self-referential gradient.

Scaled Exponential Linear Unit (SELU)

SELU is a self-normalizing variant of ELU. When used with proper weight initialization, SELU activations automatically converge to zero mean and unit variance, eliminating the need for batch normalization.

SELU(z)=λ{zif z>0α(ez1)if z0\text{SELU}(z) = \lambda \begin{cases} z & \text{if } z > 0 \\ \alpha (e^z - 1) & \text{if } z \leq 0 \end{cases}

where the specific values are:

  • λ1.0507\lambda \approx 1.0507
  • α1.6733\alpha \approx 1.6733

These constants were derived analytically to ensure self-normalizing properties.

Out[6]:
Visualization
Three overlaid activation curves showing ReLU, ELU, and SELU with different negative region behaviors
ELU and SELU activations compared to ReLU. ELU provides smooth negative outputs that saturate at -α, while SELU includes scaling factors for self-normalization.

Gaussian Error Linear Unit (GELU)

GELU has become the default activation function in transformer architectures like BERT and GPT. Unlike ReLU, which deterministically zeroes out negative values, GELU applies a smooth, probabilistic gating.

Intuition

GELU can be understood through a stochastic regularization lens. Imagine each neuron is randomly multiplied by either 0 or 1, where the probability of being "on" depends on how positive the input is. GELU computes the expected value of this stochastic process:

GELU(z)=zP(Zz)\text{GELU}(z) = z \cdot P(Z \leq z)

where ZN(0,1)Z \sim \mathcal{N}(0, 1) is a standard normal random variable.

Mathematical Definition

The formal definition uses the cumulative distribution function (CDF) of the standard normal distribution:

GELU(z)=zΦ(z)\text{GELU}(z) = z \cdot \Phi(z)

where:

  • zz: the input value
  • Φ(z)\Phi(z): the CDF of the standard normal distribution, i.e., Φ(z)=P(Zz)\Phi(z) = P(Z \leq z)
  • zΦ(z)z \cdot \Phi(z): the input scaled by the probability that a standard normal is less than zz

The CDF Φ(z)\Phi(z) is defined as:

Φ(z)=12[1+erf(z2)]\Phi(z) = \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]

where erf\text{erf} is the error function.

Practical Approximation

Computing the error function can be expensive. A commonly used approximation is:

GELU(z)0.5z(1+tanh[2π(z+0.044715z3)])\text{GELU}(z) \approx 0.5z\left(1 + \tanh\left[\sqrt{\frac{2}{\pi}}\left(z + 0.044715z^3\right)\right]\right)

This approximation is accurate to within a few percent and is computationally efficient.

The Derivative

The derivative of GELU involves both the CDF and the probability density function (PDF) of the standard normal:

GELU(z)=Φ(z)+zϕ(z)\text{GELU}'(z) = \Phi(z) + z \cdot \phi(z)

where:

  • Φ(z)\Phi(z): the standard normal CDF
  • ϕ(z)=12πez2/2\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-z^2/2}: the standard normal PDF

This derivative is always positive for large positive zz, and smoothly transitions through zero.

Out[7]:
Visualization
GELU smooth curve overlaid on ReLU showing gradual transition near zero versus ReLU sharp corner
GELU activation compared to ReLU. GELU provides a smooth, probabilistic gating mechanism where small negative values are gradually suppressed rather than hard-clipped to zero. The shaded region highlights where GELU allows some negative values to pass.

The key difference is visible near zero: GELU smoothly curves through the origin, allowing small negative values to pass through with reduced magnitude. This smooth gating is believed to help with gradient flow and model expressiveness.

Swish and Mish

Swish and Mish are modern activation functions discovered through automated search and designed to combine the benefits of ReLU-like behavior with smooth gradients.

Swish

Swish was discovered by Google researchers using automated search techniques:

Swish(z)=zσ(βz)\text{Swish}(z) = z \cdot \sigma(\beta z)

where:

  • zz: the input value
  • σ\sigma: the sigmoid function
  • β\beta: a learnable or fixed parameter (often β=1\beta = 1)
  • zσ(βz)z \cdot \sigma(\beta z): the input scaled by the sigmoid of itself

When β=1\beta = 1, this simplifies to:

Swish(z)=z1+ez\text{Swish}(z) = \frac{z}{1 + e^{-z}}

Swish is non-monotonic: it dips below zero for negative inputs before rising again. This allows small negative gradients to flow backward, potentially helping escape local minima.

Mish

Mish uses the softplus function and tanh for an even smoother curve:

Mish(z)=ztanh(softplus(z))\text{Mish}(z) = z \cdot \tanh(\text{softplus}(z))

where:

  • softplus(z)=ln(1+ez)\text{softplus}(z) = \ln(1 + e^z): a smooth approximation to ReLU
  • tanh\tanh: the hyperbolic tangent function

Mish has continuous derivatives of all orders, making it particularly smooth. Like Swish, it is non-monotonic with a small negative region.

Out[8]:
Visualization
Three overlaid curves showing ReLU linear behavior and Swish/Mish smooth non-monotonic dips
Swish and Mish activations compared to ReLU. Both are smooth and non-monotonic, with a small dip below zero for negative inputs that can help gradient flow.

Comparing All Activation Functions

Let's visualize all the activation functions we've covered in a single comprehensive comparison. This 2×4 grid allows direct comparison of each function's behavior, particularly around zero where their differences are most apparent.

Out[9]:
Visualization
2x4 grid of activation function plots showing sigmoid, tanh, ReLU, Leaky ReLU, ELU, GELU, Swish, and Mish
Comprehensive comparison of all activation functions. ReLU and its variants dominate modern deep learning, while GELU has become the standard for transformer architectures. Each panel uses the same x-axis range for direct comparison.

Implementation in PyTorch

Let's implement and visualize these activation functions using PyTorch. Modern deep learning frameworks provide optimized implementations of all common activations.

In[10]:
Code
import torch
import torch.nn as nn

# Create input tensor
z = torch.linspace(-4, 4, 200)

# Built-in activations
activations = {
    "Sigmoid": nn.Sigmoid(),
    "Tanh": nn.Tanh(),
    "ReLU": nn.ReLU(),
    "LeakyReLU": nn.LeakyReLU(negative_slope=0.1),
    "ELU": nn.ELU(alpha=1.0),
    "SELU": nn.SELU(),
    "GELU": nn.GELU(),
    "SiLU (Swish)": nn.SiLU(),
    "Mish": nn.Mish(),
}
Out[11]:
Console
PyTorch Activation Functions
========================================

Sigmoid:
  f(-2) = 0.1192
  f(0)  = 0.5000
  f(2)  = 0.8808

Tanh:
  f(-2) = -0.9640
  f(0)  = 0.0000
  f(2)  = 0.9640

ReLU:
  f(-2) = 0.0000
  f(0)  = 0.0000
  f(2)  = 2.0000

LeakyReLU:
  f(-2) = -0.2000
  f(0)  = 0.0000
  f(2)  = 2.0000

ELU:
  f(-2) = -0.8647
  f(0)  = 0.0000
  f(2)  = 2.0000

SELU:
  f(-2) = -1.5202
  f(0)  = 0.0000
  f(2)  = 2.1014

GELU:
  f(-2) = -0.0455
  f(0)  = 0.0000
  f(2)  = 1.9545

SiLU (Swish):
  f(-2) = -0.2384
  f(0)  = 0.0000
  f(2)  = 1.7616

Mish:
  f(-2) = -0.2525
  f(0)  = 0.0000
  f(2)  = 1.9440

The output shows each activation's behavior at three key points: a negative value, zero, and a positive value. Notice how ReLU completely zeros out negative inputs, while Leaky ReLU preserves a scaled version. GELU and Swish (SiLU) show smooth, non-zero responses even for negative inputs.

Custom Activation Implementation

You can also implement custom activations as simple functions:

In[12]:
Code
def custom_gelu_approx(z):
    """GELU using the tanh approximation."""
    return (
        0.5
        * z
        * (
            1
            + torch.tanh(
                torch.sqrt(torch.tensor(2.0 / torch.pi)) * (z + 0.044715 * z**3)
            )
        )
    )


def parametric_swish(z, beta=1.0):
    """Swish with adjustable beta parameter."""
    return z * torch.sigmoid(beta * z)
Out[13]:
Console
GELU Approximation vs Built-in:
-----------------------------------
Input    Built-in     Approx      
-1.0     -0.1587      -0.1588     
0.0      0.0000       0.0000      
1.0      0.8413       0.8412      
2.0      1.9545       1.9546      

The tanh approximation closely matches the exact GELU computation while being computationally efficient. The small differences are negligible in practice.

Choosing Activation Functions

Selecting the right activation function depends on your architecture, task, and computational constraints. Here is a practical guide based on current best practices.

For Hidden Layers in Feedforward Networks

Activation function recommendations for feedforward hidden layers.
ActivationBest Use CaseConsiderations
ReLUDefault choice for most networksFast, effective, but watch for dying neurons
Leaky ReLUWhen dying ReLU is a concernSmall computational overhead
ELUWhen faster convergence is neededMore expensive than ReLU
SELUSelf-normalizing networksRequires specific initialization

For Transformer Architectures

GELU has become the de facto standard for transformers. Its smooth gating properties align well with the attention mechanism's probabilistic nature. Major models like BERT, GPT, and their variants all use GELU.

For Output Layers

Activation function recommendations for output layers by task type.
TaskActivationOutput Range
Binary classificationSigmoid(0, 1)
Multi-class classificationSoftmax(0, 1) per class, sum = 1
RegressionNone (linear)Unbounded
Bounded regressionSigmoid or TanhScaled to target range

For Recurrent Networks (RNNs, LSTMs)

Tanh remains common for recurrent connections because:

  1. Zero-centered outputs help maintain stable hidden states
  2. The bounded range prevents activations from exploding
  3. Gate mechanisms (in LSTMs/GRUs) specifically rely on sigmoid's (0, 1) output

Practical Considerations

When implementing activation functions in production, consider these factors:

Computational Cost: ReLU is the fastest, requiring only a comparison and conditional. Exponential-based functions (sigmoid, tanh, ELU, SELU) are more expensive. GELU requires either the expensive error function or a polynomial approximation.

Numerical Stability: Sigmoid and softmax can overflow for very large inputs. Use numerically stable implementations that subtract the maximum value before exponentiating.

Initialization Compatibility: Some activations work best with specific initialization schemes. He initialization pairs well with ReLU variants, while Xavier/Glorot initialization suits tanh and sigmoid.

Gradient Flow: For very deep networks, favor activations with gradients that neither vanish nor explode. ReLU and its variants generally provide good gradient flow, while sigmoid and tanh can cause vanishing gradients.

Summary

Activation functions are the non-linear transformations that give neural networks their expressive power. The field has evolved from biologically-inspired functions like sigmoid to modern innovations designed for specific architectural needs.

Key takeaways:

  • Sigmoid and tanh were early standards but suffer from vanishing gradients in deep networks. Sigmoid remains useful for output layers when probabilities are needed.

  • ReLU revolutionized deep learning with its simple, sparse, and gradient-preserving properties. The dying ReLU problem led to variants like Leaky ReLU and ELU.

  • GELU provides smooth, probabilistic gating and has become the standard for transformer architectures. Its stochastic interpretation aligns well with the attention mechanism.

  • Swish and Mish offer smooth, non-monotonic alternatives that can outperform ReLU on certain tasks while maintaining good gradient flow.

  • For practical applications: Start with ReLU for general feedforward networks, GELU for transformers, and task-appropriate functions for output layers.

The choice of activation function can significantly impact training dynamics and final performance. While the differences may seem subtle, they compound across millions of neurons and billions of parameters in modern networks.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about activation functions in neural networks.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{activationfunctionsfromsigmoidtogeluandbeyond, author = {Michael Brenndoerfer}, title = {Activation Functions: From Sigmoid to GELU and Beyond}, year = {2025}, url = {https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). Activation Functions: From Sigmoid to GELU and Beyond. Retrieved from https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide
MLAAcademic
Michael Brenndoerfer. "Activation Functions: From Sigmoid to GELU and Beyond." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide>.
CHICAGOAcademic
Michael Brenndoerfer. "Activation Functions: From Sigmoid to GELU and Beyond." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Activation Functions: From Sigmoid to GELU and Beyond'. Available at: https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). Activation Functions: From Sigmoid to GELU and Beyond. https://mbrenndoerfer.com/writing/activation-functions-neural-networks-complete-guide
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free