Weight Initialization: Xavier, He & Variance Preservation for Deep Networks

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn why weight initialization matters for training neural networks. Covers Xavier and He initialization, variance propagation analysis, and practical PyTorch implementation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Weight InitializationLink Copied

Training a neural network means finding good values for its weights, but where do we start? The answer matters more than you might expect. If we initialize weights poorly, gradients can explode to infinity or vanish to zero, making learning impossible before it even begins. The right initialization sets the stage for stable, efficient training.

This chapter explores why weight initialization matters and how techniques like Xavier and He initialization solve the problems that plagued early deep networks. We'll derive these methods from first principles, implement them from scratch, and see how they enable training networks that would otherwise fail to learn.

The Need for Careful InitializationLink Copied

Consider what happens when we initialize all weights to zero. Each neuron in a layer receives the same input and produces the same output. During backpropagation, every neuron receives identical gradients. They all update identically, remaining the same as each other forever. The network has many neurons but effectively behaves as if it has one. This is the symmetry problem: if neurons start identical, they stay identical.

Symmetry Breaking

The requirement that neurons in the same layer be initialized differently so they can learn different features. Without symmetry breaking, all neurons compute the same function regardless of network width.

The following visualization contrasts zero initialization (where all weights look identical) with random initialization (where each weight has a distinct value):

Out[3]:

Visualization

Heatmap showing a uniformly gray 16x16 weight matrix with all zeros. — Zero initialization: all weights are identical, shown as a uniform gray. Every neuron in this layer will compute the same function, making the additional neurons useless.

Heatmap showing a colorful 16x16 weight matrix with varied values. — Random initialization: each weight has a distinct value. This diversity allows neurons to specialize during training and learn different features.

We need randomness to break symmetry. But random initialization introduces a new challenge: how do we choose the scale of our random values? Too large, and activations explode. Too small, and gradients vanish.

Let's see this problem in action:

In[4]:

Code

import numpy as np

np.random.seed(42)


def forward_pass(x, weights, activation="tanh"):
    """Propagate input through multiple layers."""
    activations = [x]

    for W in weights:
        x = x @ W
        if activation == "tanh":
            x = np.tanh(x)
        elif activation == "relu":
            x = np.maximum(0, x)
        activations.append(x)

    return activations


# Create a deep network with 10 layers
n_layers = 10
layer_size = 256
batch_size = 100

# Initialize with different scales
scales = [0.01, 0.1, 1.0, 2.0]

import numpy as np

np.random.seed(42)


def forward_pass(x, weights, activation="tanh"):
    """Propagate input through multiple layers."""
    activations = [x]

    for W in weights:
        x = x @ W
        if activation == "tanh":
            x = np.tanh(x)
        elif activation == "relu":
            x = np.maximum(0, x)
        activations.append(x)

    return activations


# Create a deep network with 10 layers
n_layers = 10
layer_size = 256
batch_size = 100

# Initialize with different scales
scales = [0.01, 0.1, 1.0, 2.0]

Out[5]:

Console

Effect of initialization scale on activations (tanh network):
------------------------------------------------------------
Scale 0.01: Final layer std = 0.000000

Scale 0.10: Final layer std = 0.633119
Scale 1.00: Final layer std = 0.973512

Scale 2.00: Final layer std = 0.988212

With scale 0.01, activations shrink to nearly zero. With scale 2.0, they saturate at the extremes of tanh ( $\pm 1$ ). Neither is good for learning. Scale 0.1 and 1.0 preserve more reasonable activation magnitudes, but which is optimal?

Out[6]:

Visualization

Violin plot showing activation distributions narrowing toward zero in deeper layers. — Activation distributions across layers with small initialization (scale=0.01). Activations progressively collapse toward zero, making gradients vanish and preventing learning in deeper layers.

Violin plot showing bimodal activation distributions concentrated at extreme values. — Activation distributions with large initialization (scale=2.0). Activations saturate at ±1 (tanh limits), where gradients are nearly zero. This also prevents effective learning.

The visualizations reveal two failure modes. With small initialization, activations progressively shrink layer by layer, eventually collapsing to near-zero values. The network loses the ability to distinguish different inputs. With large initialization, activations immediately saturate at the extremes of the activation function. Since the derivative of tanh near $\pm 1$ is close to zero, gradients vanish here too.

We need an initialization that preserves the variance of activations across layers, keeping them in the useful range where gradients flow effectively.

Variance Analysis of Forward PropagationLink Copied

The experiments above reveal that initialization scale matters, but they don't tell us what the right scale is. To find it, we need to shift from empirical observation to mathematical analysis. The key question is: how does the variance of activations change as we pass through each layer?

If we can answer this question, we can work backward to determine what weight variance will preserve activation variance across layers. This is the central insight behind principled initialization: treat variance preservation as a design constraint, then solve for the weights that satisfy it.

The Pre-Activation EquationLink Copied

Consider a single layer with $n_{\text{in}}$ input neurons and $n_{\text{out}}$ output neurons. Each output neuron computes a weighted sum of its inputs before applying the activation function. For a single output neuron, this pre-activation is:

z = \sum_{i=1}^{n_{\text{in}}} w_i x_i

where:

$z$ : the pre-activation value (before applying the activation function)
$w_i$ : the weight connecting input neuron $i$ to this output neuron
$x_i$ : the activation from input neuron $i$
$n_{\text{in}}$ : the number of input neurons (fan-in)

This equation is the starting point for our analysis. The pre-activation $z$ is a sum of many random terms, and we want to understand how its variance relates to the variances of the weights and inputs.

Deriving the Variance RelationshipLink Copied

Assuming $w_i$ and $x_i$ are independent random variables with zero mean (a reasonable assumption when weights are initialized symmetrically around zero and inputs are centered), we can derive the variance of the pre-activation step by step.

Step 1: Variance of a sum. Since variance is additive for independent random variables:

\text{Var}(z) = \text{Var}\left(\sum_{i=1}^{n_{\text{in}}} w_i x_i\right) = \sum_{i=1}^{n_{\text{in}}} \text{Var}(w_i x_i)

Step 2: Variance of a product. For two independent, zero-mean random variables $A$ and $B$ , we have $\text{Var}(AB) = \text{Var}(A) \cdot \text{Var}(B)$ . This follows because $\mathbb{E}[AB] = \mathbb{E}[A]\mathbb{E}[B] = 0$ and $\mathbb{E}[(AB)^2] = \mathbb{E}[A^2]\mathbb{E}[B^2] = \text{Var}(A) \cdot \text{Var}(B)$ .

Step 3: Combine. Assuming all weights have the same variance $\text{Var}(w)$ and all inputs have variance $\text{Var}(x)$ :

\text{Var}(z) = \sum_{i=1}^{n_{\text{in}}} \text{Var}(w_i) \cdot \text{Var}(x_i) = n_{\text{in}} \cdot \text{Var}(w) \cdot \text{Var}(x)

where:

$\text{Var}(z)$ : variance of the pre-activation output
$n_{\text{in}}$ : number of input neurons (fan-in)
$\text{Var}(w)$ : variance of each weight (assumed identical)
$\text{Var}(x)$ : variance of each input activation (assumed identical)

This result tells us something useful: the output variance equals the input variance multiplied by both the number of inputs and the weight variance. It immediately suggests why naive initialization fails. With 256 input neurons and unit-variance weights, the output variance would be 256 times larger than the input. Each layer amplifies the signal by a factor of $n_{\text{in}}$ , leading to exponential growth.

Fan-in and Fan-out

For a weight matrix connecting two layers, fan-in ( $n_{\text{in}}$ ) is the number of input connections per output neuron, and fan-out ( $n_{\text{out}}$ ) is the number of output connections per input neuron. For a fully connected layer, fan-in equals the input dimension and fan-out equals the output dimension.

Solving for the Optimal Weight VarianceLink Copied

Now we can solve for the weight variance that preserves signal magnitude. To maintain stable variance across layers, we want $\text{Var}(z) = \text{Var}(x)$ . Starting from our variance equation:

\text{Var}(z) = n_{\text{in}} \cdot \text{Var}(w) \cdot \text{Var}(x)

Setting $\text{Var}(z) = \text{Var}(x)$ and dividing both sides by $\text{Var}(x)$ :

1 = n_{\text{in}} \cdot \text{Var}(w)

Solving for the weight variance:

\text{Var}(w) = \frac{1}{n_{\text{in}}}

This is the key insight: the weight variance should scale inversely with the number of input connections. Intuitively, when we sum $n_{\text{in}}$ random terms, the total variance grows proportionally to $n_{\text{in}}$ . To counteract this growth and keep the output variance equal to the input variance, we must shrink each weight's variance by the factor $1/n_{\text{in}}$ .

Empirical VerificationLink Copied

Theory is only useful if it matches reality. Let's verify our analysis by comparing different weight variances and observing how activation variance propagates through a deep network:

In[7]:

Code

def analyze_variance_propagation(n_layers, layer_size, weight_variance):
    """Track variance across layers with specified weight variance."""
    weights = [
        np.random.randn(layer_size, layer_size) * np.sqrt(weight_variance)
        for _ in range(n_layers)
    ]

    x = np.random.randn(1000, layer_size)
    x = x / np.std(x)  # Normalize input to unit variance

    variances = [np.var(x)]

    for W in weights:
        x = x @ W  # Linear transformation only
        variances.append(np.var(x))

    return variances


# Compare different variance choices
variance_choices = {
    "Too small (1/n²)": 1 / (layer_size**2),
    "Just right (1/n)": 1 / layer_size,
    "Too large (1)": 1.0,
}

def analyze_variance_propagation(n_layers, layer_size, weight_variance):
    """Track variance across layers with specified weight variance."""
    weights = [
        np.random.randn(layer_size, layer_size) * np.sqrt(weight_variance)
        for _ in range(n_layers)
    ]

    x = np.random.randn(1000, layer_size)
    x = x / np.std(x)  # Normalize input to unit variance

    variances = [np.var(x)]

    for W in weights:
        x = x @ W  # Linear transformation only
        variances.append(np.var(x))

    return variances


# Compare different variance choices
variance_choices = {
    "Too small (1/n²)": 1 / (layer_size**2),
    "Just right (1/n)": 1 / layer_size,
    "Too large (1)": 1.0,
}

Out[8]:

Console

Variance propagation through 10 layers (no activation):
Layer size: 256
------------------------------------------------------------


Too small (1/n²):
  Initial variance: 1.0000
  Final variance:   8.1907e-25
  Ratio (final/initial): 8.1907e-25


Just right (1/n):
  Initial variance: 1.0000
  Final variance:   1.0744e+00
  Ratio (final/initial): 1.0744e+00


Too large (1):
  Initial variance: 1.0000
  Final variance:   1.2690e+24
  Ratio (final/initial): 1.2690e+24

With weight variance of $1/n$ , the output variance remains close to the input variance, confirming our analysis. The other choices cause variance to explode or collapse exponentially with depth.

Out[9]:

Visualization

Line plot showing variance across 10 layers for three initialization strategies, with 1/n remaining stable while others diverge. — Variance propagation through layers with different weight initialization scales. Only the 1/n variance (green) maintains stable variance across all layers. Too small variance causes exponential decay; too large causes exponential growth.

Xavier/Glorot InitializationLink Copied

Our variance analysis derived the optimal weight variance for the forward pass. But training a neural network involves both forward and backward propagation. Xavier Glorot and Yoshua Bengio, in their influential 2010 paper, recognized that we must consider both directions of signal flow.

Their insight was that gradients during backpropagation face the same variance amplification problem as activations during forward propagation, but in reverse. If we only optimize for forward variance, gradients might explode or vanish. The challenge is finding a weight variance that works reasonably well for both directions.

Forward Pass RequirementLink Copied

As we derived above, to maintain variance during forward propagation:

\text{Var}(w) = \frac{1}{n_{\text{in}}}

where $n_{\text{in}}$ is the fan-in (number of input connections to each neuron).

Backward Pass RequirementLink Copied

During backpropagation, gradients flow in the opposite direction. Each input neuron receives gradient contributions from all $n_{\text{out}}$ output neurons. By similar reasoning to the forward pass, to maintain gradient variance:

\text{Var}(w) = \frac{1}{n_{\text{out}}}

where $n_{\text{out}}$ is the fan-out (number of output neurons that receive input from each input neuron).

The CompromiseLink Copied

We now face a dilemma. The forward pass demands $\text{Var}(w) = 1/n_{\text{in}}$ . The backward pass demands $\text{Var}(w) = 1/n_{\text{out}}$ . These two requirements conflict unless $n_{\text{in}} = n_{\text{out}}$ , which is rare in practice.

Since we cannot satisfy both simultaneously, Glorot and Bengio proposed a compromise that balances both requirements. Rather than favoring one direction over the other, they take the average of the two fan values:

\text{Var}(w) = \frac{2}{n_{\text{in}} + n_{\text{out}}}

where:

$\text{Var}(w)$ : the variance of each weight in the layer
$n_{\text{in}}$ : fan-in (input dimension)
$n_{\text{out}}$ : fan-out (output dimension)

This is Xavier initialization (also called Glorot initialization). To sample weights from this distribution, we need to convert the variance to distribution parameters.

For a uniform distribution $U[-a, a]$ , the variance is $a^2/3$ . Setting this equal to our target variance:

\frac{a^2}{3} = \frac{2}{n_{\text{in}} + n_{\text{out}}}

Solving for $a$ :

a = \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}

So we sample:

w \sim U\left[-\sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}, \sqrt{\frac{6}{n_{\text{in}} + n_{\text{out}}}}\right]

For a normal distribution $\mathcal{N}(0, \sigma)$ , the variance is $\sigma^2$ . Setting this equal to our target variance:

\sigma^2 = \frac{2}{n_{\text{in}} + n_{\text{out}}}

Taking the square root:

\sigma = \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}

So we sample:

w \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}} + n_{\text{out}}}}\right)

Xavier/Glorot Initialization

A weight initialization scheme where weights are drawn from a distribution with variance $2/(n_{\text{in}} + n_{\text{out}})$ . Designed to maintain variance of activations and gradients across layers for networks using tanh or sigmoid activations.

ImplementationLink Copied

With the formulas in hand, implementing Xavier initialization is straightforward. We need to sample weights from a distribution with the correct variance, choosing either uniform or normal based on preference.

The following visualization shows the resulting weight distributions for both uniform and normal variants:

Out[10]:

Visualization

Histogram showing uniform weight distribution with sharp cutoffs at the bounds. — Xavier uniform weight distribution. Weights are drawn from U[-0.108, 0.108] for a layer with fan_in=256 and fan_out=256, giving variance 2/(256+256) ≈ 0.0039.

Histogram showing bell-shaped normal weight distribution with smooth tails. — Xavier normal weight distribution. Weights are drawn from N(0, 0.0625) with the same target variance, but with unbounded tails allowing occasional larger weights.

In[11]:

Code

def xavier_uniform(shape):
    """Xavier initialization with uniform distribution."""
    fan_in, fan_out = shape
    limit = np.sqrt(6 / (fan_in + fan_out))
    return np.random.uniform(-limit, limit, shape)


def xavier_normal(shape):
    """Xavier initialization with normal distribution."""
    fan_in, fan_out = shape
    std = np.sqrt(2 / (fan_in + fan_out))
    return np.random.randn(*shape) * std


# Compare Xavier to naive initialization
def compare_initializations(n_layers, layer_size, init_func, activation="tanh"):
    """Track activations through network with given initialization."""
    weights = [init_func((layer_size, layer_size)) for _ in range(n_layers)]

    x = np.random.randn(1000, layer_size)
    activations = forward_pass(x, weights, activation=activation)

    return [np.std(act) for act in activations]

def xavier_uniform(shape):
    """Xavier initialization with uniform distribution."""
    fan_in, fan_out = shape
    limit = np.sqrt(6 / (fan_in + fan_out))
    return np.random.uniform(-limit, limit, shape)


def xavier_normal(shape):
    """Xavier initialization with normal distribution."""
    fan_in, fan_out = shape
    std = np.sqrt(2 / (fan_in + fan_out))
    return np.random.randn(*shape) * std


# Compare Xavier to naive initialization
def compare_initializations(n_layers, layer_size, init_func, activation="tanh"):
    """Track activations through network with given initialization."""
    weights = [init_func((layer_size, layer_size)) for _ in range(n_layers)]

    x = np.random.randn(1000, layer_size)
    activations = forward_pass(x, weights, activation=activation)

    return [np.std(act) for act in activations]

Out[12]:

Console

Activation standard deviations across layers (tanh network):
------------------------------------------------------------


Naive (std=1.0):
  Layer  0: std = 0.9993
  Layer  2: std = 0.9743
  Layer  4: std = 0.9740
  Layer  6: std = 0.9740
  Layer  8: std = 0.9737
  Layer 10: std = 0.9741


Small (std=0.01):
  Layer  0: std = 1.0012
  Layer  2: std = 0.0254
  Layer  4: std = 0.0007
  Layer  6: std = 0.0000
  Layer  8: std = 0.0000
  Layer 10: std = 0.0000


Xavier normal:
  Layer  0: std = 0.9989
  Layer  2: std = 0.4869
  Layer  4: std = 0.3565
  Layer  6: std = 0.2913
  Layer  8: std = 0.2521
  Layer 10: std = 0.2265


Xavier uniform:
  Layer  0: std = 0.9989
  Layer  2: std = 0.4850
  Layer  4: std = 0.3558
  Layer  6: std = 0.2927
  Layer  8: std = 0.2532
  Layer 10: std = 0.2250

Xavier initialization maintains activation variance much better than naive approaches. The standard deviation stays relatively stable across layers, preventing both vanishing and exploding activations.

Out[13]:

Visualization

Line plot comparing activation standard deviations across 10 layers for four initialization methods. — Activation standard deviation across layers for different initialization strategies in a tanh network. Xavier initialization (both uniform and normal variants) maintains stable activation magnitudes, while naive initialization causes saturation and small initialization causes vanishing activations.

He Initialization for ReLU NetworksLink Copied

Xavier initialization was a breakthrough for networks using tanh or sigmoid activations. But by the mid-2010s, ReLU (Rectified Linear Unit) had become the dominant activation function due to its simplicity and effectiveness in avoiding vanishing gradients. Unfortunately, Xavier initialization doesn't work well for ReLU networks.

The problem is that Xavier's derivation assumes the activation function is approximately linear around zero. For tanh and sigmoid, this is reasonable: near zero, they behave almost like the identity function. But ReLU is decidedly non-linear: it sets all negative values to zero, keeping only the positive half of the distribution.

Understanding ReLU's Variance HalvingLink Copied

Consider a pre-activation $z$ drawn from a symmetric distribution centered at zero with variance $\sigma^2$ . ReLU sets all negative values to zero while keeping positive values unchanged. Since the distribution is symmetric, approximately half the values are negative, so half become zero.

The variance of the output can be computed as:

\text{Var}(\text{ReLU}(z)) = \mathbb{E}[\text{ReLU}(z)^2] - \mathbb{E}[\text{ReLU}(z)]^2

For a symmetric zero-mean distribution, the expected value of the positive half is small, and the key term is $\mathbb{E}[\text{ReLU}(z)^2]$ . Since only positive values contribute, this equals approximately half of $\mathbb{E}[z^2] = \sigma^2$ :

\text{Var}(\text{ReLU}(z)) \approx \frac{\sigma^2}{2}

where:

$\text{ReLU}(z) = \max(0, z)$ : the rectified linear unit function
$\sigma^2$ : variance of the input pre-activation $z$

This variance halving has severe consequences for deep networks. After $L$ layers, the variance becomes approximately $\sigma^2 / 2^L$ , causing activations to shrink exponentially. A 10-layer network would see activations reduced by a factor of $2^{10} = 1024$ . A 20-layer network would see a reduction of over a million.

The following visualization makes this concrete by showing how ReLU transforms a symmetric input distribution:

Out[14]:

Visualization

Histogram showing symmetric Gaussian distribution centered at zero. — Input distribution: a symmetric Gaussian with variance 1.0. Half of the values are negative (left of red line).

Histogram showing only positive values after ReLU transformation. — After ReLU: all negative values become zero, reducing variance to approximately 0.5. The 50% of mass at zero is not shown in this density plot.

Demonstrating the ProblemLink Copied

Let's see this variance collapse in action by comparing Xavier initialization on tanh versus ReLU networks:

In[15]:

Code

# Compare Xavier on ReLU vs tanh
activation_comparison = {}
for activation in ["tanh", "relu"]:
    stds = compare_initializations(
        10, 256, xavier_normal, activation=activation
    )
    activation_comparison[activation] = stds

# Compare Xavier on ReLU vs tanh
activation_comparison = {}
for activation in ["tanh", "relu"]:
    stds = compare_initializations(
        10, 256, xavier_normal, activation=activation
    )
    activation_comparison[activation] = stds

Out[16]:

Console

Xavier initialization on different activations:
--------------------------------------------------

TANH:
  Initial std: 1.0004
  Final std:   0.228390
  Decay ratio: 0.2283

RELU:
  Initial std: 1.0006
  Final std:   0.025703
  Decay ratio: 0.0257

With tanh, Xavier initialization maintains activations reasonably well across 10 layers. With ReLU, however, activations decay dramatically. The decay ratio shows that ReLU activations shrink to a tiny fraction of their initial magnitude, confirming that Xavier is unsuitable for ReLU networks.

Deriving He InitializationLink Copied

Kaiming He and colleagues addressed this problem in their 2015 paper by modifying the variance analysis to account for ReLU's behavior. The derivation follows the same logic as Xavier, but incorporates the factor of 1/2 from ReLU's variance reduction.

Starting from our original variance equation and incorporating the ReLU effect:

\text{Var}(a) = \frac{1}{2} \cdot n_{\text{in}} \cdot \text{Var}(w) \cdot \text{Var}(x)

where:

$\text{Var}(a)$ : variance of the activation output (after ReLU)
$\frac{1}{2}$ : the variance reduction factor from ReLU
$n_{\text{in}}$ : fan-in (number of input connections)
$\text{Var}(w)$ : variance of the weights
$\text{Var}(x)$ : variance of the input activations

To maintain $\text{Var}(a) = \text{Var}(x)$ , we set:

\text{Var}(x) = \frac{1}{2} \cdot n_{\text{in}} \cdot \text{Var}(w) \cdot \text{Var}(x)

Dividing both sides by $\text{Var}(x)$ and solving for $\text{Var}(w)$ :

1 = \frac{1}{2} \cdot n_{\text{in}} \cdot \text{Var}(w)

\text{Var}(w) = \frac{2}{n_{\text{in}}}

This is He initialization (also called Kaiming initialization). The factor of 2 in the numerator compensates exactly for ReLU's variance-halving effect.

He/Kaiming Initialization

A weight initialization scheme where weights are drawn from a distribution with variance $2/n_{\text{in}}$ . Designed specifically for ReLU networks, accounting for the variance reduction caused by setting negative activations to zero.

Converting Variance to Distribution ParametersLink Copied

Just as with Xavier, we need to convert our target variance into the parameters of either a normal or uniform distribution.

For a normal distribution, we set the standard deviation to the square root of the target variance:

w \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{\text{in}}}}\right)

where $\mathcal{N}(\mu, \sigma)$ denotes a normal distribution with mean $\mu$ and standard deviation $\sigma$ .

For a uniform distribution $U[-a, a]$ , using the same derivation as Xavier (variance of uniform is $a^2/3$ ), we solve $a^2/3 = 2/n_{\text{in}}$ to get:

w \sim U\left[-\sqrt{\frac{6}{n_{\text{in}}}}, \sqrt{\frac{6}{n_{\text{in}}}}\right]

Notice that He initialization uses only $n_{\text{in}}$ (fan-in) rather than averaging $n_{\text{in}}$ and $n_{\text{out}}$ like Xavier. This is because He et al. found that matching forward pass variance was more important for training deep ReLU networks. The backward pass still works reasonably well because ReLU's simple gradient (either 0 or 1) doesn't introduce the same variance distortion as its forward computation.

Implementation and ComparisonLink Copied

Let's implement He initialization and compare it directly to Xavier on a ReLU network:

In[17]:

Code

def he_uniform(shape):
    """He initialization with uniform distribution."""
    fan_in, fan_out = shape
    limit = np.sqrt(6 / fan_in)
    return np.random.uniform(-limit, limit, shape)


def he_normal(shape):
    """He initialization with normal distribution."""
    fan_in, fan_out = shape
    std = np.sqrt(2 / fan_in)
    return np.random.randn(*shape) * std

def he_uniform(shape):
    """He initialization with uniform distribution."""
    fan_in, fan_out = shape
    limit = np.sqrt(6 / fan_in)
    return np.random.uniform(-limit, limit, shape)


def he_normal(shape):
    """He initialization with normal distribution."""
    fan_in, fan_out = shape
    std = np.sqrt(2 / fan_in)
    return np.random.randn(*shape) * std

Out[18]:

Console

Comparing Xavier and He initialization on ReLU network:
------------------------------------------------------------


Xavier normal:
  Layer  0: std = 0.9984
  Layer  1: std = 0.5849
  Layer  2: std = 0.4156
  Layer  3: std = 0.2957
  Layer  4: std = 0.2027
  Layer  5: std = 0.1463
  Layer  6: std = 0.1108
  Layer  7: std = 0.0793
  Layer  8: std = 0.0602
  Layer  9: std = 0.0412
  Layer 10: std = 0.0265


He normal:
  Layer  0: std = 1.0005
  Layer  1: std = 0.8285
  Layer  2: std = 0.8354
  Layer  3: std = 0.8172
  Layer  4: std = 0.8400
  Layer  5: std = 0.8352
  Layer  6: std = 0.8401
  Layer  7: std = 0.7601
  Layer  8: std = 0.7407
  Layer  9: std = 0.6754
  Layer 10: std = 0.6912

He initialization maintains much more stable activation magnitudes in ReLU networks compared to Xavier.

Out[19]:

Visualization

Line plot showing activation standard deviation across layers, with He initialization maintaining stability while Xavier decays. — Comparison of Xavier and He initialization in a ReLU network. Xavier initialization causes activations to decay across layers because it doesn't account for ReLU's variance-halving effect. He initialization compensates for this and maintains stable activation magnitudes.

Initialization for Different ActivationsLink Copied

We've now seen how Xavier initialization works for symmetric activations and He initialization works for ReLU. But the deep learning toolkit includes many more activation functions, each with its own variance characteristics. The general principle remains the same: analyze how the activation affects variance and adjust the initialization accordingly.

The following table summarizes the recommended initialization for common activations:

Recommended initialization schemes for common activation functions. The variance formulas assume you're sampling from a normal distribution; adjust by factor of 3 for uniform.

Activation	Recommended Initialization	Weight Variance
Linear	Xavier	$\frac{2}{n_{\text{in}} + n_{\text{out}}}$
Tanh	Xavier	$\frac{2}{n_{\text{in}} + n_{\text{out}}}$
Sigmoid	Xavier	$\frac{2}{n_{\text{in}} + n_{\text{out}}}$
ReLU	He	$\frac{2}{n_{\text{in}}}$
Leaky ReLU	He (adjusted)	$\frac{2}{(1 + \alpha^2) n_{\text{in}}}$
SELU	LeCun	$\frac{1}{n_{\text{in}}}$

The Leaky ReLU CaseLink Copied

For Leaky ReLU with negative slope $\alpha$ , the situation is more nuanced than standard ReLU. Instead of zeroing negative inputs, Leaky ReLU scales them by a small factor $\alpha$ (typically 0.01 or 0.2). This preserves some information from the negative half of the distribution, meaning the variance reduction is less severe than standard ReLU.

The adjusted variance formula is:

\text{Var}(w) = \frac{2}{(1 + \alpha^2) \cdot n_{\text{in}}}

where:

$\alpha$ : the negative slope of Leaky ReLU (typically 0.01 or 0.2)
$n_{\text{in}}$ : fan-in (number of input connections)
$(1 + \alpha^2)$ : a correction factor that accounts for variance from both positive inputs (coefficient 1) and negative inputs (coefficient $\alpha^2$ )

The denominator $(1 + \alpha^2)$ arises because variance scales with the square of the coefficient. When $\alpha = 0$ (standard ReLU), the factor becomes $(1 + 0) = 1$ , giving us $\text{Var}(w) = 2/n_{\text{in}}$ , which is He initialization. When $\alpha = 1$ (linear activation), the factor becomes $(1 + 1) = 2$ , giving $\text{Var}(w) = 1/n_{\text{in}}$ , which approaches Xavier initialization for the fan-in-only case.

A General Initialization FunctionLink Copied

Rather than implementing a separate function for each activation, we can create a unified initialization function that computes the appropriate variance based on the activation type:

In[20]:

Code

def init_weights(
    shape, activation="relu", mode="fan_in", distribution="normal"
):
    """
    Initialize weights based on activation function.

    Parameters:
    - shape: (fan_in, fan_out)
    - activation: 'linear', 'tanh', 'sigmoid', 'relu', 'leaky_relu', 'selu'
    - mode: 'fan_in', 'fan_out', or 'fan_avg'
    - distribution: 'normal' or 'uniform'
    """
    fan_in, fan_out = shape

    # Determine fan value based on mode
    if mode == "fan_in":
        fan = fan_in
    elif mode == "fan_out":
        fan = fan_out
    else:  # fan_avg
        fan = (fan_in + fan_out) / 2

    # Determine gain based on activation
    gain = {
        "linear": 1.0,
        "tanh": 5 / 3,  # Approximate gain for tanh
        "sigmoid": 1.0,
        "relu": np.sqrt(2),
        "leaky_relu": np.sqrt(2 / (1 + 0.01**2)),  # Default negative slope 0.01
        "selu": 1.0,
    }.get(activation, 1.0)

    std = gain / np.sqrt(fan)

    if distribution == "normal":
        return np.random.randn(*shape) * std
    else:  # uniform
        limit = np.sqrt(3) * std
        return np.random.uniform(-limit, limit, shape)

def init_weights(
    shape, activation="relu", mode="fan_in", distribution="normal"
):
    """
    Initialize weights based on activation function.

    Parameters:
    - shape: (fan_in, fan_out)
    - activation: 'linear', 'tanh', 'sigmoid', 'relu', 'leaky_relu', 'selu'
    - mode: 'fan_in', 'fan_out', or 'fan_avg'
    - distribution: 'normal' or 'uniform'
    """
    fan_in, fan_out = shape

    # Determine fan value based on mode
    if mode == "fan_in":
        fan = fan_in
    elif mode == "fan_out":
        fan = fan_out
    else:  # fan_avg
        fan = (fan_in + fan_out) / 2

    # Determine gain based on activation
    gain = {
        "linear": 1.0,
        "tanh": 5 / 3,  # Approximate gain for tanh
        "sigmoid": 1.0,
        "relu": np.sqrt(2),
        "leaky_relu": np.sqrt(2 / (1 + 0.01**2)),  # Default negative slope 0.01
        "selu": 1.0,
    }.get(activation, 1.0)

    std = gain / np.sqrt(fan)

    if distribution == "normal":
        return np.random.randn(*shape) * std
    else:  # uniform
        limit = np.sqrt(3) * std
        return np.random.uniform(-limit, limit, shape)

Out[21]:

Console

Testing general initialization function:
--------------------------------------------------
relu        : std = 0.0889, range = [-0.349, 0.379]
tanh        : std = 0.1040, range = [-0.431, 0.423]
leaky_relu  : std = 0.0883, range = [-0.393, 0.363]
selu        : std = 0.0623, range = [-0.271, 0.273]

The ReLU initialization has the largest standard deviation because it uses the factor of $\sqrt{2}$ to compensate for variance halving. Tanh uses a gain of approximately 5/3 to account for its non-linearity. SELU uses a standard LeCun initialization with variance $1/n_{\text{in}}$ , resulting in the smallest standard deviation.

Gradient AnalysisLink Copied

So far, we've focused on how activations propagate forward. But neural network training depends equally on how gradients propagate backward. A good initialization must preserve gradient magnitudes during backpropagation; otherwise, learning signals won't reach the early layers.

Let's analyze gradient flow and verify that our initializations maintain stable gradients across network depth:

In[22]:

Code

def compute_gradients(x, weights, activation="tanh"):
    """Compute gradients via backpropagation."""
    # Forward pass storing pre-activations
    activations = [x]
    pre_activations = []

    for W in weights:
        z = activations[-1] @ W
        pre_activations.append(z)
        if activation == "tanh":
            a = np.tanh(z)
        elif activation == "relu":
            a = np.maximum(0, z)
        activations.append(a)

    # Backward pass
    gradients = []
    delta = np.ones_like(
        activations[-1]
    )  # Start with ones as upstream gradient

    for i in range(len(weights) - 1, -1, -1):
        # Gradient through activation
        if activation == "tanh":
            grad_act = 1 - np.tanh(pre_activations[i]) ** 2
        elif activation == "relu":
            grad_act = (pre_activations[i] > 0).astype(float)

        delta = delta * grad_act

        # Gradient w.r.t. weights
        grad_w = activations[i].T @ delta
        gradients.insert(0, grad_w)

        # Gradient for next layer
        delta = delta @ weights[i].T

    return gradients

def compute_gradients(x, weights, activation="tanh"):
    """Compute gradients via backpropagation."""
    # Forward pass storing pre-activations
    activations = [x]
    pre_activations = []

    for W in weights:
        z = activations[-1] @ W
        pre_activations.append(z)
        if activation == "tanh":
            a = np.tanh(z)
        elif activation == "relu":
            a = np.maximum(0, z)
        activations.append(a)

    # Backward pass
    gradients = []
    delta = np.ones_like(
        activations[-1]
    )  # Start with ones as upstream gradient

    for i in range(len(weights) - 1, -1, -1):
        # Gradient through activation
        if activation == "tanh":
            grad_act = 1 - np.tanh(pre_activations[i]) ** 2
        elif activation == "relu":
            grad_act = (pre_activations[i] > 0).astype(float)

        delta = delta * grad_act

        # Gradient w.r.t. weights
        grad_w = activations[i].T @ delta
        gradients.insert(0, grad_w)

        # Gradient for next layer
        delta = delta @ weights[i].T

    return gradients

Out[23]:

Visualization

Bar chart showing gradient magnitudes across 10 layers remaining relatively stable. — Gradient magnitudes through a tanh network with Xavier initialization. Gradients remain relatively stable across layers, enabling effective learning even in deeper layers.

Line plot comparing gradient norms for Xavier and He initialization in ReLU network. — Gradient magnitudes through a ReLU network comparing Xavier and He initialization. He initialization maintains more stable gradient flow, while Xavier causes gradients to grow or shrink depending on network depth.

The gradient analysis confirms what we learned from activation analysis: Xavier initialization works well for tanh, maintaining reasonable gradient magnitudes across layers. For ReLU networks, He initialization provides more stable gradient flow.

Practical Implementation with PyTorchLink Copied

Modern deep learning frameworks provide built-in initialization functions. Let's see how to use PyTorch's initialization utilities:

In[24]:

Code

import torch.nn as nn


def create_network_with_init(hidden_sizes, activation="relu", init_method="he"):
    """Create a network with specified initialization."""
    layers = []

    for i in range(len(hidden_sizes) - 1):
        linear = nn.Linear(hidden_sizes[i], hidden_sizes[i + 1])

        # Apply initialization
        if init_method == "xavier_uniform":
            nn.init.xavier_uniform_(linear.weight)
        elif init_method == "xavier_normal":
            nn.init.xavier_normal_(linear.weight)
        elif init_method == "he_uniform":
            nn.init.kaiming_uniform_(linear.weight, nonlinearity="relu")
        elif init_method == "he_normal":
            nn.init.kaiming_normal_(linear.weight, nonlinearity="relu")
        elif init_method == "zeros":
            nn.init.zeros_(linear.weight)
        elif init_method == "ones":
            nn.init.ones_(linear.weight)

        # Initialize biases to zero (common practice)
        nn.init.zeros_(linear.bias)

        layers.append(linear)

        # Add activation
        if activation == "relu" and i < len(hidden_sizes) - 2:
            layers.append(nn.ReLU())
        elif activation == "tanh" and i < len(hidden_sizes) - 2:
            layers.append(nn.Tanh())

    return nn.Sequential(*layers)


# Create networks with different initializations
hidden_sizes = [784, 256, 256, 256, 10]

import torch.nn as nn


def create_network_with_init(hidden_sizes, activation="relu", init_method="he"):
    """Create a network with specified initialization."""
    layers = []

    for i in range(len(hidden_sizes) - 1):
        linear = nn.Linear(hidden_sizes[i], hidden_sizes[i + 1])

        # Apply initialization
        if init_method == "xavier_uniform":
            nn.init.xavier_uniform_(linear.weight)
        elif init_method == "xavier_normal":
            nn.init.xavier_normal_(linear.weight)
        elif init_method == "he_uniform":
            nn.init.kaiming_uniform_(linear.weight, nonlinearity="relu")
        elif init_method == "he_normal":
            nn.init.kaiming_normal_(linear.weight, nonlinearity="relu")
        elif init_method == "zeros":
            nn.init.zeros_(linear.weight)
        elif init_method == "ones":
            nn.init.ones_(linear.weight)

        # Initialize biases to zero (common practice)
        nn.init.zeros_(linear.bias)

        layers.append(linear)

        # Add activation
        if activation == "relu" and i < len(hidden_sizes) - 2:
            layers.append(nn.ReLU())
        elif activation == "tanh" and i < len(hidden_sizes) - 2:
            layers.append(nn.Tanh())

    return nn.Sequential(*layers)


# Create networks with different initializations
hidden_sizes = [784, 256, 256, 256, 10]

Out[25]:

Console

PyTorch weight initialization examples:
--------------------------------------------------

xavier_uniform:
  Shape: (256, 784)
  Mean:  -0.000082
  Std:   0.0438
  Range: [-0.0760, 0.0760]

xavier_normal:
  Shape: (256, 784)
  Mean:  0.000045
  Std:   0.0439
  Range: [-0.1979, 0.2097]

he_uniform:
  Shape: (256, 784)
  Mean:  0.000135
  Std:   0.0505
  Range: [-0.0875, 0.0875]

he_normal:
  Shape: (256, 784)
  Mean:  0.000129
  Std:   0.0503
  Range: [-0.2260, 0.2389]

All initialization methods produce weights centered at zero, as expected. The He methods have slightly larger standard deviations than Xavier because they use only fan-in rather than averaging fan-in and fan-out. The uniform variants show tighter ranges compared to normal variants because the uniform distribution has bounded support.

Training ComparisonLink Copied

Let's compare training dynamics with different initializations on a simple classification task:

In[26]:

Code

import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim


def train_model(model, train_loader, epochs=50, lr=0.01):
    """Train model and return loss history."""
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr)

    losses = []

    for epoch in range(epochs):
        epoch_loss = 0
        for X, y in train_loader:
            optimizer.zero_grad()
            outputs = model(X)
            loss = criterion(outputs, y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        losses.append(epoch_loss / len(train_loader))

    return losses


# Create synthetic dataset
torch.manual_seed(42)
n_samples = 1000
n_features = 100
n_classes = 10

X = torch.randn(n_samples, n_features)
y = torch.randint(0, n_classes, (n_samples,))

dataset = TensorDataset(X, y)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Train with different initializations
hidden_sizes = [100, 64, 64, 10]
init_methods = ["xavier_normal", "he_normal", "zeros"]

import torch
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim


def train_model(model, train_loader, epochs=50, lr=0.01):
    """Train model and return loss history."""
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr)

    losses = []

    for epoch in range(epochs):
        epoch_loss = 0
        for X, y in train_loader:
            optimizer.zero_grad()
            outputs = model(X)
            loss = criterion(outputs, y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()

        losses.append(epoch_loss / len(train_loader))

    return losses


# Create synthetic dataset
torch.manual_seed(42)
n_samples = 1000
n_features = 100
n_classes = 10

X = torch.randn(n_samples, n_features)
y = torch.randint(0, n_classes, (n_samples,))

dataset = TensorDataset(X, y)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Train with different initializations
hidden_sizes = [100, 64, 64, 10]
init_methods = ["xavier_normal", "he_normal", "zeros"]

Out[27]:

Visualization

Line plot showing training loss over 50 epochs for three initialization methods, with zero init remaining high and flat. — Training loss curves for networks with different weight initializations. Xavier and He initialization enable quick convergence, while zero initialization fails completely (flat loss curve) because all neurons learn identically due to the symmetry problem.

The zero initialization fails to learn at all, stuck at the initial loss. Xavier and He initialization both enable learning, with similar convergence curves for this shallow network. The difference between Xavier and He becomes more pronounced in deeper networks with ReLU activations.

Bias InitializationLink Copied

While we've focused on weight initialization, biases also need consideration. The common practice is to initialize biases to zero:

In[28]:

Code

# Bias initialization options
def init_biases(layer, method="zeros"):
    """Initialize biases with specified method."""
    if method == "zeros":
        nn.init.zeros_(layer.bias)
    elif method == "small_positive":
        nn.init.constant_(layer.bias, 0.01)
    elif method == "normal":
        nn.init.normal_(layer.bias, mean=0, std=0.01)

# Bias initialization options
def init_biases(layer, method="zeros"):
    """Initialize biases with specified method."""
    if method == "zeros":
        nn.init.zeros_(layer.bias)
    elif method == "small_positive":
        nn.init.constant_(layer.bias, 0.01)
    elif method == "normal":
        nn.init.normal_(layer.bias, mean=0, std=0.01)

Zero biases are almost always a safe choice. Unlike weights, initializing biases to the same value doesn't cause symmetry problems because the gradients for biases depend on the (different) input-weight products.

For ReLU networks, some practitioners use small positive biases (like 0.01) to ensure neurons are active initially. This prevents "dead neurons" that never activate because their inputs are always negative. However, modern techniques like batch normalization and careful learning rate scheduling have reduced the importance of this trick.

The following visualization shows how initialization scale affects the fraction of dead neurons in a ReLU network:

Out[29]:

Visualization

Line plot showing dead neuron fraction vs initialization scale, with high death rate for small scales. — Fraction of dead neurons (neurons that output zero for all inputs in a batch) as a function of initialization scale. Very small scales cause most pre-activations to be negative, killing the majority of neurons. The optimal range keeps dead neurons below 50%, which matches ReLU's expected behavior.

Diagnosing Initialization ProblemsLink Copied

When a network fails to train, initialization is a common culprit. Here are diagnostic techniques:

In[30]:

Code

def diagnose_initialization(model, sample_input):
    """Diagnose potential initialization problems."""
    model.eval()

    activations = {}
    gradients = {}

    # Register hooks to capture activations and gradients
    def save_activation(name):
        def hook(module, input, output):
            activations[name] = output.detach()

        return hook

    def save_gradient(name):
        def hook(module, grad_input, grad_output):
            gradients[name] = grad_output[0].detach()

        return hook

    # Register hooks on all layers
    handles = []
    for name, module in model.named_modules():
        if isinstance(module, (nn.Linear, nn.Conv2d)):
            handles.append(module.register_forward_hook(save_activation(name)))
            handles.append(
                module.register_full_backward_hook(save_gradient(name))
            )

    # Forward and backward pass
    model.train()
    output = model(sample_input)
    loss = output.sum()
    loss.backward()

    # Clean up hooks
    for h in handles:
        h.remove()

    return activations, gradients


# Diagnose a model
model = create_network_with_init([100, 64, 64, 10], init_method="he_normal")
sample = torch.randn(32, 100)
acts, grads = diagnose_initialization(model, sample)

def diagnose_initialization(model, sample_input):
    """Diagnose potential initialization problems."""
    model.eval()

    activations = {}
    gradients = {}

    # Register hooks to capture activations and gradients
    def save_activation(name):
        def hook(module, input, output):
            activations[name] = output.detach()

        return hook

    def save_gradient(name):
        def hook(module, grad_input, grad_output):
            gradients[name] = grad_output[0].detach()

        return hook

    # Register hooks on all layers
    handles = []
    for name, module in model.named_modules():
        if isinstance(module, (nn.Linear, nn.Conv2d)):
            handles.append(module.register_forward_hook(save_activation(name)))
            handles.append(
                module.register_full_backward_hook(save_gradient(name))
            )

    # Forward and backward pass
    model.train()
    output = model(sample_input)
    loss = output.sum()
    loss.backward()

    # Clean up hooks
    for h in handles:
        h.remove()

    return activations, gradients


# Diagnose a model
model = create_network_with_init([100, 64, 64, 10], init_method="he_normal")
sample = torch.randn(32, 100)
acts, grads = diagnose_initialization(model, sample)

Out[31]:

Console

Initialization Diagnostics:
--------------------------------------------------

Activation statistics:
  0: mean=-0.0232, std=1.4136, dead=0.0%
  2: mean=-0.1378, std=1.3990, dead=0.0%
  4: mean=-0.4532, std=1.2005, dead=0.0%

Gradient statistics:
  4: mean=1.000000, std=0.000000
  2: mean=-0.041939, std=0.348282
  0: mean=-0.043756, std=0.303204

Key warning signs include:

Activations near zero or saturated: Indicates vanishing activations
Large gradient variance: May cause unstable training
High percentage of dead neurons: Common in ReLU networks with poor initialization
Extreme activation values: Risk of numerical overflow

Out[32]:

Visualization

Bar chart showing activation mean and standard deviation across network layers. — Activation statistics across layers in a network with He initialization. The bar heights show mean activation values with error bars indicating standard deviation. Consistent magnitudes indicate healthy initialization.

Bar chart showing gradient standard deviation across network layers on log scale. — Gradient statistics across layers. The y-axis (log scale) shows gradient standard deviation. Stable gradients indicate proper signal flow during backpropagation.

Modern Practices and AlternativesLink Copied

While Xavier and He initialization remain the defaults, modern deep learning has introduced techniques that reduce sensitivity to initialization:

Batch NormalizationLink Copied

Batch normalization normalizes activations during training, which provides some robustness to initialization:

In[33]:

Code

class NetworkWithBatchNorm(nn.Module):
    def __init__(self, sizes):
        super().__init__()
        layers = []
        for i in range(len(sizes) - 1):
            layers.append(nn.Linear(sizes[i], sizes[i + 1]))
            if i < len(sizes) - 2:  # No batchnorm on last layer
                layers.append(nn.BatchNorm1d(sizes[i + 1]))
                layers.append(nn.ReLU())
        self.network = nn.Sequential(*layers)

        # Use intentionally bad initialization to show BatchNorm robustness
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, std=0.5)  # Large std

    def forward(self, x):
        return self.network(x)

class NetworkWithBatchNorm(nn.Module):
    def __init__(self, sizes):
        super().__init__()
        layers = []
        for i in range(len(sizes) - 1):
            layers.append(nn.Linear(sizes[i], sizes[i + 1]))
            if i < len(sizes) - 2:  # No batchnorm on last layer
                layers.append(nn.BatchNorm1d(sizes[i + 1]))
                layers.append(nn.ReLU())
        self.network = nn.Sequential(*layers)

        # Use intentionally bad initialization to show BatchNorm robustness
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight, std=0.5)  # Large std

    def forward(self, x):
        return self.network(x)

Layer NormalizationLink Copied

Common in transformers, layer normalization similarly helps with training stability:

In[34]:

Code

class TransformerBlock(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

        # Special initialization for residual connections
        # Scale down the last layer to preserve residual signal
        nn.init.zeros_(self.ff[-1].weight)

    def forward(self, x):
        x = x + self.ff(self.norm1(x))
        return x

class TransformerBlock(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

        # Special initialization for residual connections
        # Scale down the last layer to preserve residual signal
        nn.init.zeros_(self.ff[-1].weight)

    def forward(self, x):
        x = x + self.ff(self.norm1(x))
        return x

Residual NetworksLink Copied

For networks with skip connections, careful initialization of residual branches is important:

In[35]:

Code

def init_residual_block(block, scale=1.0):
    """Initialize residual block with scaling for stable training."""
    for m in block.modules():
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight)
            if m.bias is not None:
                nn.init.zeros_(m.bias)

    # Scale down the last layer's weights
    # This ensures residual starts close to identity
    if hasattr(block, "fc2"):
        block.fc2.weight.data *= scale

def init_residual_block(block, scale=1.0):
    """Initialize residual block with scaling for stable training."""
    for m in block.modules():
        if isinstance(m, nn.Linear):
            nn.init.kaiming_normal_(m.weight)
            if m.bias is not None:
                nn.init.zeros_(m.bias)

    # Scale down the last layer's weights
    # This ensures residual starts close to identity
    if hasattr(block, "fc2"):
        block.fc2.weight.data *= scale

The scaling factor (often $1/\sqrt{n}$ where $n$ is the number of residual blocks) prevents the variance from growing linearly with depth.

Worked Example: Initializing a Language ModelLink Copied

Let's walk through initializing a small transformer-style language model:

In[36]:

Code

class SmallLanguageModel(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers):
        super().__init__()

        # Token embeddings - special initialization
        self.embedding = nn.Embedding(vocab_size, d_model)

        # Transformer layers
        self.layers = nn.ModuleList(
            [TransformerLayer(d_model, n_heads) for _ in range(n_layers)]
        )

        # Output projection - tie with embedding
        self.output = nn.Linear(d_model, vocab_size, bias=False)
        self.output.weight = self.embedding.weight  # Weight tying

        # Apply initialization
        self._init_weights()

    def _init_weights(self):
        """Initialize all weights following GPT-2 conventions."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                # Normal initialization with small std
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    torch.nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                # Embeddings also use small std
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            elif isinstance(module, nn.LayerNorm):
                torch.nn.init.ones_(module.weight)
                torch.nn.init.zeros_(module.bias)

    def forward(self, x):
        x = self.embedding(x)
        for layer in self.layers:
            x = layer(x)
        return self.output(x)


class TransformerLayer(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        # Pre-norm architecture
        x_norm = self.norm1(x)
        attn_out, _ = self.attn(x_norm, x_norm, x_norm)
        x = x + attn_out
        x = x + self.ff(self.norm2(x))
        return x

class SmallLanguageModel(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers):
        super().__init__()

        # Token embeddings - special initialization
        self.embedding = nn.Embedding(vocab_size, d_model)

        # Transformer layers
        self.layers = nn.ModuleList(
            [TransformerLayer(d_model, n_heads) for _ in range(n_layers)]
        )

        # Output projection - tie with embedding
        self.output = nn.Linear(d_model, vocab_size, bias=False)
        self.output.weight = self.embedding.weight  # Weight tying

        # Apply initialization
        self._init_weights()

    def _init_weights(self):
        """Initialize all weights following GPT-2 conventions."""
        for module in self.modules():
            if isinstance(module, nn.Linear):
                # Normal initialization with small std
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
                if module.bias is not None:
                    torch.nn.init.zeros_(module.bias)
            elif isinstance(module, nn.Embedding):
                # Embeddings also use small std
                torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            elif isinstance(module, nn.LayerNorm):
                torch.nn.init.ones_(module.weight)
                torch.nn.init.zeros_(module.bias)

    def forward(self, x):
        x = self.embedding(x)
        for layer in self.layers:
            x = layer(x)
        return self.output(x)


class TransformerLayer(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        # Pre-norm architecture
        x_norm = self.norm1(x)
        attn_out, _ = self.attn(x_norm, x_norm, x_norm)
        x = x + attn_out
        x = x + self.ff(self.norm2(x))
        return x

Out[37]:

Console

Language Model Initialization Statistics:
------------------------------------------------------------
embedding.weight                        : mean=-0.000006, std=0.0200
layers.0.attn.in_proj_weight            : mean=-0.000091, std=0.0441
layers.0.attn.out_proj.weight           : mean=-0.000004, std=0.0200
layers.0.ff.0.weight                    : mean=-0.000021, std=0.0200
layers.0.ff.2.weight                    : mean=-0.000039, std=0.0200
layers.1.attn.in_proj_weight            : mean=-0.000295, std=0.0442
layers.1.attn.out_proj.weight           : mean=-0.000083, std=0.0200
layers.1.ff.0.weight                    : mean=-0.000057, std=0.0200
layers.1.ff.2.weight                    : mean=0.000006, std=0.0200
layers.2.attn.in_proj_weight            : mean=-0.000009, std=0.0442
layers.2.attn.out_proj.weight           : mean=-0.000128, std=0.0200
layers.2.ff.0.weight                    : mean=-0.000020, std=0.0200
layers.2.ff.2.weight                    : mean=0.000053, std=0.0200
layers.3.attn.in_proj_weight            : mean=-0.000057, std=0.0443
layers.3.attn.out_proj.weight           : mean=0.000127, std=0.0199
layers.3.ff.0.weight                    : mean=-0.000018, std=0.0200
layers.3.ff.2.weight                    : mean=-0.000031, std=0.0200

The GPT-2 initialization uses a small standard deviation (0.02) for all linear layers. This relatively uniform approach works because layer normalization handles the variance normalization during training.

Limitations and ImpactLink Copied

Weight initialization, while important, is not a magic solution to all training difficulties.

Initialization is just the starting point. A good initialization sets up favorable conditions for learning, but it cannot compensate for basic problems like inappropriate architectures, poor data quality, or incorrect hyperparameters. A network with He initialization will still fail if the learning rate is orders of magnitude too large.

Normalization techniques reduce sensitivity. Batch normalization and layer normalization have reduced the importance of precise initialization. Networks with these components can often train successfully with a wider range of initialization scales. This is one reason modern transformer architectures use a simple standard deviation of 0.02 rather than layer-specific calculations.

Very deep networks remain challenging. For networks with hundreds of layers (like some vision transformers), even careful initialization may not prevent training instabilities. Additional techniques like learning rate warmup, gradient clipping, and careful residual scaling become necessary.

Despite these limitations, understanding initialization principles remains valuable. When a network fails to train, checking initialization is often a productive first step. The mathematical framework we developed, analyzing how variance propagates through layers, provides insight into network behavior that extends beyond just initialization.

Key ParametersLink Copied

The following parameters control weight initialization behavior in PyTorch:

PyTorch initialization parameters. Themode parameter controls fan calculation, while nonlinearitysets the gain factor for the activation function.

Parameter	Values	Effect
`mode`	`'fan_in'`, `'fan_out'`, `'fan_avg'`	Determines whether to use input dimensions, output dimensions, or their average for variance calculation. Use `fan_in` for forward pass stability, `fan_out` for backward pass stability.
`nonlinearity`	`'relu'`, `'leaky_relu'`, `'tanh'`, etc.	Specifies the activation function to compute the appropriate gain factor. Must match the activation used after the layer.
`a` (for Leaky ReLU)	0.01 (default), 0.2, etc.	The negative slope parameter for Leaky ReLU. Affects the variance correction factor $(1 + a^2)$ .

For nn.init.kaiming_normal_ and nn.init.kaiming_uniform_:

Use mode='fan_in' (default) to preserve forward pass variance
Use mode='fan_out' to preserve backward pass variance
Set nonlinearity='relu' for standard ReLU or specify 'leaky_relu' with the a parameter

For nn.init.xavier_normal_ and nn.init.xavier_uniform_:

These use fan_avg mode internally (no mode parameter)
The gain parameter multiplies the standard deviation (default 1.0)
Use gain=nn.init.calculate_gain('tanh') for tanh activations

SummaryLink Copied

Weight initialization determines whether a neural network can learn effectively from the start. The core principle is variance preservation: weights should be scaled so that activations and gradients maintain reasonable magnitudes as they propagate through layers.

Key takeaways:

Zero initialization causes symmetry: All neurons learn the same thing, wasting network capacity
Xavier initialization uses variance $\frac{2}{n_{\text{in}} + n_{\text{out}}}$ , designed for tanh and sigmoid activations
He initialization uses variance $\frac{2}{n_{\text{in}}}$ , accounting for ReLU's variance-halving effect
Biases are typically initialized to zero, which doesn't cause symmetry problems
Modern architectures with batch or layer normalization are more robust to initialization, but still benefit from reasonable starting values
Diagnostic tools can identify initialization problems by examining activation and gradient statistics

The next chapter covers batch normalization, a technique that normalizes activations during training and further reduces sensitivity to initialization while enabling training of even deeper networks.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about weight initialization in neural networks.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{weightinitializationxavierhevariancepreservationfordeepnetworks, author = {Michael Brenndoerfer}, title = {Weight Initialization: Xavier, He & Variance Preservation for Deep Networks}, year = {2025}, url = {https://mbrenndoerfer.com/writing/weight-initialization-neural-networks-xavier-he}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Weight Initialization: Xavier, He & Variance Preservation for Deep Networks. Retrieved from https://mbrenndoerfer.com/writing/weight-initialization-neural-networks-xavier-he

MLAAcademic

Michael Brenndoerfer. "Weight Initialization: Xavier, He & Variance Preservation for Deep Networks." 2026. Web. today. <https://mbrenndoerfer.com/writing/weight-initialization-neural-networks-xavier-he>.

CHICAGOAcademic

Michael Brenndoerfer. "Weight Initialization: Xavier, He & Variance Preservation for Deep Networks." Accessed today. https://mbrenndoerfer.com/writing/weight-initialization-neural-networks-xavier-he.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Weight Initialization: Xavier, He & Variance Preservation for Deep Networks'. Available at: https://mbrenndoerfer.com/writing/weight-initialization-neural-networks-xavier-he (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Weight Initialization: Xavier, He & Variance Preservation for Deep Networks. https://mbrenndoerfer.com/writing/weight-initialization-neural-networks-xavier-he

Direct link:

https://mbrenndoerfer.com/writing/weight-initialization-neural-networks-xavier-he

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Weight Initialization: Xavier, He & Variance Preservation for Deep Networks

Weight InitializationLink Copied

The Need for Careful InitializationLink Copied

Variance Analysis of Forward PropagationLink Copied

The Pre-Activation EquationLink Copied

Deriving the Variance RelationshipLink Copied

Solving for the Optimal Weight VarianceLink Copied

Empirical VerificationLink Copied

Xavier/Glorot InitializationLink Copied

Forward Pass RequirementLink Copied

Backward Pass RequirementLink Copied

The CompromiseLink Copied

ImplementationLink Copied

He Initialization for ReLU NetworksLink Copied

Understanding ReLU's Variance HalvingLink Copied

Demonstrating the ProblemLink Copied

Deriving He InitializationLink Copied

Converting Variance to Distribution ParametersLink Copied

Implementation and ComparisonLink Copied

Initialization for Different ActivationsLink Copied

The Leaky ReLU CaseLink Copied

A General Initialization FunctionLink Copied

Gradient AnalysisLink Copied

Practical Implementation with PyTorchLink Copied

Training ComparisonLink Copied

Bias InitializationLink Copied

Diagnosing Initialization ProblemsLink Copied

Modern Practices and AlternativesLink Copied

Batch NormalizationLink Copied

Layer NormalizationLink Copied

Residual NetworksLink Copied

Worked Example: Initializing a Language ModelLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Backpropagation: The Algorithm That Makes Deep Learning Possible

Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Backpropagation: The Algorithm That Makes Deep Learning Possible

Loss Functions: MSE, Cross-Entropy, Focal Loss & Custom Implementations

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Stay updated