Search

Search articles

Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Michael BrenndoerferDecember 15, 202533 min read

Learn how MLPs stack neurons into layers to solve complex problems. Covers hidden layers, weight matrices, batch processing, and classification/regression tasks.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Multilayer Perceptrons

In the previous chapters, we explored linear classifiers and activation functions as separate building blocks. Linear classifiers can find decision boundaries, but only straight ones. Activation functions introduce non-linearity, but a single neuron with an activation still has limited representational power. The breakthrough comes when we stack these components together into layers, creating what we call a multilayer perceptron (MLP).

MLPs are the workhorses of deep learning. They can approximate virtually any function given enough neurons and proper training. From sentiment analysis to language modeling, understanding MLPs is essential because they form the building blocks of more complex architectures like transformers. This chapter shows you how to construct, understand, and implement MLPs from the ground up.

From Single Neurons to Hidden Layers

A single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. Given an input vector x=[x1,x2,,xn]\mathbf{x} = [x_1, x_2, \ldots, x_n] with nn features, the neuron computes:

y=σ(wx+b)y = \sigma(\mathbf{w}^\top \mathbf{x} + b)

where:

  • x\mathbf{x}: the input vector containing nn features
  • w\mathbf{w}: the weight vector, where each wiw_i controls how much input xix_i influences the output
  • wx\mathbf{w}^\top \mathbf{x}: the dot product i=1nwixi\sum_{i=1}^{n} w_i x_i, computing a weighted sum of inputs
  • bb: the bias term, which shifts the decision boundary
  • σ\sigma: the activation function (e.g., ReLU, sigmoid), which introduces non-linearity
  • yy: the scalar output of the neuron

This single neuron can learn a linear decision boundary (made non-linear by σ\sigma), but it cannot solve problems requiring more complex boundaries.

Hidden Layer

A hidden layer is a collection of neurons that sits between the input and output of a neural network. Each neuron in a hidden layer receives the full input (or the output of the previous layer), applies its own weights and bias, and produces one scalar output. The term "hidden" reflects that these intermediate computations are not directly observed, only the final output layer is.

Consider the classic XOR problem: given two binary inputs, output 1 if exactly one input is 1, and 0 otherwise. No single linear boundary can separate the positive from negative examples. But with a hidden layer, we can first transform the inputs into a new representation where the classes become linearly separable.

In[2]:
Code
import numpy as np

# XOR inputs and outputs
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])
Out[3]:
Visualization
Scatter plot showing four points in 2D space with XOR labels, illustrating non-linear separability.
The XOR problem visualized. Blue circles represent class 0 and orange crosses represent class 1. No single straight line can separate the two classes, demonstrating the limitation of linear classifiers.

The magic happens when we add a hidden layer. Each hidden neuron learns to detect a different feature or pattern in the input. The output layer then combines these learned features to make the final prediction.

To see this in action, let's manually construct a hidden layer that solves XOR and visualize how it transforms the input space:

Out[4]:
Visualization
Two-panel plot showing XOR points before and after hidden layer transformation, with a separating line in the transformed space.
Original input space where XOR classes cannot be separated by a single straight line.
After hidden layer transformation, points are remapped to a new space where they become linearly separable.
After hidden layer transformation, points are remapped to a new space where they become linearly separable.

The hidden layer has transformed the input space so that a simple linear classifier can now separate the classes. The first hidden neuron activates only when both inputs are high (AND-like behavior), while the second activates when either input is high (OR-like behavior). In this new coordinate system, the XOR pattern becomes trivially separable.

Network Architecture and Notation

An MLP consists of an input layer, one or more hidden layers, and an output layer. We describe the architecture by the number of units in each layer. For example, a network with 4 inputs, two hidden layers of 8 and 4 units, and 2 outputs would be written as 4-8-4-2.

Let's establish notation that will serve us throughout this chapter and beyond:

  • LL: Total number of layers (excluding input)
  • n[l]n^{[l]}: Number of neurons in layer ll
  • W[l]\mathbf{W}^{[l]}: Weight matrix for layer ll, with shape (n[l],n[l1])(n^{[l]}, n^{[l-1]})
  • b[l]\mathbf{b}^{[l]}: Bias vector for layer ll, with shape (n[l],1)(n^{[l]}, 1)
  • z[l]\mathbf{z}^{[l]}: Pre-activation values at layer ll
  • a[l]\mathbf{a}^{[l]}: Activations (post-activation values) at layer ll
  • a[0]=x\mathbf{a}^{[0]} = \mathbf{x}: The input is treated as the activation of layer 0

The weight matrix W[l]\mathbf{W}^{[l]} connects layer l1l-1 to layer ll. Each row of W[l]\mathbf{W}^{[l]} contains the weights for one neuron in layer ll. The element Wij[l]W^{[l]}_{ij} represents the weight connecting neuron jj in layer l1l-1 to neuron ii in layer ll.

In[5]:
Code
# Define a simple 3-layer MLP: 2 inputs -> 4 hidden -> 3 hidden -> 1 output
layer_sizes = [2, 4, 3, 1]

# Initialize weight matrices and bias vectors
np.random.seed(42)
weights = []
biases = []

for l in range(1, len(layer_sizes)):
    W = np.random.randn(layer_sizes[l], layer_sizes[l - 1]) * 0.5
    b = np.zeros((layer_sizes[l], 1))
    weights.append(W)
    biases.append(b)
Out[6]:
Console
Network architecture: 2 -> 4 -> 3 -> 1

Weight matrix shapes:
  W[1]: (4, 2) (connects 2 neurons to 4 neurons)
  W[2]: (3, 4) (connects 4 neurons to 3 neurons)
  W[3]: (1, 3) (connects 3 neurons to 1 neurons)

Bias vector shapes:
  b[1]: (4, 1)
  b[2]: (3, 1)
  b[3]: (1, 1)

Total parameters: 31 (23 weights + 8 biases)

This 3-layer network has a modest parameter count, but the numbers grow quickly. Let's visualize what these weight matrices actually look like:

Out[7]:
Visualization
Three heatmaps showing weight matrices of shapes 4x2, 3x4, and 1x3, with color intensity indicating weight magnitude.
$W^{[1]}$: Weight matrix connecting input (2 neurons) to first hidden layer (4 neurons).
$W^{[2]}$: Weight matrix connecting first hidden (4 neurons) to second hidden layer (3 neurons).
$W^{[2]}$: Weight matrix connecting first hidden (4 neurons) to second hidden layer (3 neurons).
$W^{[3]}$: Weight matrix connecting second hidden (3 neurons) to output layer (1 neuron).
$W^{[3]}$: Weight matrix connecting second hidden (3 neurons) to output layer (1 neuron).

The weight matrices grow with the product of consecutive layer sizes. A layer connecting 512 neurons to 256 neurons requires 512×256=131,072512 \times 256 = 131,072 parameters just for the weights, plus 256 bias terms. This rapid growth in parameters is why network architecture design requires careful consideration.

Forward Pass Computation

With our notation established, we can now understand how an MLP transforms an input into an output. This process, called the forward pass, is the heart of neural network computation. Think of it as a pipeline: data flows in one direction, from input through hidden layers to output, with each layer transforming the representation along the way.

The Two-Step Layer Computation

At each layer, the network performs two distinct operations that work together to create expressive transformations:

Step 1: Linear Transformation. First, we compute a weighted combination of inputs from the previous layer. This is where the learnable parameters (weights and biases) come into play:

z[l]=W[l]a[l1]+b[l]\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}

The weight matrix W[l]\mathbf{W}^{[l]} determines how strongly each input neuron influences each output neuron. The bias vector b[l]\mathbf{b}^{[l]} shifts the output, allowing neurons to activate even when inputs are zero. Together, they define a linear transformation that can rotate, scale, and translate the input space.

Step 2: Non-linear Activation. Next, we apply an activation function element-wise to introduce non-linearity:

a[l]=σ(z[l])\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})

This step is crucial. Without activation functions, stacking multiple linear transformations would collapse into a single linear transformation, no matter how many layers we add. The activation function σ\sigma bends and warps the space, allowing the network to model complex, non-linear relationships.

The Complete Forward Pass

For a network with LL layers, the forward pass chains these two-step computations together. Starting with the input x\mathbf{x} (which we treat as a[0]\mathbf{a}^{[0]}), we propagate through each layer:

z[1]=W[1]x+b[1],a[1]=σ(z[1])z[2]=W[2]a[1]+b[2],a[2]=σ(z[2])z[L]=W[L]a[L1]+b[L],y^=a[L]\begin{aligned} \mathbf{z}^{[1]} &= \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]}, \quad \mathbf{a}^{[1]} = \sigma(\mathbf{z}^{[1]}) \\ \mathbf{z}^{[2]} &= \mathbf{W}^{[2]} \mathbf{a}^{[1]} + \mathbf{b}^{[2]}, \quad \mathbf{a}^{[2]} = \sigma(\mathbf{z}^{[2]}) \\ &\vdots \\ \mathbf{z}^{[L]} &= \mathbf{W}^{[L]} \mathbf{a}^{[L-1]} + \mathbf{b}^{[L]}, \quad \hat{\mathbf{y}} = \mathbf{a}^{[L]} \end{aligned}

where:

  • z[l]\mathbf{z}^{[l]}: the pre-activation vector at layer ll, the raw output of the linear transformation before any non-linearity is applied
  • a[l]\mathbf{a}^{[l]}: the activation vector at layer ll, the output after applying the activation function, which becomes input to the next layer
  • W[l]\mathbf{W}^{[l]}: the weight matrix connecting layer l1l-1 to layer ll, containing n[l]×n[l1]n^{[l]} \times n^{[l-1]} learnable parameters
  • b[l]\mathbf{b}^{[l]}: the bias vector for layer ll, containing n[l]n^{[l]} learnable parameters
  • σ\sigma: the activation function applied element-wise (e.g., ReLU, sigmoid, tanh)
  • y^\hat{\mathbf{y}}: the network's final output, our prediction

Why This Architecture Works

The power of this layered structure comes from composition. Each layer learns to detect increasingly abstract features:

  1. Early layers learn simple patterns directly from the input (edges, basic shapes, common word fragments)
  2. Middle layers combine these simple patterns into more complex features (textures, object parts, phrases)
  3. Later layers assemble these features into high-level concepts (objects, categories, meanings)

The output layer typically uses a different activation than hidden layers, chosen based on the task. For binary classification, sigmoid squashes output to (0,1)(0, 1) for probability interpretation. For multiclass classification, softmax produces a probability distribution across classes. For regression, we often use no activation (identity function) to allow unbounded predictions.

Implementing the Forward Pass

Let's translate these mathematical concepts into code. We'll build a forward pass function from scratch using NumPy to see exactly how the computation flows.

First, we define our activation functions. ReLU (Rectified Linear Unit) is the most common choice for hidden layers because it's simple, computationally efficient, and helps avoid the vanishing gradient problem. Sigmoid is useful for the output layer in binary classification, where we need probabilities between 0 and 1.

In[8]:
Code
def relu(z):
    """ReLU activation function: max(0, z)"""
    return np.maximum(0, z)


def sigmoid(z):
    """Sigmoid activation function: 1 / (1 + e^(-z))"""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
Out[9]:
Visualization
Two-panel line plot comparing ReLU and Sigmoid functions, showing ReLU as a kinked line and Sigmoid as an S-curve.
ReLU activation passes positive values unchanged and zeros out negatives, creating sparse activations.
Sigmoid smoothly squashes all values to the (0, 1) range, making it suitable for probability outputs.
Sigmoid smoothly squashes all values to the (0, 1) range, making it suitable for probability outputs.

Now we implement the forward pass itself. The function iterates through each layer, applying the two-step computation we described: linear transformation followed by activation.

In[10]:
Code
def forward_pass(
    x, weights, biases, hidden_activation=relu, output_activation=sigmoid
):
    """
    Compute forward pass through the network.

    Args:
        x: Input vector of shape (n_features, 1) or (n_features,)
        weights: List of weight matrices
        biases: List of bias vectors
        hidden_activation: Activation function for hidden layers
        output_activation: Activation function for output layer

    Returns:
        activations: List of activations for each layer (including input)
        pre_activations: List of pre-activation values for each layer
    """
    # Ensure x is a column vector
    a = x.reshape(-1, 1) if x.ndim == 1 else x

    activations = [a]
    pre_activations = []

    for l in range(len(weights)):
        # Step 1: Linear transformation
        z = weights[l] @ a + biases[l]
        pre_activations.append(z)

        # Step 2: Apply activation function
        # Use output activation for last layer, hidden activation otherwise
        if l == len(weights) - 1:
            a = output_activation(z)
        else:
            a = hidden_activation(z)

        activations.append(a)

    return activations, pre_activations

Tracing Through a Concrete Example

To solidify our understanding, let's trace through the forward pass with actual numbers. We'll use the 2-4-3-1 network we defined earlier and pass a single input through it.

In[11]:
Code
# Single input example
x = np.array([[0.5], [0.8]])

# Run forward pass
activations, pre_activations = forward_pass(x, weights, biases)
Out[12]:
Console
Input x:
  Shape: (2, 1), values: [0.5 0.8]

Layer 1:
  Pre-activation z[1] = W[1] @ a[0] + b[1]
  z[1]: shape (4, 1), values: [0.0689, 0.7711, -0.1522, 0.7018]
  a[1]: shape (4, 1), values: [0.0689, 0.7711, 0.0000, 0.7018]

Layer 2:
  Pre-activation z[2] = W[2] @ a[1] + b[2]
  z[2]: shape (3, 1), values: [0.0296, -0.9267, -0.4093]
  a[2]: shape (3, 1), values: [0.0296, 0.0000, 0.0000]

Layer 3:
  Pre-activation z[3] = W[3] @ a[2] + b[3]
  z[3]: shape (1, 1), values: [0.0217]
  a[3]: shape (1, 1), values: [0.5054]

Final output: 0.5054

The forward pass transforms our 2D input through three layers, ultimately producing a single scalar output. Several key observations emerge from this trace:

  1. ReLU's effect on hidden layers: Notice how negative pre-activation values become zero after ReLU. This sparsity (many zeros) is actually beneficial, making the network more computationally efficient and helping prevent overfitting.

  2. Dimensional changes: The representation changes size at each layer, from 2 dimensions (input) to 4, then 3, then finally 1 (output). The network progressively compresses information toward the final prediction.

  3. Sigmoid's bounded output: The final layer's sigmoid activation squashes the output to a probability between 0 and 1, suitable for binary classification.

  4. Composition creates complexity: Although each individual step is simple (matrix multiply, add bias, apply activation), the composition of many such steps creates a highly non-linear function capable of modeling complex patterns.

Batch Processing with Matrix Operations

The forward pass we implemented processes one example at a time, but this is inefficient in practice. Modern hardware, especially GPUs, is designed for parallel computation. By processing multiple examples simultaneously, we can achieve dramatic speedups.

The key insight is that we can stack multiple input vectors into a matrix, where each column represents one example. Instead of looping through examples one by one, we perform a single matrix multiplication that processes all examples at once.

For a batch of mm examples, the input becomes a matrix X\mathbf{X} of shape (n[0],m)(n^{[0]}, m), where each column represents one example. The forward pass equations generalize elegantly to matrix form:

Z[l]=W[l]A[l1]+b[l]\mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}

where:

  • A[l1]\mathbf{A}^{[l-1]}: the activation matrix from the previous layer, with shape (n[l1],m)(n^{[l-1]}, m)
  • W[l]\mathbf{W}^{[l]}: the weight matrix with shape (n[l],n[l1])(n^{[l]}, n^{[l-1]})
  • Z[l]\mathbf{Z}^{[l]}: the pre-activation matrix with shape (n[l],m)(n^{[l]}, m)
  • b[l]\mathbf{b}^{[l]}: the bias vector with shape (n[l],1)(n^{[l]}, 1), broadcast across all mm columns
  • mm: the batch size (number of examples processed simultaneously)

The matrix multiplication W[l]A[l1]\mathbf{W}^{[l]} \mathbf{A}^{[l-1]} computes the linear transformation for all examples at once. The bias vector b[l]\mathbf{b}^{[l]} is broadcast (replicated) across all mm columns, adding the same bias to each example. The activation function is then applied element-wise to produce A[l]\mathbf{A}^{[l]}.

In[13]:
Code
def forward_pass_batch(
    X, weights, biases, hidden_activation=relu, output_activation=sigmoid
):
    """
    Compute forward pass for a batch of inputs.

    Args:
        X: Input matrix of shape (n_features, batch_size)
        weights: List of weight matrices
        biases: List of bias vectors
        hidden_activation: Activation function for hidden layers
        output_activation: Activation function for output layer

    Returns:
        activations: List of activation matrices for each layer
        pre_activations: List of pre-activation matrices for each layer
    """
    A = X if X.ndim == 2 else X.reshape(-1, 1)

    activations = [A]
    pre_activations = []

    for l in range(len(weights)):
        Z = (
            weights[l] @ A + biases[l]
        )  # Broadcasting handles the batch dimension
        pre_activations.append(Z)

        if l == len(weights) - 1:
            A = output_activation(Z)
        else:
            A = hidden_activation(Z)

        activations.append(A)

    return activations, pre_activations


# Create a batch of 5 examples
X_batch = np.random.randn(2, 5)
activations_batch, _ = forward_pass_batch(X_batch, weights, biases)
Out[14]:
Console
Batch input shape: (2, 5) (2 features, 5 examples)
Batch output shape: (1, 5) (1 output, 5 examples)

Outputs for each example:
  Example 1: 0.5000
  Example 2: 0.5000
  Example 3: 0.5000
  Example 4: 0.5528
  Example 5: 0.5000

All five examples are processed in a single matrix operation, producing five outputs simultaneously. The varying output values reflect the different input features for each example. Batch processing is not just about efficiency. It also provides more stable gradient estimates during training, as we will see in the backpropagation chapter.

Representational Capacity and the Universal Approximation Theorem

One of the most remarkable properties of MLPs is their ability to approximate any continuous function. The Universal Approximation Theorem states that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of Rn\mathbb{R}^n, given appropriate weights.

Universal Approximation Theorem

A neural network with one hidden layer of sufficient width can approximate any continuous function to arbitrary accuracy. This does not guarantee that gradient descent will find such an approximation, nor does it specify how many neurons are needed.

The theorem tells us that MLPs are expressive enough to represent complex functions. However, it says nothing about:

  • How many neurons are needed (could be exponentially many)
  • Whether training will find good weights
  • How well the network will generalize to new data

In practice, we find that deeper networks (more layers) often work better than wider networks (more neurons per layer) for the same total number of parameters. Depth enables hierarchical feature learning, where early layers detect simple patterns and later layers combine them into complex concepts.

In[15]:
Code
# Demonstrate function approximation with varying network sizes
def create_mlp(layer_sizes):
    """Create an MLP with given layer sizes."""
    np.random.seed(42)
    weights = []
    biases = []
    for l in range(1, len(layer_sizes)):
        # Xavier initialization
        scale = np.sqrt(2.0 / (layer_sizes[l - 1] + layer_sizes[l]))
        W = np.random.randn(layer_sizes[l], layer_sizes[l - 1]) * scale
        b = np.zeros((layer_sizes[l], 1))
        weights.append(W)
        biases.append(b)
    return weights, biases


# Create networks of varying depth
shallow = create_mlp([1, 64, 1])  # 1 hidden layer, 64 neurons
medium = create_mlp([1, 32, 32, 1])  # 2 hidden layers, 32 neurons each
deep = create_mlp([1, 16, 16, 16, 1])  # 3 hidden layers, 16 neurons each


# Count parameters
def count_params(weights, biases):
    return sum(W.size for W in weights) + sum(b.size for b in biases)
Out[16]:
Console
Network architectures and parameter counts:
  Shallow (1-64-1):         193 parameters  (1 hidden layer)
  Medium  (1-32-32-1):    1,153 parameters  (2 hidden layers)
  Deep    (1-16-16-16-1):   593 parameters  (3 hidden layers)

The shallow network actually has the most parameters despite having only one hidden layer. The deeper networks distribute their capacity across more layers with fewer neurons each. Let's visualize this trade-off:

Out[17]:
Visualization
Stacked bar chart comparing parameter counts in shallow, medium, and deep networks.
Parameter distribution across different network architectures. Despite having more layers, the deeper network has fewer total parameters because it uses smaller hidden layers. The leftmost layer connection dominates parameter count since it connects from the input.

This trade-off matters: deeper networks can represent more complex compositional functions. Think of it like building with LEGO: a few large blocks can cover area, but many small blocks arranged hierarchically can create intricate structures.

MLP for Classification

Classification is one of the most common tasks for MLPs. The network takes features as input and produces probabilities for each class. For binary classification, we use a single output neuron with sigmoid activation. For multiclass, we use multiple output neurons with softmax activation.

Binary Classification

In binary classification, the output y^(0,1)\hat{y} \in (0, 1) represents the probability that the input belongs to class 1. We train the network by minimizing the binary cross-entropy loss, which measures how well the predicted probabilities match the true labels:

L=1mi=1m[y(i)log(y^(i))+(1y(i))log(1y^(i))]\mathcal{L} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]

where:

  • mm: the number of training examples in the batch
  • y(i)y^{(i)}: the true label for example ii (either 0 or 1)
  • y^(i)\hat{y}^{(i)}: the predicted probability that example ii belongs to class 1
  • log\log: the natural logarithm

This loss function has an intuitive interpretation. When the true label is y(i)=1y^{(i)} = 1, only the first term log(y^(i))-\log(\hat{y}^{(i)}) is active, which penalizes low predicted probabilities. When y(i)=0y^{(i)} = 0, only the second term log(1y^(i))-\log(1 - \hat{y}^{(i)}) is active, penalizing high predicted probabilities. The loss approaches zero when predictions are confident and correct, and grows large when predictions are wrong.

Out[18]:
Visualization
Line plot showing binary cross-entropy loss curves for y=1 and y=0 cases across predicted probabilities.
Binary cross-entropy loss for different true labels. When y=1 (blue), the loss decreases as predicted probability increases. When y=0 (orange), the loss decreases as predicted probability decreases. The steep rise near wrong predictions creates strong gradients for learning.

Let's build a complete binary classifier using PyTorch:

In[19]:
Code
import torch
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create a non-linearly separable dataset
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).reshape(-1, 1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).reshape(-1, 1)
In[20]:
Code
import torch.nn as nn
import torch.optim as optim


# Define the MLP architecture
class BinaryClassifierMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, dropout_rate=0.2):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))
            prev_size = hidden_size

        # Output layer with sigmoid for binary classification
        layers.append(nn.Linear(prev_size, 1))
        layers.append(nn.Sigmoid())

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create model: 2 inputs -> 16 -> 8 -> 1 output
model = BinaryClassifierMLP(input_size=2, hidden_sizes=[16, 8])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)
In[21]:
Code
# Training loop
train_losses = []
test_accuracies = []

for epoch in range(200):
    # Forward pass
    model.train()
    y_pred = model(X_train_t)
    loss = criterion(y_pred, y_train_t)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Track metrics
    train_losses.append(loss.item())

    # Evaluate on test set
    model.eval()
    with torch.no_grad():
        y_test_pred = model(X_test_t)
        accuracy = ((y_test_pred > 0.5) == y_test_t).float().mean().item()
        test_accuracies.append(accuracy)
Out[22]:
Console
Training Results:
  Final training loss: 0.1479
  Final test accuracy: 97.50%
  Initial loss:        0.7094
  Loss reduction:      79.2%

The model achieves strong test accuracy after 200 epochs of training. The significant reduction in loss from start to finish indicates successful learning. A test accuracy above 85% on the two moons dataset suggests the network has learned the non-linear decision boundary effectively.

Let's visualize the decision boundary learned by our classifier:

Out[23]:
Visualization
Contour plot showing a curved decision boundary separating two moon-shaped clusters of points.
Decision boundary learned by the MLP on the two moons dataset. The curved boundary separates the two classes.
Training progress showing loss decrease and accuracy improvement over epochs.
Training progress showing loss decrease and accuracy improvement over epochs.

The MLP learns a curved decision boundary that cleanly separates the two moon-shaped clusters. A linear classifier would fail miserably on this task, but the hidden layer transforms the input into a representation where the classes become separable.

Multiclass Classification

For problems with more than two classes, we use softmax activation in the output layer. Softmax converts a vector of raw scores (called logits) into a probability distribution. Given an output vector z=[z1,z2,,zK]\mathbf{z} = [z_1, z_2, \ldots, z_K] from the final layer, softmax computes the probability for each class:

softmax(z)i=ezij=1Kezj\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

where:

  • KK: the total number of classes
  • ziz_i: the logit (raw score) for class ii
  • ezie^{z_i}: the exponential of ziz_i, which ensures all values become positive
  • j=1Kezj\sum_{j=1}^{K} e^{z_j}: the sum of exponentials across all classes, serving as a normalization constant

The exponential function amplifies differences between logits: if one class has a much higher score than others, it will dominate the probability distribution. The denominator ensures all probabilities sum to 1, creating a valid probability distribution where each output represents the predicted probability of the corresponding class.

Out[24]:
Visualization
Two bar charts side by side showing logits and their corresponding softmax probabilities for three classes.
Raw logits (unnormalized scores) from the network's output layer.
After softmax, the scores become a valid probability distribution that sums to 1.
After softmax, the scores become a valid probability distribution that sums to 1.
In[25]:
Code
from sklearn.datasets import load_iris

# Load iris dataset (3 classes)
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42
)

# Standardize
scaler_iris = StandardScaler()
X_train_iris = scaler_iris.fit_transform(X_train_iris)
X_test_iris = scaler_iris.transform(X_test_iris)

# Convert to tensors
X_train_iris_t = torch.FloatTensor(X_train_iris)
y_train_iris_t = torch.LongTensor(y_train_iris)
X_test_iris_t = torch.FloatTensor(X_test_iris)
y_test_iris_t = torch.LongTensor(y_test_iris)
In[26]:
Code
class MulticlassClassifierMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, num_classes):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(0.2))
            prev_size = hidden_size

        # Output layer (no softmax, CrossEntropyLoss includes it)
        layers.append(nn.Linear(prev_size, num_classes))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create and train multiclass model
model_mc = MulticlassClassifierMLP(
    input_size=4, hidden_sizes=[16, 8], num_classes=3
)
criterion_mc = nn.CrossEntropyLoss()
optimizer_mc = optim.Adam(model_mc.parameters(), lr=0.01)

for epoch in range(200):
    model_mc.train()
    logits = model_mc(X_train_iris_t)
    loss = criterion_mc(logits, y_train_iris_t)

    optimizer_mc.zero_grad()
    loss.backward()
    optimizer_mc.step()
Out[27]:
Console
Test Results:
  Accuracy: 100.00% (30/30 correct)

Sample predictions (first 5 test examples):
  Classes: ['setosa', 'versicolor', 'virginica']

  [✓] True: versicolor   | Predicted: versicolor   | Probs: [0.000, 0.992, 0.008]
  [✓] True: setosa       | Predicted: setosa       | Probs: [1.000, 0.000, 0.000]
  [✓] True: virginica    | Predicted: virginica    | Probs: [0.000, 0.000, 1.000]
  [✓] True: versicolor   | Predicted: versicolor   | Probs: [0.000, 0.969, 0.031]
  [✓] True: versicolor   | Predicted: versicolor   | Probs: [0.000, 0.989, 0.011]

The multiclass model achieves high accuracy on the iris dataset, correctly classifying most test samples. Notice how the softmax outputs sum to 1.0 for each example, creating valid probability distributions. When the model is confident, one class dominates with probability close to 1.0 while others are near zero. The cross-entropy loss encourages this behavior by rewarding confident correct predictions and heavily penalizing confident incorrect ones.

MLP for Regression

Regression tasks require predicting continuous values rather than class labels. The key differences from classification are:

  • No activation function on the output layer (identity function)
  • Mean squared error (MSE) or mean absolute error (MAE) as the loss function
  • Output layer has one neuron per target variable
In[28]:
Code
# Create a regression dataset
np.random.seed(42)
X_reg = np.random.uniform(-3, 3, (500, 1))
y_reg = np.sin(X_reg) + 0.1 * X_reg**2 + np.random.normal(0, 0.1, X_reg.shape)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

X_train_reg_t = torch.FloatTensor(X_train_reg)
y_train_reg_t = torch.FloatTensor(y_train_reg)
X_test_reg_t = torch.FloatTensor(X_test_reg)
y_test_reg_t = torch.FloatTensor(y_test_reg)
In[29]:
Code
class RegressionMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size

        # Output layer with no activation for regression
        layers.append(nn.Linear(prev_size, 1))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create and train regression model
model_reg = RegressionMLP(input_size=1, hidden_sizes=[32, 16])
criterion_reg = nn.MSELoss()
optimizer_reg = optim.Adam(model_reg.parameters(), lr=0.01)

train_losses_reg = []
for epoch in range(300):
    model_reg.train()
    y_pred = model_reg(X_train_reg_t)
    loss = criterion_reg(y_pred, y_train_reg_t)

    optimizer_reg.zero_grad()
    loss.backward()
    optimizer_reg.step()

    train_losses_reg.append(loss.item())
Out[30]:
Visualization
Two-panel figure showing regression fit curve and training loss over epochs.
MLP learns the underlying non-linear function. Blue line shows MLP prediction, dashed orange shows true function.
Training loss decreasing over epochs on logarithmic scale.
Training loss decreasing over epochs on logarithmic scale.
Out[31]:
Console
Regression Results:
  Test MSE:  0.0111
  Test RMSE: 0.1055

Training Progress:
  Initial MSE: 0.6658
  Final MSE:   0.0094
  Improvement: 98.6%

The low test MSE and RMSE indicate the model fits the underlying function well. The RMSE of around 0.1 means predictions are typically within 0.1 units of the true value on average, which matches the noise level we added to the data. The MLP learns to approximate the underlying non-linear function. Unlike polynomial regression where you must choose the degree, the MLP automatically discovers the appropriate level of complexity through its hidden representations.

Architecture Design Guidelines

Designing an MLP architecture involves choosing the number of layers, neurons per layer, activation functions, and regularization techniques. While there is no universal formula, several principles guide these decisions.

Depth vs. Width

Deeper networks can represent more complex hierarchical features. However, they are harder to train due to vanishing or exploding gradients. Wider networks have more parameters per layer but may struggle to learn compositional patterns.

As a starting point:

  • For simple problems, 1-2 hidden layers often suffice
  • For complex patterns, 3-5 hidden layers may be needed
  • Very deep networks (10+ layers) typically require residual connections

Layer Sizes

Common patterns for hidden layer sizes include:

  • Funnel: Decreasing sizes (e.g., 256-128-64) that progressively compress information
  • Constant: Same size throughout (e.g., 128-128-128) for uniform capacity
  • Bottleneck: Narrow middle layer (e.g., 256-32-256) to force compressed representations
Out[32]:
Visualization
Side-by-side comparison of funnel, constant, and bottleneck network architectures showing layer widths.
Funnel architecture: progressively compresses the representation from 256 to 64 neurons.
Constant architecture: maintains uniform capacity (128 neurons) throughout hidden layers.
Constant architecture: maintains uniform capacity (128 neurons) throughout hidden layers.
Bottleneck architecture: forces the network to learn compressed representations (32 neurons) in the middle.
Bottleneck architecture: forces the network to learn compressed representations (32 neurons) in the middle.

The input and output sizes are determined by your problem. Hidden sizes are hyperparameters to tune.

Activation Functions

ReLU is the default choice for hidden layers due to its simplicity and effectiveness. Alternatives like Leaky ReLU or GELU can help in specific situations:

  • ReLU: Fast, simple, works well in most cases
  • Leaky ReLU/PReLU: Addresses dying ReLU problem
  • GELU: Smooth approximation, popular in transformers
  • Tanh/Sigmoid: Rarely used in hidden layers now due to saturation

For output layers:

  • Sigmoid: Binary classification (outputs probability)
  • Softmax: Multiclass classification (outputs probability distribution)
  • Identity (none): Regression (outputs unbounded values)

Regularization

To prevent overfitting, we use regularization techniques:

  • Dropout: Randomly zeros neurons during training (typical rates: 0.2-0.5)
  • Weight decay: L2 penalty on weights (typical values: 1e-4 to 1e-2)
  • Batch normalization: Normalizes layer inputs, also acts as regularizer
In[33]:
Code
class WellDesignedMLP(nn.Module):
    """An MLP following best practices for architecture design."""

    def __init__(
        self,
        input_size,
        num_classes,
        hidden_sizes=[256, 128, 64],
        dropout_rate=0.3,
        use_batch_norm=True,
    ):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))

            if use_batch_norm:
                layers.append(nn.BatchNorm1d(hidden_size))

            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))
            prev_size = hidden_size

        layers.append(nn.Linear(prev_size, num_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)
Out[34]:
Console
Well-designed MLP architecture:
WellDesignedMLP(
  (network): Sequential(
    (0): Linear(in_features=100, out_features=256, bias=True)
    (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.3, inplace=False)
    (4): Linear(in_features=256, out_features=128, bias=True)
    (5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.3, inplace=False)
    (8): Linear(in_features=128, out_features=64, bias=True)
    (9): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Dropout(p=0.3, inplace=False)
    (12): Linear(in_features=64, out_features=10, bias=True)
  )
)

Parameter count:
  Total parameters:     68,554
  Trainable parameters: 68,554

This architecture applies batch normalization after each linear layer, followed by ReLU activation and dropout. The pattern Linear \to BatchNorm \to ReLU \to Dropout repeats for each hidden layer, providing a consistent structure that's easy to extend. The funnel shape (256 \to 128 \to 64) progressively compresses the representation, forcing the network to distill the most important features as information flows toward the output.

Limitations and Impact

Despite their power, MLPs have significant limitations that motivated the development of more specialized architectures.

The most fundamental limitation is their treatment of input as a flat vector. When processing images, MLPs ignore spatial structure, treating each pixel as an independent feature. A small shift in the image produces a completely different input vector, yet the semantic content remains the same. This lack of translation invariance means MLPs need to learn the same pattern multiple times for different positions. Convolutional neural networks address this by sharing weights across spatial locations.

Similarly, for sequential data like text, MLPs treat each position independently. The sentence "The cat sat on the mat" becomes a fixed-size vector where position 1 and position 5 have no structural relationship. This makes learning long-range dependencies extremely difficult. The meaning of "it" in "The trophy doesn't fit in the suitcase because it is too big" depends on understanding the full context, something MLPs struggle with. Recurrent networks and transformers were developed specifically to handle sequential dependencies.

MLPs also struggle with variable-length inputs. Every MLP has a fixed input size determined at architecture design time. Processing sentences of different lengths requires padding to a maximum length or using techniques like bag-of-words that lose positional information entirely. This rigidity limits their applicability to many real-world problems.

Despite these limitations, MLPs remain foundational. The feed-forward layers within transformers are MLPs. The classification heads on top of pre-trained language models are MLPs. Understanding how information flows through layers, how weights connect neurons, and how activations transform representations is essential knowledge for working with any modern neural architecture.

The representational power of MLPs demonstrated that neural networks could, in principle, learn complex functions. The universal approximation theorem provided theoretical justification. The practical challenge became not representation but optimization: finding the right weights among billions of possibilities. The techniques developed to train MLPs (gradient descent, backpropagation, and regularization) form the foundation of all deep learning.

Summary

Multilayer perceptrons extend single neurons into networks capable of learning complex, non-linear functions. By stacking layers with non-linear activations, MLPs can approximate virtually any function, a property formalized in the universal approximation theorem.

Key takeaways from this chapter:

  • Hidden layers transform inputs into new representations where patterns become easier to detect
  • Weight matrices connect layers, with shape (n[l],n[l1])(n^{[l]}, n^{[l-1]}) for the matrix connecting layer l1l-1 to layer ll
  • Forward pass propagates information through linear transformations and activations: a[l]=σ(W[l]a[l1]+b[l])\mathbf{a}^{[l]} = \sigma(\mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]})
  • Batch processing uses matrix operations for efficiency, stacking examples as columns
  • Classification uses sigmoid (binary) or softmax (multiclass) output activations with cross-entropy loss
  • Regression uses identity (no) output activation with MSE loss
  • Architecture design involves balancing depth, width, activation functions, and regularization

The next chapter explores loss functions in detail, examining how different choices affect learning and what happens when we optimize various objectives. Understanding loss functions is crucial because they define what "good" means for our network.

Key Parameters

When building MLPs in PyTorch, several parameters significantly impact model performance:

Architecture Parameters:

  • hidden_sizes: List of neurons per hidden layer (e.g., [256, 128, 64]). Larger sizes increase capacity but also parameter count and risk of overfitting. Start with powers of 2 for computational efficiency.
  • input_size: Number of input features, determined by your data.
  • num_classes / output_size: Number of output neurons. Use 1 for binary classification or regression, KK for KK-class classification.

Regularization Parameters:

  • dropout_rate: Probability of zeroing neurons during training (typically 0.1-0.5). Higher values provide stronger regularization but may slow convergence. Use 0.2-0.3 as a starting point.
  • weight_decay: L2 regularization strength in the optimizer (typically 1e-4 to 1e-2). Penalizes large weights to reduce overfitting.

Training Parameters:

  • lr (learning rate): Step size for gradient updates (typically 1e-4 to 1e-1). Too high causes instability; too low causes slow convergence. Adam optimizer often works well with lr=0.001.
  • epochs: Number of complete passes through the training data. Monitor validation loss to avoid overfitting.
  • batch_size: Number of examples per gradient update (typically 32-256). Larger batches provide more stable gradients but require more memory.

Activation Choices:

  • Hidden layers: ReLU is the default choice. Consider Leaky ReLU if many neurons "die" (output constant zero).
  • Output layer: Sigmoid for binary classification, Softmax (via CrossEntropyLoss) for multiclass, Identity (none) for regression.

Loss Functions:

  • nn.BCELoss(): Binary cross-entropy for binary classification with sigmoid output.
  • nn.CrossEntropyLoss(): Combines softmax and negative log-likelihood for multiclass classification. Expects raw logits, not probabilities.
  • nn.MSELoss(): Mean squared error for regression tasks.

Quiz

Ready to test your understanding of multilayer perceptrons? Take this quick quiz to reinforce what you've learned about hidden layers, forward pass computation, and network architecture.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{multilayerperceptronsarchitectureforwardpassimplementation, author = {Michael Brenndoerfer}, title = {Multilayer Perceptrons: Architecture, Forward Pass & Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). Multilayer Perceptrons: Architecture, Forward Pass & Implementation. Retrieved from https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks
MLAAcademic
Michael Brenndoerfer. "Multilayer Perceptrons: Architecture, Forward Pass & Implementation." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks>.
CHICAGOAcademic
Michael Brenndoerfer. "Multilayer Perceptrons: Architecture, Forward Pass & Implementation." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Multilayer Perceptrons: Architecture, Forward Pass & Implementation'. Available at: https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). Multilayer Perceptrons: Architecture, Forward Pass & Implementation. https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free