Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Michael Brenndoerfer

Learn how MLPs stack neurons into layers to solve complex problems. Covers hidden layers, weight matrices, batch processing, and classification/regression tasks.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Multilayer PerceptronsLink Copied

In the previous chapters, we explored linear classifiers and activation functions as separate building blocks. Linear classifiers can find decision boundaries, but only straight ones. Activation functions introduce non-linearity, but a single neuron with an activation still has limited representational power. The breakthrough comes when we stack these components together into layers, creating what we call a multilayer perceptron (MLP).

MLPs are the workhorses of deep learning. They can approximate virtually any function given enough neurons and proper training. From sentiment analysis to language modeling, understanding MLPs is essential because they form the building blocks of more complex architectures like transformers. This chapter shows you how to construct, understand, and implement MLPs from the ground up.

From Single Neurons to Hidden LayersLink Copied

A single neuron computes a weighted sum of its inputs, adds a bias, and passes the result through an activation function. Given an input vector $\mathbf{x} = [x_1, x_2, \ldots, x_n]$ with $n$ features, the neuron computes:

y = \sigma(\mathbf{w}^\top \mathbf{x} + b)

where:

$\mathbf{x}$ : the input vector containing $n$ features
$\mathbf{w}$ : the weight vector, where each $w_i$ controls how much input $x_i$ influences the output
$\mathbf{w}^\top \mathbf{x}$ : the dot product $\sum_{i=1}^{n} w_i x_i$ , computing a weighted sum of inputs
$b$ : the bias term, which shifts the decision boundary
$\sigma$ : the activation function (e.g., ReLU, sigmoid), which introduces non-linearity
$y$ : the scalar output of the neuron

This single neuron can learn a linear decision boundary (made non-linear by $\sigma$ ), but it cannot solve problems requiring more complex boundaries.

Hidden Layer

A hidden layer is a collection of neurons that sits between the input and output of a neural network. Each neuron in a hidden layer receives the full input (or the output of the previous layer), applies its own weights and bias, and produces one scalar output. The term "hidden" reflects that these intermediate computations are not directly observed, only the final output layer is.

Consider the classic XOR problem: given two binary inputs, output 1 if exactly one input is 1, and 0 otherwise. No single linear boundary can separate the positive from negative examples. But with a hidden layer, we can first transform the inputs into a new representation where the classes become linearly separable.

In[2]:

Code

import numpy as np

# XOR inputs and outputs
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

import numpy as np

# XOR inputs and outputs
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([0, 1, 1, 0])

Out[3]:

Visualization

Scatter plot showing four points in 2D space with XOR labels, illustrating non-linear separability. — The XOR problem visualized. Blue circles represent class 0 and orange crosses represent class 1. No single straight line can separate the two classes, demonstrating the limitation of linear classifiers.

The magic happens when we add a hidden layer. Each hidden neuron learns to detect a different feature or pattern in the input. The output layer then combines these learned features to make the final prediction.

To see this in action, let's manually construct a hidden layer that solves XOR and visualize how it transforms the input space:

Out[4]:

Visualization

Two-panel plot showing XOR points before and after hidden layer transformation, with a separating line in the transformed space. — Original input space where XOR classes cannot be separated by a single straight line.

After hidden layer transformation, points are remapped to a new space where they become linearly separable.

The hidden layer has transformed the input space so that a simple linear classifier can now separate the classes. The first hidden neuron activates only when both inputs are high (AND-like behavior), while the second activates when either input is high (OR-like behavior). In this new coordinate system, the XOR pattern becomes trivially separable.

Network Architecture and NotationLink Copied

An MLP consists of an input layer, one or more hidden layers, and an output layer. We describe the architecture by the number of units in each layer. For example, a network with 4 inputs, two hidden layers of 8 and 4 units, and 2 outputs would be written as 4-8-4-2.

Let's establish notation that will serve us throughout this chapter and beyond:

$L$ : Total number of layers (excluding input)
$n^{[l]}$ : Number of neurons in layer $l$
$\mathbf{W}^{[l]}$ : Weight matrix for layer $l$ , with shape $(n^{[l]}, n^{[l-1]})$
$\mathbf{b}^{[l]}$ : Bias vector for layer $l$ , with shape $(n^{[l]}, 1)$
$\mathbf{z}^{[l]}$ : Pre-activation values at layer $l$
$\mathbf{a}^{[l]}$ : Activations (post-activation values) at layer $l$
$\mathbf{a}^{[0]} = \mathbf{x}$ : The input is treated as the activation of layer 0

The weight matrix $\mathbf{W}^{[l]}$ connects layer $l-1$ to layer $l$ . Each row of $\mathbf{W}^{[l]}$ contains the weights for one neuron in layer $l$ . The element $W^{[l]}_{ij}$ represents the weight connecting neuron $j$ in layer $l-1$ to neuron $i$ in layer $l$ .

In[5]:

Code

# Define a simple 3-layer MLP: 2 inputs -> 4 hidden -> 3 hidden -> 1 output
layer_sizes = [2, 4, 3, 1]

# Initialize weight matrices and bias vectors
np.random.seed(42)
weights = []
biases = []

for l in range(1, len(layer_sizes)):
    W = np.random.randn(layer_sizes[l], layer_sizes[l - 1]) * 0.5
    b = np.zeros((layer_sizes[l], 1))
    weights.append(W)
    biases.append(b)

# Define a simple 3-layer MLP: 2 inputs -> 4 hidden -> 3 hidden -> 1 output
layer_sizes = [2, 4, 3, 1]

# Initialize weight matrices and bias vectors
np.random.seed(42)
weights = []
biases = []

for l in range(1, len(layer_sizes)):
    W = np.random.randn(layer_sizes[l], layer_sizes[l - 1]) * 0.5
    b = np.zeros((layer_sizes[l], 1))
    weights.append(W)
    biases.append(b)

Out[6]:

Console

Network architecture: 2 -> 4 -> 3 -> 1

Weight matrix shapes:
  W[1]: (4, 2) (connects 2 neurons to 4 neurons)
  W[2]: (3, 4) (connects 4 neurons to 3 neurons)
  W[3]: (1, 3) (connects 3 neurons to 1 neurons)

Bias vector shapes:
  b[1]: (4, 1)
  b[2]: (3, 1)
  b[3]: (1, 1)

Total parameters: 31 (23 weights + 8 biases)

This 3-layer network has a modest parameter count, but the numbers grow quickly. Let's visualize what these weight matrices actually look like:

Out[7]:

Visualization

Three heatmaps showing weight matrices of shapes 4x2, 3x4, and 1x3, with color intensity indicating weight magnitude. — $W^{[1]}$: Weight matrix connecting input (2 neurons) to first hidden layer (4 neurons).

$W^{[2]}$: Weight matrix connecting first hidden (4 neurons) to second hidden layer (3 neurons).

$W^{[3]}$: Weight matrix connecting second hidden (3 neurons) to output layer (1 neuron).

The weight matrices grow with the product of consecutive layer sizes. A layer connecting 512 neurons to 256 neurons requires $512 \times 256 = 131,072$ parameters just for the weights, plus 256 bias terms. This rapid growth in parameters is why network architecture design requires careful consideration.

Forward Pass ComputationLink Copied

With our notation established, we can now understand how an MLP transforms an input into an output. This process, called the forward pass, is the heart of neural network computation. Think of it as a pipeline: data flows in one direction, from input through hidden layers to output, with each layer transforming the representation along the way.

The Two-Step Layer ComputationLink Copied

At each layer, the network performs two distinct operations that work together to create expressive transformations:

Step 1: Linear Transformation. First, we compute a weighted combination of inputs from the previous layer. This is where the learnable parameters (weights and biases) come into play:

\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}

The weight matrix $\mathbf{W}^{[l]}$ determines how strongly each input neuron influences each output neuron. The bias vector $\mathbf{b}^{[l]}$ shifts the output, allowing neurons to activate even when inputs are zero. Together, they define a linear transformation that can rotate, scale, and translate the input space.

Step 2: Non-linear Activation. Next, we apply an activation function element-wise to introduce non-linearity:

\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})

This step is crucial. Without activation functions, stacking multiple linear transformations would collapse into a single linear transformation, no matter how many layers we add. The activation function $\sigma$ bends and warps the space, allowing the network to model complex, non-linear relationships.

The Complete Forward PassLink Copied

For a network with $L$ layers, the forward pass chains these two-step computations together. Starting with the input $\mathbf{x}$ (which we treat as $\mathbf{a}^{[0]}$ ), we propagate through each layer:

\begin{aligned} \mathbf{z}^{[1]} &= \mathbf{W}^{[1]} \mathbf{x} + \mathbf{b}^{[1]}, \quad \mathbf{a}^{[1]} = \sigma(\mathbf{z}^{[1]}) \\ \mathbf{z}^{[2]} &= \mathbf{W}^{[2]} \mathbf{a}^{[1]} + \mathbf{b}^{[2]}, \quad \mathbf{a}^{[2]} = \sigma(\mathbf{z}^{[2]}) \\ &\vdots \\ \mathbf{z}^{[L]} &= \mathbf{W}^{[L]} \mathbf{a}^{[L-1]} + \mathbf{b}^{[L]}, \quad \hat{\mathbf{y}} = \mathbf{a}^{[L]} \end{aligned}

where:

$\mathbf{z}^{[l]}$ : the pre-activation vector at layer $l$ , the raw output of the linear transformation before any non-linearity is applied
$\mathbf{a}^{[l]}$ : the activation vector at layer $l$ , the output after applying the activation function, which becomes input to the next layer
$\mathbf{W}^{[l]}$ : the weight matrix connecting layer $l-1$ to layer $l$ , containing $n^{[l]} \times n^{[l-1]}$ learnable parameters
$\mathbf{b}^{[l]}$ : the bias vector for layer $l$ , containing $n^{[l]}$ learnable parameters
$\sigma$ : the activation function applied element-wise (e.g., ReLU, sigmoid, tanh)
$\hat{\mathbf{y}}$ : the network's final output, our prediction

Why This Architecture WorksLink Copied

The power of this layered structure comes from composition. Each layer learns to detect increasingly abstract features:

Early layers learn simple patterns directly from the input (edges, basic shapes, common word fragments)
Middle layers combine these simple patterns into more complex features (textures, object parts, phrases)
Later layers assemble these features into high-level concepts (objects, categories, meanings)

The output layer typically uses a different activation than hidden layers, chosen based on the task. For binary classification, sigmoid squashes output to $(0, 1)$ for probability interpretation. For multiclass classification, softmax produces a probability distribution across classes. For regression, we often use no activation (identity function) to allow unbounded predictions.

Implementing the Forward PassLink Copied

Let's translate these mathematical concepts into code. We'll build a forward pass function from scratch using NumPy to see exactly how the computation flows.

First, we define our activation functions. ReLU (Rectified Linear Unit) is the most common choice for hidden layers because it's simple, computationally efficient, and helps avoid the vanishing gradient problem. Sigmoid is useful for the output layer in binary classification, where we need probabilities between 0 and 1.

In[8]:

Code

def relu(z):
    """ReLU activation function: max(0, z)"""
    return np.maximum(0, z)


def sigmoid(z):
    """Sigmoid activation function: 1 / (1 + e^(-z))"""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def relu(z):
    """ReLU activation function: max(0, z)"""
    return np.maximum(0, z)


def sigmoid(z):
    """Sigmoid activation function: 1 / (1 + e^(-z))"""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

Out[9]:

Visualization

Two-panel line plot comparing ReLU and Sigmoid functions, showing ReLU as a kinked line and Sigmoid as an S-curve. — ReLU activation passes positive values unchanged and zeros out negatives, creating sparse activations.

Sigmoid smoothly squashes all values to the (0, 1) range, making it suitable for probability outputs.

Now we implement the forward pass itself. The function iterates through each layer, applying the two-step computation we described: linear transformation followed by activation.

In[10]:

Code

def forward_pass(
    x, weights, biases, hidden_activation=relu, output_activation=sigmoid
):
    """
    Compute forward pass through the network.

    Args:
        x: Input vector of shape (n_features, 1) or (n_features,)
        weights: List of weight matrices
        biases: List of bias vectors
        hidden_activation: Activation function for hidden layers
        output_activation: Activation function for output layer

    Returns:
        activations: List of activations for each layer (including input)
        pre_activations: List of pre-activation values for each layer
    """
    # Ensure x is a column vector
    a = x.reshape(-1, 1) if x.ndim == 1 else x

    activations = [a]
    pre_activations = []

    for l in range(len(weights)):
        # Step 1: Linear transformation
        z = weights[l] @ a + biases[l]
        pre_activations.append(z)

        # Step 2: Apply activation function
        # Use output activation for last layer, hidden activation otherwise
        if l == len(weights) - 1:
            a = output_activation(z)
        else:
            a = hidden_activation(z)

        activations.append(a)

    return activations, pre_activations

def forward_pass(
    x, weights, biases, hidden_activation=relu, output_activation=sigmoid
):
    """
    Compute forward pass through the network.

    Args:
        x: Input vector of shape (n_features, 1) or (n_features,)
        weights: List of weight matrices
        biases: List of bias vectors
        hidden_activation: Activation function for hidden layers
        output_activation: Activation function for output layer

    Returns:
        activations: List of activations for each layer (including input)
        pre_activations: List of pre-activation values for each layer
    """
    # Ensure x is a column vector
    a = x.reshape(-1, 1) if x.ndim == 1 else x

    activations = [a]
    pre_activations = []

    for l in range(len(weights)):
        # Step 1: Linear transformation
        z = weights[l] @ a + biases[l]
        pre_activations.append(z)

        # Step 2: Apply activation function
        # Use output activation for last layer, hidden activation otherwise
        if l == len(weights) - 1:
            a = output_activation(z)
        else:
            a = hidden_activation(z)

        activations.append(a)

    return activations, pre_activations

Tracing Through a Concrete ExampleLink Copied

To solidify our understanding, let's trace through the forward pass with actual numbers. We'll use the 2-4-3-1 network we defined earlier and pass a single input through it.

In[11]:

Code

# Single input example
x = np.array([[0.5], [0.8]])

# Run forward pass
activations, pre_activations = forward_pass(x, weights, biases)

# Single input example
x = np.array([[0.5], [0.8]])

# Run forward pass
activations, pre_activations = forward_pass(x, weights, biases)

Out[12]:

Console

Input x:
  Shape: (2, 1), values: [0.5 0.8]

Layer 1:
  Pre-activation z[1] = W[1] @ a[0] + b[1]
  z[1]: shape (4, 1), values: [0.0689, 0.7711, -0.1522, 0.7018]
  a[1]: shape (4, 1), values: [0.0689, 0.7711, 0.0000, 0.7018]

Layer 2:
  Pre-activation z[2] = W[2] @ a[1] + b[2]
  z[2]: shape (3, 1), values: [0.0296, -0.9267, -0.4093]
  a[2]: shape (3, 1), values: [0.0296, 0.0000, 0.0000]

Layer 3:
  Pre-activation z[3] = W[3] @ a[2] + b[3]
  z[3]: shape (1, 1), values: [0.0217]
  a[3]: shape (1, 1), values: [0.5054]

Final output: 0.5054

The forward pass transforms our 2D input through three layers, ultimately producing a single scalar output. Several key observations emerge from this trace:

ReLU's effect on hidden layers: Notice how negative pre-activation values become zero after ReLU. This sparsity (many zeros) is actually beneficial, making the network more computationally efficient and helping prevent overfitting.
Dimensional changes: The representation changes size at each layer, from 2 dimensions (input) to 4, then 3, then finally 1 (output). The network progressively compresses information toward the final prediction.
Sigmoid's bounded output: The final layer's sigmoid activation squashes the output to a probability between 0 and 1, suitable for binary classification.
Composition creates complexity: Although each individual step is simple (matrix multiply, add bias, apply activation), the composition of many such steps creates a highly non-linear function capable of modeling complex patterns.

Batch Processing with Matrix OperationsLink Copied

The forward pass we implemented processes one example at a time, but this is inefficient in practice. Modern hardware, especially GPUs, is designed for parallel computation. By processing multiple examples simultaneously, we can achieve dramatic speedups.

The key insight is that we can stack multiple input vectors into a matrix, where each column represents one example. Instead of looping through examples one by one, we perform a single matrix multiplication that processes all examples at once.

For a batch of $m$ examples, the input becomes a matrix $\mathbf{X}$ of shape $(n^{[0]}, m)$ , where each column represents one example. The forward pass equations generalize elegantly to matrix form:

\mathbf{Z}^{[l]} = \mathbf{W}^{[l]} \mathbf{A}^{[l-1]} + \mathbf{b}^{[l]}

where:

$\mathbf{A}^{[l-1]}$ : the activation matrix from the previous layer, with shape $(n^{[l-1]}, m)$
$\mathbf{W}^{[l]}$ : the weight matrix with shape $(n^{[l]}, n^{[l-1]})$
$\mathbf{Z}^{[l]}$ : the pre-activation matrix with shape $(n^{[l]}, m)$
$\mathbf{b}^{[l]}$ : the bias vector with shape $(n^{[l]}, 1)$ , broadcast across all $m$ columns
$m$ : the batch size (number of examples processed simultaneously)

The matrix multiplication $\mathbf{W}^{[l]} \mathbf{A}^{[l-1]}$ computes the linear transformation for all examples at once. The bias vector $\mathbf{b}^{[l]}$ is broadcast (replicated) across all $m$ columns, adding the same bias to each example. The activation function is then applied element-wise to produce $\mathbf{A}^{[l]}$ .

In[13]:

Code

def forward_pass_batch(
    X, weights, biases, hidden_activation=relu, output_activation=sigmoid
):
    """
    Compute forward pass for a batch of inputs.

    Args:
        X: Input matrix of shape (n_features, batch_size)
        weights: List of weight matrices
        biases: List of bias vectors
        hidden_activation: Activation function for hidden layers
        output_activation: Activation function for output layer

    Returns:
        activations: List of activation matrices for each layer
        pre_activations: List of pre-activation matrices for each layer
    """
    A = X if X.ndim == 2 else X.reshape(-1, 1)

    activations = [A]
    pre_activations = []

    for l in range(len(weights)):
        Z = (
            weights[l] @ A + biases[l]
        )  # Broadcasting handles the batch dimension
        pre_activations.append(Z)

        if l == len(weights) - 1:
            A = output_activation(Z)
        else:
            A = hidden_activation(Z)

        activations.append(A)

    return activations, pre_activations


# Create a batch of 5 examples
X_batch = np.random.randn(2, 5)
activations_batch, _ = forward_pass_batch(X_batch, weights, biases)

def forward_pass_batch(
    X, weights, biases, hidden_activation=relu, output_activation=sigmoid
):
    """
    Compute forward pass for a batch of inputs.

    Args:
        X: Input matrix of shape (n_features, batch_size)
        weights: List of weight matrices
        biases: List of bias vectors
        hidden_activation: Activation function for hidden layers
        output_activation: Activation function for output layer

    Returns:
        activations: List of activation matrices for each layer
        pre_activations: List of pre-activation matrices for each layer
    """
    A = X if X.ndim == 2 else X.reshape(-1, 1)

    activations = [A]
    pre_activations = []

    for l in range(len(weights)):
        Z = (
            weights[l] @ A + biases[l]
        )  # Broadcasting handles the batch dimension
        pre_activations.append(Z)

        if l == len(weights) - 1:
            A = output_activation(Z)
        else:
            A = hidden_activation(Z)

        activations.append(A)

    return activations, pre_activations


# Create a batch of 5 examples
X_batch = np.random.randn(2, 5)
activations_batch, _ = forward_pass_batch(X_batch, weights, biases)

Out[14]:

Console

Batch input shape: (2, 5) (2 features, 5 examples)
Batch output shape: (1, 5) (1 output, 5 examples)

Outputs for each example:
  Example 1: 0.5000
  Example 2: 0.5000
  Example 3: 0.5000
  Example 4: 0.5528
  Example 5: 0.5000

All five examples are processed in a single matrix operation, producing five outputs simultaneously. The varying output values reflect the different input features for each example. Batch processing is not just about efficiency. It also provides more stable gradient estimates during training, as we will see in the backpropagation chapter.

Representational Capacity and the Universal Approximation TheoremLink Copied

One of the most remarkable properties of MLPs is their ability to approximate any continuous function. The Universal Approximation Theorem states that a feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of $\mathbb{R}^n$ , given appropriate weights.

Universal Approximation Theorem

A neural network with one hidden layer of sufficient width can approximate any continuous function to arbitrary accuracy. This does not guarantee that gradient descent will find such an approximation, nor does it specify how many neurons are needed.

The theorem tells us that MLPs are expressive enough to represent complex functions. However, it says nothing about:

How many neurons are needed (could be exponentially many)
Whether training will find good weights
How well the network will generalize to new data

In practice, we find that deeper networks (more layers) often work better than wider networks (more neurons per layer) for the same total number of parameters. Depth enables hierarchical feature learning, where early layers detect simple patterns and later layers combine them into complex concepts.

In[15]:

Code

# Demonstrate function approximation with varying network sizes
def create_mlp(layer_sizes):
    """Create an MLP with given layer sizes."""
    np.random.seed(42)
    weights = []
    biases = []
    for l in range(1, len(layer_sizes)):
        # Xavier initialization
        scale = np.sqrt(2.0 / (layer_sizes[l - 1] + layer_sizes[l]))
        W = np.random.randn(layer_sizes[l], layer_sizes[l - 1]) * scale
        b = np.zeros((layer_sizes[l], 1))
        weights.append(W)
        biases.append(b)
    return weights, biases


# Create networks of varying depth
shallow = create_mlp([1, 64, 1])  # 1 hidden layer, 64 neurons
medium = create_mlp([1, 32, 32, 1])  # 2 hidden layers, 32 neurons each
deep = create_mlp([1, 16, 16, 16, 1])  # 3 hidden layers, 16 neurons each


# Count parameters
def count_params(weights, biases):
    return sum(W.size for W in weights) + sum(b.size for b in biases)

# Demonstrate function approximation with varying network sizes
def create_mlp(layer_sizes):
    """Create an MLP with given layer sizes."""
    np.random.seed(42)
    weights = []
    biases = []
    for l in range(1, len(layer_sizes)):
        # Xavier initialization
        scale = np.sqrt(2.0 / (layer_sizes[l - 1] + layer_sizes[l]))
        W = np.random.randn(layer_sizes[l], layer_sizes[l - 1]) * scale
        b = np.zeros((layer_sizes[l], 1))
        weights.append(W)
        biases.append(b)
    return weights, biases


# Create networks of varying depth
shallow = create_mlp([1, 64, 1])  # 1 hidden layer, 64 neurons
medium = create_mlp([1, 32, 32, 1])  # 2 hidden layers, 32 neurons each
deep = create_mlp([1, 16, 16, 16, 1])  # 3 hidden layers, 16 neurons each


# Count parameters
def count_params(weights, biases):
    return sum(W.size for W in weights) + sum(b.size for b in biases)

Out[16]:

Console

Network architectures and parameter counts:
  Shallow (1-64-1):         193 parameters  (1 hidden layer)
  Medium  (1-32-32-1):    1,153 parameters  (2 hidden layers)
  Deep    (1-16-16-16-1):   593 parameters  (3 hidden layers)

The shallow network actually has the most parameters despite having only one hidden layer. The deeper networks distribute their capacity across more layers with fewer neurons each. Let's visualize this trade-off:

Out[17]:

Visualization

Stacked bar chart comparing parameter counts in shallow, medium, and deep networks. — Parameter distribution across different network architectures. Despite having more layers, the deeper network has fewer total parameters because it uses smaller hidden layers. The leftmost layer connection dominates parameter count since it connects from the input.

This trade-off matters: deeper networks can represent more complex compositional functions. Think of it like building with LEGO: a few large blocks can cover area, but many small blocks arranged hierarchically can create intricate structures.

MLP for ClassificationLink Copied

Classification is one of the most common tasks for MLPs. The network takes features as input and produces probabilities for each class. For binary classification, we use a single output neuron with sigmoid activation. For multiclass, we use multiple output neurons with softmax activation.

Binary ClassificationLink Copied

In binary classification, the output $\hat{y} \in (0, 1)$ represents the probability that the input belongs to class 1. We train the network by minimizing the binary cross-entropy loss, which measures how well the predicted probabilities match the true labels:

\mathcal{L} = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]

where:

$m$ : the number of training examples in the batch
$y^{(i)}$ : the true label for example $i$ (either 0 or 1)
$\hat{y}^{(i)}$ : the predicted probability that example $i$ belongs to class 1
$\log$ : the natural logarithm

This loss function has an intuitive interpretation. When the true label is $y^{(i)} = 1$ , only the first term $-\log(\hat{y}^{(i)})$ is active, which penalizes low predicted probabilities. When $y^{(i)} = 0$ , only the second term $-\log(1 - \hat{y}^{(i)})$ is active, penalizing high predicted probabilities. The loss approaches zero when predictions are confident and correct, and grows large when predictions are wrong.

Out[18]:

Visualization

Line plot showing binary cross-entropy loss curves for y=1 and y=0 cases across predicted probabilities. — Binary cross-entropy loss for different true labels. When y=1 (blue), the loss decreases as predicted probability increases. When y=0 (orange), the loss decreases as predicted probability decreases. The steep rise near wrong predictions creates strong gradients for learning.

Let's build a complete binary classifier using PyTorch:

In[19]:

Code

import torch
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create a non-linearly separable dataset
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).reshape(-1, 1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).reshape(-1, 1)

import torch
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Create a non-linearly separable dataset
X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Convert to PyTorch tensors
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train).reshape(-1, 1)
X_test_t = torch.FloatTensor(X_test)
y_test_t = torch.FloatTensor(y_test).reshape(-1, 1)

In[20]:

Code

import torch.nn as nn
import torch.optim as optim


# Define the MLP architecture
class BinaryClassifierMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, dropout_rate=0.2):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))
            prev_size = hidden_size

        # Output layer with sigmoid for binary classification
        layers.append(nn.Linear(prev_size, 1))
        layers.append(nn.Sigmoid())

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create model: 2 inputs -> 16 -> 8 -> 1 output
model = BinaryClassifierMLP(input_size=2, hidden_sizes=[16, 8])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

import torch.nn as nn
import torch.optim as optim


# Define the MLP architecture
class BinaryClassifierMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, dropout_rate=0.2):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))
            prev_size = hidden_size

        # Output layer with sigmoid for binary classification
        layers.append(nn.Linear(prev_size, 1))
        layers.append(nn.Sigmoid())

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create model: 2 inputs -> 16 -> 8 -> 1 output
model = BinaryClassifierMLP(input_size=2, hidden_sizes=[16, 8])
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In[21]:

Code

# Training loop
train_losses = []
test_accuracies = []

for epoch in range(200):
    # Forward pass
    model.train()
    y_pred = model(X_train_t)
    loss = criterion(y_pred, y_train_t)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Track metrics
    train_losses.append(loss.item())

    # Evaluate on test set
    model.eval()
    with torch.no_grad():
        y_test_pred = model(X_test_t)
        accuracy = ((y_test_pred > 0.5) == y_test_t).float().mean().item()
        test_accuracies.append(accuracy)

# Training loop
train_losses = []
test_accuracies = []

for epoch in range(200):
    # Forward pass
    model.train()
    y_pred = model(X_train_t)
    loss = criterion(y_pred, y_train_t)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Track metrics
    train_losses.append(loss.item())

    # Evaluate on test set
    model.eval()
    with torch.no_grad():
        y_test_pred = model(X_test_t)
        accuracy = ((y_test_pred > 0.5) == y_test_t).float().mean().item()
        test_accuracies.append(accuracy)

Out[22]:

Console

Training Results:
  Final training loss: 0.1479
  Final test accuracy: 97.50%
  Initial loss:        0.7094
  Loss reduction:      79.2%

The model achieves strong test accuracy after 200 epochs of training. The significant reduction in loss from start to finish indicates successful learning. A test accuracy above 85% on the two moons dataset suggests the network has learned the non-linear decision boundary effectively.

Let's visualize the decision boundary learned by our classifier:

Out[23]:

Visualization

Contour plot showing a curved decision boundary separating two moon-shaped clusters of points. — Decision boundary learned by the MLP on the two moons dataset. The curved boundary separates the two classes.

Training progress showing loss decrease and accuracy improvement over epochs.

The MLP learns a curved decision boundary that cleanly separates the two moon-shaped clusters. A linear classifier would fail miserably on this task, but the hidden layer transforms the input into a representation where the classes become separable.

Multiclass ClassificationLink Copied

For problems with more than two classes, we use softmax activation in the output layer. Softmax converts a vector of raw scores (called logits) into a probability distribution. Given an output vector $\mathbf{z} = [z_1, z_2, \ldots, z_K]$ from the final layer, softmax computes the probability for each class:

\text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

where:

$K$ : the total number of classes
$z_i$ : the logit (raw score) for class $i$
$e^{z_i}$ : the exponential of $z_i$ , which ensures all values become positive
$\sum_{j=1}^{K} e^{z_j}$ : the sum of exponentials across all classes, serving as a normalization constant

The exponential function amplifies differences between logits: if one class has a much higher score than others, it will dominate the probability distribution. The denominator ensures all probabilities sum to 1, creating a valid probability distribution where each output represents the predicted probability of the corresponding class.

Out[24]:

Visualization

Two bar charts side by side showing logits and their corresponding softmax probabilities for three classes. — Raw logits (unnormalized scores) from the network's output layer.

After softmax, the scores become a valid probability distribution that sums to 1.

In[25]:

Code

from sklearn.datasets import load_iris

# Load iris dataset (3 classes)
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42
)

# Standardize
scaler_iris = StandardScaler()
X_train_iris = scaler_iris.fit_transform(X_train_iris)
X_test_iris = scaler_iris.transform(X_test_iris)

# Convert to tensors
X_train_iris_t = torch.FloatTensor(X_train_iris)
y_train_iris_t = torch.LongTensor(y_train_iris)
X_test_iris_t = torch.FloatTensor(X_test_iris)
y_test_iris_t = torch.LongTensor(y_test_iris)

from sklearn.datasets import load_iris

# Load iris dataset (3 classes)
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42
)

# Standardize
scaler_iris = StandardScaler()
X_train_iris = scaler_iris.fit_transform(X_train_iris)
X_test_iris = scaler_iris.transform(X_test_iris)

# Convert to tensors
X_train_iris_t = torch.FloatTensor(X_train_iris)
y_train_iris_t = torch.LongTensor(y_train_iris)
X_test_iris_t = torch.FloatTensor(X_test_iris)
y_test_iris_t = torch.LongTensor(y_test_iris)

In[26]:

Code

class MulticlassClassifierMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, num_classes):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(0.2))
            prev_size = hidden_size

        # Output layer (no softmax, CrossEntropyLoss includes it)
        layers.append(nn.Linear(prev_size, num_classes))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create and train multiclass model
model_mc = MulticlassClassifierMLP(
    input_size=4, hidden_sizes=[16, 8], num_classes=3
)
criterion_mc = nn.CrossEntropyLoss()
optimizer_mc = optim.Adam(model_mc.parameters(), lr=0.01)

for epoch in range(200):
    model_mc.train()
    logits = model_mc(X_train_iris_t)
    loss = criterion_mc(logits, y_train_iris_t)

    optimizer_mc.zero_grad()
    loss.backward()
    optimizer_mc.step()

class MulticlassClassifierMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes, num_classes):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(0.2))
            prev_size = hidden_size

        # Output layer (no softmax, CrossEntropyLoss includes it)
        layers.append(nn.Linear(prev_size, num_classes))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create and train multiclass model
model_mc = MulticlassClassifierMLP(
    input_size=4, hidden_sizes=[16, 8], num_classes=3
)
criterion_mc = nn.CrossEntropyLoss()
optimizer_mc = optim.Adam(model_mc.parameters(), lr=0.01)

for epoch in range(200):
    model_mc.train()
    logits = model_mc(X_train_iris_t)
    loss = criterion_mc(logits, y_train_iris_t)

    optimizer_mc.zero_grad()
    loss.backward()
    optimizer_mc.step()

Out[27]:

Console

Test Results:
  Accuracy: 100.00% (30/30 correct)

Sample predictions (first 5 test examples):
  Classes: ['setosa', 'versicolor', 'virginica']

  [✓] True: versicolor   | Predicted: versicolor   | Probs: [0.000, 0.992, 0.008]
  [✓] True: setosa       | Predicted: setosa       | Probs: [1.000, 0.000, 0.000]
  [✓] True: virginica    | Predicted: virginica    | Probs: [0.000, 0.000, 1.000]
  [✓] True: versicolor   | Predicted: versicolor   | Probs: [0.000, 0.969, 0.031]
  [✓] True: versicolor   | Predicted: versicolor   | Probs: [0.000, 0.989, 0.011]

The multiclass model achieves high accuracy on the iris dataset, correctly classifying most test samples. Notice how the softmax outputs sum to 1.0 for each example, creating valid probability distributions. When the model is confident, one class dominates with probability close to 1.0 while others are near zero. The cross-entropy loss encourages this behavior by rewarding confident correct predictions and heavily penalizing confident incorrect ones.

MLP for RegressionLink Copied

Regression tasks require predicting continuous values rather than class labels. The key differences from classification are:

No activation function on the output layer (identity function)
Mean squared error (MSE) or mean absolute error (MAE) as the loss function
Output layer has one neuron per target variable

In[28]:

Code

# Create a regression dataset
np.random.seed(42)
X_reg = np.random.uniform(-3, 3, (500, 1))
y_reg = np.sin(X_reg) + 0.1 * X_reg**2 + np.random.normal(0, 0.1, X_reg.shape)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

X_train_reg_t = torch.FloatTensor(X_train_reg)
y_train_reg_t = torch.FloatTensor(y_train_reg)
X_test_reg_t = torch.FloatTensor(X_test_reg)
y_test_reg_t = torch.FloatTensor(y_test_reg)

# Create a regression dataset
np.random.seed(42)
X_reg = np.random.uniform(-3, 3, (500, 1))
y_reg = np.sin(X_reg) + 0.1 * X_reg**2 + np.random.normal(0, 0.1, X_reg.shape)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

X_train_reg_t = torch.FloatTensor(X_train_reg)
y_train_reg_t = torch.FloatTensor(y_train_reg)
X_test_reg_t = torch.FloatTensor(X_test_reg)
y_test_reg_t = torch.FloatTensor(y_test_reg)

In[29]:

Code

class RegressionMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size

        # Output layer with no activation for regression
        layers.append(nn.Linear(prev_size, 1))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create and train regression model
model_reg = RegressionMLP(input_size=1, hidden_sizes=[32, 16])
criterion_reg = nn.MSELoss()
optimizer_reg = optim.Adam(model_reg.parameters(), lr=0.01)

train_losses_reg = []
for epoch in range(300):
    model_reg.train()
    y_pred = model_reg(X_train_reg_t)
    loss = criterion_reg(y_pred, y_train_reg_t)

    optimizer_reg.zero_grad()
    loss.backward()
    optimizer_reg.step()

    train_losses_reg.append(loss.item())

class RegressionMLP(nn.Module):
    def __init__(self, input_size, hidden_sizes):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))
            layers.append(nn.ReLU())
            prev_size = hidden_size

        # Output layer with no activation for regression
        layers.append(nn.Linear(prev_size, 1))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)


# Create and train regression model
model_reg = RegressionMLP(input_size=1, hidden_sizes=[32, 16])
criterion_reg = nn.MSELoss()
optimizer_reg = optim.Adam(model_reg.parameters(), lr=0.01)

train_losses_reg = []
for epoch in range(300):
    model_reg.train()
    y_pred = model_reg(X_train_reg_t)
    loss = criterion_reg(y_pred, y_train_reg_t)

    optimizer_reg.zero_grad()
    loss.backward()
    optimizer_reg.step()

    train_losses_reg.append(loss.item())

Out[30]:

Visualization

Two-panel figure showing regression fit curve and training loss over epochs. — MLP learns the underlying non-linear function. Blue line shows MLP prediction, dashed orange shows true function.

Training loss decreasing over epochs on logarithmic scale.

Out[31]:

Console

Regression Results:
  Test MSE:  0.0111
  Test RMSE: 0.1055

Training Progress:
  Initial MSE: 0.6658
  Final MSE:   0.0094
  Improvement: 98.6%

The low test MSE and RMSE indicate the model fits the underlying function well. The RMSE of around 0.1 means predictions are typically within 0.1 units of the true value on average, which matches the noise level we added to the data. The MLP learns to approximate the underlying non-linear function. Unlike polynomial regression where you must choose the degree, the MLP automatically discovers the appropriate level of complexity through its hidden representations.

Architecture Design GuidelinesLink Copied

Designing an MLP architecture involves choosing the number of layers, neurons per layer, activation functions, and regularization techniques. While there is no universal formula, several principles guide these decisions.

Depth vs. WidthLink Copied

Deeper networks can represent more complex hierarchical features. However, they are harder to train due to vanishing or exploding gradients. Wider networks have more parameters per layer but may struggle to learn compositional patterns.

As a starting point:

For simple problems, 1-2 hidden layers often suffice
For complex patterns, 3-5 hidden layers may be needed
Very deep networks (10+ layers) typically require residual connections

Layer SizesLink Copied

Common patterns for hidden layer sizes include:

Funnel: Decreasing sizes (e.g., 256-128-64) that progressively compress information
Constant: Same size throughout (e.g., 128-128-128) for uniform capacity
Bottleneck: Narrow middle layer (e.g., 256-32-256) to force compressed representations

Out[32]:

Visualization

Side-by-side comparison of funnel, constant, and bottleneck network architectures showing layer widths. — Funnel architecture: progressively compresses the representation from 256 to 64 neurons.

Constant architecture: maintains uniform capacity (128 neurons) throughout hidden layers.

Bottleneck architecture: forces the network to learn compressed representations (32 neurons) in the middle.

The input and output sizes are determined by your problem. Hidden sizes are hyperparameters to tune.

Activation FunctionsLink Copied

ReLU is the default choice for hidden layers due to its simplicity and effectiveness. Alternatives like Leaky ReLU or GELU can help in specific situations:

ReLU: Fast, simple, works well in most cases
Leaky ReLU/PReLU: Addresses dying ReLU problem
GELU: Smooth approximation, popular in transformers
Tanh/Sigmoid: Rarely used in hidden layers now due to saturation

For output layers:

Sigmoid: Binary classification (outputs probability)
Softmax: Multiclass classification (outputs probability distribution)
Identity (none): Regression (outputs unbounded values)

RegularizationLink Copied

To prevent overfitting, we use regularization techniques:

Dropout: Randomly zeros neurons during training (typical rates: 0.2-0.5)
Weight decay: L2 penalty on weights (typical values: 1e-4 to 1e-2)
Batch normalization: Normalizes layer inputs, also acts as regularizer

In[33]:

Code

class WellDesignedMLP(nn.Module):
    """An MLP following best practices for architecture design."""

    def __init__(
        self,
        input_size,
        num_classes,
        hidden_sizes=[256, 128, 64],
        dropout_rate=0.3,
        use_batch_norm=True,
    ):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))

            if use_batch_norm:
                layers.append(nn.BatchNorm1d(hidden_size))

            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))
            prev_size = hidden_size

        layers.append(nn.Linear(prev_size, num_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

class WellDesignedMLP(nn.Module):
    """An MLP following best practices for architecture design."""

    def __init__(
        self,
        input_size,
        num_classes,
        hidden_sizes=[256, 128, 64],
        dropout_rate=0.3,
        use_batch_norm=True,
    ):
        super().__init__()

        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.append(nn.Linear(prev_size, hidden_size))

            if use_batch_norm:
                layers.append(nn.BatchNorm1d(hidden_size))

            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout_rate))
            prev_size = hidden_size

        layers.append(nn.Linear(prev_size, num_classes))
        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

Out[34]:

Console

Well-designed MLP architecture:
WellDesignedMLP(
  (network): Sequential(
    (0): Linear(in_features=100, out_features=256, bias=True)
    (1): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.3, inplace=False)
    (4): Linear(in_features=256, out_features=128, bias=True)
    (5): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.3, inplace=False)
    (8): Linear(in_features=128, out_features=64, bias=True)
    (9): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Dropout(p=0.3, inplace=False)
    (12): Linear(in_features=64, out_features=10, bias=True)
  )
)

Parameter count:
  Total parameters:     68,554
  Trainable parameters: 68,554

This architecture applies batch normalization after each linear layer, followed by ReLU activation and dropout. The pattern Linear $\to$ BatchNorm $\to$ ReLU $\to$ Dropout repeats for each hidden layer, providing a consistent structure that's easy to extend. The funnel shape (256 $\to$ 128 $\to$ 64) progressively compresses the representation, forcing the network to distill the most important features as information flows toward the output.

Limitations and ImpactLink Copied

Despite their power, MLPs have significant limitations that motivated the development of more specialized architectures.

The most fundamental limitation is their treatment of input as a flat vector. When processing images, MLPs ignore spatial structure, treating each pixel as an independent feature. A small shift in the image produces a completely different input vector, yet the semantic content remains the same. This lack of translation invariance means MLPs need to learn the same pattern multiple times for different positions. Convolutional neural networks address this by sharing weights across spatial locations.

Similarly, for sequential data like text, MLPs treat each position independently. The sentence "The cat sat on the mat" becomes a fixed-size vector where position 1 and position 5 have no structural relationship. This makes learning long-range dependencies extremely difficult. The meaning of "it" in "The trophy doesn't fit in the suitcase because it is too big" depends on understanding the full context, something MLPs struggle with. Recurrent networks and transformers were developed specifically to handle sequential dependencies.

MLPs also struggle with variable-length inputs. Every MLP has a fixed input size determined at architecture design time. Processing sentences of different lengths requires padding to a maximum length or using techniques like bag-of-words that lose positional information entirely. This rigidity limits their applicability to many real-world problems.

Despite these limitations, MLPs remain foundational. The feed-forward layers within transformers are MLPs. The classification heads on top of pre-trained language models are MLPs. Understanding how information flows through layers, how weights connect neurons, and how activations transform representations is essential knowledge for working with any modern neural architecture.

The representational power of MLPs demonstrated that neural networks could, in principle, learn complex functions. The universal approximation theorem provided theoretical justification. The practical challenge became not representation but optimization: finding the right weights among billions of possibilities. The techniques developed to train MLPs (gradient descent, backpropagation, and regularization) form the foundation of all deep learning.

SummaryLink Copied

Multilayer perceptrons extend single neurons into networks capable of learning complex, non-linear functions. By stacking layers with non-linear activations, MLPs can approximate virtually any function, a property formalized in the universal approximation theorem.

Key takeaways from this chapter:

Hidden layers transform inputs into new representations where patterns become easier to detect
Weight matrices connect layers, with shape $(n^{[l]}, n^{[l-1]})$ for the matrix connecting layer $l-1$ to layer $l$
Forward pass propagates information through linear transformations and activations: $\mathbf{a}^{[l]} = \sigma(\mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]})$
Batch processing uses matrix operations for efficiency, stacking examples as columns
Classification uses sigmoid (binary) or softmax (multiclass) output activations with cross-entropy loss
Regression uses identity (no) output activation with MSE loss
Architecture design involves balancing depth, width, activation functions, and regularization

The next chapter explores loss functions in detail, examining how different choices affect learning and what happens when we optimize various objectives. Understanding loss functions is crucial because they define what "good" means for our network.

Key ParametersLink Copied

When building MLPs in PyTorch, several parameters significantly impact model performance:

Architecture Parameters:

hidden_sizes: List of neurons per hidden layer (e.g., [256, 128, 64]). Larger sizes increase capacity but also parameter count and risk of overfitting. Start with powers of 2 for computational efficiency.
input_size: Number of input features, determined by your data.
num_classes / output_size: Number of output neurons. Use 1 for binary classification or regression, $K$ for $K$ -class classification.

Regularization Parameters:

dropout_rate: Probability of zeroing neurons during training (typically 0.1-0.5). Higher values provide stronger regularization but may slow convergence. Use 0.2-0.3 as a starting point.
weight_decay: L2 regularization strength in the optimizer (typically 1e-4 to 1e-2). Penalizes large weights to reduce overfitting.

Training Parameters:

lr (learning rate): Step size for gradient updates (typically 1e-4 to 1e-1). Too high causes instability; too low causes slow convergence. Adam optimizer often works well with lr=0.001.
epochs: Number of complete passes through the training data. Monitor validation loss to avoid overfitting.
batch_size: Number of examples per gradient update (typically 32-256). Larger batches provide more stable gradients but require more memory.

Activation Choices:

Hidden layers: ReLU is the default choice. Consider Leaky ReLU if many neurons "die" (output constant zero).
Output layer: Sigmoid for binary classification, Softmax (via CrossEntropyLoss) for multiclass, Identity (none) for regression.

Loss Functions:

nn.BCELoss(): Binary cross-entropy for binary classification with sigmoid output.
nn.CrossEntropyLoss(): Combines softmax and negative log-likelihood for multiclass classification. Expects raw logits, not probabilities.
nn.MSELoss(): Mean squared error for regression tasks.

QuizLink Copied

Ready to test your understanding of multilayer perceptrons? Take this quick quiz to reinforce what you've learned about hidden layers, forward pass computation, and network architecture.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{multilayerperceptronsarchitectureforwardpassimplementation, author = {Michael Brenndoerfer}, title = {Multilayer Perceptrons: Architecture, Forward Pass & Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }

APAAcademic

Michael Brenndoerfer (2025). Multilayer Perceptrons: Architecture, Forward Pass & Implementation. Retrieved from https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks

MLAAcademic

Michael Brenndoerfer. "Multilayer Perceptrons: Architecture, Forward Pass & Implementation." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks>.

CHICAGOAcademic

Michael Brenndoerfer. "Multilayer Perceptrons: Architecture, Forward Pass & Implementation." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Multilayer Perceptrons: Architecture, Forward Pass & Implementation'. Available at: https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks (Accessed: 12/15/2025).

SimpleBasic

Michael Brenndoerfer (2025). Multilayer Perceptrons: Architecture, Forward Pass & Implementation. https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks

Direct link:

https://mbrenndoerfer.com/writing/multilayer-perceptrons-neural-networks

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Multilayer Perceptrons: Architecture, Forward Pass & Implementation

Multilayer PerceptronsLink Copied

From Single Neurons to Hidden LayersLink Copied

Network Architecture and NotationLink Copied

Forward Pass ComputationLink Copied

The Two-Step Layer ComputationLink Copied

The Complete Forward PassLink Copied

Why This Architecture WorksLink Copied

Implementing the Forward PassLink Copied

Tracing Through a Concrete ExampleLink Copied

Batch Processing with Matrix OperationsLink Copied

Representational Capacity and the Universal Approximation TheoremLink Copied

MLP for ClassificationLink Copied

Binary ClassificationLink Copied

Multiclass ClassificationLink Copied

MLP for RegressionLink Copied

Architecture Design GuidelinesLink Copied

Depth vs. WidthLink Copied

Layer SizesLink Copied

Activation FunctionsLink Copied

RegularizationLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Stochastic Gradient Descent: From Batch to Minibatch Optimization

Linear Classifiers: The Foundation of Neural Networks

Dropout: Neural Network Regularization Through Random Neuron Masking

Stay updated