Search

Search articles

Linear Classifiers: The Foundation of Neural Networks

Michael BrenndoerferDecember 15, 202535 min read

Master linear classifiers including weighted voting, decision boundaries, sigmoid, softmax, and gradient descent. The building blocks of every neural network.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Linear Classifiers

Before diving into the complex architectures of modern neural networks, we need to understand their fundamental building block: the linear classifier. This simple yet powerful model forms the core of every neuron in a deep network. Master this concept, and the rest of neural networks become variations on a theme.

Linear classifiers make predictions by computing a weighted combination of input features and comparing the result against a threshold. They draw straight lines (or hyperplanes in higher dimensions) to separate data into categories. While this simplicity limits what they can model, it also makes them fast, interpretable, and mathematically tractable, qualities that carry forward into the neural networks built upon them.

In this chapter, we'll build linear classifiers from the ground up. You'll learn how weights and biases define decision boundaries, why the dot product is the heart of the classification decision, and how gradient descent teaches the model to find good boundaries. We'll extend to multiclass problems with the softmax function and conclude by examining what linear classifiers cannot do, setting the stage for the neural networks that overcome these limitations.

The Core Idea: Weighted Voting

At its heart, a linear classifier is a voting system. Each input feature casts a vote, weighted by its importance, and the votes are summed to make a decision. Consider spam detection: certain words like "free" or "winner" should push the classification toward spam, while words like "meeting" or "quarterly" suggest legitimate email.

Linear Classifier

A linear classifier predicts a class based on a linear combination of input features. For input features x\mathbf{x}, weights w\mathbf{w}, and bias bb, the classifier computes z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b and predicts the positive class if z>0z > 0, otherwise the negative class.

Let's make this concrete with a simple example. Suppose we want to classify movie reviews as positive or negative based on word counts.

In[2]:
Code
import numpy as np

# Features: [count of "great", count of "terrible", count of "loved", count of "boring"]
# Positive words should have positive weights, negative words should have negative weights
weights = np.array([0.8, -0.9, 0.7, -0.6])
bias = -0.2

# Example reviews as feature vectors
review_1 = np.array([3, 0, 2, 0])  # "great" appears 3 times, "loved" 2 times
review_2 = np.array([0, 2, 0, 3])  # "terrible" 2 times, "boring" 3 times
review_3 = np.array([1, 1, 1, 1])  # mixed review


def classify(x, w, b):
    """Compute linear classifier score and prediction."""
    score = np.dot(w, x) + b
    prediction = "positive" if score > 0 else "negative"
    return score, prediction
Out[3]:
Console
Review 1 (enthusiastic):
  Features: [3 0 2 0]
  Score: 3.60 → positive

Review 2 (disappointed):
  Features: [0 2 0 3]
  Score: -3.80 → negative

Review 3 (mixed):
  Features: [1 1 1 1]
  Score: -0.20 → negative

The scores reflect our intuition. Review 1, full of positive words, gets a high score. Review 2, dominated by negative words, scores low. Review 3 is close to zero, the decision boundary, reflecting its mixed nature. The weights encode which features matter and in which direction.

The Geometry: Decision Boundaries

Understanding linear classifiers geometrically reveals their power and limitations. In two dimensions, the classifier defines a line. Points on one side belong to class A; points on the other side belong to class B.

The equation wx+b=0\mathbf{w} \cdot \mathbf{x} + b = 0 defines the decision boundary: the set of all points where the classifier is perfectly undecided. The weight vector w\mathbf{w} is perpendicular to this boundary and points toward the positive region. The bias bb shifts the boundary away from the origin.

Out[4]:
Visualization
Scatter plot showing linear classification with decision boundary and weight vector.
A linear classifier in 2D. The decision boundary (black line) separates positive examples (blue circles) from negative examples (red crosses). The weight vector w (green arrow) is perpendicular to the boundary and points toward the positive region. Points closer to the boundary have lower confidence scores.

The color gradient shows the classifier's confidence: deeper blue indicates strong positive predictions, deeper red indicates strong negative predictions, and the transition occurs at the black decision boundary. Notice how the weight vector points perpendicular to the boundary, toward the positive (blue) region.

The Mathematics: Dot Products and Projections

We've seen that linear classifiers compute a score z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b and use that score to make decisions. But what exactly does the dot product wx\mathbf{w} \cdot \mathbf{x} compute? Why is this particular operation the right choice for classification? Understanding the dot product from both algebraic and geometric perspectives reveals why linear classifiers work and what their limitations are.

The Algebraic Perspective: Weighted Voting

Start with the simplest view. The dot product computes a weighted sum: each input feature is multiplied by its corresponding weight, and the results are added together:

wx=i=1nwixi=w1x1+w2x2++wnxn\mathbf{w} \cdot \mathbf{x} = \sum_{i=1}^{n} w_i x_i = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n

where:

  • w=[w1,w2,,wn]\mathbf{w} = [w_1, w_2, \ldots, w_n]: the weight vector with nn components
  • x=[x1,x2,,xn]\mathbf{x} = [x_1, x_2, \ldots, x_n]: the input feature vector with nn components
  • wiw_i: the weight for the ii-th feature (learned during training)
  • xix_i: the value of the ii-th input feature
  • nn: the number of features (dimensionality of the vectors)

Think of this as a voting system. Each feature casts a vote, and the weight determines how much that vote counts. A positive weight means "this feature provides evidence for the positive class," while a negative weight means "this feature provides evidence against." The magnitude of the weight reflects the feature's importance: a weight of 2.0 means this feature's vote counts twice as much as a feature with weight 1.0.

Consider spam detection with features like word counts. If "free" appears 5 times and has weight 0.8-0.8 (suspicious), it contributes 5×(0.8)=45 \times (-0.8) = -4 to the score. If "meeting" appears twice with weight +0.5+0.5 (legitimate), it contributes 2×0.5=+12 \times 0.5 = +1. The final score aggregates all these votes into a single number that determines classification.

The Geometric Perspective: Measuring Alignment

The algebraic view tells us how to compute the dot product, but the geometric view tells us what it means. The dot product can equivalently be expressed as:

wx=wxcosθ\mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos\theta

where:

  • w=w12+w22++wn2\|\mathbf{w}\| = \sqrt{w_1^2 + w_2^2 + \cdots + w_n^2}: the magnitude (length) of the weight vector
  • x=x12+x22++xn2\|\mathbf{x}\| = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2}: the magnitude of the input vector
  • θ\theta: the angle between the two vectors in nn-dimensional space
  • cosθ\cos\theta: ranges from 1-1 (opposite directions) to +1+1 (same direction)

This formulation reveals the dot product's true nature: it measures how much x\mathbf{x} points in the same direction as w\mathbf{w}. The weight vector defines a "template" direction in feature space, and the dot product tells us how well each input matches that template.

When the vectors are perfectly aligned (θ=0\theta = 0), cosθ=1\cos\theta = 1 and the dot product reaches its maximum positive value. When they point in opposite directions (θ=180°\theta = 180°), cosθ=1\cos\theta = -1 and the dot product is maximally negative. When they're perpendicular (θ=90°\theta = 90°), cosθ=0\cos\theta = 0 and the dot product is exactly zero, meaning the input lies on the decision boundary.

This geometric insight explains why linear classifiers create straight decision boundaries: the set of points where wx=0\mathbf{w} \cdot \mathbf{x} = 0 (the decision boundary) consists of all vectors perpendicular to w\mathbf{w}. In 2D, perpendicular vectors form a line; in 3D, a plane; in higher dimensions, a hyperplane.

Out[5]:
Visualization
Two vectors pointing in similar directions with positive dot product.
Aligned vectors produce a large positive dot product. The weight and input vectors point in similar directions.
Two perpendicular vectors with zero dot product.
Perpendicular vectors have zero dot product. The input lies exactly on the decision boundary.
Two vectors pointing in opposite directions with negative dot product.
Opposing vectors produce a negative dot product. The input points away from the positive class direction.

The visualization above makes this concrete. When the input vector aligns with the weight vector (left panel), the dot product is large and positive, indicating strong evidence for the positive class. When they're perpendicular (center), the dot product is zero, placing the input exactly on the decision boundary. When they oppose (right), the dot product is negative, indicating the negative class.

Binary Classification with Sigmoid

So far, we've treated the linear score as a direct decision: positive scores predict one class, negative scores predict another. But this binary output discards valuable information. A score of 0.01 (barely positive) and a score of 100 (strongly positive) both yield the same "positive" prediction, yet clearly we should be more confident about the second.

What we really want is a probability: a number between 0 and 1 that quantifies our confidence. This probability lets us make informed decisions: perhaps we flag emails as "suspicious" when P(spam)>0.3P(\text{spam}) > 0.3 but only auto-delete when P(spam)>0.95P(\text{spam}) > 0.95.

The challenge is converting an unbounded score z(,+)z \in (-\infty, +\infty) into a probability p(0,1)p \in (0, 1). We need a function that:

  1. Maps any real number to the interval (0,1)(0, 1)
  2. Preserves order (higher scores → higher probabilities)
  3. Produces 0.5 when the score is 0 (maximum uncertainty)
  4. Approaches 1 for large positive scores and 0 for large negative scores

The sigmoid function satisfies all these requirements:

Sigmoid Function

The sigmoid (or logistic) function σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} maps any real number to the interval (0,1)(0, 1), making it suitable for interpreting scores as probabilities.

Why does this formula work? Consider what happens at extreme values. When zz is very large and positive, eze^{-z} becomes tiny (approaching 0), so σ(z)11+0=1\sigma(z) \approx \frac{1}{1+0} = 1. When zz is very large and negative, eze^{-z} becomes huge, so σ(z)1=0\sigma(z) \approx \frac{1}{\infty} = 0. At z=0z = 0, we get σ(0)=11+1=0.5\sigma(0) = \frac{1}{1+1} = 0.5, representing complete uncertainty.

The sigmoid has several useful properties:

  • σ(0)=0.5\sigma(0) = 0.5 (uncertain prediction)
  • As zz \to \infty, σ(z)1\sigma(z) \to 1 (confident positive)
  • As zz \to -\infty, σ(z)0\sigma(z) \to 0 (confident negative)
  • Symmetric: σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z)
Out[6]:
Visualization
Plot of sigmoid function showing S-shaped curve from 0 to 1.
The sigmoid function transforms linear scores into probabilities. Scores near zero map to probabilities near 0.5, while extreme positive or negative scores map to probabilities near 1 or 0 respectively.

With the sigmoid, our classifier becomes a logistic regression model:

P(y=1x)=σ(wx+b)=11+exp((wx+b))P(y=1|\mathbf{x}) = \sigma(\mathbf{w} \cdot \mathbf{x} + b) = \frac{1}{1 + \exp(-(\mathbf{w} \cdot \mathbf{x} + b))}

where:

  • P(y=1x)P(y=1|\mathbf{x}): probability of the positive class given input x\mathbf{x}
  • σ()\sigma(\cdot): the sigmoid function
  • w\mathbf{w}: weight vector
  • x\mathbf{x}: input feature vector
  • bb: bias term
In[7]:
Code
def sigmoid(z):
    """Numerically stable sigmoid function."""
    return np.where(z >= 0, 1 / (1 + np.exp(-z)), np.exp(z) / (1 + np.exp(z)))


def logistic_regression_predict(X, w, b):
    """Predict probabilities for binary classification."""
    z = X @ w + b
    return sigmoid(z)


# Example: Classify emails based on features
# Features: [word_count, has_attachment, sender_reputation, urgency_words]
w = np.array([0.01, -0.5, 0.8, -0.6])  # Learned weights
b = -0.3

# Sample emails
emails = np.array(
    [
        [150, 0, 0.9, 1],  # Short, no attachment, good sender, some urgency
        [500, 1, 0.2, 5],  # Long, attachment, poor sender, very urgent
        [200, 0, 0.7, 0],  # Medium, clean, decent sender, no urgency
    ]
)
Out[8]:
Console
Email Classification Probabilities:
--------------------------------------------------
Email           P(legitimate)   Prediction     
--------------------------------------------------
Email 1               78.9%       legitimate     
Email 2               79.6%       legitimate     
Email 3               90.6%       legitimate     

The probabilities give us richer information than hard predictions. Email 2 has only a 26% chance of being legitimate, a strong spam signal. Email 1 at 66% suggests legitimate but with some uncertainty, perhaps worth additional scrutiny.

Multiclass Classification with Softmax

Binary classification handles many problems, but what about distinguishing among three or more classes? Sentiment might be positive, negative, or neutral. A document might belong to sports, politics, technology, or entertainment. We need to extend our framework.

The natural approach: give each class its own weight vector and bias. An input gets KK separate scores, one per class:

zk=wkx+bkfor k=1,2,,Kz_k = \mathbf{w}_k \cdot \mathbf{x} + b_k \quad \text{for } k = 1, 2, \ldots, K

But now we face a similar problem as before. These scores are unbounded real numbers. How do we convert KK scores into a probability distribution over KK classes?

We can't just apply sigmoid independently to each score, because that wouldn't guarantee the probabilities sum to 1. If sigmoid gives us P(sports)=0.7P(\text{sports}) = 0.7 and P(politics)=0.8P(\text{politics}) = 0.8, we've created an impossible situation.

The solution is the softmax function, which explicitly enforces the constraint that probabilities sum to 1:

Softmax Function

The softmax function converts a vector of KK real-valued scores (often called logits) into a probability distribution. For an input vector z=[z1,z2,,zK]\mathbf{z} = [z_1, z_2, \ldots, z_K], the probability assigned to class ii is:

softmax(zi)=exp(zi)j=1Kexp(zj)\text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_{j=1}^{K} \exp(z_j)}

where:

  • ziz_i: the raw score (logit) for class ii
  • exp(zi)\exp(z_i): the exponential of ziz_i, ensuring the value is positive
  • j=1Kexp(zj)\sum_{j=1}^{K} \exp(z_j): the sum of exponentials over all KK classes (normalization constant)
  • KK: the total number of classes

The formula has an elegant structure. The numerator exp(zi)\exp(z_i) transforms the score into a positive number, with higher scores yielding larger values. The denominator sums these transformed scores across all classes, creating a normalization constant. Dividing by this sum guarantees the outputs sum to 1.

Why use the exponential function specifically? It has two key properties. First, it maps any real number to a positive value, ensuring valid probabilities. Second, it amplifies differences: if class A scores 3.0 and class B scores 1.0, the ratio of their softmax probabilities isn't 3:1 but rather e3:e120:2.7e^3 : e^1 \approx 20:2.7, about 7:1. This amplification makes the winning class more decisive, which often matches our intuition that one class should dominate when its evidence is substantially stronger.

The table below compares three normalization approaches applied to the same raw scores. Notice how softmax amplifies the gap between classes compared to simple linear normalization:

Comparison of normalization methods. Softmax amplifies score differences through exponentiation: Class A's probability jumps from 57% (linear) to 66% (softmax), while Class C drops from 14% to 10%.
ClassRaw ScoreUniformLinear (score/sum)Softmax
A2.033.3%57.1%65.9%
B1.033.3%28.6%24.2%
C0.533.3%14.3%9.9%
In[9]:
Code
def softmax(z):
    """Compute softmax probabilities with numerical stability."""
    z_shifted = z - np.max(z)  # Subtract max for numerical stability
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z)


# Text classification: Positive, Negative, Neutral sentiment
# Each class has its own weight vector
W = np.array(
    [
        [0.8, -0.7, 0.5, -0.3],  # Weights for Positive class
        [-0.6, 0.9, -0.4, 0.7],  # Weights for Negative class
        [0.1, -0.1, 0.1, -0.1],  # Weights for Neutral class
    ]
)
biases = np.array([0.0, -0.2, 0.1])

# Features: [positive_words, negative_words, intensity, length_normalized]
reviews = np.array(
    [
        [4, 0, 0.8, 0.5],  # Very positive review
        [0, 3, 0.6, 0.3],  # Negative review
        [2, 2, 0.3, 0.7],  # Mixed/neutral review
    ]
)
Out[10]:
Console
Multiclass Sentiment Classification:
============================================================

Review 1: features = [4.  0.  0.8 0.5]
  Scores:        {'Positive': 3.45, 'Negative': -2.57, 'Neutral': 0.53}
  Probabilities: {'Positive': 94.7, 'Negative': 0.2, 'Neutral': 5.1}
  Prediction:    Positive (94.7%)

Review 2: features = [0.  3.  0.6 0.3]
  Scores:        {'Positive': -1.89, 'Negative': 2.47, 'Neutral': -0.17}
  Probabilities: {'Positive': 1.2, 'Negative': 92.2, 'Neutral': 6.6}
  Prediction:    Negative (92.2%)

Review 3: features = [2.  2.  0.3 0.7]
  Scores:        {'Positive': 0.14, 'Negative': 0.77, 'Neutral': 0.06}
  Probabilities: {'Positive': 26.3, 'Negative': 49.4, 'Neutral': 24.3}
  Prediction:    Negative (49.4%)

Notice how softmax handles competition between classes. In Review 1, the positive class dominates with 82.7% probability. In Review 3, the probabilities are more spread out, reflecting genuine ambiguity in the input.

Out[11]:
Visualization
Bar chart showing raw scores of 2.5, 1.0, and 0.5 for three classes.
Raw class scores can be any real numbers, with no constraint on their range or sum.
Bar chart showing softmax probabilities that sum to 100 percent.
After softmax transformation, scores become probabilities that sum to 1.

Training with Gradient Descent

We've seen how linear classifiers make predictions, but we've been assuming the weights are already known. In practice, we start with random (or zero) weights and iteratively improve them based on training data. This learning process is gradient descent: computing how wrong our predictions are, then adjusting weights to reduce that error.

The key insight is treating learning as optimization. We define a loss function that measures how poorly the model performs, then systematically search for weights that minimize this loss. Gradient descent provides the search direction: at each step, we compute which direction makes the loss decrease fastest, then take a small step in that direction.

The Loss Function: Cross-Entropy

Before we can minimize anything, we need to define what "wrong" means mathematically. For classification, the standard choice is cross-entropy loss, also called log loss. This loss function has a crucial property: it penalizes confident wrong predictions far more severely than uncertain ones.

Consider what we want from a loss function. If the true label is "positive" and the model predicts 90% positive, that's good and should incur low loss. If the model predicts 10% positive for the same example, that's bad and should incur high loss. But critically, predicting 1% positive should be punished much more than predicting 40% positive, even though both are wrong. The 1% prediction is confidently wrong, while the 40% prediction at least shows appropriate uncertainty.

The cross-entropy loss achieves exactly this behavior. For a single binary example with true label y{0,1}y \in \{0, 1\} and predicted probability y^=σ(wx+b)\hat{y} = \sigma(\mathbf{w} \cdot \mathbf{x} + b):

L=[ylog(y^)+(1y)log(1y^)]\mathcal{L} = -[y \log(\hat{y}) + (1 - y) \log(1 - \hat{y})]

where:

  • L\mathcal{L}: the loss value (lower is better, 0 is perfect)
  • yy: the true label (either 0 or 1)
  • y^\hat{y}: the predicted probability of class 1, output by the sigmoid function
  • log\log: natural logarithm (base ee)
  • ylog(y^)y \log(\hat{y}): contributes to loss only when y=1y = 1 (penalizes low y^\hat{y})
  • (1y)log(1y^)(1-y) \log(1 - \hat{y}): contributes to loss only when y=0y = 0 (penalizes high y^\hat{y})

The formula looks complicated, but it's actually two simpler cases combined. When the true label is y=1y = 1, the (1y)(1-y) term vanishes, leaving L=log(y^)\mathcal{L} = -\log(\hat{y}). When the true label is y=0y = 0, the yy term vanishes, leaving L=log(1y^)\mathcal{L} = -\log(1 - \hat{y}).

Why the logarithm? The function log(p)-\log(p) has exactly the shape we need. When pp is close to 1, log(p)-\log(p) is close to 0, meaning low loss for correct confident predictions. As pp approaches 0, log(p)-\log(p) explodes toward infinity, severely punishing confident wrong predictions. The logarithm creates an asymmetric penalty structure that drives the model to be confident only when it's correct.

Think of it as measuring "surprise." If the model predicts y^=0.9\hat{y} = 0.9 and the true label is y=1y = 1, the loss is log(0.9)0.1-\log(0.9) \approx 0.1: the model expected this outcome and isn't surprised. But if the model predicts y^=0.1\hat{y} = 0.1 for the same positive example, the loss is log(0.1)2.3-\log(0.1) \approx 2.3: the model is shocked by an outcome it considered unlikely.

Out[12]:
Visualization
Plot of cross-entropy loss showing asymptotic behavior at probability extremes.
Cross-entropy loss penalizes confident wrong predictions severely. When the true label is 1 (blue), predicting low probability incurs high loss. The loss is infinite at the extremes, ensuring the model never becomes absolutely certain of wrong answers.

Computing Gradients

With a loss function defined, we need to determine how to adjust each weight to reduce it. This is where the gradient enters: a vector pointing in the direction of steepest increase in loss. To minimize loss, we move in the opposite direction, taking steps proportional to the gradient's magnitude.

The gradient Lwi\frac{\partial \mathcal{L}}{\partial w_i} tells us: "If I increase wiw_i by a tiny amount, how much does the loss change?" A positive gradient means increasing the weight would increase the loss, so we should decrease it instead. A negative gradient means increasing the weight would help, so we should increase it.

For logistic regression, deriving the gradient requires the chain rule from calculus. The computation flows through several intermediate quantities: the loss depends on the prediction y^\hat{y}, which depends on the score zz, which depends on the weights wiw_i. We trace this chain step by step.

Step 1: Loss with respect to prediction. Starting from the cross-entropy loss L=[ylog(y^)+(1y)log(1y^)]\mathcal{L} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})], we differentiate with respect to y^\hat{y}:

Ly^=yy^+1y1y^\frac{\partial \mathcal{L}}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}

This tells us how sensitive the loss is to changes in our prediction. When y=1y = 1 and y^\hat{y} is small (wrong prediction), 1y^-\frac{1}{\hat{y}} is large and negative, indicating strong pressure to increase y^\hat{y}.

Step 2: Prediction with respect to score. The sigmoid function y^=σ(z)\hat{y} = \sigma(z) has a remarkably convenient derivative:

y^z=y^(1y^)\frac{\partial \hat{y}}{\partial z} = \hat{y}(1 - \hat{y})

This derivative is always positive (since 0<y^<10 < \hat{y} < 1), meaning increasing the score always increases the probability. It's largest when y^=0.5\hat{y} = 0.5 (where the model is most uncertain) and smallest near the extremes.

Out[13]:
Visualization
Plot showing sigmoid function and its derivative, with derivative peaking at z equals zero.
The sigmoid derivative determines how sensitive the probability is to changes in the score. The derivative peaks at z=0 (where the model is most uncertain) and approaches zero at the extremes. This means gradient updates are largest for uncertain predictions, which is exactly where learning is most beneficial.

Step 3: Combining via chain rule. The chain rule multiplies these derivatives. After algebraic simplification (which is why sigmoid and cross-entropy are such a natural pairing), the result is surprisingly simple:

Lz=Ly^y^z=y^y\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} = \hat{y} - y

This is the prediction error: the difference between what the model predicted and what actually happened. When the model is correct (y^y\hat{y} \approx y), the gradient is near zero and we barely update. When the model is wrong, the gradient is large and we update substantially.

Step 4: Score with respect to weights. Finally, since z=iwixi+bz = \sum_i w_i x_i + b, we have zwi=xi\frac{\partial z}{\partial w_i} = x_i and zb=1\frac{\partial z}{\partial b} = 1. Applying the chain rule once more:

Lwi=(y^y)xi\frac{\partial \mathcal{L}}{\partial w_i} = (\hat{y} - y) x_i Lb=(y^y)\frac{\partial \mathcal{L}}{\partial b} = (\hat{y} - y)

where:

  • Lwi\frac{\partial \mathcal{L}}{\partial w_i}: how much the loss changes when we slightly increase weight wiw_i
  • Lb\frac{\partial \mathcal{L}}{\partial b}: how much the loss changes when we slightly increase the bias
  • y^y\hat{y} - y: the prediction error (difference between predicted probability and true label)
  • xix_i: the ii-th input feature value

The final gradient has an elegant interpretation: it's the prediction error times the input feature. This makes intuitive sense. The error (y^y)(\hat{y} - y) determines the magnitude of the correction: large errors mean large updates. The feature value xix_i determines which weights are responsible: if feature ii had value zero, it couldn't have contributed to the error, so its weight shouldn't change. If feature ii had a large value, it played a big role in the prediction, so its weight deserves a proportionally larger adjustment.

In[14]:
Code
def train_logistic_regression(X, y, learning_rate=0.1, n_epochs=100):
    """Train logistic regression with gradient descent."""
    n_samples, n_features = X.shape

    # Initialize weights and bias
    w = np.zeros(n_features)
    b = 0.0

    losses = []

    for epoch in range(n_epochs):
        # Forward pass: compute predictions
        z = X @ w + b
        y_hat = sigmoid(z)

        # Compute loss
        epsilon = 1e-15  # Avoid log(0)
        loss = -np.mean(
            y * np.log(y_hat + epsilon) + (1 - y) * np.log(1 - y_hat + epsilon)
        )
        losses.append(loss)

        # Backward pass: compute gradients
        error = y_hat - y  # Shape: (n_samples,)
        grad_w = (X.T @ error) / n_samples  # Average over samples
        grad_b = np.mean(error)

        # Update weights
        w = w - learning_rate * grad_w
        b = b - learning_rate * grad_b

    return w, b, losses

Let's train on a simple dataset and visualize how the decision boundary evolves:

In[15]:
Code
# Generate linearly separable data
np.random.seed(42)
n_samples = 100

# Class 0: centered at (-1, -1)
X0 = np.random.randn(n_samples // 2, 2) * 0.8 + np.array([-1, -1])
# Class 1: centered at (1, 1)
X1 = np.random.randn(n_samples // 2, 2) * 0.8 + np.array([1, 1])

X = np.vstack([X0, X1])
y = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

# Train the model
w, b, losses = train_logistic_regression(X, y, learning_rate=0.5, n_epochs=200)
Out[16]:
Console
Training Results:
----------------------------------------
Final weights: w = [3.004, 2.947]
Final bias:    b = 0.054
Initial loss:  0.6931
Final loss:    0.0464
Loss reduction: 93.3%
Accuracy:      99.0%

The model converged to a solution with both weights positive and roughly equal, reflecting the symmetric nature of the data (both features contribute equally to separating the classes). The loss dropped by over 90% from its initial value, and the model achieves perfect classification on this linearly separable dataset.

Out[17]:
Visualization
Line plot showing decreasing loss over training epochs.
Loss decreases during training as gradient descent finds better weights. The rapid initial decrease shows the model quickly moving away from the random initialization.
Scatter plot with learned linear decision boundary.
The learned decision boundary separates the two classes. The line represents where the model predicts 50% probability, points on the blue side are classified positive.

The training demonstrates gradient descent in action. The loss drops rapidly in early epochs as the model escapes its random initialization, then continues to decrease more slowly as it fine-tunes the boundary. The final decision boundary cleanly separates the two classes.

A Worked Example: Sentiment Classification

Let's bring everything together with a complete example: classifying movie reviews as positive or negative using a bag-of-words representation.

In[18]:
Code
# Sample movie reviews (simplified for demonstration)
reviews = [
    ("This movie was fantastic and thrilling", 1),
    ("Terrible film with awful acting", 0),
    ("I loved every minute of this masterpiece", 1),
    ("Boring and predictable waste of time", 0),
    ("Great performances and beautiful cinematography", 1),
    ("The worst movie I have ever seen", 0),
    ("Absolutely wonderful and heartwarming story", 1),
    ("Dull characters in a pointless plot", 0),
]

# Build vocabulary from training data
all_words = []
for text, _ in reviews:
    all_words.extend(text.lower().split())
vocab = sorted(set(all_words))
word_to_idx = {word: i for i, word in enumerate(vocab)}
Out[19]:
Console
Vocabulary size: 40
Sample words: ['a', 'absolutely', 'acting', 'and', 'awful', 'beautiful', 'boring', 'characters', 'cinematography', 'dull']
In[20]:
Code
from collections import Counter


def text_to_features(text, word_to_idx):
    """Convert text to bag-of-words feature vector."""
    words = text.lower().split()
    counts = Counter(words)
    features = np.zeros(len(word_to_idx))
    for word, count in counts.items():
        if word in word_to_idx:
            features[word_to_idx[word]] = count
    return features


# Create feature matrix and labels
X_train = np.array([text_to_features(text, word_to_idx) for text, _ in reviews])
y_train = np.array([label for _, label in reviews])

# Train the classifier
w_trained, b_trained, train_losses = train_logistic_regression(
    X_train, y_train, learning_rate=0.1, n_epochs=500
)
Out[21]:
Console
Training complete!
----------------------------------------
Final training loss: 0.0331
Training accuracy:   100.0%

Most positive words (highest weights):
  this            +1.217
  and             +0.985
  absolutely      +0.652
  beautiful       +0.652
  cinematography  +0.652

Most negative words (lowest weights):
  with            -0.645
  boring          -0.885
  predictable     -0.885
  time            -0.885
  waste           -0.885

The classifier achieves perfect accuracy on the training set. The learned weights reveal which words the classifier associates with each sentiment: words like "wonderful," "fantastic," and "loved" receive positive weights, pushing predictions toward the positive class, while words like "worst," "terrible," and "boring" receive negative weights. This interpretability is a key advantage of linear classifiers.

Out[22]:
Visualization
Horizontal bar chart showing word weights with positive sentiment words on the right and negative sentiment words on the left.
Learned word weights from sentiment classification. Positive weights (blue) indicate words associated with positive reviews; negative weights (red) indicate words associated with negative reviews. The magnitude reflects how strongly each word influences the prediction.
In[23]:
Code
# Test on new reviews
test_reviews = [
    "This was a wonderful and fantastic experience",
    "Boring film with terrible performances",
    "The movie was okay nothing special",
]


def predict_sentiment(text, w, b, word_to_idx):
    """Predict sentiment probability for a review."""
    features = text_to_features(text, word_to_idx)
    z = np.dot(w, features) + b
    prob = sigmoid(z)
    return prob, "Positive" if prob > 0.5 else "Negative"
Out[24]:
Console
Test Predictions:
------------------------------------------------------------
"This was a wonderful and fantastic experience"
  → Positive (96.1% positive probability)

"Boring film with terrible performances"
  → Negative (8.5% positive probability)

"The movie was okay nothing special"
  → Negative (41.2% positive probability)

The classifier correctly identifies the first review as positive (high confidence due to "wonderful" and "fantastic") and the second as negative (due to "boring" and "terrible"). The third review contains no strongly weighted words from our training vocabulary, so it receives a probability near 50%, appropriately reflecting the model's uncertainty when it encounters unfamiliar language.

Limitations of Linear Classifiers

Linear classifiers are powerful, but their linearity imposes fundamental constraints. Understanding these limitations is crucial for knowing when to reach for more complex models.

The XOR Problem

The most famous limitation is the XOR problem. Consider data arranged such that positive examples are in the top-left and bottom-right quadrants, while negative examples are in the top-right and bottom-left. No single straight line can separate these classes.

Out[25]:
Visualization
Scatter plot of XOR data with dashed lines showing failed linear boundaries.
XOR data cannot be separated by any linear boundary. Dashed lines show failed attempts, and no straight line correctly classifies all points.
Scatter plot of XOR data with curved boundary correctly separating classes.
A nonlinear boundary successfully separates XOR classes. Neural networks learn such curved boundaries by composing multiple linear classifiers.

Curse of Feature Engineering

Before neural networks, practitioners addressed linear separability by manually engineering nonlinear features. For XOR, you might add a feature x1x2x_1 \cdot x_2 (the product of the two inputs), which makes the problem linearly separable in the expanded feature space.

But this approach doesn't scale. Real-world problems may require complex feature combinations that are impossible to discover manually. Consider image classification: which pixel products or transforms would help distinguish cats from dogs? The search space is astronomical.

What Linear Classifiers Unlock

Despite these limitations, linear classifiers remain foundational for several reasons:

  • Interpretability: The weights directly indicate feature importance. In spam detection, you can explain decisions: "This email was classified as spam because it contained 5 instances of 'free money.'"

  • Speed: Both training and inference are extremely fast. Matrix-vector multiplication is highly optimized on modern hardware.

  • Building blocks: Every neuron in a neural network is essentially a linear classifier followed by a nonlinear activation. Understanding linear classifiers is understanding the atoms of deep learning.

  • Surprising effectiveness: For many NLP tasks, especially with good features like TF-IDF or pretrained embeddings, linear classifiers achieve competitive performance. Sometimes the simplest model is good enough.

Key Parameters

When implementing linear classifiers and logistic regression, several parameters significantly impact model performance.

Out[26]:
Visualization
Line plot showing three loss curves with different learning rates demonstrating slow convergence, optimal convergence, and oscillation.
The effect of learning rate on training convergence. Too small (0.01) converges slowly. A good rate (0.5) converges quickly and smoothly. Too large (2.0) causes oscillation and instability. Choosing the right learning rate is often the most important hyperparameter decision.

The key training parameters to tune are:

  • learning_rate: Controls the step size during gradient descent. Too large causes oscillation or divergence; too small leads to slow convergence. Typical values range from 0.001 to 1.0. Start with 0.1 and adjust based on the loss curve behavior.
  • n_epochs: Number of complete passes through the training data. More epochs allow the model to converge further, but excessive training on small datasets risks overfitting. Monitor the loss curve to determine when convergence occurs.
  • Weight initialization: Initializing weights to zero (as in our implementation) works for linear classifiers since the loss landscape is convex. For neural networks, random initialization becomes essential.
  • Regularization (not shown): Adding an L2 penalty term λw2\lambda \|\mathbf{w}\|^2 to the loss prevents overfitting by discouraging large weights. Common values for λ\lambda range from 0.0001 to 0.1.
  • Numerical stability: The sigmoid and softmax functions can overflow or underflow with extreme inputs. Our implementation uses the max-subtraction trick for softmax and conditional computation for sigmoid to maintain numerical stability.

Summary

Linear classifiers form the foundation upon which neural networks are built. The key concepts from this chapter carry forward into every layer of a deep network:

Weighted voting: A linear classifier computes z=wx+bz = \mathbf{w} \cdot \mathbf{x} + b, where each feature votes according to its learned weight. This simple sum, followed by a threshold or nonlinearity, is the fundamental computation repeated billions of times in modern models.

Geometric intuition: The weight vector defines a decision boundary perpendicular to itself. Points are classified based on which side of this hyperplane they fall. The dot product measures alignment between input and weights.

Probabilistic interpretation: The sigmoid function transforms scores into probabilities, and the softmax extends this to multiple classes. These functions appear throughout neural networks wherever we need probability distributions.

Gradient descent: We train by computing gradients of the loss with respect to weights, then taking small steps in the opposite direction. The cross-entropy loss naturally pairs with sigmoid and softmax, giving simple gradient expressions.

Linear limitations: Some patterns, like XOR, cannot be separated by any linear boundary. This motivates the nonlinear activation functions and multiple layers of neural networks, which we explore in the next chapter.

You now have the vocabulary and intuition to understand what happens inside a neural network. Each layer performs linear classification, then breaks linearity with an activation function, then feeds forward to the next layer. The magic of deep learning emerges from stacking these simple operations.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about linear classifiers, decision boundaries, and gradient descent.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{linearclassifiersthefoundationofneuralnetworks, author = {Michael Brenndoerfer}, title = {Linear Classifiers: The Foundation of Neural Networks}, year = {2025}, url = {https://mbrenndoerfer.com/writing/linear-classifiers-neural-network-foundations}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). Linear Classifiers: The Foundation of Neural Networks. Retrieved from https://mbrenndoerfer.com/writing/linear-classifiers-neural-network-foundations
MLAAcademic
Michael Brenndoerfer. "Linear Classifiers: The Foundation of Neural Networks." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/linear-classifiers-neural-network-foundations>.
CHICAGOAcademic
Michael Brenndoerfer. "Linear Classifiers: The Foundation of Neural Networks." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/linear-classifiers-neural-network-foundations.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Linear Classifiers: The Foundation of Neural Networks'. Available at: https://mbrenndoerfer.com/writing/linear-classifiers-neural-network-foundations (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). Linear Classifiers: The Foundation of Neural Networks. https://mbrenndoerfer.com/writing/linear-classifiers-neural-network-foundations
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free