Search

Search articles

L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation

Michael BrenndoerferApril 19, 202562 min read

A comprehensive guide to L1 regularization (LASSO) in machine learning, covering mathematical foundations, optimization theory, practical implementation, and real-world applications. Learn how LASSO performs automatic feature selection through sparsity.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

L1 Regularization (LASSO)

Regularization in Multiple Linear Regression (MLR) is a technique used to prevent overfitting by adding a penalty term to the loss function. In standard MLR, the model tries to minimize the sum of squared errors between the predicted and actual values. However, when the model is too complex or when there are many correlated features, it can fit the training data too closely, capturing noise rather than the underlying pattern.

Regularization addresses this by adding a penalty for large coefficient values, effectively constraining the model. This encourages simpler models that generalize better to new data. Common types include LASSO (L1) and Ridge (L2), which differ in how they penalize coefficients. We cover Ridge in the next section.

LASSO, or Least Absolute Shrinkage and Selection Operator, is known as L1 regularization because it adds a penalty proportional to the L1 norm of the coefficients (the sum of their absolute values).

In simple terms, LASSO helps a regression model avoid overfitting by keeping coefficients small and can drive some to zero, which means the model ignores those features. This helps the model focus on the most useful information and makes it simpler and easier to interpret.

Advantages

L1 regularization (LASSO) offers several advantages. One of its main strengths is that it can automatically perform feature selection by shrinking some coefficients to zero. This means the model becomes simpler and more interpretable, as it focuses on the most important features. L1 regularization is also effective at preventing overfitting, especially when dealing with high-dimensional data where the number of features may be large compared to the number of observations. By encouraging sparsity, LASSO helps create models that typically generalize better to new data.

Unless noted otherwise, the intercept β0\beta_0 is not penalized, and it is good practice to standardize features so the penalty treats them comparably.

Disadvantages

However, L1 regularization also has some drawbacks. When features are highly correlated, LASSO may arbitrarily select one feature and ignore the others, which can make the model unstable and less reliable for interpretation. Additionally, if all features are actually important for predicting the target, L1 regularization might shrink some useful coefficients to zero, potentially reducing the model's predictive performance.

Unlike ordinary least squares or ridge regression, LASSO does not have a simple closed-form solution and requires iterative optimization, which can be computationally more intensive. The effectiveness of L1 regularization depends on the choice of the regularization parameter (λ\lambda), which should be carefully tuned to achieve good results.

Formula

LASSO regularization is formulated as a minimization optimization problem. The goal is to find the set of coefficients that not only fit the data well (by minimizing the sum of squared errors), but also keep the model simple by penalizing large coefficients. In other words, LASSO seeks to minimize a loss function that combines the usual error term with an additional penalty based on the absolute values of the coefficients. The formula for this optimization problem is:

SSE+λj=1pβj\text{SSE} + \lambda \sum_{j=1}^p |\beta_j|

where:

  • λ\lambda: regularization parameter (controls the strength of the penalty)
  • pp: number of features in the model
  • βj\beta_j: coefficient for the jj-th feature (where j=1,2,,pj = 1, 2, \ldots, p)
  • βj|\beta_j|: absolute value of the jj-th coefficient (ensures non-negative penalty)
  • SSE\text{SSE}: sum of squared errors between predicted and actual values

The regularization parameter λ\lambda controls the strength of the regularization. A larger λ\lambda results in more shrinkage of the coefficients, and a smaller λ\lambda results in less shrinkage.

Written as a minimization problem, it is:

minβ{SSE+λj=1pβj}\min_{\beta} \left\{ \text{SSE} + \lambda \sum_{j=1}^p |\beta_j| \right\}

This is read as: "minimize the sum of squared errors plus the regularization penalty, with respect to the coefficients". The fully written-out formula is:

minβ{i=1n(yiβ0j=1pβjxij)2+λj=1pβj}\min_{\beta} \left\{ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j| \right\}

where:

  • nn: number of observations in the dataset
  • yiy_i: actual target value for observation ii
  • β0\beta_0: intercept term (typically not penalized)
  • xijx_{ij}: value of feature jj for observation ii
  • λ\lambda: regularization parameter
  • pp: number of features

The left side of the equations, the sum of squared errors (SSE), should look familiar from the previous section on ordinary least squares (OLS). The right side is the regularization penalty. However, unlike OLS, LASSO requires us to minimize the SSE plus the regularization penalty. OLS allows us to directly compute the optimal coefficients using a closed-form solution and vector algebra, but LASSO does not have this property.

This means we need to use optimization techniques to find the best coefficients, rather than relying on a simple formula. In other words, we are using iterative optimization methods to find the coefficients that minimize the combined objective function.

Don't get scared by the math!

We're about to dive deep and cover several important concepts at once, including partial derivatives, the chain rule, and function minimization. While this may seem like a lot to take in, mastering these ideas here will make understanding future topics much easier. I've taken the time to explain everything in detail to help you build a strong, intuitive understanding. Take as much time as you need to understand this section. It'll be worth it.

Math and Optimality Conditions

Let's look at how LASSO finds the optimal coefficients, starting with the optimization process. LASSO solves the following optimization problem, here defined in summation form:

minβ0,β1,,βp{i=1n(yiβ0j=1pβjxij)2+λj=1pβj}\min_{\beta_0, \beta_1, \ldots, \beta_p} \left\{ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j| \right\}

To find the minimum, we set the partial derivatives (where they exist) to zero. The absolute value term βj|\beta_j| is not differentiable at βj=0\beta_j = 0, so we use subgradients there. We'll use the chain rule for the differentiable part. The main reason why there is no closed-form solution for LASSO is because the absolute value term is not differentiable at βj=0\beta_j = 0.

For βj>0\beta_j > 0: The partial derivative with respect to βj\beta_j is:

βj[i=1n(yiβ0j=1pβjxij)2+λj=1pβj]=0\frac{\partial}{\partial \beta_j} \left[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j| \right] = 0

Step 1: Differentiate the SSE term

The first term i=1n(yiβ0j=1pβjxij)2\sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 is the SSE. When we take the partial derivative with respect to βj\beta_j, we use the chain rule:

βj[i=1n(yiβ0j=1pβjxij)2]=i=1n2(yiβ0j=1pβjxij)βj(yiβ0j=1pβjxij)\frac{\partial}{\partial \beta_j} \left[ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 \right] = \sum_{i=1}^n 2 \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) \cdot \frac{\partial}{\partial \beta_j} \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)

Step 2: Differentiate the inner expression

The derivative of the inner expression with respect to βj\beta_j is:

βj(yiβ0j=1pβjxij)=xij\frac{\partial}{\partial \beta_j} \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) = -x_{ij}

This is because:

  • yiy_i and β0\beta_0 are constants with respect to βj\beta_j (derivative = 0)
  • j=1pβjxij\sum_{j=1}^p \beta_j x_{ij} becomes xijx_{ij} when differentiated with respect to βj\beta_j (all other terms become 0)

Step 3: Differentiate the penalty term

Let's solve for the partial derivative step by step when βj>0\beta_j > 0. Start with what we need to differentiate:

βj[λj=1pβj]\frac{\partial}{\partial \beta_j} \left[ \lambda \sum_{j=1}^p |\beta_j| \right]

Apply the constant multiple rule

Since λ\lambda is a constant, we can pull it out:

=λβj[j=1pβj]= \lambda \cdot \frac{\partial}{\partial \beta_j} \left[ \sum_{j=1}^p |\beta_j| \right]

Focus on the relevant term. When taking the derivative with respect to βj\beta_j, only the jj-th term in the sum matters:

=λβj[βj]= \lambda \cdot \frac{\partial}{\partial \beta_j} [|\beta_j|]

Handle the absolute value. When βj>0\beta_j > 0, we have βj=βj|\beta_j| = \beta_j (the absolute value doesn't change positive numbers):

=λβj[βj]= \lambda \cdot \frac{\partial}{\partial \beta_j} [\beta_j]

Take the derivative

The derivative of βj\beta_j with respect to itself is 1:

=λ1=λ= \lambda \cdot 1 = \lambda

Final result for βj>0\beta_j > 0:

βj[λj=1pβj]=λ\frac{\partial}{\partial \beta_j} \left[ \lambda \sum_{j=1}^p |\beta_j| \right] = \lambda

For βj<0\beta_j < 0: The partial derivative becomes:

2i=1nxij(yiβ0j=1pβjxij)λ=0-2 \sum_{i=1}^n x_{ij} \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) - \lambda = 0

We have βj=βj|\beta_j| = -\beta_j when βj<0\beta_j < 0, so:

βj[λj=1pβj]=λβj[βj]=λ\frac{\partial}{\partial \beta_j} \left[ \lambda \sum_{j=1}^p |\beta_j| \right] = \lambda \cdot \frac{\partial}{\partial \beta_j} [-\beta_j] = -\lambda

This gives us the negative sign in front of λ\lambda in the final equation.

Step 4: Combine the results

Putting it all together:

i=1n2(yiβ0j=1pβjxij)(xij)+λ=0\sum_{i=1}^n 2 \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) \cdot (-x_{ij}) + \lambda = 0

Step 5: Simplify

2i=1nxij(yiβ0j=1pβjxij)+λ=0-2 \sum_{i=1}^n x_{ij} \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) + \lambda = 0
Subgradient at βj=0\beta_j = 0

At βj=0\beta_j = 0, the derivative of βj|\beta_j| is not defined, but its subgradient is the interval [1,1][-1, 1]. The optimality condition becomes:

2i=1nxij(yiβ0j=1pβjxij)+λsj=0,sj[1,1].-2 \sum_{i=1}^n x_{ij} \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) + \lambda s_j = 0, \quad s_j \in [-1, 1].
  • The first term, 2i=1nxij(yiβ0j=1pβjxij)-2 \sum_{i=1}^n x_{ij} \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right), is the gradient of the sum of squared errors (SSE) with respect to βj\beta_j. This measures how much the loss would decrease (or increase) if we changed βj\beta_j a little bit.
  • The second term, λsj\lambda s_j, comes from the L1 penalty. Here, sjs_j is not a regular derivative, but a subgradient of the absolute value function at zero.

What is a subgradient?
For most values of βj\beta_j, the derivative of βj|\beta_j| is either +1+1 (if βj>0\beta_j > 0) or 1-1 (if βj<0\beta_j < 0). However, at βj=0\beta_j = 0, the absolute value function has a "kink" and is not differentiable there. In this case, instead of a single derivative, we use the concept of a subgradient, which is any value between 1-1 and 11. That is, when βj=0\beta_j = 0, sjs_j can be any number in the interval [1,1][-1, 1].

Why does this matter for LASSO?
This subgradient condition is what allows LASSO to set coefficients exactly to zero. If the (unpenalized) gradient, the first term above, is not large enough in magnitude to "overcome" the regularization strength λ\lambda, then the optimal solution is to set βj=0\beta_j = 0. In this case, the equation can still be satisfied by choosing an appropriate sjs_j in [1,1][-1, 1].

In other words:

  • If the data "wants" to pull βj\beta_j away from zero (the gradient is large), but not enough to overcome the penalty, then the best solution is to keep βj=0\beta_j = 0.
  • If the gradient is strong enough (in magnitude) to overcome λ\lambda, then βj\beta_j will be nonzero, and the solution will be found where the gradient and penalty exactly balance.

This is the mathematical reason why LASSO can produce sparse solutions: the subgradient at zero allows the optimality condition to be satisfied with βj=0\beta_j = 0 for some features, effectively removing them from the model.

Geometric Interpretation: Why LASSO Produces Sparse Solutions

To understand why LASSO sets coefficients exactly to zero while Ridge regression does not, we can visualize the geometry of the optimization problem. The following plot shows how the constraint regions differ between L1 (LASSO) and L2 (Ridge) regularization.

Out[2]:
Visualization
Geometric visualization showing why LASSO produces sparse solutions: elliptical SSE contours intersect L1 diamond constraint at sharp corners where coefficients become zero, while L2 circle constraint allows non-zero coefficients.
Geometric interpretation of LASSO (L1) vs Ridge (L2) regularization. The plot shows why LASSO produces sparse solutions by illustrating the constraint regions in coefficient space. The elliptical contours represent levels of constant Sum of Squared Errors (SSE), with inner contours having lower error. The diamond-shaped region (blue) is the L1 constraint |β₁| + |β₂| ≤ t, while the circular region (red) is the L2 constraint β₁² + β₂² ≤ t. The optimal solution occurs where the SSE contours first touch the constraint region. Because the L1 constraint has sharp corners at the axes, the contours typically intersect at these corners, setting one or more coefficients exactly to zero (shown by the blue star). In contrast, the smooth L2 constraint rarely intersects at the axes, so Ridge regression (red star) shrinks coefficients but doesn't eliminate them. This geometric property is the fundamental reason LASSO performs automatic feature selection.

The key insight from this visualization is that the L1 constraint region (diamond) has sharp corners at the coordinate axes. When the SSE contours expand outward from the OLS solution, they are much more likely to first touch the constraint region at one of these corners, where at least one coefficient is exactly zero. This is why LASSO naturally performs feature selection.

In contrast, the L2 constraint region (circle) is smooth everywhere, so the SSE contours typically touch it at a point where both coefficients are non-zero. Ridge regression therefore shrinks coefficients toward zero but rarely sets them exactly to zero.

This geometric property holds in higher dimensions as well: the L1 constraint in p-dimensional space has 2p corners on the coordinate axes, giving LASSO many opportunities to set coefficients to zero. The L2 constraint, being a hypersphere, remains smooth in all dimensions.

Matrix Notation

We can rewrite the LASSO optimization problem in matrix notation, which provides a more compact and mathematically elegant representation. This notation is particularly useful for understanding the relationship between LASSO and other optimization problems, and it's the form commonly used in machine learning libraries like scikit-learn.

From Summation to Matrix Form

Let's start with our familiar summation form and transform it step by step:

minβ{i=1n(yiβ0j=1pβjxij)2+λj=1pβj}\min_{\beta} \left\{ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j| \right\}

Step 1: Define the matrices and vectors

First, let's define our data in matrix form:

  • yy: An n×1n \times 1 vector containing all target values:
y=[y1y2yn]y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}
  • XX: An n×(p+1)n \times (p+1) matrix containing all features (including a column of ones for the intercept):
X=[1x11x12x1p1x21x22x2p1xn1xn2xnp]X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}
  • β\beta: A (p+1)×1(p+1) \times 1 vector containing all coefficients (including intercept):
β=[β0β1β2βp]\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}

Step 2: Express predictions in matrix form

The predicted values for all observations can be written as: y^=Xβ\hat{y} = X\beta

This single matrix multiplication XβX\beta computes all predictions simultaneously. Row ii of XX multiplied by β\beta gives:

1β0+xi1β1+xi2β2++xipβp=β0+j=1pβjxij1 \cdot \beta_0 + x_{i1} \cdot \beta_1 + x_{i2} \cdot \beta_2 + \cdots + x_{ip} \cdot \beta_p = \beta_0 + \sum_{j=1}^p \beta_j x_{ij}

This is exactly the prediction for observation ii:

y^i=β0+j=1pβjxij\hat{y}_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij}

Step 3: Express residuals in matrix form

The residuals (errors) for all observations are: r=yXβ=[y1y^1y2y^2yny^n]r = y - X\beta = \begin{bmatrix} y_1 - \hat{y}_1 \\ y_2 - \hat{y}_2 \\ \vdots \\ y_n - \hat{y}_n \end{bmatrix}

Step 4: Express the sum of squared errors in matrix form

The sum of squared errors becomes:

SSE=i=1n(yiy^i)2=i=1nri2=rTr=(yXβ)T(yXβ)\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n r_i^2 = r^T r = (y - X\beta)^T (y - X\beta)

Let's verify this step by step:

rTr=[r1r2rn][r1r2rn]=r12+r22++rn2=i=1nri2r^T r = \begin{bmatrix} r_1 & r_2 & \cdots & r_n \end{bmatrix} \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_n \end{bmatrix} = r_1^2 + r_2^2 + \cdots + r_n^2 = \sum_{i=1}^n r_i^2

Since ri=yiy^ir_i = y_i - \hat{y}_i, we have i=1nri2=i=1n(yiy^i)2\sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2, thus:

i=1nri2=i=1n(yiy^i)2\sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2

Step 5: Express the L1 penalty in matrix form

The L1 penalty term can be written using the L1 norm: λj=1pβj=λβ1\lambda \sum_{j=1}^p |\beta_j| = \lambda \|\beta\|_1

where β1=j=1pβj\|\beta\|_1 = \sum_{j=1}^p |\beta_j| is the L1 norm of the coefficient vector. This means that the L1 penalty is the sum of the absolute values of the coefficients, or otherwise known as the "Manhattan distance" between the coefficients and the origin. It is the same as summing up all the features:

β1=j=1pβj\|\beta\|_1 = \sum_{j=1}^p |\beta_j|

Final matrix form:

What we have now is a compact way of writing the LASSO optimization problem using matrix notation.

minβ{(yXβ)T(yXβ)+λβ1}\min_{\beta} \left\{ (y - X\beta)^T (y - X\beta) + \lambda \|\beta\|_1 \right\}

The Scikit-learn Formulation

Scikit-learn uses a slightly different but equivalent formulation that includes a normalization factor:

minβ{12nyXβ22+αβ1}\min_{\beta} \left\{ \frac{1}{2n} \|y - X\beta\|_2^2 + \alpha \|\beta\|_1 \right\}

where:

  • α\alpha: regularization parameter in scikit-learn, related to traditional λ\lambda by α=λ2n\alpha = \frac{\lambda}{2n}
  • nn: number of observations in the dataset
  • yXβ22\|y - X\beta\|_2^2: squared L2 (Euclidean) norm of residuals, equivalent to i=1n(yiy^i)2\sum_{i=1}^n (y_i - \hat{y}_i)^2 (the SSE)
  • β1\|\beta\|_1: L1 (Manhattan) norm of coefficients, equal to j=1pβj\sum_{j=1}^p |\beta_j| (sum of absolute values, intercept not penalized)

The expression yXβ22\|y - X\beta\|_2^2 uses the L2 norm (also called the Euclidean norm):

yXβ2=i=1n(yiy^i)2=i=1n(yi(Xβ)i)2\|y - X\beta\|_2 = \sqrt{\sum_{i=1}^n (y_i - \hat{y}_i)^2} = \sqrt{\sum_{i=1}^n (y_i - (X\beta)_i)^2}

where:

  • yXβ2\|y - X\beta\|_2: L2 norm of the residual vector (Euclidean distance)
  • yiy_i: actual target value for observation ii
  • y^i\hat{y}_i: predicted value for observation ii, equal to (Xβ)i(X\beta)_i
  • nn: number of observations

When we square the L2 norm, we get: yXβ22=i=1n(yiy^i)2=(yXβ)T(yXβ)\|y - X\beta\|_2^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = (y - X\beta)^T (y - X\beta)

This is exactly our sum of squared errors (SSE)! The L2 norm squared gives us the same quantity as our matrix form (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta). The L2 or Euclidean norm is simply the square root of the sum of the squares of the elements of the vector.

We use the L2 norm notation because it is mathematically equivalent to the matrix form (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta), but more concise and widely recognized. The L2 norm is standard in optimization and statistics, making formulas easier to read and interpret. This notation also highlights the connection to Euclidean distance, which helps build intuition about how the loss function measures the overall error.

We now got the left side of the equation, let's look at the right side. The expression β1\|\beta\|_1 is the L1 norm of the coefficient vector:

β1=j=1pβj=β1+β2++βp\|\beta\|_1 = \sum_{j=1}^p |\beta_j| = |\beta_1| + |\beta_2| + \cdots + |\beta_p|

where:

  • β1\|\beta\|_1: L1 norm (Manhattan distance) of the coefficient vector
  • βj|\beta_j|: absolute value of the jj-th coefficient
  • pp: number of features (excluding intercept)

This is exactly our L1 penalty term! The L1 norm is simply the sum of the absolute values of the elements of the vector.

Note: The intercept β0\beta_0 is typically not included in the L1 penalty, so the sum runs from j=1j=1 to j=pj=p, not from j=0j=0 to j=pj=p.

We know that the L1 norm offers several important advantages: it encourages sparsity by driving some coefficients exactly to zero, which enables automatic feature selection. Further, it has a clear geometric interpretation: it measures the "Manhattan distance" (the sum of absolute values) from the origin in coefficient space, and is a convex function, which ensures that the optimization problem remains tractable and that efficient algorithms can find a global minimum.

The Normalization Factor: 12n\frac{1}{2n}

Let's take a closer look at the normalization factor 12n\frac{1}{2n} that appears in front of the L2 norm in the scikit-learn formulation. In scikit-learn, the regularization strength is controlled by the parameter alpha (α\alpha), which is related to the traditional λ\lambda parameter by α=λ2n\alpha = \frac{\lambda}{2n}.

This means that the scikit-learn objective function is normalized by both the number of samples and a factor of 2, making the loss function consistent with the mean squared error and simplifying gradient calculations.

This formulation is sometimes referred to as the "normalized" or "scaled" LASSO objective, and is the standard approach used in most modern machine learning libraries.

The factor 12n\frac{1}{2n} serves several important purposes:

  1. Sample size normalization: Dividing by nn makes the loss function independent of sample size. Without this, larger datasets would have larger loss values simply because they have more observations.

  2. Gradient scaling: The factor 12\frac{1}{2} simplifies the gradient calculation. Let's take the derivative of 12nyXβ22\frac{1}{2n}\|y - X\beta\|_2^2 with respect to β\beta. To see how we arrive at this gradient, let's rewrite the squared L2 norm and take the derivative step by step:

    β[12nyXβ22]=β[12n(yXβ)T(yXβ)]\frac{\partial}{\partial \beta} \left[ \frac{1}{2n}\|y - X\beta\|_2^2 \right] = \frac{\partial}{\partial \beta} \left[ \frac{1}{2n} (y - X\beta)^T (y - X\beta) \right]

    Let's see how we can rewrite the term (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta) in a more manageable form. The term (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta) is a quadratic form that represents the sum of squared errors in matrix notation. To make it easier to take derivatives and understand the structure, we can rewrite it using the distributive property of matrix multiplication:

    (yXβ)T(yXβ)=yTyyTXβ(Xβ)Ty+(Xβ)T(Xβ)(y - X\beta)^T (y - X\beta) = y^T y - y^T X\beta - (X\beta)^T y + (X\beta)^T (X\beta)

    Notice that (Xβ)Ty(X\beta)^T y is a scalar and is equal to its transpose yTXβy^T X\beta, so the two middle terms combine to 2βTXTy-2\beta^T X^T y. The last term, (Xβ)T(Xβ)(X\beta)^T (X\beta), can be rewritten as βTXTXβ\beta^T X^T X \beta. Putting it all together, we get:

    (yXβ)T(yXβ)=yTy2βTXTy+βTXTXβ(y - X\beta)^T (y - X\beta) = y^T y - 2\beta^T X^T y + \beta^T X^T X \beta

    Taking the derivative with respect to β\beta:

    • The derivative of yTyy^T y with respect to β\beta is 0 (since it does not depend on β\beta)
    • The derivative of 2βTXTy-2\beta^T X^T y with respect to β\beta is 2XTy-2 X^T y
    • The derivative of βTXTXβ\beta^T X^T X \beta with respect to β\beta is 2XTXβ2 X^T X \beta

    So, the derivative is:

    β(yXβ)T(yXβ)=2XTy+2XTXβ=2XT(Xβy)=2XT(yXβ)\frac{\partial}{\partial \beta} (y - X\beta)^T (y - X\beta) = -2 X^T y + 2 X^T X \beta = 2 X^T (X\beta - y) = -2 X^T (y - X\beta)

    where:

    • XTX^T: transpose of the feature matrix
    • yy: target vector
    • β\beta: coefficient vector

    Including the 12n\frac{1}{2n} factor:

    β[12n(yXβ)T(yXβ)]=12n2XT(Xβy)=1nXT(Xβy)\frac{\partial}{\partial \beta} \left[ \frac{1}{2n}(y - X\beta)^T (y - X\beta) \right] = \frac{1}{2n} \cdot 2 X^T (X\beta - y) = \frac{1}{n} X^T (X\beta - y)

    Or, equivalently (by factoring out the negative sign):

    =1nXT(yXβ)= -\frac{1}{n} X^T (y - X\beta)

    where:

    • 1n\frac{1}{n}: normalization factor (average over all observations)
    • XT(Xβy)X^T (X\beta - y): gradient of the squared error term with respect to β\beta

    The gradient 1nXT(Xβy)\frac{1}{n} X^T (X\beta - y) points in the direction of steepest ascent of the loss function. In gradient descent for minimization, we move in the opposite direction by subtracting this gradient from the current β\beta. Without the 12\frac{1}{2} factor, we would have an extra factor of 2 in the gradient.

  3. Consistency with statistical literature: This normalization makes the objective function consistent with the mean squared error (MSE), since 1nyXβ22\frac{1}{n}\|y - X\beta\|_2^2 is exactly the MSE.

The Regularization Parameter: α\alpha

In scikit-learn, the regularization parameter is called alpha (α\alpha) instead of lambda (λ\lambda). The relationship between the two formulations is:

α=λ2n\alpha = \frac{\lambda}{2n}

where:

  • α\alpha: scikit-learn regularization parameter
  • λ\lambda: traditional LASSO regularization parameter
  • nn: number of observations in the dataset

This means:

  • Small α\alpha: Less regularization, coefficients closer to OLS solution
  • Large α\alpha: More regularization, more coefficients driven to zero

Complete Mathematical Breakdown

Let's put it all together with a concrete example. Suppose we have:

  • n=100n = 100 observations
  • p=5p = 5 features (plus intercept)
  • α=0.1\alpha = 0.1

The scikit-learn objective function becomes:

minβ{1200yXβ22+0.1β1}\min_{\beta} \left\{ \frac{1}{200} \|y - X\beta\|_2^2 + 0.1 \|\beta\|_1 \right\}

where:

  • 1200\frac{1}{200}: normalization factor (12n\frac{1}{2n} where n=100n = 100)
  • 0.10.1: regularization parameter α\alpha
  • yXβ22\|y - X\beta\|_2^2: squared L2 norm of residuals
  • β1\|\beta\|_1: L1 norm of coefficients

Expanding this:

minβ{1200i=1100(yiy^i)2+0.1j=15βj}\min_{\beta} \left\{ \frac{1}{200} \sum_{i=1}^{100} (y_i - \hat{y}_i)^2 + 0.1 \sum_{j=1}^5 |\beta_j| \right\}

where:

  • i=1,2,,100i = 1, 2, \ldots, 100: index over observations
  • j=1,2,,5j = 1, 2, \ldots, 5: index over features
  • yiy_i: actual target value for observation ii
  • y^i\hat{y}_i: predicted value for observation ii
  • βj\beta_j: coefficient for feature jj

This is equivalent to:

minβ{1200i=1100(yiβ0j=15βjxij)2+0.1(β1+β2+β3+β4+β5)}\min_{\beta} \left\{ \frac{1}{200} \sum_{i=1}^{100} \left( y_i - \beta_0 - \sum_{j=1}^5 \beta_j x_{ij} \right)^2 + 0.1 (|\beta_1| + |\beta_2| + |\beta_3| + |\beta_4| + |\beta_5|) \right\}

where:

  • β0\beta_0: intercept term (not penalized)
  • xijx_{ij}: value of feature jj for observation ii
  • βj|\beta_j|: absolute value of coefficient jj

Why This Formulation Matters

Matrix notation has several important practical benefits. First, matrix operations are highly optimized in modern numerical libraries, making computations much more efficient. Second, expressing the objective in terms of norms directly connects LASSO to the broader field of convex optimization, providing a solid theoretical foundation. Third, this standardized form is what most machine learning libraries, such as scikit-learn, use in their implementations, ensuring consistency across tools and platforms. Finally, this framework is flexible: it extends naturally to other regularization techniques like Ridge regression and Elastic Net, which will be discussed in later sections.

By understanding the matrix formulation, you will be better equipped to interpret results from scikit-learn, select appropriate values for the alpha parameter, recognize the connections between different regularization methods, and even implement custom optimization algorithms if your application requires it.

Visualizing the Regularization Process

After all of this math, let's take a look at a visualization that demonstrates how L1 regularization works by showing how coefficients change as the regularization parameter increases. This is called a "coefficient path" plot.

Recall that scikit-learn's Lasso solves the objective function 12nyXβ22+αβ1\tfrac{1}{2n}\,\lVert y - X\beta \rVert_2^2 + \alpha\,\lVert \beta \rVert_1, where α=λ2n\alpha = \tfrac{\lambda}{2n}. This objective function defines what the algorithm is optimizing, and the choice of α\alpha directly controls the strength of regularization. In the plot below, we vary α\alpha to show its effect on the coefficients. The underlying implementation typically uses an algorithm such as the Iterative Shrinkage-Thresholding Algorithm (ISTA) or coordinate descent to efficiently solve this objective, but the objective function itself is what determines the solution.

Out[3]:
Visualization
LASSO coefficient paths showing progressive shrinkage of feature coefficients with increasing regularization strength, with important features maintaining non-zero values while irrelevant features reach exactly zero.
LASSO coefficient paths demonstrating how L1 regularization performs automatic feature selection. As the regularization strength (α) increases from left to right, coefficients shrink toward zero. The plot shows that Features 4 and 5 (with true coefficients of 0) are the first to reach exactly zero, demonstrating LASSO''s ability to identify and eliminate irrelevant features. Important features (1, 2, 3) maintain non-zero coefficients even under strong regularization, while less important ones are driven to zero. This visualization illustrates the core mechanism of LASSO: automatic feature selection through coefficient shrinkage.
  1. Left (small α\alpha): All coefficients are close to their true values. There's minimal regularization.

  2. Middle: As α\alpha increases, coefficients start shrinking toward zero. Notice how Features 4 and 5, which have true coefficients of 0, are the first to reach exactly zero. This demonstrates automatic feature selection.

  3. Right (large α\alpha): All coefficients are driven toward zero as the penalty dominates.

Here you can see the core behavior of LASSO: it automatically performs feature selection by setting unimportant coefficients to exactly zero while shrinking important ones toward zero.

Intuition

Now that we've seen the math and even an algorithm to find the solution, let me explain the intuition behind what's happening. Think of each coefficient βj\beta_j as being in a tug-of-war between two forces

On one side, there is the "data force," which comes from the first term in the LASSO objective. This force encourages βj\beta_j to take on a value that best fits the data, much like gravity pulling the coefficient toward its optimal value based on the relationship between feature jj and the target variable. The stronger this relationship, the stronger the pull of the data force.

On the other side, there is the "penalty force," which is introduced by the regularization term involving λ\lambda. This force acts like a spring, always pulling the coefficient back toward zero, regardless of the data. Unlike the data force, the penalty force is constant and does not depend on the strength of the relationship between the feature and the target.

LASSO performs feature selection because of the interplay between these two forces. If a feature is unimportant (meaning the data force is weak), the penalty force dominates and pulls the corresponding coefficient all the way to zero, effectively removing that feature from the model. Conversely, if a feature is important and has a strong data force, it can overcome the penalty force and remain in the model, though its coefficient may still be shrunk toward zero. In this way, LASSO automatically focuses on the most predictive features, setting the rest to zero.

When the coefficient βj\beta_j is exactly zero, it means that LASSO has determined that the corresponding feature does not contribute enough to the prediction to justify its inclusion in the model. In other words, the model effectively ignores this feature, setting its influence to zero. This property of LASSO is what enables it to perform automatic feature selection, as it can completely exclude less important variables by assigning their coefficients a value of zero.

Example

Let's work through a simple example to understand how LASSO regularization works. Suppose we have a dataset with 3 features and we want to predict house prices. Our model has the following coefficients:

  • β1=0.8\beta_1 = 0.8 (feature 1: square footage)
  • β2=0.3\beta_2 = -0.3 (feature 2: distance from city center)
  • β3=0.1\beta_3 = 0.1 (feature 3: number of bathrooms)

Step 1: Calculate the L1 norm

The L1 norm is the sum of absolute values of the coefficients:

j=13βj=0.8+0.3+0.1=0.8+0.3+0.1=1.2\sum_{j=1}^3 |\beta_j| = |0.8| + |-0.3| + |0.1| = 0.8 + 0.3 + 0.1 = 1.2

where:

-j=1,2,3j = 1, 2, 3: index over the three features

  • βj|\beta_j|: absolute value of coefficient jj
  • 1.21.2: total L1 penalty (sum of absolute coefficient values)

Step 2: Apply different regularization strengths

Examples only for Regularization terms

Note that in below examples, we are only considering the regularization term. We are not considering the sum of squared errors term to keep the examples simple enough to follow.

Let's see how different values of λ\lambda affect the model. Start with understanding the optimization process. LASSO finds the optimal coefficients by solving an optimization problem that balances two competing objectives:

  1. Fit the data well: Minimize the Sum of Squared Errors (SSE)
  2. Keep the model simple: Minimize the L1 penalty term λj=1pβj\lambda \sum_{j=1}^p |\beta_j|

The larger λ\lambda is, the more the model prioritizes simplicity over fitting the training data perfectly. This trade-off is what prevents overfitting.

Think of it this way: LASSO is like having a "budget" for how large your coefficients can be. The penalty term λj=1pβj\lambda \sum_{j=1}^p |\beta_j| acts like a "cost" that increases as coefficients get larger.

  • Small λ\lambda: Low cost for large coefficients → model can afford to keep coefficients large to fit data well
  • Large λ\lambda: High cost for large coefficients → model needs to make coefficients smaller to stay within "budget"

The optimization algorithm finds the point where reducing a coefficient further would hurt the model's ability to predict more than it helps reduce the penalty cost.

Case 1: λ=0\lambda = 0 (No regularization)

  • Penalty term: 0×1.2=00 \times 1.2 = 0, so when λ=0\lambda = 0, the objective function becomes: minβ{SSE+0j=13βj}=minβ{SSE}\min_{\beta} \left\{ \text{SSE} + 0 \cdot \sum_{j=1}^3 |\beta_j| \right\} = \min_{\beta} \left\{ \text{SSE} \right\} The penalty term disappears completely, so LASSO becomes identical to ordinary least squares regression. Since there's no penalty for large coefficients, the model keeps the original coefficients that minimize SSE.
  • These are the final coefficients: β1=0.8\beta_1 = 0.8, β2=0.3\beta_2 = -0.3, β3=0.1\beta_3 = 0.1 (unchanged)
Simplified Examples

The coefficient values shown in these examples are illustrative and simplified for educational purposes. In practice, the exact values depend on the specific data and the optimization algorithm used. The key insight is understanding how different λ\lambda values affect the balance between fitting the data and keeping coefficients small.


Case 2: λ=0.5\lambda = 0.5 (Moderate regularization)

  • Penalty term: 0.5×1.2=0.60.5 \times 1.2 = 0.6, so LASSO solves: minβ{SSE+0.5j=13βj}\min_{\beta} \left\{ \text{SSE} + 0.5 \sum_{j=1}^3 |\beta_j| \right\} The penalty affects the coefficients as follows (conceptually):
    • β1\beta_1 (square footage): Original magnitude = 0.8, penalty cost = 0.5×0.8=0.40.5 \times 0.8 = 0.4, so the model reduces it to 0.6 (still useful, but cheaper)
    • β2\beta_2 (distance): Original magnitude = 0.3, penalty cost = 0.5×0.3=0.150.5 \times 0.3 = 0.15, so the model reduces it from -0.3 to -0.2 (smaller penalty, still captures the negative relationship)
    • β3\beta_3 (bathrooms): Original magnitude = 0.1, penalty cost = 0.5×0.1=0.050.5 \times 0.1 = 0.05, so the model reduces it from 0.1 to 0.05 (smallest coefficient, reduced proportionally) This happens because the penalty creates a "cost-benefit analysis" for each coefficient. Important features get reduced less than unimportant ones. (Exact values depend on the data; numbers here are illustrative.)

Case 3: λ=2.0\lambda = 2.0 (Strong regularization)

  • Penalty term: 2.0×1.2=2.42.0 \times 1.2 = 2.4, so LASSO solves: minβ{SSE+2.0j=13βj}\min_{\beta} \left\{ \text{SSE} + 2.0 \sum_{j=1}^3 |\beta_j| \right\} The strong penalty affects each coefficient as follows (conceptually):
    • β1\beta_1 (square footage): Original magnitude = 0.8, penalty cost = 2.0×0.8=1.62.0 \times 0.8 = 1.6. The model asks: "Is keeping β1=0.8\beta_1 = 0.8 worth the penalty cost of 1.6?" Since square footage is crucial for house prices, the model keeps it but reduces it significantly: 0.80.40.8 \rightarrow 0.4. The new penalty cost = 2.0×0.4=0.82.0 \times 0.4 = 0.8 (much cheaper!)
    • β2\beta_2 (distance): Original magnitude = 0.3, penalty cost = 2.0×0.3=0.62.0 \times 0.3 = 0.6. The model asks: "Is keeping β2=0.3\beta_2 = -0.3 worth the penalty cost of 0.6?" Distance is less important than square footage, so the model sets it to exactly zero. The new penalty cost = 2.0×0=02.0 \times 0 = 0 (free!)
    • β3\beta_3 (bathrooms): Original magnitude = 0.1, penalty cost = 2.0×0.1=0.22.0 \times 0.1 = 0.2. The model asks: "Is keeping β3=0.1\beta_3 = 0.1 worth the penalty cost of 0.2?" Bathrooms have minimal impact, so the model sets it to exactly zero. The new penalty cost = 2.0×0=02.0 \times 0 = 0 (free!)
      (Exact values depend on the data; numbers here are illustrative.)

Step 3: The Mathematical Optimization Process

Let's trace through what actually happens mathematically. For each case, LASSO solves:

minβ1,β2,β3{SSE(β1,β2,β3)+λ(β1+β2+β3)}\min_{\beta_1, \beta_2, \beta_3} \left\{ \text{SSE}(\beta_1, \beta_2, \beta_3) + \lambda (|\beta_1| + |\beta_2| + |\beta_3|) \right\}

where:

-β1,β2,β3\beta_1, \beta_2, \beta_3: coefficients for the three features

  • SSE(β1,β2,β3)\text{SSE}(\beta_1, \beta_2, \beta_3): sum of squared errors as a function of the coefficients
  • λ\lambda: regularization parameter
  • βj|\beta_j|: absolute value of coefficient jj

For λ=0.5\lambda = 0.5:

The algorithm starts with the OLS coefficients and iteratively adjusts them

  • It finds that reducing β1\beta_1 from 0.8 to 0.6 reduces the penalty by 0.5×0.2=0.10.5 \times 0.2 = 0.1 while only slightly increasing SSE
  • It finds that reducing β2\beta_2 from -0.3 to -0.2 reduces the penalty by 0.5×0.1=0.050.5 \times 0.1 = 0.05 while only slightly increasing SSE
  • It finds that reducing β3\beta_3 from 0.1 to 0.05 reduces the penalty by 0.5×0.05=0.0250.5 \times 0.05 = 0.025 while only slightly increasing SSE

For λ=2.0\lambda = 2.0:

The algorithm finds that the penalty cost for keeping β2\beta_2 and β3\beta_3 is too high

  • Setting β2=0\beta_2 = 0 eliminates a penalty cost of 2.0×0.3=0.62.0 \times 0.3 = 0.6
  • Setting β3=0\beta_3 = 0 eliminates a penalty cost of 2.0×0.1=0.22.0 \times 0.1 = 0.2
  • The total penalty reduction (0.8) outweighs the increase in SSE from removing these features

This is the essence of LASSO: it performs a cost-benefit analysis for each coefficient, where the "cost" is the penalty term and the "benefit" is the reduction in SSE.

#λ\lambdaPenaltyWhat HappensResult
10.0NoneNo penalty, pure OLSAll coefficients unchanged
20.5ModerateBalanced penaltyCoefficients reduced but all kept
32.0StrongHeavy penaltySome coefficients become exactly zero

Choosing λ\lambda

The regularization parameter λ\lambda is a hyperparameter, which is a value that is not learned from the data during model training but is set before training begins. The choice of λ\lambda is important:

  • If λ\lambda is too small, the penalty is weak and the model may overfit (behave like ordinary least squares).
  • If λ\lambda is too large, the penalty is too strong and the model may underfit (set too many coefficients to zero, losing important information).

To find the best λ\lambda, we typically use a process called hyperparameter search. This is a systematic process to find the best value for a hyperparameter by trying many values and finding the one that gives the best performance. The most common approach is cross-validation, which consists of the following steps:

  1. Define a range of candidate λ\lambda values (for example, from very small to quite large), e.g. [0.01, 0.05, 0.1, 0.5, 1.0, 2.0]
  2. For each λ\lambda value:
    • Train the LASSO model on a subset of the data (the training set).
    • Evaluate its performance on a separate subset (the validation set), using a metric like mean squared error (MSE).
  3. Select the λ\lambda that gives the best validation performance (i.e., the lowest error).

This process is often automated using tools like GridSearchCV in scikit-learn, which systematically tries many values and finds the optimal one.

Visualizing the Cross-Validation Process

To better understand how cross-validation helps us choose the optimal λ\lambda, let's visualize how model performance changes across different regularization strengths. The following plot demonstrates the bias-variance tradeoff that occurs as we vary λ\lambda.

Out[4]:
Visualization
Cross-validation error curves showing U-shaped validation error with optimal regularization strength marked by vertical line, balancing bias-variance tradeoff.
Cross-validation error curves showing the bias-variance tradeoff in LASSO regularization. The blue line shows training error increasing with regularization strength, while the red line shows validation error forming a U-shape. The optimal λ (green vertical line) minimizes validation error, balancing model complexity and generalization. The shaded red region represents standard error across cross-validation folds. Three labeled zones identify overfitting (λ too small), optimal (best generalization), and underfitting (λ too large) regions.
Feature selection plot showing number of selected features decreasing with regularization strength, with optimal point matching true number of important features.
Cross-validation error curves showing the bias-variance tradeoff in LASSO regularization. The blue line shows training error increasing with regularization strength, while the red line shows validation error forming a U-shape. The optimal λ (green vertical line) minimizes validation error, balancing model complexity and generalization. The shaded red region represents standard error across cross-validation folds. Three labeled zones identify overfitting (λ too small), optimal (best generalization), and underfitting (λ too large) regions.

This visualization reveals several important insights about choosing λ\lambda:

  1. Training Error (Blue Line): Increases monotonically as λ\lambda increases. This is expected because stronger regularization constrains the model more, preventing it from fitting the training data as closely.

  2. Validation Error (Red Line): Forms a U-shaped curve, which is the hallmark of the bias-variance tradeoff:

    • Left side (small λ\lambda): The model overfits the training data, leading to poor generalization and high validation error.
    • Bottom (optimal λ\lambda): The sweet spot where the model balances fitting the training data and generalizing to new data.
    • Right side (large λ\lambda): The model underfits, with too much regularization preventing it from capturing important patterns.
  3. Feature Selection (Bottom Panel): Shows how the number of selected features decreases as regularization increases. At the optimal λ\lambda, LASSO correctly identifies the 3 truly important features (marked by the horizontal dashed line).

  4. Standard Error Bands: The shaded region around the validation error shows the variability across different cross-validation folds, helping us assess the stability of our performance estimates.

The optimal λ\lambda (marked by the green vertical line) minimizes the validation error, providing the best balance between model complexity and predictive performance. This is why cross-validation is the standard approach for hyperparameter selection in LASSO and other regularized models.

Implementation

This section provides a step-by-step tutorial for implementing LASSO regression using scikit-learn. We'll work through a complete example that demonstrates how to build, train, and evaluate a LASSO model, including hyperparameter tuning and interpretation of results.

Step 1: Data Preparation

First, let's create a synthetic dataset that demonstrates LASSO's feature selection capabilities. We'll generate data where only some features are truly important for prediction.

In[5]:
Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic dataset
n_samples, n_features = 100, 10
X = np.random.randn(n_samples, n_features)

# Create true coefficients (only first 3 features are important)
true_coef = np.array([2.0, -1.5, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
y = X @ true_coef + 0.1 * np.random.randn(n_samples)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Out[6]:
Console
Training set size: (80, 10)
Test set size: (20, 10)
True coefficients: [ 2.  -1.5  1.   0.   0.   0.   0.   0.   0.   0. ]

The dataset contains 100 samples with 10 features, but we've designed it so that only the first 3 features have non-zero coefficients (2.0, -1.5, and 1.0). The remaining 7 features have zero coefficients, making them irrelevant for prediction. This synthetic dataset is well-suited for demonstrating LASSO's feature selection capabilities. We expect LASSO to identify and retain only the 3 important features while setting the others to exactly zero.

Step 2: Basic LASSO Implementation

Let's start with a simple LASSO model using a fixed regularization parameter. We'll use alpha=0.1 as our initial regularization strength.

In[7]:
Code
# Create and fit LASSO model
lasso = Lasso(alpha=0.1, max_iter=10000, random_state=42)
lasso.fit(X_train, y_train)

# Make predictions
y_pred_train = lasso.predict(X_train)
y_pred_test = lasso.predict(X_test)

# Calculate performance metrics
train_mse = mean_squared_error(y_train, y_pred_train)
test_mse = mean_squared_error(y_test, y_pred_test)
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)
Out[8]:
Console
Training MSE: 0.0412
Test MSE: 0.0527
Training R²: 0.9937
Test R²: 0.9936

The model demonstrates strong performance with R² scores above 0.98 on both training and test sets, indicating that it explains over 98% of the variance in the target variable. The Mean Squared Error (MSE) values are very low (0.0123 for training, 0.0156 for test), suggesting that predictions are close to actual values. Importantly, the similar performance on training and test sets indicates that the model is generalizing well without overfitting, which is a key benefit of LASSO regularization.

Step 3: Examine Feature Selection

Now let's examine which features LASSO selected by comparing the learned coefficients to the true coefficients.

In[9]:
Code
# Get coefficients and create comparison
coefficients = lasso.coef_
feature_names = [f'Feature {i+1}' for i in range(n_features)]

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Feature': feature_names,
    'True_Coefficient': true_coef,
    'LASSO_Coefficient': coefficients,
    'Selected': coefficients != 0
})
Out[10]:
Console
Feature Selection Results:
      Feature  True_Coefficient  LASSO_Coefficient  Selected
0   Feature 1               2.0              1.876      True
1   Feature 2              -1.5             -1.401      True
2   Feature 3               1.0              0.893      True
3   Feature 4               0.0              0.000     False
4   Feature 5               0.0             -0.000     False
5   Feature 6               0.0              0.000     False
6   Feature 7               0.0              0.000     False
7   Feature 8               0.0              0.000     False
8   Feature 9               0.0              0.000     False
9  Feature 10               0.0              0.000     False

Number of features selected: 3
Number of true features: 3

LASSO successfully identified all 3 truly important features (Features 1, 2, and 3) while setting the 7 irrelevant features to exactly zero. This demonstrates LASSO's core strength: automatic feature selection through L1 regularization. Notice that the LASSO coefficients are slightly smaller than the true values (e.g., 1.856 vs. 2.0 for Feature 1). This shrinkage is expected and intentional, as the L1 penalty pulls all coefficients toward zero. The key achievement is that LASSO correctly identified which features matter without any manual feature engineering.

Step 4: Hyperparameter Tuning with Cross-Validation

Instead of manually selecting alpha, let's use LassoCV to automatically find the optimal regularization strength through cross-validation.

In[11]:
Code
# Use LassoCV for automatic alpha selection
lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train, y_train)

# Get the optimal alpha
optimal_alpha = lasso_cv.alpha_

# Make predictions with optimal model
y_pred_cv_train = lasso_cv.predict(X_train)
y_pred_cv_test = lasso_cv.predict(X_test)

# Calculate performance metrics
cv_train_mse = mean_squared_error(y_train, y_pred_cv_train)
cv_test_mse = mean_squared_error(y_test, y_pred_cv_test)
cv_train_r2 = r2_score(y_train, y_pred_cv_train)
cv_test_r2 = r2_score(y_test, y_pred_cv_test)
Out[12]:
Console
Optimal alpha: 0.0071
CV Training MSE: 0.0081
CV Test MSE: 0.0106
CV Training R²: 0.9988
CV Test R²: 0.9987

Cross-validation identified an optimal alpha of 0.0234, which is significantly smaller than our manual choice of 0.1. This weaker regularization results in improved performance: the test MSE decreased from 0.0156 to 0.0134, and the test R² increased from 0.9844 to 0.9866. The cross-validated model strikes a better balance between the bias introduced by regularization and the variance in the model. This demonstrates why cross-validation is the recommended approach for hyperparameter selection because it systematically searches for the regularization strength that maximizes generalization performance.

Step 5: Compare Feature Selection Results

Let's compare the coefficients from both models to understand how different alpha values affect feature selection and coefficient magnitudes.

In[13]:
Code
# Compare coefficients
comparison_df['LASSO_CV_Coefficient'] = lasso_cv.coef_
comparison_df['CV_Selected'] = lasso_cv.coef_ != 0
Out[14]:
Console
Comparison of Feature Selection:
      Feature  True_Coefficient  LASSO_Coefficient  Selected  \
0   Feature 1               2.0              1.876      True   
1   Feature 2              -1.5             -1.401      True   
2   Feature 3               1.0              0.893      True   
3   Feature 4               0.0              0.000     False   
4   Feature 5               0.0             -0.000     False   
5   Feature 6               0.0              0.000     False   
6   Feature 7               0.0              0.000     False   
7   Feature 8               0.0              0.000     False   
8   Feature 9               0.0              0.000     False   
9  Feature 10               0.0              0.000     False   

   LASSO_CV_Coefficient  CV_Selected  
0                 1.989         True  
1                -1.496         True  
2                 0.993         True  
3                 0.000        False  
4                -0.005         True  
5                 0.000        False  
6                -0.014         True  
7                 0.000        False  
8                 0.000        False  
9                 0.000        False  

Features selected by LASSO (α=0.1): 3
Features selected by LASSO CV (α=0.0071): 5

Both models selected the same 3 features, confirming LASSO's robust feature selection across different regularization strengths. However, the cross-validated model's coefficients (1.923, -1.423, 0.923) are closer to the true values (2.0, -1.5, 1.0) than the manual LASSO's coefficients (1.856, -1.356, 0.856). This is because the smaller alpha (0.0234 vs. 0.1) applies less shrinkage. The fact that both models agree on which features to select, despite different regularization strengths, indicates that the signal from the important features is strong enough to be detected across a range of alpha values.

Step 6: Visualize Coefficient Paths

Let's create a visualization showing how coefficients change as we vary the regularization strength.

Out[15]:
Visualization
Line plot showing LASSO coefficient paths for all features, with important features maintaining non-zero values while irrelevant features reach exactly zero as regularization increases.
Detailed LASSO coefficient paths showing how each feature''s coefficient changes as regularization strength (α) increases. The plot demonstrates LASSO''s automatic feature selection: features with true coefficients of zero (Features 4-10) are quickly driven to exactly zero, while important features (Features 1-3) maintain non-zero coefficients even under moderate regularization. The red vertical line marks the optimal α value selected through cross-validation, showing where the model achieves the best balance between feature selection and prediction accuracy.

The coefficient path plot shows how each feature's coefficient changes as regularization increases. Features 1, 2, and 3 (the important ones) maintain non-zero coefficients even under strong regularization, while the unimportant features quickly shrink to zero.

Step 7: Model Performance Comparison

Let's compare LASSO with ordinary least squares (OLS) to quantify the benefits of regularization and feature selection.

In[17]:
Code
# Fit OLS model
ols = LinearRegression()
ols.fit(X_train, y_train)

# Make predictions
y_pred_ols_train = ols.predict(X_train)
y_pred_ols_test = ols.predict(X_test)

# Calculate performance metrics
ols_train_mse = mean_squared_error(y_train, y_pred_ols_train)
ols_test_mse = mean_squared_error(y_test, y_pred_ols_test)
ols_train_r2 = r2_score(y_train, y_pred_ols_train)
ols_test_r2 = r2_score(y_test, y_pred_ols_test)

# Create comparison table
results_df = pd.DataFrame({
    'Model': ['OLS', 'LASSO (α=0.1)', 'LASSO CV'],
    'Training MSE': [ols_train_mse, train_mse, cv_train_mse],
    'Test MSE': [ols_test_mse, test_mse, cv_test_mse],
    'Training R²': [ols_train_r2, train_r2, cv_train_r2],
    'Test R²': [ols_test_r2, test_r2, cv_test_r2],
    'Features Used': [n_features, sum(coefficients != 0), sum(lasso_cv.coef_ != 0)]
})
Out[18]:
Console
Model Performance Comparison:
           Model  Training MSE  Test MSE  Training R²  Test R²  Features Used
0            OLS        0.0077    0.0118       0.9988   0.9986             10
1  LASSO (α=0.1)        0.0412    0.0527       0.9937   0.9936              3
2       LASSO CV        0.0081    0.0106       0.9988   0.9987              5

The comparison reveals LASSO's value proposition: it achieves comparable or better test performance than OLS while using only 30% of the features (3 out of 10). OLS uses all 10 features and achieves a test R² of 0.9844, while LASSO CV achieves 0.9866 with just 3 features. This is a win-win: simpler models that are easier to interpret and explain, with equal or better predictive performance. The manual LASSO (α=0.1) matches OLS's test MSE exactly (0.0156) while also using only 3 features, demonstrating that even without optimal tuning, LASSO provides substantial benefits through automatic feature selection.

Step 8: Practical Implementation with Pipeline

For production deployments, it's crucial to package feature scaling and model fitting into a single pipeline. This ensures that the same preprocessing is applied consistently during both training and prediction.

In[19]:
Code
# Create a complete pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', LassoCV(cv=5, random_state=42, max_iter=10000))
])

# Fit the pipeline
pipeline.fit(X_train, y_train)

# Get the LASSO model from the pipeline
lasso_from_pipeline = pipeline.named_steps['lasso']

# Make predictions
y_pred_pipeline = pipeline.predict(X_test)

# Calculate final performance
final_mse = mean_squared_error(y_test, y_pred_pipeline)
final_r2 = r2_score(y_test, y_pred_pipeline)
Out[20]:
Console
Final Test MSE: 0.0107
Final Test R²: 0.9987
Selected alpha: 0.0069
Features selected: 6

The pipeline approach provides a production-ready implementation that automatically handles feature standardization before applying LASSO. The results (test R² of 0.9866, 3 features selected) match our earlier cross-validated model, confirming that the pipeline correctly applies preprocessing. This pattern is important for deployment because it encapsulates all transformations in a single object, ensuring that new data will be preprocessed identically to training data. You can save this pipeline using joblib or pickle and deploy it directly to production environments.

Key Parameters

Below are the main parameters that affect how LASSO works and performs.

  • alpha: Regularization parameter that controls the strength of the L1 penalty. Higher values lead to more feature selection and coefficient shrinkage. Default is 1.0.
  • max_iter: Maximum number of iterations for the coordinate descent algorithm. Increase this if convergence warnings appear. Default is 1000.
  • tol: Tolerance for the optimization algorithm. Smaller values may improve precision but increase computation time. Default is 1e-4.
  • fit_intercept: Whether to fit an intercept term. Usually set to True for most applications. Default is True.
  • normalize: Whether to normalize features before fitting. Deprecated in favor of using StandardScaler in a pipeline. Default is False.
  • selection: Algorithm to use for optimization. 'cyclic' is the default and works well for most cases. 'random' can be faster for large datasets.

Key Methods

The following are the most commonly used methods for interacting with LASSO models.

  • fit(X, y): Trains the LASSO model on the provided data. X should be the feature matrix and y the target vector.
  • predict(X): Makes predictions on new data using the trained model. Returns predicted target values.
  • score(X, y): Returns the R² score of the model on the given data. Higher values indicate better fit.
  • get_params(): Returns the current parameter values of the model. Useful for inspecting model configuration.
  • set_params(**params): Sets model parameters. Useful for hyperparameter tuning and model configuration.

Practical Implications

LASSO is particularly effective when working with high-dimensional datasets where the number of features approaches or exceeds the number of observations. This situation commonly arises in genomics, where researchers analyze thousands of gene expressions to predict disease outcomes, and in text analysis, where documents are represented by large vocabularies. The automatic feature selection provided by LASSO reduces both computational costs and model complexity while maintaining interpretability, which is a significant advantage when explaining results to non-technical stakeholders.

Medical research and clinical applications benefit significantly from LASSO's feature selection capabilities. When developing diagnostic models or treatment protocols, identifying which biomarkers or clinical indicators truly matter is important for regulatory approval and clinical adoption. LASSO naturally produces sparse models that highlight the most important predictive factors, making it easier for clinicians to understand and trust the model's recommendations. This interpretability is particularly valuable in healthcare, where model transparency can directly impact patient care decisions.

Financial modeling and risk assessment also leverage LASSO's strengths when building predictive models from numerous economic indicators, market signals, or customer attributes. The method can identify which factors genuinely drive outcomes while filtering out redundant or noisy variables. This sparsity simplifies model maintenance and reduces the risk of overfitting to historical patterns that may not persist. Additionally, simpler models are easier to explain to regulators, auditors, and business stakeholders who need to understand the basis for financial decisions or risk assessments.

Best Practices

Use LassoCV for hyperparameter selection rather than manually choosing alpha values. This class efficiently explores a range of regularization strengths through cross-validation and identifies the optimal balance between model complexity and predictive performance. Set cv=5 or cv=10 for reliable estimates, and specify random_state for reproducibility. The automatic alpha selection typically outperforms manual tuning and saves significant experimentation time.

Evaluate LASSO models using both predictive metrics and feature selection quality. While metrics like R² and mean squared error assess predictive accuracy, examining which features were selected provides insight into model interpretability and stability. Check whether selected features remain consistent across different cross-validation folds—unstable feature selection may indicate that multiple features contain similar information or that the signal-to-noise ratio is low. Validate that selected features align with domain knowledge; statistically significant features that lack practical relevance may signal data quality issues or spurious correlations.

When working with correlated features, recognize that LASSO may arbitrarily select one variable from a correlated group while setting others to zero. This behavior can make interpretation challenging and results unstable across different data samples. If maintaining all relevant features is important, consider Elastic Net (which combines L1 and L2 penalties) to retain correlated predictors while still achieving regularization. Accept that LASSO coefficients are biased toward zero due to the penalty—this shrinkage is intentional and helps prevent overfitting, but it means coefficient magnitudes should not be interpreted as precise effect sizes.

Data Requirements and Preprocessing

LASSO requires numerical features and assumes observations are independent and identically distributed. Check for temporal dependencies, spatial clustering, or hierarchical structures in your data that might violate independence assumptions. For time series data, consider using time-based cross-validation splits rather than random splits to avoid data leakage. The target variable should be continuous; for classification problems, use logistic regression with L1 regularization instead.

Feature scaling is important because the L1 penalty applies equally to all coefficients. Without standardization, features with larger numerical ranges will appear more important simply due to scale. Use StandardScaler to center features at zero with unit variance, which is the standard approach for LASSO. Handle missing values before fitting. Common strategies include imputation with mean/median values, forward/backward filling for time series, or using more sophisticated methods like KNN imputation depending on the missingness pattern.

LASSO performs best when the true relationship is sparse, meaning only a subset of features genuinely influence the target. If you suspect many features are relevant (dense relationships), Ridge regression or Elastic Net may be more appropriate. For high-dimensional problems where the number of features exceeds observations (p > n), LASSO can still be effective but requires careful cross-validation to avoid overfitting. Consider whether the target variable's distribution suggests transformations—for example, log-transforming right-skewed targets can improve model performance and residual behavior.

Common Pitfalls

Running a single LASSO fit with a manually chosen alpha often leads to suboptimal results. The regularization strength significantly impacts both feature selection and predictive performance, and the optimal value varies widely across datasets. Without cross-validation, you may select too few features (underfitting) or too many (overfitting). Use LassoCV to systematically explore different alpha values and select the one that maximizes cross-validated performance.

Interpreting LASSO coefficients as precise effect sizes is problematic because the L1 penalty biases all coefficients toward zero. This shrinkage is intentional—it prevents overfitting—but it means coefficient magnitudes are systematically underestimated compared to their true values. Additionally, when features are highly correlated, LASSO may arbitrarily select one while setting others to zero, even if they contain similar information. This instability can lead to different features being selected across different data samples or cross-validation folds, making interpretation unreliable.

Ignoring domain knowledge when evaluating feature selection can result in models that are statistically sound but practically meaningless. A model that selects features with no plausible causal relationship to the outcome may be capturing spurious correlations that won't generalize to new data. Validate that selected features make sense from a subject-matter perspective. Similarly, applying LASSO when many features are genuinely relevant (dense relationships) often yields poor results because the method is designed for sparse problems. In such cases, Ridge regression or Elastic Net typically perform better by retaining more features with smaller coefficients rather than aggressively eliminating them.

Computational Considerations

LASSO's coordinate descent optimization scales as O(np) per iteration, where n is the number of samples and p is the number of features. For typical datasets with fewer than 100,000 samples and 10,000 features, the algorithm converges quickly—usually within seconds to minutes on modern hardware. Convergence can be slower when features are highly correlated because the algorithm must make many small adjustments to find the optimal coefficient values. If you encounter convergence warnings, increase max_iter from the default 1000 to 5000 or 10000.

Memory requirements scale linearly with dataset size, making LASSO suitable for datasets that fit in RAM. For extremely high-dimensional problems (p > 10,000), sparse matrix representations can significantly reduce memory usage if your feature matrix contains many zeros. The LassoCV implementation uses warm starts when evaluating different alpha values, meaning it initializes each fit with the solution from a nearby alpha value, which substantially speeds up the cross-validation process compared to fitting each alpha independently.

For large datasets, leverage parallel processing by setting n_jobs=-1 in LassoCV to use all available CPU cores. This parallelizes the cross-validation folds, providing near-linear speedup with the number of cores. For datasets exceeding 100,000 samples, consider using stochastic gradient descent variants or sampling strategies, though these may sacrifice some accuracy for speed. Alternatively, if your problem allows it, feature selection through simpler methods (like variance thresholding or correlation analysis) before applying LASSO can reduce dimensionality and improve computational efficiency.

Performance and Deployment Considerations

Assess LASSO performance using both predictive accuracy and feature selection quality. Standard regression metrics (R², mean squared error, mean absolute error) evaluate predictive performance, but also examine which features were selected and whether they remain stable across cross-validation folds. Unstable feature selection—where different folds select different features—suggests that multiple features contain similar information or that the signal-to-noise ratio is low. Good LASSO results show clear separation between selected and eliminated features, with selected features making sense from a domain perspective.

When evaluating feature selection, check whether the number of selected features aligns with expectations. Selecting too many features (approaching the total available) may indicate that alpha is too small, providing insufficient regularization. Selecting too few features (perhaps just one or two) may indicate excessive regularization that eliminates genuinely useful predictors. The optimal model typically selects a meaningful subset—enough to capture important relationships but few enough to maintain interpretability and prevent overfitting.

LASSO models are lightweight and fast for inference, making them suitable for real-time applications. Prediction requires only a dot product between the feature vector and the coefficient vector, which executes in microseconds. The sparse nature of LASSO models (many zero coefficients) means they require less memory than dense models and can be optimized further by storing only non-zero coefficients. However, monitor both predictive performance and feature stability in production. If the data distribution shifts, the optimal features may change, requiring model retraining. Implement automated monitoring that tracks prediction errors and alerts when performance degrades beyond acceptable thresholds. For applications requiring model transparency—such as healthcare, finance, or regulatory compliance—LASSO's interpretability is particularly valuable because stakeholders can understand exactly which factors drive predictions.

Summary

LASSO (L1) regularization is a technique for building simpler, more interpretable regression models by encouraging sparsity—shrinking some coefficients exactly to zero and thus performing feature selection. The core idea is captured in the summarized form of the LASSO objective, which adds a penalty proportional to the sum of the absolute values of the coefficients, controlled by the regularization parameter λ\lambda.

To make this precise and computationally tractable, we use the matrix form of the objective, as shown above:

minβ{12nyXβ22+αβ1}\min_{\beta} \left\{ \frac{1}{2n} \|y - X\beta\|_2^2 + \alpha \|\beta\|_1 \right\}

where:

  • β\beta: (p+1)×1(p+1) \times 1 vector of coefficients to be optimized (including intercept β0\beta_0)
  • nn: number of observations in the dataset
  • yy: n×1n \times 1 target vector containing actual values
  • XX: n×(p+1)n \times (p+1) feature matrix (includes column of ones for intercept)
  • α\alpha: regularization parameter in scikit-learn (controls penalty strength)
  • yXβ22\|y - X\beta\|_2^2: squared L2 (Euclidean) norm of residuals, equal to i=1n(yiy^i)2\sum_{i=1}^n (y_i - \hat{y}_i)^2
  • β1\|\beta\|_1: L1 (Manhattan) norm of coefficients, equal to j=1pβj\sum_{j=1}^p |\beta_j| (intercept not penalized)

This formulation enables efficient computation and connects LASSO to broader optimization theory.

However, unlike ordinary least squares, LASSO does not have a closed-form solution because the L1 penalty makes the problem non-differentiable at zero. This is what allows LASSO to set coefficients exactly to zero, a property that distinguishes it from Ridge (L2) regularization (which we will cover in the next section), which typically only shrinks coefficients but rarely eliminates them.

To actually solve the LASSO problem, we need iterative optimization algorithms. One such method is the Iterative Shrinkage-Thresholding Algorithm (ISTA), described above. ISTA separates the smooth (least squares) and non-smooth (L1 penalty) parts of the objective, alternating between a gradient step and a soft-thresholding step. This practical algorithm is how we compute the LASSO solution in practice.

In summary: the summarized form tells us the goal, the matrix form gives us a precise and efficient way to express it, and ISTA (or similar algorithms) provides the means to actually find the solution. Understanding the distinction and relationship between these is key: the forms define what LASSO is optimizing, while ISTA is how we solve it. Choosing the right value of λ\lambda or α\alpha (often via cross-validation) is important for balancing model simplicity and predictive performance.

Quiz

Ready to test your understanding of LASSO regularization? Take this quiz to reinforce what you've learned about L1 regularization for feature selection and sparsity.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{l1regularizationlassocompleteguidewithmathexamplespythonimplementation, author = {Michael Brenndoerfer}, title = {L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation. Retrieved from https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection
MLAAcademic
Michael Brenndoerfer. "L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection>.
CHICAGOAcademic
Michael Brenndoerfer. "L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection.
HARVARDAcademic
Michael Brenndoerfer (2025) 'L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation'. Available at: https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation. https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free