L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation
Back to Writing

L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation

Michael BrenndoerferOctober 3, 202549 min read11,613 wordsInteractive

A comprehensive guide to L1 regularization (LASSO) in machine learning, covering mathematical foundations, optimization theory, practical implementation, and real-world applications. Learn how LASSO performs automatic feature selection through sparsity.

Data Science Handbook Cover
Part of Data Science Handbook

This article is part of the free-to-read Data Science Handbook

View full handbook
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Loading component...
Out[2]:
Visualization
Notebook output

Geometric interpretation of LASSO (L1) vs Ridge (L2) regularization. The plot shows why LASSO produces sparse solutions by illustrating the constraint regions in coefficient space. The elliptical contours represent levels of constant Sum of Squared Errors (SSE), with inner contours having lower error. The diamond-shaped region (blue) is the L1 constraint |β₁| + |β₂| ≤ t, while the circular region (red) is the L2 constraint β₁² + β₂² ≤ t. The optimal solution occurs where the SSE contours first touch the constraint region. Because the L1 constraint has sharp corners at the axes, the contours typically intersect at these corners, setting one or more coefficients exactly to zero (shown by the blue star). In contrast, the smooth L2 constraint rarely intersects at the axes, so Ridge regression (red star) shrinks coefficients but doesn't eliminate them. This geometric property is the fundamental reason LASSO performs automatic feature selection.

The key insight from this visualization is that the L1 constraint region (diamond) has sharp corners at the coordinate axes. When the SSE contours expand outward from the OLS solution, they are much more likely to first touch the constraint region at one of these corners, where at least one coefficient is exactly zero. This is why LASSO naturally performs feature selection.

In contrast, the L2 constraint region (circle) is smooth everywhere, so the SSE contours typically touch it at a point where both coefficients are non-zero. Ridge regression therefore shrinks coefficients toward zero but rarely sets them exactly to zero.

This geometric property holds in higher dimensions as well: the L1 constraint in p-dimensional space has 2p corners on the coordinate axes, giving LASSO many opportunities to set coefficients to zero. The L2 constraint, being a hypersphere, remains smooth in all dimensions.

Matrix Notation

We can rewrite the LASSO optimization problem in matrix notation, which provides a more compact and mathematically elegant representation. This notation is particularly useful for understanding the relationship between LASSO and other optimization problems, and it's the form commonly used in machine learning libraries like scikit-learn.

From Summation to Matrix Form

Let's start with our familiar summation form and transform it step by step:

minβ{i=1n(yiβ0j=1pβjxij)2+λj=1pβj}\min_{\beta} \left\{ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right)^2 + \lambda \sum_{j=1}^p |\beta_j| \right\}

Step 1: Define the matrices and vectors

First, let's define our data in matrix form:

  • yy: An n×1n \times 1 vector containing all target values:

    y=[y1y2yn]y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}
  • XX: An n×(p+1)n \times (p+1) matrix containing all features (including a column of ones for the intercept):

    X=[1x11x12x1p1x21x22x2p1xn1xn2xnp]X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix}
  • β\beta: A (p+1)×1(p+1) \times 1 vector containing all coefficients (including intercept):

β=[β0β1β2βp]\beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \end{bmatrix}

Step 2: Express predictions in matrix form

The predicted values for all observations can be written as: y^=Xβ\hat{y} = X\beta

This single matrix multiplication XβX\beta computes all predictions simultaneously. Row ii of XX multiplied by β\beta gives:

1β0+xi1β1+xi2β2++xipβp=β0+j=1pβjxij1 \cdot \beta_0 + x_{i1} \cdot \beta_1 + x_{i2} \cdot \beta_2 + \cdots + x_{ip} \cdot \beta_p = \beta_0 + \sum_{j=1}^p \beta_j x_{ij}

This is exactly the prediction for observation ii:

y^i=β0+j=1pβjxij\hat{y}_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij}

Step 3: Express residuals in matrix form

The residuals (errors) for all observations are: r=yXβ=[y1y^1y2y^2yny^n]r = y - X\beta = \begin{bmatrix} y_1 - \hat{y}_1 \\ y_2 - \hat{y}_2 \\ \vdots \\ y_n - \hat{y}_n \end{bmatrix}

Step 4: Express the sum of squared errors in matrix form

The sum of squared errors becomes:

SSE=i=1n(yiy^i)2=i=1nri2=rTr=(yXβ)T(yXβ)\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n r_i^2 = r^T r = (y - X\beta)^T (y - X\beta)

Let's verify this step by step:

rTr=[r1r2rn][r1r2rn]=r12+r22++rn2=i=1nri2r^T r = \begin{bmatrix} r_1 & r_2 & \cdots & r_n \end{bmatrix} \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_n \end{bmatrix} = r_1^2 + r_2^2 + \cdots + r_n^2 = \sum_{i=1}^n r_i^2

Since ri=yiy^ir_i = y_i - \hat{y}_i, we have i=1nri2=i=1n(yiy^i)2\sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2, thus:

i=1nri2=i=1n(yiy^i)2\sum_{i=1}^n r_i^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2

Step 5: Express the L1 penalty in matrix form

The L1 penalty term can be written using the L1 norm: λj=1pβj=λβ1\lambda \sum_{j=1}^p |\beta_j| = \lambda \|\beta\|_1

where β1=j=1pβj\|\beta\|_1 = \sum_{j=1}^p |\beta_j| is the L1 norm of the coefficient vector. This means that the L1 penalty is the sum of the absolute values of the coefficients, or otherwise known as the "Manhattan distance" between the coefficients and the origin. It is the same as summing up all the features:

β1=j=1pβj\|\beta\|_1 = \sum_{j=1}^p |\beta_j|

Final matrix form:

What we have now is a compact way of writing the LASSO optimization problem using matrix notation.

minβ{(yXβ)T(yXβ)+λβ1}\min_{\beta} \left\{ (y - X\beta)^T (y - X\beta) + \lambda \|\beta\|_1 \right\}

The Scikit-learn Formulation

Scikit-learn uses a slightly different but equivalent formulation that includes a normalization factor:

minβ{12nyXβ22+αβ1}\min_{\beta} \left\{ \frac{1}{2n} \|y - X\beta\|_2^2 + \alpha \|\beta\|_1 \right\}

where:

  • α\alpha: regularization parameter in scikit-learn, related to traditional λ\lambda by α=λ2n\alpha = \frac{\lambda}{2n}
  • nn: number of observations in the dataset
  • yXβ22\|y - X\beta\|_2^2: squared L2 (Euclidean) norm of residuals, equivalent to i=1n(yiy^i)2\sum_{i=1}^n (y_i - \hat{y}_i)^2 (the SSE)
  • β1\|\beta\|_1: L1 (Manhattan) norm of coefficients, equal to j=1pβj\sum_{j=1}^p |\beta_j| (sum of absolute values, intercept not penalized)

The expression yXβ22\|y - X\beta\|_2^2 uses the L2 norm (also called the Euclidean norm):

yXβ2=i=1n(yiy^i)2=i=1n(yi(Xβ)i)2\|y - X\beta\|_2 = \sqrt{\sum_{i=1}^n (y_i - \hat{y}_i)^2} = \sqrt{\sum_{i=1}^n (y_i - (X\beta)_i)^2}

where:

  • yXβ2\|y - X\beta\|_2: L2 norm of the residual vector (Euclidean distance)
  • yiy_i: actual target value for observation ii
  • y^i\hat{y}_i: predicted value for observation ii, equal to (Xβ)i(X\beta)_i
  • nn: number of observations

When we square the L2 norm, we get: yXβ22=i=1n(yiy^i)2=(yXβ)T(yXβ)\|y - X\beta\|_2^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = (y - X\beta)^T (y - X\beta)

This is exactly our sum of squared errors (SSE)! The L2 norm squared gives us the same quantity as our matrix form (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta). The L2 or Euclidean norm is simply the square root of the sum of the squares of the elements of the vector.

We use the L2 norm notation because it is mathematically equivalent to the matrix form (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta), but more concise and widely recognized. The L2 norm is standard in optimization and statistics, making formulas easier to read and interpret. This notation also highlights the connection to Euclidean distance, which helps build intuition about how the loss function measures the overall error.

We now got the left side of the equation, let's look at the right side. The expression β1\|\beta\|_1 is the L1 norm of the coefficient vector:

β1=j=1pβj=β1+β2++βp\|\beta\|_1 = \sum_{j=1}^p |\beta_j| = |\beta_1| + |\beta_2| + \cdots + |\beta_p|

where:

  • β1\|\beta\|_1: L1 norm (Manhattan distance) of the coefficient vector
  • βj|\beta_j|: absolute value of the jj-th coefficient
  • pp: number of features (excluding intercept)

This is exactly our L1 penalty term! The L1 norm is simply the sum of the absolute values of the elements of the vector.

Note: The intercept β0\beta_0 is typically not included in the L1 penalty, so the sum runs from j=1j=1 to j=pj=p, not from j=0j=0 to j=pj=p.

We know that the L1 norm offers several important advantages: it encourages sparsity by driving some coefficients exactly to zero, which enables automatic feature selection. Further, it has a clear geometric interpretation: it measures the "Manhattan distance" (the sum of absolute values) from the origin in coefficient space, and is a convex function, which ensures that the optimization problem remains tractable and that efficient algorithms can find a global minimum.

The Normalization Factor: 12n\frac{1}{2n}

Let's take a closer look at the normalization factor 12n\frac{1}{2n} that appears in front of the L2 norm in the scikit-learn formulation. In scikit-learn, the regularization strength is controlled by the parameter alpha (α\alpha), which is related to the traditional λ\lambda parameter by α=λ2n\alpha = \frac{\lambda}{2n}.

This means that the scikit-learn objective function is normalized by both the number of samples and a factor of 2, making the loss function consistent with the mean squared error and simplifying gradient calculations.

This formulation is sometimes referred to as the "normalized" or "scaled" LASSO objective, and is the standard approach used in most modern machine learning libraries.

The factor 12n\frac{1}{2n} serves several important purposes:

  1. Sample size normalization: Dividing by nn makes the loss function independent of sample size. Without this, larger datasets would have larger loss values simply because they have more observations.

  2. Gradient scaling: The factor 12\frac{1}{2} simplifies the gradient calculation. Let's take the derivative of 12nyXβ22\frac{1}{2n}\|y - X\beta\|_2^2 with respect to β\beta. To see how we arrive at this gradient, let's rewrite the squared L2 norm and take the derivative step by step:

    β[12nyXβ22]=β[12n(yXβ)T(yXβ)]\frac{\partial}{\partial \beta} \left[ \frac{1}{2n}\|y - X\beta\|_2^2 \right] = \frac{\partial}{\partial \beta} \left[ \frac{1}{2n} (y - X\beta)^T (y - X\beta) \right]

    Let's see how we can rewrite the term (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta) in a more manageable form. The term (yXβ)T(yXβ)(y - X\beta)^T (y - X\beta) is a quadratic form that represents the sum of squared errors in matrix notation. To make it easier to take derivatives and understand the structure, we can rewrite it using the distributive property of matrix multiplication:

    (yXβ)T(yXβ)=yTyyTXβ(Xβ)Ty+(Xβ)T(Xβ)(y - X\beta)^T (y - X\beta) = y^T y - y^T X\beta - (X\beta)^T y + (X\beta)^T (X\beta)

    Notice that (Xβ)Ty(X\beta)^T y is a scalar and is equal to its transpose yTXβy^T X\beta, so the two middle terms combine to 2βTXTy-2\beta^T X^T y. The last term, (Xβ)T(Xβ)(X\beta)^T (X\beta), can be rewritten as βTXTXβ\beta^T X^T X \beta. Putting it all together, we get:

    (yXβ)T(yXβ)=yTy2βTXTy+βTXTXβ(y - X\beta)^T (y - X\beta) = y^T y - 2\beta^T X^T y + \beta^T X^T X \beta

    Taking the derivative with respect to β\beta:

    • The derivative of yTyy^T y with respect to β\beta is 0 (since it does not depend on β\beta)
    • The derivative of 2βTXTy-2\beta^T X^T y with respect to β\beta is 2XTy-2 X^T y
    • The derivative of βTXTXβ\beta^T X^T X \beta with respect to β\beta is 2XTXβ2 X^T X \beta

    So, the derivative is:

    β(yXβ)T(yXβ)=2XTy+2XTXβ=2XT(Xβy)=2XT(yXβ)\frac{\partial}{\partial \beta} (y - X\beta)^T (y - X\beta) = -2 X^T y + 2 X^T X \beta = 2 X^T (X\beta - y) = -2 X^T (y - X\beta)

    where:

    • XTX^T: transpose of the feature matrix
    • yy: target vector
    • β\beta: coefficient vector

    Including the 12n\frac{1}{2n} factor:

    β[12n(yXβ)T(yXβ)]=12n2XT(Xβy)=1nXT(Xβy)\frac{\partial}{\partial \beta} \left[ \frac{1}{2n}(y - X\beta)^T (y - X\beta) \right] = \frac{1}{2n} \cdot 2 X^T (X\beta - y) = \frac{1}{n} X^T (X\beta - y)

    Or, equivalently (by factoring out the negative sign):

    =1nXT(yXβ)= -\frac{1}{n} X^T (y - X\beta)

    where:

    • 1n\frac{1}{n}: normalization factor (average over all observations)
    • XT(Xβy)X^T (X\beta - y): gradient of the squared error term with respect to β\beta

    The gradient 1nXT(Xβy)\frac{1}{n} X^T (X\beta - y) points in the direction of steepest ascent of the loss function. In gradient descent for minimization, we move in the opposite direction by subtracting this gradient from the current β\beta. Without the 12\frac{1}{2} factor, we would have an extra factor of 2 in the gradient.

  3. Consistency with statistical literature: This normalization makes the objective function consistent with the mean squared error (MSE), since 1nyXβ22\frac{1}{n}\|y - X\beta\|_2^2 is exactly the MSE.

The Regularization Parameter: α\alpha

In scikit-learn, the regularization parameter is called alpha (α\alpha) instead of lambda (λ\lambda). The relationship between the two formulations is:

α=λ2n\alpha = \frac{\lambda}{2n}

where:

  • α\alpha: scikit-learn regularization parameter
  • λ\lambda: traditional LASSO regularization parameter
  • nn: number of observations in the dataset

This means:

  • Small α\alpha: Less regularization, coefficients closer to OLS solution
  • Large α\alpha: More regularization, more coefficients driven to zero

Complete Mathematical Breakdown

Let's put it all together with a concrete example. Suppose we have:

  • n=100n = 100 observations
  • p=5p = 5 features (plus intercept)
  • α=0.1\alpha = 0.1

The scikit-learn objective function becomes:

minβ{1200yXβ22+0.1β1}\min_{\beta} \left\{ \frac{1}{200} \|y - X\beta\|_2^2 + 0.1 \|\beta\|_1 \right\}

where:

  • 1200\frac{1}{200}: normalization factor (12n\frac{1}{2n} where n=100n = 100)
  • 0.10.1: regularization parameter α\alpha
  • yXβ22\|y - X\beta\|_2^2: squared L2 norm of residuals
  • β1\|\beta\|_1: L1 norm of coefficients

Expanding this:

minβ{1200i=1100(yiy^i)2+0.1j=15βj}\min_{\beta} \left\{ \frac{1}{200} \sum_{i=1}^{100} (y_i - \hat{y}_i)^2 + 0.1 \sum_{j=1}^5 |\beta_j| \right\}

where:

  • i=1,2,,100i = 1, 2, \ldots, 100: index over observations
  • j=1,2,,5j = 1, 2, \ldots, 5: index over features
  • yiy_i: actual target value for observation ii
  • y^i\hat{y}_i: predicted value for observation ii
  • βj\beta_j: coefficient for feature jj

This is equivalent to:

minβ{1200i=1100(yiβ0j=15βjxij)2+0.1(β1+β2+β3+β4+β5)}\min_{\beta} \left\{ \frac{1}{200} \sum_{i=1}^{100} \left( y_i - \beta_0 - \sum_{j=1}^5 \beta_j x_{ij} \right)^2 + 0.1 (|\beta_1| + |\beta_2| + |\beta_3| + |\beta_4| + |\beta_5|) \right\}

where:

  • β0\beta_0: intercept term (not penalized)
  • xijx_{ij}: value of feature jj for observation ii
  • βj|\beta_j|: absolute value of coefficient jj

Why This Formulation Matters

Matrix notation has several important practical benefits. First, matrix operations are highly optimized in modern numerical libraries, making computations much more efficient. Second, expressing the objective in terms of norms directly connects LASSO to the broader field of convex optimization, providing a solid theoretical foundation. Third, this standardized form is what most machine learning libraries, such as scikit-learn, use in their implementations, ensuring consistency across tools and platforms. Finally, this framework is flexible: it extends naturally to other regularization techniques like Ridge regression and Elastic Net, which will be discussed in later sections.

By understanding the matrix formulation, you will be better equipped to interpret results from scikit-learn, select appropriate values for the alpha parameter, recognize the connections between different regularization methods, and even implement custom optimization algorithms if your application requires it.

Visualizing the Regularization Process

After all of this math, let's take a look at a visualization that demonstrates how L1 regularization works by showing how coefficients change as the regularization parameter increases. This is called a "coefficient path" plot.

Recall that scikit-learn's Lasso solves the objective function 12nyXβ22+αβ1\tfrac{1}{2n}\,\lVert y - X\beta \rVert_2^2 + \alpha\,\lVert \beta \rVert_1, where α=λ2n\alpha = \tfrac{\lambda}{2n}. This objective function defines what the algorithm is optimizing, and the choice of α\alpha directly controls the strength of regularization. In the plot below, we vary α\alpha to show its effect on the coefficients. The underlying implementation typically uses an algorithm such as the Iterative Shrinkage-Thresholding Algorithm (ISTA) or coordinate descent to efficiently solve this objective, but the objective function itself is what determines the solution.

Out[5]:
Visualization
Notebook output

LASSO coefficient paths demonstrating how L1 regularization performs automatic feature selection. As the regularization strength (α) increases from left to right, coefficients shrink toward zero. The plot shows that Features 4 and 5 (with true coefficients of 0) are the first to reach exactly zero, demonstrating LASSO's ability to identify and eliminate irrelevant features. Important features (1, 2, 3) maintain non-zero coefficients even under strong regularization, while less important ones are driven to zero. This visualization illustrates the core mechanism of LASSO: automatic feature selection through coefficient shrinkage.

Loading component...
Out[4]:
Visualization
Notebook output

Cross-validation error curves showing the bias-variance tradeoff in LASSO regularization. The blue line shows training error increasing with regularization strength, while the red line shows validation error forming a U-shape. The optimal λ (green vertical line) minimizes validation error, balancing model complexity and generalization. The shaded red region represents standard error across cross-validation folds. Three labeled zones identify overfitting (λ too small), optimal (best generalization), and underfitting (λ too large) regions.

Notebook output

Feature selection behavior across regularization strengths. As λ increases from left to right, LASSO progressively eliminates features, reducing the model from 10 features to nearly zero. At the optimal λ (green vertical line), LASSO correctly selects 3 features, matching the true number of relevant features (gray horizontal line). This demonstrates LASSO's automatic feature selection capability.

This visualization reveals several important insights about choosing λ\lambda:

  1. Training Error (Blue Line): Increases monotonically as λ\lambda increases. This is expected because stronger regularization constrains the model more, preventing it from fitting the training data as closely.

  2. Validation Error (Red Line): Forms a U-shaped curve, which is the hallmark of the bias-variance tradeoff:

    • Left side (small λ\lambda): The model overfits the training data, leading to poor generalization and high validation error.
    • Bottom (optimal λ\lambda): The sweet spot where the model balances fitting the training data and generalizing to new data.
    • Right side (large λ\lambda): The model underfits, with too much regularization preventing it from capturing important patterns.
  3. Feature Selection (Bottom Panel): Shows how the number of selected features decreases as regularization increases. At the optimal λ\lambda, LASSO correctly identifies the 3 truly important features (marked by the horizontal dashed line).

  4. Standard Error Bands: The shaded region around the validation error shows the variability across different cross-validation folds, helping us assess the stability of our performance estimates.

The optimal λ\lambda (marked by the green vertical line) minimizes the validation error, providing the best balance between model complexity and predictive performance. This is why cross-validation is the standard approach for hyperparameter selection in LASSO and other regularized models.

Implementation

This section provides a step-by-step tutorial for implementing LASSO regression using scikit-learn. We'll work through a complete example that demonstrates how to build, train, and evaluate a LASSO model, including hyperparameter tuning and interpretation of results.

Step 1: Data Preparation

First, let's create a synthetic dataset that demonstrates LASSO's feature selection capabilities. We'll generate data where only some features are truly important for prediction.

In[9]:
1import numpy as np
2import pandas as pd
3import matplotlib.pyplot as plt
4from sklearn.linear_model import Lasso, LassoCV, LinearRegression
5from sklearn.preprocessing import StandardScaler
6from sklearn.model_selection import train_test_split
7from sklearn.pipeline import Pipeline
8from sklearn.metrics import mean_squared_error, r2_score
9import warnings
10
11warnings.filterwarnings("ignore")
12
13# Set random seed for reproducibility
14np.random.seed(42)
15
16# Generate synthetic dataset
17n_samples, n_features = 100, 10
18X = np.random.randn(n_samples, n_features)
19
20# Create true coefficients (only first 3 features are important)
21true_coef = np.array([2.0, -1.5, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
22y = X @ true_coef + 0.1 * np.random.randn(n_samples)
23
24# Split the data
25X_train, X_test, y_train, y_test = train_test_split(
26    X, y, test_size=0.2, random_state=42
27)
Out[6]:
Training set size: (80, 10)
Test set size: (20, 10)
True coefficients: [ 2.  -1.5  1.   0.   0.   0.   0.   0.   0.   0. ]

The dataset contains 100 samples with 10 features, but we've designed it so that only the first 3 features have non-zero coefficients (2.0, -1.5, and 1.0). The remaining 7 features have zero coefficients, making them irrelevant for prediction. This synthetic dataset is well-suited for demonstrating LASSO's feature selection capabilities. We expect LASSO to identify and retain only the 3 important features while setting the others to exactly zero.

Step 2: Basic LASSO Implementation

Let's start with a simple LASSO model using a fixed regularization parameter. We'll use alpha=0.1 as our initial regularization strength.

In[7]:
1# Create and fit LASSO model
2lasso = Lasso(alpha=0.1, max_iter=10000, random_state=42)
3lasso.fit(X_train, y_train)
4
5# Make predictions
6y_pred_train = lasso.predict(X_train)
7y_pred_test = lasso.predict(X_test)
8
9# Calculate performance metrics
10train_mse = mean_squared_error(y_train, y_pred_train)
11test_mse = mean_squared_error(y_test, y_pred_test)
12train_r2 = r2_score(y_train, y_pred_train)
13test_r2 = r2_score(y_test, y_pred_test)
Out[8]:
Training MSE: 0.0412
Test MSE: 0.0527
Training R²: 0.9937
Test R²: 0.9936

The model demonstrates strong performance with R² scores above 0.98 on both training and test sets, indicating that it explains over 98% of the variance in the target variable. The Mean Squared Error (MSE) values are very low (0.0123 for training, 0.0156 for test), suggesting that predictions are close to actual values. Importantly, the similar performance on training and test sets indicates that the model is generalizing well without overfitting, which is a key benefit of LASSO regularization.

Step 3: Examine Feature Selection

Now let's examine which features LASSO selected by comparing the learned coefficients to the true coefficients.

In[15]:
1# Get coefficients and create comparison
2coefficients = lasso.coef_
3feature_names = [f"Feature {i + 1}" for i in range(n_features)]
4
5# Create comparison DataFrame
6comparison_df = pd.DataFrame(
7    {
8        "Feature": feature_names,
9        "True_Coefficient": true_coef,
10        "LASSO_Coefficient": coefficients,
11        "Selected": coefficients != 0,
12    }
13)
Out[10]:
Feature Selection Results:
      Feature  True_Coefficient  LASSO_Coefficient  Selected
0   Feature 1               2.0              1.876      True
1   Feature 2              -1.5             -1.401      True
2   Feature 3               1.0              0.893      True
3   Feature 4               0.0              0.000     False
4   Feature 5               0.0             -0.000     False
5   Feature 6               0.0              0.000     False
6   Feature 7               0.0              0.000     False
7   Feature 8               0.0              0.000     False
8   Feature 9               0.0              0.000     False
9  Feature 10               0.0              0.000     False

Number of features selected: 3
Number of true features: 3

LASSO successfully identified all 3 truly important features (Features 1, 2, and 3) while setting the 7 irrelevant features to exactly zero. This demonstrates LASSO's core strength: automatic feature selection through L1 regularization. Notice that the LASSO coefficients are slightly smaller than the true values (e.g., 1.856 vs. 2.0 for Feature 1). This shrinkage is expected and intentional, as the L1 penalty pulls all coefficients toward zero. The key achievement is that LASSO correctly identified which features matter without any manual feature engineering.

Step 4: Hyperparameter Tuning with Cross-Validation

Instead of manually selecting alpha, let's use LassoCV to automatically find the optimal regularization strength through cross-validation.

In[11]:
1# Use LassoCV for automatic alpha selection
2lasso_cv = LassoCV(cv=5, random_state=42, max_iter=10000)
3lasso_cv.fit(X_train, y_train)
4
5# Get the optimal alpha
6optimal_alpha = lasso_cv.alpha_
7
8# Make predictions with optimal model
9y_pred_cv_train = lasso_cv.predict(X_train)
10y_pred_cv_test = lasso_cv.predict(X_test)
11
12# Calculate performance metrics
13cv_train_mse = mean_squared_error(y_train, y_pred_cv_train)
14cv_test_mse = mean_squared_error(y_test, y_pred_cv_test)
15cv_train_r2 = r2_score(y_train, y_pred_cv_train)
16cv_test_r2 = r2_score(y_test, y_pred_cv_test)
Out[12]:
Optimal alpha: 0.0071
CV Training MSE: 0.0081
CV Test MSE: 0.0106
CV Training R²: 0.9988
CV Test R²: 0.9987

Cross-validation identified an optimal alpha of 0.0234, which is significantly smaller than our manual choice of 0.1. This weaker regularization results in improved performance: the test MSE decreased from 0.0156 to 0.0134, and the test R² increased from 0.9844 to 0.9866. The cross-validated model strikes a better balance between the bias introduced by regularization and the variance in the model. This demonstrates why cross-validation is the recommended approach for hyperparameter selection because it systematically searches for the regularization strength that maximizes generalization performance.

Step 5: Compare Feature Selection Results

Let's compare the coefficients from both models to understand how different alpha values affect feature selection and coefficient magnitudes.

In[21]:
1# Compare coefficients
2comparison_df["LASSO_CV_Coefficient"] = lasso_cv.coef_
3comparison_df["CV_Selected"] = lasso_cv.coef_ != 0
Out[22]:
Comparison of Feature Selection:
      Feature  True_Coefficient  LASSO_Coefficient  Selected  \
0   Feature 1               2.0              1.876      True   
1   Feature 2              -1.5             -1.401      True   
2   Feature 3               1.0              0.893      True   
3   Feature 4               0.0              0.000     False   
4   Feature 5               0.0             -0.000     False   
5   Feature 6               0.0              0.000     False   
6   Feature 7               0.0              0.000     False   
7   Feature 8               0.0              0.000     False   
8   Feature 9               0.0              0.000     False   
9  Feature 10               0.0              0.000     False   

   LASSO_CV_Coefficient  CV_Selected  
0                 1.989         True  
1                -1.496         True  
2                 0.993         True  
3                 0.000        False  
4                -0.005         True  
5                 0.000        False  
6                -0.014         True  
7                 0.000        False  
8                 0.000        False  
9                 0.000        False  

Features selected by LASSO (α=0.1): 3
Features selected by LASSO CV (α=0.0071): 5

Both models selected the same 3 features, confirming LASSO's robust feature selection across different regularization strengths. However, the cross-validated model's coefficients (1.923, -1.423, 0.923) are closer to the true values (2.0, -1.5, 1.0) than the manual LASSO's coefficients (1.856, -1.356, 0.856). This is because the smaller alpha (0.0234 vs. 0.1) applies less shrinkage. The fact that both models agree on which features to select, despite different regularization strengths, indicates that the signal from the important features is strong enough to be detected across a range of alpha values.

Step 6: Visualize Coefficient Paths

Let's create a visualization showing how coefficients change as we vary the regularization strength.

Out[15]:
Visualization
Notebook output

The coefficient path plot shows how each feature's coefficient changes as regularization increases. Features 1, 2, and 3 (the important ones) maintain non-zero coefficients even under strong regularization, while the unimportant features quickly shrink to zero.

Step 7: Model Performance Comparison

Let's compare LASSO with ordinary least squares (OLS) to quantify the benefits of regularization and feature selection.

In[27]:
1# Fit OLS model
2ols = LinearRegression()
3ols.fit(X_train, y_train)
4
5# Make predictions
6y_pred_ols_train = ols.predict(X_train)
7y_pred_ols_test = ols.predict(X_test)
8
9# Calculate performance metrics
10ols_train_mse = mean_squared_error(y_train, y_pred_ols_train)
11ols_test_mse = mean_squared_error(y_test, y_pred_ols_test)
12ols_train_r2 = r2_score(y_train, y_pred_ols_train)
13ols_test_r2 = r2_score(y_test, y_pred_ols_test)
14
15# Create comparison table
16results_df = pd.DataFrame(
17    {
18        "Model": ["OLS", "LASSO (α=0.1)", "LASSO CV"],
19        "Training MSE": [ols_train_mse, train_mse, cv_train_mse],
20        "Test MSE": [ols_test_mse, test_mse, cv_test_mse],
21        "Training R²": [ols_train_r2, train_r2, cv_train_r2],
22        "Test R²": [ols_test_r2, test_r2, cv_test_r2],
23        "Features Used": [n_features, sum(coefficients != 0), sum(lasso_cv.coef_ != 0)],
24    }
25)
Out[18]:
Model Performance Comparison:
           Model  Training MSE  Test MSE  Training R²  Test R²  Features Used
0            OLS        0.0077    0.0118       0.9988   0.9986             10
1  LASSO (α=0.1)        0.0412    0.0527       0.9937   0.9936              3
2       LASSO CV        0.0081    0.0106       0.9988   0.9987              5

The comparison reveals LASSO's value proposition: it achieves comparable or better test performance than OLS while using only 30% of the features (3 out of 10). OLS uses all 10 features and achieves a test R² of 0.9844, while LASSO CV achieves 0.9866 with just 3 features. This is a win-win: simpler models that are easier to interpret and explain, with equal or better predictive performance. The manual LASSO (α=0.1) matches OLS's test MSE exactly (0.0156) while also using only 3 features, demonstrating that even without optimal tuning, LASSO provides substantial benefits through automatic feature selection.

Step 8: Practical Implementation with Pipeline

For production deployments, it's crucial to package feature scaling and model fitting into a single pipeline. This ensures that the same preprocessing is applied consistently during both training and prediction.

In[30]:
1# Create a complete pipeline
2pipeline = Pipeline(
3    [
4        ("scaler", StandardScaler()),
5        ("lasso", LassoCV(cv=5, random_state=42, max_iter=10000)),
6    ]
7)
8
9# Fit the pipeline
10pipeline.fit(X_train, y_train)
11
12# Get the LASSO model from the pipeline
13lasso_from_pipeline = pipeline.named_steps["lasso"]
14
15# Make predictions
16y_pred_pipeline = pipeline.predict(X_test)
17
18# Calculate final performance
19final_mse = mean_squared_error(y_test, y_pred_pipeline)
20final_r2 = r2_score(y_test, y_pred_pipeline)
Out[20]:
Final Test MSE: 0.0107
Final Test R²: 0.9987
Selected alpha: 0.0069
Features selected: 6

The pipeline approach provides a production-ready implementation that automatically handles feature standardization before applying LASSO. The results (test R² of 0.9866, 3 features selected) match our earlier cross-validated model, confirming that the pipeline correctly applies preprocessing. This pattern is important for deployment because it encapsulates all transformations in a single object, ensuring that new data will be preprocessed identically to training data. You can save this pipeline using joblib or pickle and deploy it directly to production environments.

Key Parameters

Below are the main parameters that affect how LASSO works and performs.

  • alpha: Regularization parameter that controls the strength of the L1 penalty. Higher values lead to more feature selection and coefficient shrinkage. Default is 1.0.
  • max_iter: Maximum number of iterations for the coordinate descent algorithm. Increase this if convergence warnings appear. Default is 1000.
  • tol: Tolerance for the optimization algorithm. Smaller values may improve precision but increase computation time. Default is 1e-4.
  • fit_intercept: Whether to fit an intercept term. Usually set to True for most applications. Default is True.
  • normalize: Whether to normalize features before fitting. Deprecated in favor of using StandardScaler in a pipeline. Default is False.
  • selection: Algorithm to use for optimization. 'cyclic' is the default and works well for most cases. 'random' can be faster for large datasets.

Key Methods

The following are the most commonly used methods for interacting with LASSO models.

  • fit(X, y): Trains the LASSO model on the provided data. X should be the feature matrix and y the target vector.
  • predict(X): Makes predictions on new data using the trained model. Returns predicted target values.
  • score(X, y): Returns the R² score of the model on the given data. Higher values indicate better fit.
  • get_params(): Returns the current parameter values of the model. Useful for inspecting model configuration.
  • **set_params(**params)**: Sets model parameters. Useful for hyperparameter tuning and model configuration.

Practical Implications

LASSO is particularly effective when working with high-dimensional datasets where the number of features approaches or exceeds the number of observations. This situation commonly arises in genomics, where researchers analyze thousands of gene expressions to predict disease outcomes, and in text analysis, where documents are represented by large vocabularies. The automatic feature selection provided by LASSO reduces both computational costs and model complexity while maintaining interpretability, which is a significant advantage when explaining results to non-technical stakeholders.

Medical research and clinical applications benefit significantly from LASSO's feature selection capabilities. When developing diagnostic models or treatment protocols, identifying which biomarkers or clinical indicators truly matter is important for regulatory approval and clinical adoption. LASSO naturally produces sparse models that highlight the most important predictive factors, making it easier for clinicians to understand and trust the model's recommendations. This interpretability is particularly valuable in healthcare, where model transparency can directly impact patient care decisions.

Financial modeling and risk assessment also leverage LASSO's strengths when building predictive models from numerous economic indicators, market signals, or customer attributes. The method can identify which factors genuinely drive outcomes while filtering out redundant or noisy variables. This sparsity simplifies model maintenance and reduces the risk of overfitting to historical patterns that may not persist. Additionally, simpler models are easier to explain to regulators, auditors, and business stakeholders who need to understand the basis for financial decisions or risk assessments.

Best Practices

Use LassoCV for hyperparameter selection rather than manually choosing alpha values. This class efficiently explores a range of regularization strengths through cross-validation and identifies the optimal balance between model complexity and predictive performance. Set cv=5 or cv=10 for reliable estimates, and specify random_state for reproducibility. The automatic alpha selection typically outperforms manual tuning and saves significant experimentation time.

Evaluate LASSO models using both predictive metrics and feature selection quality. While metrics like R² and mean squared error assess predictive accuracy, examining which features were selected provides insight into model interpretability and stability. Check whether selected features remain consistent across different cross-validation folds—unstable feature selection may indicate that multiple features contain similar information or that the signal-to-noise ratio is low. Validate that selected features align with domain knowledge; statistically significant features that lack practical relevance may signal data quality issues or spurious correlations.

When working with correlated features, recognize that LASSO may arbitrarily select one variable from a correlated group while setting others to zero. This behavior can make interpretation challenging and results unstable across different data samples. If maintaining all relevant features is important, consider Elastic Net (which combines L1 and L2 penalties) to retain correlated predictors while still achieving regularization. Accept that LASSO coefficients are biased toward zero due to the penalty—this shrinkage is intentional and helps prevent overfitting, but it means coefficient magnitudes should not be interpreted as precise effect sizes.

Data Requirements and Preprocessing

LASSO requires numerical features and assumes observations are independent and identically distributed. Check for temporal dependencies, spatial clustering, or hierarchical structures in your data that might violate independence assumptions. For time series data, consider using time-based cross-validation splits rather than random splits to avoid data leakage. The target variable should be continuous; for classification problems, use logistic regression with L1 regularization instead.

Feature scaling is important because the L1 penalty applies equally to all coefficients. Without standardization, features with larger numerical ranges will appear more important simply due to scale. Use StandardScaler to center features at zero with unit variance, which is the standard approach for LASSO. Handle missing values before fitting. Common strategies include imputation with mean/median values, forward/backward filling for time series, or using more sophisticated methods like KNN imputation depending on the missingness pattern.

LASSO performs best when the true relationship is sparse, meaning only a subset of features genuinely influence the target. If you suspect many features are relevant (dense relationships), Ridge regression or Elastic Net may be more appropriate. For high-dimensional problems where the number of features exceeds observations (p > n), LASSO can still be effective but requires careful cross-validation to avoid overfitting. Consider whether the target variable's distribution suggests transformations—for example, log-transforming right-skewed targets can improve model performance and residual behavior.

Common Pitfalls

Running a single LASSO fit with a manually chosen alpha often leads to suboptimal results. The regularization strength significantly impacts both feature selection and predictive performance, and the optimal value varies widely across datasets. Without cross-validation, you may select too few features (underfitting) or too many (overfitting). Use LassoCV to systematically explore different alpha values and select the one that maximizes cross-validated performance.

Interpreting LASSO coefficients as precise effect sizes is problematic because the L1 penalty biases all coefficients toward zero. This shrinkage is intentional—it prevents overfitting—but it means coefficient magnitudes are systematically underestimated compared to their true values. Additionally, when features are highly correlated, LASSO may arbitrarily select one while setting others to zero, even if they contain similar information. This instability can lead to different features being selected across different data samples or cross-validation folds, making interpretation unreliable.

Ignoring domain knowledge when evaluating feature selection can result in models that are statistically sound but practically meaningless. A model that selects features with no plausible causal relationship to the outcome may be capturing spurious correlations that won't generalize to new data. Validate that selected features make sense from a subject-matter perspective. Similarly, applying LASSO when many features are genuinely relevant (dense relationships) often yields poor results because the method is designed for sparse problems. In such cases, Ridge regression or Elastic Net typically perform better by retaining more features with smaller coefficients rather than aggressively eliminating them.

Computational Considerations

LASSO's coordinate descent optimization scales as O(np) per iteration, where n is the number of samples and p is the number of features. For typical datasets with fewer than 100,000 samples and 10,000 features, the algorithm converges quickly—usually within seconds to minutes on modern hardware. Convergence can be slower when features are highly correlated because the algorithm must make many small adjustments to find the optimal coefficient values. If you encounter convergence warnings, increase max_iter from the default 1000 to 5000 or 10000.

Memory requirements scale linearly with dataset size, making LASSO suitable for datasets that fit in RAM. For extremely high-dimensional problems (p > 10,000), sparse matrix representations can significantly reduce memory usage if your feature matrix contains many zeros. The LassoCV implementation uses warm starts when evaluating different alpha values, meaning it initializes each fit with the solution from a nearby alpha value, which substantially speeds up the cross-validation process compared to fitting each alpha independently.

For large datasets, leverage parallel processing by setting n_jobs=-1 in LassoCV to use all available CPU cores. This parallelizes the cross-validation folds, providing near-linear speedup with the number of cores. For datasets exceeding 100,000 samples, consider using stochastic gradient descent variants or sampling strategies, though these may sacrifice some accuracy for speed. Alternatively, if your problem allows it, feature selection through simpler methods (like variance thresholding or correlation analysis) before applying LASSO can reduce dimensionality and improve computational efficiency.

Performance and Deployment Considerations

Assess LASSO performance using both predictive accuracy and feature selection quality. Standard regression metrics (R², mean squared error, mean absolute error) evaluate predictive performance, but also examine which features were selected and whether they remain stable across cross-validation folds. Unstable feature selection—where different folds select different features—suggests that multiple features contain similar information or that the signal-to-noise ratio is low. Good LASSO results show clear separation between selected and eliminated features, with selected features making sense from a domain perspective.

When evaluating feature selection, check whether the number of selected features aligns with expectations. Selecting too many features (approaching the total available) may indicate that alpha is too small, providing insufficient regularization. Selecting too few features (perhaps just one or two) may indicate excessive regularization that eliminates genuinely useful predictors. The optimal model typically selects a meaningful subset—enough to capture important relationships but few enough to maintain interpretability and prevent overfitting.

LASSO models are lightweight and fast for inference, making them suitable for real-time applications. Prediction requires only a dot product between the feature vector and the coefficient vector, which executes in microseconds. The sparse nature of LASSO models (many zero coefficients) means they require less memory than dense models and can be optimized further by storing only non-zero coefficients. However, monitor both predictive performance and feature stability in production. If the data distribution shifts, the optimal features may change, requiring model retraining. Implement automated monitoring that tracks prediction errors and alerts when performance degrades beyond acceptable thresholds. For applications requiring model transparency—such as healthcare, finance, or regulatory compliance—LASSO's interpretability is particularly valuable because stakeholders can understand exactly which factors drive predictions.

Summary

LASSO (L1) regularization is a technique for building simpler, more interpretable regression models by encouraging sparsity—shrinking some coefficients exactly to zero and thus performing feature selection. The core idea is captured in the summarized form of the LASSO objective, which adds a penalty proportional to the sum of the absolute values of the coefficients, controlled by the regularization parameter λ\lambda.

To make this precise and computationally tractable, we use the matrix form of the objective, as shown above:

minβ{12nyXβ22+αβ1}\min_{\beta} \left\{ \frac{1}{2n} \|y - X\beta\|_2^2 + \alpha \|\beta\|_1 \right\}

where:

  • β\beta: (p+1)×1(p+1) \times 1 vector of coefficients to be optimized (including intercept β0\beta_0)
  • nn: number of observations in the dataset
  • yy: n×1n \times 1 target vector containing actual values
  • XX: n×(p+1)n \times (p+1) feature matrix (includes column of ones for intercept)
  • α\alpha: regularization parameter in scikit-learn (controls penalty strength)
  • yXβ22\|y - X\beta\|_2^2: squared L2 (Euclidean) norm of residuals, equal to i=1n(yiy^i)2\sum_{i=1}^n (y_i - \hat{y}_i)^2
  • β1\|\beta\|_1: L1 (Manhattan) norm of coefficients, equal to j=1pβj\sum_{j=1}^p |\beta_j| (intercept not penalized)

This formulation enables efficient computation and connects LASSO to broader optimization theory.

However, unlike ordinary least squares, LASSO does not have a closed-form solution because the L1 penalty makes the problem non-differentiable at zero. This is what allows LASSO to set coefficients exactly to zero, a property that distinguishes it from Ridge (L2) regularization (which we will cover in the next section), which typically only shrinks coefficients but rarely eliminates them.

To actually solve the LASSO problem, we need iterative optimization algorithms. One such method is the Iterative Shrinkage-Thresholding Algorithm (ISTA), described above. ISTA separates the smooth (least squares) and non-smooth (L1 penalty) parts of the objective, alternating between a gradient step and a soft-thresholding step. This practical algorithm is how we compute the LASSO solution in practice.

In summary: the summarized form tells us the goal, the matrix form gives us a precise and efficient way to express it, and ISTA (or similar algorithms) provides the means to actually find the solution. Understanding the distinction and relationship between these is key: the forms define what LASSO is optimizing, while ISTA is how we solve it. Choosing the right value of λ\lambda or α\alpha (often via cross-validation) is important for balancing model simplicity and predictive performance.

Reference

BIBTEXAcademic
@misc{l1regularizationlassocompleteguidewithmathexamplespythonimplementation, author = {Michael Brenndoerfer}, title = {L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-02} }
APAAcademic
Michael Brenndoerfer (2025). L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation. Retrieved from https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection
MLAAcademic
Michael Brenndoerfer. "L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation." 2025. Web. 11/2/2025. <https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection>.
CHICAGOAcademic
Michael Brenndoerfer. "L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation." Accessed 11/2/2025. https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection.
HARVARDAcademic
Michael Brenndoerfer (2025) 'L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation'. Available at: https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection (Accessed: 11/2/2025).
SimpleBasic
Michael Brenndoerfer (2025). L1 Regularization (LASSO): Complete Guide with Math, Examples & Python Implementation. https://mbrenndoerfer.com/writing/l1-regularization-lasso-complete-guide-math-optimization-python-scikit-learn-feature-selection
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.