Elastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Data Science Handbook

A comprehensive guide covering Elastic Net regularization, including mathematical foundations, geometric interpretation, and practical implementation. Learn how to combine L1 and L2 regularization for optimal feature selection and model stability.

Part of Data Science Handbook

This article is part of the free-to-read Data Science Handbook

View full handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Elastic Net Regularization

Elastic Net regularization is a technique used in multiple linear regression (MLR) that combines the strengths of both LASSO (L1) and Ridge (L2) regularization. In standard MLR, the model tries to minimize the sum of squared errors between the predicted and actual values. However, when the model is too complex or when there are many correlated features, it can fit the training data too closely, capturing noise rather than the underlying pattern.

Elastic Net addresses this by adding a penalty that combines both L1 and L2 terms, effectively constraining the model while providing the benefits of both regularization approaches. This encourages simpler models that generalize better to new data. We've covered LASSO (L1) and Ridge (L2) in the previous sections—Elastic Net combines both approaches.

Elastic Net, or Elastic Net Regularization, combines L1 and L2 regularization by adding penalties proportional to both the L1 norm (sum of absolute values) and L2 norm (sum of squares) of the coefficients.

In simple terms, Elastic Net helps a regression model avoid overfitting by keeping coefficients small and can drive some exactly to zero (like LASSO) while also handling correlated features well (like Ridge). This makes it particularly useful when we have many features, some of which may be correlated, and we want both feature selection and stable coefficient estimates.

Advantages

Elastic Net offers several key advantages by combining the best of both LASSO and Ridge regularization. First, it can perform automatic feature selection like LASSO by driving some coefficients exactly to zero, making the model more interpretable. Unlike LASSO, however, Elastic Net can handle groups of correlated features more effectively—when LASSO might arbitrarily select one feature from a correlated group, Elastic Net tends to include the entire group. This makes it more stable and reliable for interpretation when dealing with multicollinear features. The method also provides better performance than LASSO when the number of features exceeds the number of observations, and it can handle situations where there are more predictors than samples. Finally, the combination of L1 and L2 penalties provides a good balance between sparsity (from L1) and stability (from L2), making it a robust choice for many real-world applications.

Disadvantages

Despite its advantages, Elastic Net has some limitations. The method requires tuning two hyperparameters (the L1 and L2 regularization strengths), which can be more complex than tuning a single parameter in LASSO or Ridge. This increases the computational cost of hyperparameter selection and requires more careful cross-validation. Additionally, while Elastic Net can handle correlated features better than LASSO, it may still include more features than necessary in some cases, potentially reducing interpretability compared to pure LASSO. The method also inherits some limitations from both parent methods: it still requires feature scaling like Ridge, and the optimization is more complex than Ridge's closed-form solution, though it's more tractable than pure LASSO in some cases.

Formula

Let's build up the Elastic Net objective function step by step, starting from the most intuitive form and explaining each mathematical transformation along the way.

Starting with the Basic Regression Problem

We begin with the standard multiple linear regression problem. Our goal is to find coefficients $\beta_0, \beta_1, \beta_2, \ldots, \beta_p$ that minimize the sum of squared errors between our predictions and the actual values:

\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2

where:

$y_i$ is the actual target value for observation $i$ (where $i = 1, 2, \ldots, n$ )
$\hat{y}_i$ is the predicted value for observation $i$
$n$ is the number of observations in the dataset

Here, $\hat{y}_i$ represents our predicted value for observation $i$ , which we calculate as:

\hat{y}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}

where:

$\beta_0$ is the intercept term (constant offset)
$\beta_j$ is the coefficient (weight) for feature $j$ (where $j = 1, 2, \ldots, p$ )
$x_{ij}$ is the value of feature $j$ for observation $i$
$p$ is the number of features (predictors) in the model

Adding Regularization Penalties

Now, instead of just minimizing the sum of squared errors, we want to add penalty terms that will help us control the complexity of our model. Elastic Net adds two types of penalties:

L1 Penalty (LASSO component): $\lambda_1 \sum_{j=1}^p |\beta_j|$
L2 Penalty (Ridge component): $\lambda_2 \sum_{j=1}^p \beta_j^2$

where:

$\lambda_1 > 0$ is the L1 regularization parameter (controls the strength of the L1 penalty)
$\lambda_2 > 0$ is the L2 regularization parameter (controls the strength of the L2 penalty)
$|\beta_j|$ is the absolute value of coefficient $\beta_j$
$\beta_j^2$ is the squared value of coefficient $\beta_j$

Let's understand why we use these specific penalty forms:

Why the L1 penalty uses absolute values?

The absolute value function $|\beta_j|$ has a special property: its derivative is not continuous at zero. This creates a "corner" in the optimization landscape that can drive coefficients exactly to zero, effectively performing automatic feature selection. When we take the derivative of $|\beta_j|$ , we get:

$\frac{d}{d\beta_j}|\beta_j| = 1$ when $\beta_j > 0$
$\frac{d}{d\beta_j}|\beta_j| = -1$ when $\beta_j < 0$
$\frac{d}{d\beta_j}|\beta_j|$ is undefined at $\beta_j = 0$

This discontinuity is what allows LASSO to set coefficients exactly to zero.

Why the L2 penalty uses squared terms?

The squared penalty $\beta_j^2$ has a smooth, continuous derivative everywhere: $\frac{d}{d\beta_j}\beta_j^2 = 2\beta_j$ . This smoothness helps with correlated features by encouraging similar coefficients for similar features, creating a "grouping effect." The penalty grows quadratically, so larger coefficients are penalized more heavily than smaller ones.

The Complete Elastic Net Objective Function

Combining all three components, our Elastic Net objective function becomes:

\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2 \right\}

where:

$\boldsymbol{\beta} = [\beta_0, \beta_1, \ldots, \beta_p]^T$ is the vector of coefficients we are optimizing
$\sum_{i=1}^n (y_i - \hat{y}_i)^2$ is the sum of squared errors (data fit term)
$\lambda_1 \sum_{j=1}^p |\beta_j|$ is the L1 penalty term (encourages sparsity)
$\lambda_2 \sum_{j=1}^p \beta_j^2$ is the L2 penalty term (encourages small coefficients)

Let's break down what each part does:

Data Fit Term $\sum_{i=1}^n (y_i - \hat{y}_i)^2$ : Ensures our predictions are close to the actual values
L1 Penalty $\lambda_1 \sum_{j=1}^p |\beta_j|$ : Encourages sparsity by driving some coefficients to exactly zero
L2 Penalty $\lambda_2 \sum_{j=1}^p \beta_j^2$ : Encourages small coefficients and handles correlated features well

The parameters $\lambda_1$ and $\lambda_2$ control the strength of each penalty:

When $\lambda_1 = 0$ and $\lambda_2 > 0$ : We get pure Ridge regression
When $\lambda_1 > 0$ and $\lambda_2 = 0$ : We get pure LASSO regression
When $\lambda_1 > 0$ and $\lambda_2 > 0$ : We get Elastic Net

Understanding the Norm Notation

We can write the penalty terms more compactly using norm notation:

\lambda_1 \sum_{j=1}^p |\beta_j| + \lambda_2 \sum_{j=1}^p \beta_j^2 = \lambda_1 \|\boldsymbol{\beta}\|_1 + \lambda_2 \|\boldsymbol{\beta}\|_2^2

where:

$\|\boldsymbol{\beta}\|_1 = \sum_{j=1}^p |\beta_j|$ is the L1 norm (Manhattan distance)
$\|\boldsymbol{\beta}\|_2 = \sqrt{\sum_{j=1}^p \beta_j^2}$ is the L2 norm (Euclidean distance)
$\|\boldsymbol{\beta}\|_2^2 = \sum_{j=1}^p \beta_j^2$ is the squared L2 norm

The L1 norm measures the sum of absolute values, while the L2 norm measures the square root of the sum of squares. In our formulation, we use the squared L2 norm ( $\|\boldsymbol{\beta}\|_2^2$ ) for mathematical convenience.

Why This Combination Works

Elastic Net's effectiveness comes from how these two penalties complement each other:

The L1 penalty provides sparsity (feature selection) but can be unstable with correlated features
The L2 penalty provides stability with correlated features but doesn't perform feature selection
Together, they provide both sparsity and stability

This is particularly valuable when we have many features, some of which are correlated, and we want both interpretability (through feature selection) and robustness (through stable coefficient estimates).

Geometric Interpretation of Regularization

To understand why Elastic Net combines the best of both worlds, let's visualize the constraint regions geometrically. In regularized regression, we can think of the problem as minimizing the sum of squared errors subject to a constraint on the coefficients.

Out[2]:

Visualization

Geometric interpretation of regularization constraints in 2D coefficient space. The red contours represent the loss function (sum of squared errors), with the optimal OLS solution at the center. The blue regions show the constraint boundaries: L1 (diamond), L2 (circle), and Elastic Net (rounded diamond). The optimal regularized solution occurs where the loss contours first touch the constraint region. Notice how the L1 constraint has sharp corners that encourage sparsity (coefficients exactly zero on the axes), while the L2 constraint is smooth. Elastic Net combines both, creating a rounded diamond that can achieve sparsity at the corners while providing stability along the edges.

This geometric view reveals several key insights:

L1 Constraint (Diamond): The sharp corners at the axes mean that the loss contours are likely to first touch the constraint region at a corner, resulting in sparse solutions where some coefficients are exactly zero.
L2 Constraint (Circle): The smooth, round shape means the loss contours will touch the constraint region at a point where all coefficients are typically non-zero but small. This provides stability but no sparsity.
Elastic Net Constraint (Rounded Diamond): By combining both penalties, we get a shape that has both corners (for sparsity) and smooth edges (for stability). This allows the model to achieve sparsity when needed while maintaining the grouping effect for correlated features.

The red contours represent levels of the loss function (sum of squared errors). The optimal solution for each method occurs where these contours first touch the respective constraint region. This geometric interpretation makes it clear why Elastic Net can achieve both feature selection and stability—it literally combines the geometric properties of both L1 and L2 constraints.

Matrix Notation

Now let's translate our objective function into matrix notation, which is more compact and reveals the mathematical structure more clearly.

First, let's define our matrices:

$\mathbf{y}$ is an $n \times 1$ vector containing all target values: $\mathbf{y} = [y_1, y_2, \ldots, y_n]^T$
$\mathbf{X}$ is an $n \times (p+1)$ design matrix where each row represents one observation and each column represents one feature (including the intercept column of ones)
$\boldsymbol{\beta}$ is a $(p+1) \times 1$ vector of coefficients: $\boldsymbol{\beta} = [\beta_0, \beta_1, \beta_2, \ldots, \beta_p]^T$

where:

$n$ is the number of observations
$p$ is the number of features (excluding the intercept)
The superscript $T$ denotes the transpose operation

The predicted values can now be written as:

\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}

where $\hat{\mathbf{y}} = [\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_n]^T$ is the $n \times 1$ vector of predicted values.

This is much more compact than writing out each prediction individually. The sum of squared errors becomes:

\sum_{i=1}^n (y_i - \hat{y}_i)^2 = ||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||_2^2

where $||\cdot||_2$ represents the L2 norm (Euclidean norm), which for a vector $\mathbf{v} = [v_1, v_2, \ldots, v_n]^T$ is defined as:

||\mathbf{v}||_2 = \sqrt{\sum_{i=1}^n v_i^2}

Therefore, $||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||_2^2 = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ is the squared Euclidean distance between the actual values and our predictions.

Similarly, the penalty terms become:

L1 penalty: $\lambda_1 ||\boldsymbol{\beta}||_1 = \lambda_1 \sum_{j=0}^p |\beta_j|$
L2 penalty: $\lambda_2 ||\boldsymbol{\beta}||_2^2 = \lambda_2 \sum_{j=0}^p \beta_j^2$

where the summation includes all coefficients from $j=0$ (intercept) to $j=p$ (last feature).

Putting it all together, our Elastic Net objective function in matrix notation is:

\min_{\boldsymbol{\beta}} \left\{ ||\mathbf{y} - \mathbf{X}\boldsymbol{\beta}||_2^2 + \lambda_1 ||\boldsymbol{\beta}||_1 + \lambda_2 ||\boldsymbol{\beta}||_2^2 \right\}

where:

$\min_{\boldsymbol{\beta}}$ indicates we are finding the coefficient vector $\boldsymbol{\beta}$ that minimizes the objective function
The objective function is the sum of three terms: data fit (SSE), L1 penalty, and L2 penalty

This notation makes it clear that we're minimizing a function of the coefficient vector $\boldsymbol{\beta}$ , and it's much more convenient for mathematical analysis and computational implementation.

Mathematical Properties

Understanding the mathematical properties of Elastic Net helps us predict how it will behave in different situations:

Sparsity and Grouping Effect: The L1 term encourages sparsity by driving some coefficients exactly to zero, while the L2 term encourages grouping of correlated features. This means that when features are highly correlated, Elastic Net tends to include the entire group rather than arbitrarily selecting one feature from the group (as LASSO might do).

Bias-Variance Tradeoff: As the regularization parameters $\lambda_1$ and $\lambda_2$ increase, the model becomes more biased (systematically wrong) but has lower variance (less sensitive to small changes in the data). This is the fundamental tradeoff in regularization.

Stability: The L2 component helps stabilize the solution when features are correlated, making the coefficient estimates more reliable and less sensitive to small changes in the data.

No Closed-Form Solution: Unlike Ridge regression, which has a closed-form solution, Elastic Net requires iterative optimization due to the non-differentiable L1 penalty. This makes it computationally more expensive but still more tractable than pure LASSO in many cases.

Convexity: The Elastic Net objective function is convex, which means it has a unique global minimum. This guarantees that our optimization algorithm will find the best solution.

Regularization Paths: Visualizing Coefficient Evolution

One of the most insightful ways to understand how Elastic Net behaves is to examine the regularization path—how coefficients change as we vary the regularization strength. This visualization reveals the key difference between LASSO and Elastic Net when dealing with correlated features.

Out[3]:

Visualization

LASSO regularization path showing how coefficients evolve as regularization strength (α) increases. Notice how correlated features 0 and 1 (red and blue lines) behave independently—LASSO arbitrarily selects feature 0 and drives feature 1 to zero, demonstrating the instability with correlated features. As α increases, coefficients are driven to exactly zero at different points, with only the most important features surviving at high regularization.

Elastic Net regularization path (α=0.5) showing how coefficients evolve as regularization strength increases. Notice how correlated features 0 and 1 (red and blue lines) tend to stay together—both enter and exit the model simultaneously, demonstrating the grouping effect. This stability makes Elastic Net more reliable for interpretation when features are correlated, while still achieving sparsity at high regularization levels.

Loading component...

Out[7]:

Visualization

Ridge regression with λ=0.01: All coefficients remain non-zero but are slightly shrunk from OLS estimates. Notice how correlated features 0 and 1 maintain similar values.

Ridge regression with λ=0.1: Coefficients are further shrunk toward zero while maintaining the grouping effect for correlated features.

Ridge regression with λ=1.0: Strong shrinkage brings all coefficients closer to zero, but no feature selection occurs.

LASSO with α=0.01: Minimal regularization, most features retained with coefficients close to OLS estimates.

LASSO with α=0.1: Moderate feature selection begins, some coefficients driven to zero while others remain.

LASSO with α=1.0: Strong feature selection, only the most important features retained. Note potential instability with correlated features.

Elastic Net with α=0.1 (mostly Ridge): Behaves similarly to Ridge, keeping most features with small coefficients.

Elastic Net with α=0.5 (balanced): Combines feature selection with stability, providing a middle ground between Ridge and LASSO.

Elastic Net with α=0.9 (mostly LASSO): Strong feature selection while maintaining some stability for correlated features.

Row 1 - Ridge Regression (λ = 0.01, 0.1, 1.0): As the regularization parameter λ increases, all coefficients shrink toward zero but remain non-zero. Notice how correlated features 0 and 1 maintain similar coefficient values across all λ values, demonstrating Ridge's ability to handle correlated features by keeping them together in the model. The coefficients become progressively smaller as λ increases, but no feature is ever eliminated.

Row 2 - LASSO (α = 0.01, 0.1, 1.0): As the regularization parameter α increases, LASSO performs automatic feature selection by driving some coefficients exactly to zero. Notice how LASSO might arbitrarily select one feature from the correlated pair (features 0 and 1) when α becomes large, demonstrating the instability that can occur with highly correlated features. The sparsity increases dramatically as α increases, with only the most important features remaining.

Row 3 - Elastic Net (α = 0.1, 0.5, 0.9): This row shows the mixing parameter α (l1_ratio) controlling the balance between L1 and L2 penalties. When α = 0.1, the model behaves more like Ridge, keeping most features with small coefficients. As α increases to 0.5 and 0.9, the L1 component becomes stronger, leading to more sparsity while still maintaining some stability for correlated features. Notice how features 0 and 1 tend to be kept together more consistently than in pure LASSO.

Out[5]:

Visualization

Side-by-side comparison of coefficient estimates across Ridge, LASSO, and Elastic Net methods with moderate regularization. Ridge (red) keeps all features with moderate coefficients, providing stability but no feature selection. LASSO (blue) performs aggressive feature selection, setting some coefficients exactly to zero, but may arbitrarily choose between correlated features. Elastic Net (green) provides a balanced approach, maintaining some sparsity while preserving the grouping effect for correlated features.

Loading component...

In[11]:

1import numpy as np
2from sklearn.datasets import make_regression
3from sklearn.model_selection import train_test_split
4from sklearn.linear_model import ElasticNetCV
5from sklearn.preprocessing import StandardScaler
6from sklearn.pipeline import Pipeline
7from sklearn.metrics import mean_squared_error, r2_score
8
9# Set random seed for reproducibility
10np.random.seed(42)
11
12# Generate synthetic data with correlated features
13X, y = make_regression(n_samples=1000, n_features=20, noise=50.0, random_state=42)
14
15# Create correlated features to demonstrate grouping effect
16X[:, 1] = X[:, 0] + 0.3 * np.random.randn(1000)  # Features 0 & 1 are correlated
17X[:, 3] = 0.7 * X[:, 2] + 0.3 * np.random.randn(1000)  # Features 2 & 3 are correlated
18
19# Split data into training and testing sets
20X_train, X_test, y_train, y_test = train_test_split(
21    X, y, test_size=0.2, random_state=42
22)

1import numpy as np
2from sklearn.datasets import make_regression
3from sklearn.model_selection import train_test_split
4from sklearn.linear_model import ElasticNetCV
5from sklearn.preprocessing import StandardScaler
6from sklearn.pipeline import Pipeline
7from sklearn.metrics import mean_squared_error, r2_score
8
9# Set random seed for reproducibility
10np.random.seed(42)
11
12# Generate synthetic data with correlated features
13X, y = make_regression(n_samples=1000, n_features=20, noise=50.0, random_state=42)
14
15# Create correlated features to demonstrate grouping effect
16X[:, 1] = X[:, 0] + 0.3 * np.random.randn(1000)  # Features 0 & 1 are correlated
17X[:, 3] = 0.7 * X[:, 2] + 0.3 * np.random.randn(1000)  # Features 2 & 3 are correlated
18
19# Split data into training and testing sets
20X_train, X_test, y_train, y_test = train_test_split(
21    X, y, test_size=0.2, random_state=42
22)

Out[7]:

Dataset: 1000 samples, 20 features
Training set: 800 samples
Test set: 200 samples

We've created a dataset with 1,000 samples and 20 features, where some features are intentionally correlated. Features 0 and 1 are highly correlated, as are features 2 and 3. This correlation structure will allow us to demonstrate Elastic Net's grouping effect—its ability to keep correlated features together in the model.

Step 2: Configure and Train the Model

Next, we'll set up an Elastic Net model with automatic hyperparameter tuning using ElasticNetCV. This approach tests multiple combinations of regularization parameters and selects the best one through cross-validation.

In[14]:

1# Create Elastic Net with automatic hyperparameter selection
2elastic_net = ElasticNetCV(
3    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9],  # Mixing parameter: balance between L1 and L2
4    alphas=np.logspace(-3, 1, 20),  # Regularization strength: from 0.001 to 10
5    cv=5,  # 5-fold cross-validation for parameter selection
6    max_iter=2000,  # Maximum iterations for convergence
7    random_state=42,  # For reproducible results
8    n_jobs=-1,  # Use all available CPU cores
9)
10
11# Create pipeline with standardization and Elastic Net
12pipeline = Pipeline(
13    [
14        (
15            "scaler",
16            StandardScaler(),
17        ),  # Important: standardize features before regularization
18        ("elastic_net", elastic_net),
19    ]
20)
21
22# Fit the model on training data
23pipeline.fit(X_train, y_train)

1# Create Elastic Net with automatic hyperparameter selection
2elastic_net = ElasticNetCV(
3    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9],  # Mixing parameter: balance between L1 and L2
4    alphas=np.logspace(-3, 1, 20),  # Regularization strength: from 0.001 to 10
5    cv=5,  # 5-fold cross-validation for parameter selection
6    max_iter=2000,  # Maximum iterations for convergence
7    random_state=42,  # For reproducible results
8    n_jobs=-1,  # Use all available CPU cores
9)
10
11# Create pipeline with standardization and Elastic Net
12pipeline = Pipeline(
13    [
14        (
15            "scaler",
16            StandardScaler(),
17        ),  # Important: standardize features before regularization
18        ("elastic_net", elastic_net),
19    ]
20)
21
22# Fit the model on training data
23pipeline.fit(X_train, y_train)

The ElasticNetCV model will test 5 different l1_ratio values (controlling the L1/L2 balance) against 20 different alpha values (controlling overall regularization strength), resulting in 100 different parameter combinations. For each combination, it performs 5-fold cross-validation to estimate performance, then selects the best parameters.

The Pipeline ensures that feature standardization is applied consistently to both training and test data, which is important for Elastic Net since the regularization penalty is sensitive to feature scales.

Step 3: Evaluate Model Performance

Now let's evaluate how well the model performs on the test set and examine which hyperparameters were selected.

In[9]:

1# Make predictions on test data
2y_pred = pipeline.predict(X_test)
3
4# Calculate performance metrics
5mse = mean_squared_error(y_test, y_pred)
6r2 = r2_score(y_test, y_pred)
7
8# Extract the best Elastic Net model from the pipeline
9best_elastic_net = pipeline.named_steps["elastic_net"]

1# Make predictions on test data
2y_pred = pipeline.predict(X_test)
3
4# Calculate performance metrics
5mse = mean_squared_error(y_test, y_pred)
6r2 = r2_score(y_test, y_pred)
7
8# Extract the best Elastic Net model from the pipeline
9best_elastic_net = pipeline.named_steps["elastic_net"]

Out[10]:

Performance Metrics:
Mean Squared Error: 12234.5750
R² Score: 0.7076

The R² score indicates how much of the variance in the target variable our model explains. A value close to 1.0 suggests excellent predictive performance, while values closer to 0 indicate poor performance. The MSE provides the average squared difference between predictions and actual values—lower values indicate better fit.

Out[11]:


Optimal Hyperparameters:
Best l1_ratio: 0.900
Best alpha: 0.1274

The selected l1_ratio tells us the balance between L1 and L2 regularization that worked best for this dataset. A value closer to 1.0 means the model favored LASSO-like behavior (more sparsity), while values closer to 0 indicate Ridge-like behavior (keeping more features with small coefficients). The alpha value controls the overall strength of regularization—higher values mean more aggressive regularization.

Step 4: Examine Feature Selection

Let's look at which features the model selected and their coefficients to understand the sparsity pattern.

In[12]:

1# Count and display non-zero coefficients
2non_zero_coefs = 0
3selected_features = []
4for i, coef in enumerate(best_elastic_net.coef_):
5    if abs(coef) > 1e-6:  # Non-zero coefficient threshold
6        non_zero_coefs += 1
7        selected_features.append((i, coef))

1# Count and display non-zero coefficients
2non_zero_coefs = 0
3selected_features = []
4for i, coef in enumerate(best_elastic_net.coef_):
5    if abs(coef) > 1e-6:  # Non-zero coefficient threshold
6        non_zero_coefs += 1
7        selected_features.append((i, coef))

Out[22]:

Selected Features and Coefficients:
  Feature 0: 81.9982
  Feature 1: 4.0983
  Feature 2: 6.4044
  Feature 4: 83.1459
  Feature 5: -0.5909
  Feature 6: 69.7527
  Feature 7: 7.5425
  Feature 8: -1.2363
  Feature 9: 3.9866
  Feature 10: 19.8315
  Feature 11: 44.8573
  Feature 12: 3.7041
  Feature 13: 2.3340
  Feature 14: -3.9652
  Feature 15: 24.8593
  Feature 16: -0.3934
  Feature 17: 83.3584
  Feature 18: 2.5290
  Feature 19: 1.7641

Sparsity Summary:
Features selected: 19/20
Sparsity ratio: 95.0%

The sparsity ratio shows what percentage of features the model retained. Elastic Net's automatic feature selection has eliminated features that don't contribute meaningfully to predictions, simplifying the model and potentially improving interpretability. Notice how correlated features (like 0 and 1, or 2 and 3) tend to be selected or eliminated together—this is the grouping effect in action.

Alternative: Manual Hyperparameter Tuning with GridSearchCV

While ElasticNetCV is recommended for most cases, you can also use GridSearchCV if you need more control over the search process or want to tune additional parameters.

In[24]:

1from sklearn.model_selection import GridSearchCV
2from sklearn.linear_model import ElasticNet
3
4# Define parameter grid for manual search
5param_grid = {
6    "elastic_net__l1_ratio": [0.1, 0.5, 0.9],  # Mixing parameter values
7    "elastic_net__alpha": [0.01, 0.1, 1.0],  # Regularization strength values
8}
9
10# Create pipeline with manual Elastic Net
11manual_pipeline = Pipeline(
12    [
13        ("scaler", StandardScaler()),
14        ("elastic_net", ElasticNet(max_iter=2000, random_state=42)),
15    ]
16)
17
18# Perform grid search with cross-validation
19grid_search = GridSearchCV(
20    manual_pipeline,
21    param_grid,
22    cv=3,  # 3-fold cross-validation
23    scoring="neg_mean_squared_error",  # Minimize MSE
24    n_jobs=-1,  # Use all CPU cores
25)
26
27# Fit the grid search
28grid_search.fit(X_train, y_train)

1from sklearn.model_selection import GridSearchCV
2from sklearn.linear_model import ElasticNet
3
4# Define parameter grid for manual search
5param_grid = {
6    "elastic_net__l1_ratio": [0.1, 0.5, 0.9],  # Mixing parameter values
7    "elastic_net__alpha": [0.01, 0.1, 1.0],  # Regularization strength values
8}
9
10# Create pipeline with manual Elastic Net
11manual_pipeline = Pipeline(
12    [
13        ("scaler", StandardScaler()),
14        ("elastic_net", ElasticNet(max_iter=2000, random_state=42)),
15    ]
16)
17
18# Perform grid search with cross-validation
19grid_search = GridSearchCV(
20    manual_pipeline,
21    param_grid,
22    cv=3,  # 3-fold cross-validation
23    scoring="neg_mean_squared_error",  # Minimize MSE
24    n_jobs=-1,  # Use all CPU cores
25)
26
27# Fit the grid search
28grid_search.fit(X_train, y_train)

Out[15]:

GridSearchCV Results:
Best parameters: {'elastic_net__alpha': 0.1, 'elastic_net__l1_ratio': 0.9}
Best cross-validation MSE: 12067.6621

Comparison with ElasticNetCV:
ElasticNetCV l1_ratio: 0.900
ElasticNetCV alpha: 0.1274
GridSearchCV l1_ratio: 0.9
GridSearchCV alpha: 0.1

Loading component...

Reference

BIBTEXAcademic

@misc{elasticnetregularizationcompleteguidewithmathematicalfoundationspythonimplementation, author = {Michael Brenndoerfer}, title = {Elastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/elastic-net-regularization-complete-guide-mathematical-foundations-python-implementation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-02} }

APAAcademic

Michael Brenndoerfer (2025). Elastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation. Retrieved from https://mbrenndoerfer.com/writing/elastic-net-regularization-complete-guide-mathematical-foundations-python-implementation

MLAAcademic

Michael Brenndoerfer. "Elastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation." 2025. Web. 11/2/2025. <https://mbrenndoerfer.com/writing/elastic-net-regularization-complete-guide-mathematical-foundations-python-implementation>.

CHICAGOAcademic

Michael Brenndoerfer. "Elastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation." Accessed 11/2/2025. https://mbrenndoerfer.com/writing/elastic-net-regularization-complete-guide-mathematical-foundations-python-implementation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Elastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation'. Available at: https://mbrenndoerfer.com/writing/elastic-net-regularization-complete-guide-mathematical-foundations-python-implementation (Accessed: 11/2/2025).

SimpleBasic

Michael Brenndoerfer (2025). Elastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation. https://mbrenndoerfer.com/writing/elastic-net-regularization-complete-guide-mathematical-foundations-python-implementation

Direct link:

https://mbrenndoerfer.com/writing/elastic-net-regularization-complete-guide-mathematical-foundations-python-implementation

Part of Data Science Handbook

This article is part of the free-to-read Data Science Handbook

View full handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications

InteractiveElastic Net Regularization: Complete Guide with Mathematical Foundations & Python Implementation

Elastic Net Regularization

Advantages

Disadvantages

Formula

Starting with the Basic Regression Problem

Adding Regularization Penalties

The Complete Elastic Net Objective Function

Understanding the Norm Notation

Why This Combination Works

Geometric Interpretation of Regularization

Matrix Notation

Mathematical Properties

Regularization Paths: Visualizing Coefficient Evolution

Step 2: Configure and Train the Model

Step 3: Evaluate Model Performance

Step 4: Examine Feature Selection

Alternative: Manual Hyperparameter Tuning with GridSearchCV

Reference

About the author: Michael Brenndoerfer

Related Content

t-SNE: Complete Guide to Dimensionality Reduction & High-Dimensional Data Visualization

LIME Explainability: Complete Guide to Local Interpretable Model-Agnostic Explanations

UMAP: Complete Guide to Uniform Manifold Approximation and Projection for Dimensionality Reduction

Stay updated