A comprehensive guide to standardization in machine learning, covering mathematical foundations, practical implementation, and Python examples. Learn how to properly standardize features for fair comparison across different scales and units.
Reading Level
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Standardization: Normalizing Features for Fair Comparison
Standardization is a crucial preprocessing technique that rescales features to have mean 0 and variance 1, ensuring that machine learning algorithms treat all features fairly regardless of their original units or scales. This process is essential for many algorithms, particularly those that rely on distance calculations or regularization penalties.
Introduction
In real-world datasets, features often have vastly different scales and units. For example, a dataset might contain house prices (in thousands of dollars), square footage (in hundreds of square feet), and number of bedrooms (single digits). Without standardization, algorithms like LASSO regression or k-means clustering would be dominated by features with larger numeric values, leading to biased results and poor model performance.
Standardization transforms each feature to have a mean of 0 and standard deviation of 1, putting all features on the same scale. This ensures that:
- Regularization penalties treat all features equally
- Distance-based algorithms work correctly
- Gradient-based optimization converges more efficiently
- Model coefficients become comparable across features
Mathematical Foundation
The Standardization Formula
For a feature with observations, standardization transforms each value to using:
where:
- is the standardized value of feature for observation
- is the mean of feature
- is the standard deviation of feature
Key Properties
After standardization, each feature in your dataset is transformed so that it has a mean of zero and a standard deviation of one. This transformation ensures that all features, regardless of their original scale or units, are directly comparable and contribute equally to the analysis. In practical terms, this means that no single feature will dominate the learning process simply because it has larger numeric values. Instead, every feature is centered and scaled, allowing algorithms—especially those sensitive to feature magnitude, such as regularized regression or clustering—to perform optimally and fairly.
- Mean: for all features
- Standard deviation: for all features
- Variance: for all features
This ensures that all features contribute equally to distance calculations and regularization penalties.
Visual Example
This example demonstrates how standardization transforms features with different scales into a common scale:
Original features with vastly different scales. House size ranges from 1000-2000 square feet, while number of bedrooms ranges from 2-4. Without standardization, algorithms would be dominated by the house size feature due to its larger numeric values.
Standardized features on the same scale. Both features now have mean 0 and standard deviation 1, ensuring fair treatment by machine learning algorithms. The relative relationships within each feature are preserved while making them comparable.
Original data: House Size - Mean: 1516.67 Std: 338.71 Bedrooms - Mean: 3.0 Std: 0.82 Standardized data: House Size - Mean: -0.0 Std: 1.0 Bedrooms - Mean: 0.0 Std: 1.0
Example: Step-by-Step Calculation
Let's work through a detailed example with two features:
x1
: house size in square feet → [1000, 1500, 2000]x2
: number of bedrooms → [2, 3, 4]
Step 1: Calculate means
Step 2: Calculate standard deviations
Step 3: Apply standardization formula
For the first feature ():
For the second feature ():
Result: Both features are now on the same scale:
x1
: [1000, 1500, 2000] → [-1.225, 0.000, 1.225]x2
: [2, 3, 4] → [-1.225, 0.000, 1.225]
Practical Implementation
Proper Train-Test Split with Standardization
This example demonstrates the correct way to apply standardization in a machine learning pipeline:
1import numpy as np
2from sklearn.preprocessing import StandardScaler
3from sklearn.linear_model import Lasso
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import mean_squared_error
6
7# Set random seed for reproducibility
8np.random.seed(42)
9
10# Create sample dataset
11X = np.random.randn(100, 3)
12X[:, 0] *= 1000 # Scale first feature to be much larger
13X[:, 1] *= 10 # Scale second feature moderately
14# Third feature remains small scale
15
16# Create target variable
17y = 2 * X[:, 0] + 3 * X[:, 1] + 0.5 * X[:, 2] + np.random.randn(100) * 0.1
18
19# Split data
20X_train, X_test, y_train, y_test = train_test_split(
21 X, y, test_size=0.2, random_state=42
22)
23
24# CORRECT: Fit scaler only on training data
25scaler = StandardScaler()
26X_train_scaled = scaler.fit_transform(X_train)
27X_test_scaled = scaler.transform(X_test)
28
29# Train model
30model = Lasso(alpha=0.1)
31model.fit(X_train_scaled, y_train)
32
33# Make predictions
34# The test data is not scaled, so we need to scale it using the same scaler
35y_pred = model.predict(X_test_scaled)
36mse = mean_squared_error(y_test, y_pred)
37
38print("Model coefficients:", model.coef_)
39print("Test MSE:", mse)
40print("Training data mean:", np.mean(X_train_scaled, axis=0))
41print("Training data std:", np.std(X_train_scaled, axis=0))
1import numpy as np
2from sklearn.preprocessing import StandardScaler
3from sklearn.linear_model import Lasso
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import mean_squared_error
6
7# Set random seed for reproducibility
8np.random.seed(42)
9
10# Create sample dataset
11X = np.random.randn(100, 3)
12X[:, 0] *= 1000 # Scale first feature to be much larger
13X[:, 1] *= 10 # Scale second feature moderately
14# Third feature remains small scale
15
16# Create target variable
17y = 2 * X[:, 0] + 3 * X[:, 1] + 0.5 * X[:, 2] + np.random.randn(100) * 0.1
18
19# Split data
20X_train, X_test, y_train, y_test = train_test_split(
21 X, y, test_size=0.2, random_state=42
22)
23
24# CORRECT: Fit scaler only on training data
25scaler = StandardScaler()
26X_train_scaled = scaler.fit_transform(X_train)
27X_test_scaled = scaler.transform(X_test)
28
29# Train model
30model = Lasso(alpha=0.1)
31model.fit(X_train_scaled, y_train)
32
33# Make predictions
34# The test data is not scaled, so we need to scale it using the same scaler
35y_pred = model.predict(X_test_scaled)
36mse = mean_squared_error(y_test, y_pred)
37
38print("Model coefficients:", model.coef_)
39print("Test MSE:", mse)
40print("Training data mean:", np.mean(X_train_scaled, axis=0))
41print("Training data std:", np.std(X_train_scaled, axis=0))
Model coefficients: [1.68083062e+03 2.67552877e+01 4.19099894e-01] Test MSE: 0.04795636619638803 Training data mean: [-2.67147415e-17 3.46944695e-17 -5.68989300e-17] Training data std: [1. 1. 1.]
Key Implementation Guidelines
Standardization is a simple but crucial step in the machine learning workflow. Here are the most important guidelines to follow:
-
Fit the scaler only on the training data.
This ensures that information from the test set does not leak into the model during training. Fitting on the entire dataset (including the test set) can lead to overly optimistic performance estimates and poor generalization. -
Transform both training and test data using the same scaler.
After fitting the scaler on the training data, use it to transform both the training and test sets. This guarantees that the scaling parameters (mean and standard deviation) are consistent and based solely on the training data. -
Never fit the scaler on the entire dataset before splitting.
Doing so introduces data leakage, as the test set statistics influence the scaling of the training data. -
Use pipelines to automate and safeguard the process.
Scikit-learn pipelines help ensure that standardization and modeling steps are applied correctly and in the right order, reducing the risk of data leakage and making your workflow more reproducible.
By following these guidelines, you ensure that your model evaluation is fair and that your results will generalize well to new, unseen data.
1from sklearn.pipeline import Pipeline
2
3# Create pipeline with standardization and model
4pipeline = Pipeline([("scaler", StandardScaler()), ("model", Lasso(alpha=0.1))])
5
6# Fit pipeline on training data
7pipeline.fit(X_train, y_train)
8
9# Make predictions on test data
10y_pred_pipeline = pipeline.predict(X_test)
11mse_pipeline = mean_squared_error(y_test, y_pred_pipeline)
12
13print("Pipeline MSE:", mse_pipeline)
1from sklearn.pipeline import Pipeline
2
3# Create pipeline with standardization and model
4pipeline = Pipeline([("scaler", StandardScaler()), ("model", Lasso(alpha=0.1))])
5
6# Fit pipeline on training data
7pipeline.fit(X_train, y_train)
8
9# Make predictions on test data
10y_pred_pipeline = pipeline.predict(X_test)
11mse_pipeline = mean_squared_error(y_test, y_pred_pipeline)
12
13print("Pipeline MSE:", mse_pipeline)
Pipeline MSE: 0.04795636619638803
When to Use Standardization
Standardization is especially important for certain types of machine learning algorithms, while for others it is less critical.
Algorithms that require standardization include:
- LASSO and Ridge regression: These methods use regularization penalties that assume all features are on the same scale. Without standardization, features with larger numeric ranges can dominate the penalty and skew the model.
- k-means clustering: Since this algorithm relies on distance calculations, features must be on comparable scales to ensure that no single variable dominates the clustering process.
- Support Vector Machines: The kernel functions used in SVMs are distance-based, so standardized features are essential for fair and effective separation.
- Neural networks: Gradient-based optimization in neural networks converges more efficiently when inputs are normalized.
- Principal Component Analysis (PCA): As PCA is based on variance, standardizing features ensures that each variable contributes appropriately to the dimensionality reduction.
When standardization is less critical:
- Decision trees: These algorithms are scale-invariant, meaning their splitting criteria are unaffected by the magnitude of input variables.
- Random Forest: As an ensemble of decision trees, Random Forests inherit this scale invariance.
- Logistic regression: While standardization is not strictly required, it becomes important when regularization is used, as the penalty terms are sensitive to feature scale.
Limitations and Considerations
When applying standardization, keep in mind several important limitations:
- Sparse data: For sparse matrices, standardizing by subtracting the mean can convert the data into a dense format, which is inefficient and memory-intensive.
- Tip: Use
StandardScaler(with_mean=False)
to avoid densifying the matrix.
- Tip: Use
- Outliers: Standardization is sensitive to outliers, as extreme values can disproportionately affect the mean and standard deviation, potentially distorting the transformation.
- Categorical variables: Do not standardize one-hot encoded or ordinal categorical variables, as this can destroy their intended meaning.
- Target variable: In most cases, avoid standardizing the target variable unless there is a specific reason to do so.
Be aware of common pitfalls:
- Data leakage: Fitting the scaler on the entire dataset (including the test set) introduces information from the test data into the training process, leading to overly optimistic performance estimates.
- Best practice: Always fit the scaler only on the training data, then use it to transform both the training and test sets.
- Inconsistent scaling: Using different scalers for training and test data can result in mismatched feature distributions.
- Over-standardization: Standardizing features that are already normalized can be unnecessary or even harmful.
- Categorical confusion: Standardizing categorical variables that should remain discrete can undermine their interpretability and utility.
Practical Applications
Standardization is essential in many real-world scenarios, such as:
- Financial modeling: Combining features like stock prices, trading volumes, and economic indicators, which may be on vastly different scales.
- Image processing: Normalizing pixel values across different image formats.
- Natural language processing: Integrating word counts, document lengths, and TF-IDF scores.
- Healthcare analytics: Combining lab values, vital signs, and demographic data, all of which may have different units and ranges.
- Recommendation systems: Merging user ratings, item features, and interaction counts.
Summary
Standardization is a fundamental preprocessing step that ensures features are treated fairly in machine learning algorithms. By transforming features to have a mean of zero and a standard deviation of one, standardization:
- Enables fair comparison across variables with different scales
- Improves the performance of distance-based and regularization methods
- Prevents bias toward features with larger numeric values
- Stabilizes optimization by normalizing gradients
The key to successful standardization is proper implementation: always fit the scaler on training data only, then transform both training and test sets using the fitted scaler. This prevents data leakage and ensures realistic model evaluation.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Sum of Squared Errors (SSE): Complete Guide to Measuring Model Performance
A comprehensive guide to the Sum of Squared Errors (SSE) metric in regression analysis. Learn the mathematical foundation, visualization techniques, practical applications, and limitations of SSE with Python examples and detailed explanations.

Multiple Linear Regression: Complete Guide with Formulas, Examples & Python Implementation
A comprehensive guide to multiple linear regression, including mathematical foundations, intuitive explanations, worked examples, and Python implementation. Learn how to fit, interpret, and evaluate multiple linear regression models with real-world applications.

Backpropagation - Training Deep Neural Networks
In the 1980s, neural networks hit a wall—nobody knew how to train deep models. That changed when Rumelhart, Hinton, and Williams introduced backpropagation in 1986. Their clever use of the chain rule finally let researchers figure out which parts of a network deserved credit or blame, making deep learning work in practice. Thanks to this breakthrough, we now have everything from word embeddings to powerful language models like transformers.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.