A comprehensive guide to normalization in machine learning, covering min-max scaling, proper train-test split implementation, when to use normalization vs standardization, and practical applications for neural networks and distance-based algorithms.

Part of Data Science Handbook
This article is part of the free-to-read Data Science Handbook
Reading Level
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Normalization: Scaling Features to a Common Range
Normalization is a fundamental preprocessing technique that rescales features to a specific range, typically [0, 1], ensuring that all features contribute equally to machine learning algorithms regardless of their original scale. This process is essential for algorithms that are sensitive to the magnitude of input values and helps prevent features with larger ranges from dominating the learning process.
Introduction
In real-world datasets, features often have vastly different scales and ranges. For example, a dataset might contain age (0-100), income (thousands to millions), and binary flags (0 or 1). Without normalization, algorithms like neural networks or k-nearest neighbors would be dominated by features with larger numeric ranges, leading to biased results and poor model performance.
Normalization transforms each feature to a common scale, typically between 0 and 1, ensuring that:
- All features contribute equally to distance calculations
- Gradient-based optimization converges more efficiently
- Model coefficients become comparable across features
- Neural networks train more effectively
Mathematical Foundation
The Min-Max Normalization Formula
For a feature with observations, min-max normalization transforms each value to using:
where:
- is the normalized value of feature for observation
- is the minimum value of feature
- is the maximum value of feature
Key Properties
After normalization, each feature in your dataset is transformed so that it has a minimum value of 0 and a maximum value of 1. This transformation ensures that all features, regardless of their original range or units, are directly comparable and contribute equally to the analysis. In practical terms, this means that no single feature will dominate the learning process simply because it has a larger numeric range. Instead, every feature is scaled to the same [0, 1] interval, allowing algorithms—especially those sensitive to feature magnitude, such as neural networks or distance-based methods—to perform optimally and fairly.
- Minimum: for all features
- Maximum: for all features
- Range: for all features
This ensures that all features contribute equally to distance calculations and optimization processes.
Visual Example
This example demonstrates how normalization transforms features with different ranges into a common [0, 1] scale:
Original features with vastly different ranges. Age ranges from 20-80 years, while income ranges from 30,000-150,000 dollars. Without normalization, algorithms would be dominated by the income feature due to its larger numeric range.
Normalized features on the same [0, 1] scale. Both features now have minimum 0 and maximum 1, ensuring fair treatment by machine learning algorithms. The relative relationships within each feature are preserved while making them comparable.
Original data: Age - Min: 25 Max: 75 Income - Min: 30000 Max: 130000 Normalized data: Age - Min: 0.0 Max: 1.0 Income - Min: 0.0 Max: 1.0
Example: Step-by-Step Calculation
Let's work through a detailed example with two features:
x1
: age in years → [25, 35, 45, 55, 65, 75]x2
: income in thousands → [30, 50, 70, 90, 110, 130]
Step 1: Calculate min and max values
- ,
- ,
Step 2: Apply normalization formula
For the first feature ():
For the second feature ():
Result: Both features are now on the same [0, 1] scale:
x1
: [25, 35, 45, 55, 65, 75] → [0.000, 0.200, 0.400, 0.600, 0.800, 1.000]x2
: [30, 50, 70, 90, 110, 130] → [0.000, 0.200, 0.400, 0.600, 0.800, 1.000]
Practical Implementation
Proper Train-Test Split with Normalization
This example demonstrates the correct way to apply normalization in a machine learning pipeline:
1import numpy as np
2from sklearn.preprocessing import MinMaxScaler
3from sklearn.neural_network import MLPRegressor
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import mean_squared_error
6
7# Set random seed for reproducibility
8np.random.seed(42)
9
10# Create sample dataset with different scales
11X = np.random.randn(100, 3)
12X[:, 0] *= 1000 # Scale first feature to be much larger
13X[:, 1] *= 10 # Scale second feature moderately
14# Third feature remains small scale
15
16# Create target variable
17y = 2 * X[:, 0] + 3 * X[:, 1] + 0.5 * X[:, 2] + np.random.randn(100) * 0.1
18
19# Split data
20X_train, X_test, y_train, y_test = train_test_split(
21 X, y, test_size=0.2, random_state=42
22)
23
24# CORRECT: Fit scaler only on training data
25scaler = MinMaxScaler()
26X_train_normalized = scaler.fit_transform(X_train)
27X_test_normalized = scaler.transform(X_test)
28
29# Train model
30model = MLPRegressor(hidden_layer_sizes=(50,), max_iter=1000, random_state=42)
31model.fit(X_train_normalized, y_train)
32
33# Make predictions
34y_pred = model.predict(X_test_normalized)
35mse = mean_squared_error(y_test, y_pred)
36
37print("Model performance:")
38print("Test MSE:", mse)
39print("Training data min:", np.min(X_train_normalized, axis=0))
40print("Training data max:", np.max(X_train_normalized, axis=0))
1import numpy as np
2from sklearn.preprocessing import MinMaxScaler
3from sklearn.neural_network import MLPRegressor
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import mean_squared_error
6
7# Set random seed for reproducibility
8np.random.seed(42)
9
10# Create sample dataset with different scales
11X = np.random.randn(100, 3)
12X[:, 0] *= 1000 # Scale first feature to be much larger
13X[:, 1] *= 10 # Scale second feature moderately
14# Third feature remains small scale
15
16# Create target variable
17y = 2 * X[:, 0] + 3 * X[:, 1] + 0.5 * X[:, 2] + np.random.randn(100) * 0.1
18
19# Split data
20X_train, X_test, y_train, y_test = train_test_split(
21 X, y, test_size=0.2, random_state=42
22)
23
24# CORRECT: Fit scaler only on training data
25scaler = MinMaxScaler()
26X_train_normalized = scaler.fit_transform(X_train)
27X_test_normalized = scaler.transform(X_test)
28
29# Train model
30model = MLPRegressor(hidden_layer_sizes=(50,), max_iter=1000, random_state=42)
31model.fit(X_train_normalized, y_train)
32
33# Make predictions
34y_pred = model.predict(X_test_normalized)
35mse = mean_squared_error(y_test, y_pred)
36
37print("Model performance:")
38print("Test MSE:", mse)
39print("Training data min:", np.min(X_train_normalized, axis=0))
40print("Training data max:", np.max(X_train_normalized, axis=0))
Model performance: Test MSE: 2116358.4104419583 Training data min: [0. 0. 0.] Training data max: [1. 1. 1.]
Key Implementation Guidelines
Normalization is a simple but crucial step in the machine learning workflow. Here are the most important guidelines to follow:
-
Fit the scaler only on the training data.
This ensures that information from the test set does not leak into the model during training. Fitting on the entire dataset (including the test set) can lead to overly optimistic performance estimates and poor generalization. -
Transform both training and test data using the same scaler.
After fitting the scaler on the training data, use it to transform both the training and test sets. This guarantees that the scaling parameters (min and max values) are consistent and based solely on the training data. -
Never fit the scaler on the entire dataset before splitting.
Doing so introduces data leakage, as the test set statistics influence the scaling of the training data. -
Use pipelines to automate and safeguard the process.
Scikit-learn pipelines help ensure that normalization and modeling steps are applied correctly and in the right order, reducing the risk of data leakage and making your workflow more reproducible.
By following these guidelines, you ensure that your model evaluation is fair and that your results will generalize well to new, unseen data.
1from sklearn.pipeline import Pipeline
2
3# Create pipeline with normalization and model
4pipeline = Pipeline(
5 [
6 ("scaler", MinMaxScaler()),
7 (
8 "model",
9 MLPRegressor(hidden_layer_sizes=(50,), max_iter=1000, random_state=42),
10 ),
11 ]
12)
13
14# Fit pipeline on training data
15pipeline.fit(X_train, y_train)
16
17# Make predictions on test data
18y_pred_pipeline = pipeline.predict(X_test)
19mse_pipeline = mean_squared_error(y_test, y_pred_pipeline)
20
21print("Pipeline MSE:", mse_pipeline)
1from sklearn.pipeline import Pipeline
2
3# Create pipeline with normalization and model
4pipeline = Pipeline(
5 [
6 ("scaler", MinMaxScaler()),
7 (
8 "model",
9 MLPRegressor(hidden_layer_sizes=(50,), max_iter=1000, random_state=42),
10 ),
11 ]
12)
13
14# Fit pipeline on training data
15pipeline.fit(X_train, y_train)
16
17# Make predictions on test data
18y_pred_pipeline = pipeline.predict(X_test)
19mse_pipeline = mean_squared_error(y_test, y_pred_pipeline)
20
21print("Pipeline MSE:", mse_pipeline)
Pipeline MSE: 2116358.4104419583
When to Use Normalization vs Standardization
Understanding when to use normalization versus standardization is crucial for optimal model performance:
Use Normalization (Min-Max Scaling) when:
- Neural Networks: Most neural network algorithms expect inputs in the range [0, 1] or [-1, 1]. Normalization ensures optimal activation function behavior.
- Image Processing: Pixel values are naturally bounded between 0 and 255, making normalization to [0, 1] intuitive.
- Distance-based algorithms: When you need to preserve the original distribution shape and want all features on the same bounded scale.
- Sparse data: When working with sparse matrices where standardization might create dense representations.
Use Standardization (Z-score scaling) when:
- Regularized regression: LASSO and Ridge regression assume features are on the same scale with mean 0.
- Principal Component Analysis: PCA is based on variance, so standardization ensures equal contribution from all features.
- Outlier presence: Standardization is more robust to outliers than normalization.
- Gaussian assumptions: When your algorithm assumes normally distributed features.
Key Differences: To clarify the differences between normalization and standardization, here's a summary table and some explanatory notes:
Aspect | Normalization | Standardization |
---|---|---|
Range | [0, 1] | Unbounded (can be any value, positive or negative) |
Mean | Varies | 0 |
Std Dev | Varies | 1 |
Outlier Sensitivity | High | Lower (less sensitive) |
Distribution Shape | Preserved | May change |
What does this mean in practice?
-
Normalization rescales your feature values so that everything fits between 0 and 1. Think of stretching or compressing the values to fit inside a box between 0 and 1, but the original shape and distribution of your data is preserved (unless you have extreme outliers, which can shrink everything else unnaturally). Use this when you want all features to be on the same scale, especially for neural nets or algorithms sensitive to value magnitudes.
-
Standardization transforms your data so that each feature has a mean of zero and a standard deviation of one. The values are spread out according to how they deviate from the average. This can make your data look more like a "bell curve" (Gaussian), but may actually change the distribution shape—especially for skewed data.
-
Outlier Sensitivity: Normalization is more affected by outliers because just one very large or very small value can set the min or max, squishing the rest of your data. Standardization handles outliers a bit better since it centers on the mean and spreads based on standard deviation.
When to use which?
- Use Normalization (Min-Max scaling) for image data, neural networks, or when features naturally fit in a bounded range.
- Use Standardization (Z-score scaling) for algorithms that assume data is centered (mean = 0), or when the data has outliers or Gaussian-like distributions.
If you're unsure, try both and see which works better for your model!
Limitations and Considerations
When applying normalization, keep in mind several important limitations:
- Outlier sensitivity: Normalization is highly sensitive to outliers, as extreme values can compress the majority of data into a small range.
- Tip: Consider using robust scaling methods or removing outliers before normalization.
- Sparse data: For sparse matrices, normalization can convert the data into a dense format, which is inefficient and memory-intensive.
- Tip: Use
MinMaxScaler(feature_range=(0, 1))
with sparse data carefully.
- Tip: Use
- Categorical variables: Do not normalize one-hot encoded or ordinal categorical variables, as this can destroy their intended meaning.
- Target variable: In most cases, avoid normalizing the target variable unless there is a specific reason to do so.
When applying normalization to your data, it's important to be aware of several common pitfalls that can affect the quality of your results and potentially introduce bias into your models. Understanding these issues in advance can help you avoid costly mistakes and ensure that your preprocessing pipeline is robust and reliable. Be aware of these common pitfalls:
- Data leakage: Fitting the scaler on the entire dataset (including the test set) introduces information from the test data into the training process, leading to overly optimistic performance estimates.
- Best practice: Always fit the scaler only on the training data, then use it to transform both the training and test sets.
- Inconsistent scaling: Using different scalers for training and test data can result in mismatched feature distributions.
- Over-normalization: Normalizing features that are already in the desired range can be unnecessary or even harmful.
- Categorical confusion: Normalizing categorical variables that should remain discrete can undermine their interpretability and utility.
Practical Applications
Normalization plays a crucial role in a wide variety of real-world applications. For example, in computer vision tasks, it is common practice to rescale pixel values from their original range of [0, 255] down to [0, 1], which helps neural networks learn more effectively. Recommendation systems also benefit from normalization by bringing user ratings, item features, or interaction counts to a common scale, allowing algorithms to make fairer comparisons. In time series analysis, sensor readings that may be recorded in different units and scales are often normalized so that patterns can be more easily detected across multiple sensors. Natural language processing tasks involve working with features such as TF-IDF scores, document lengths, or word counts, all of which may be normalized to make the data more comparable across samples. Similarly, in financial modeling, normalization ensures that diverse features like stock prices, trading volumes, and economic indicators—all of which can have widely varying magnitudes—can be analyzed together on an equal footing.
Summary
Normalization is a fundamental preprocessing step that ensures features are treated fairly in machine learning algorithms. By transforming features to a common range, typically [0, 1], normalization:
- Enables fair comparison across variables with different ranges
- Improves the performance of neural networks and distance-based methods
- Prevents bias toward features with larger numeric ranges
- Stabilizes optimization by providing consistent input scales
The key to successful normalization is proper implementation: always fit the scaler on training data only, then transform both training and test sets using the fitted scaler. This prevents data leakage and ensures realistic model evaluation.
Choose normalization over standardization when working with neural networks, image data, or when you need to preserve the original distribution shape while ensuring all features contribute equally to the learning process.
Quiz
Ready to test your understanding of normalization? Challenge yourself with this quiz covering key concepts about feature scaling, min-max normalization, and when to use different scaling techniques. Good luck!
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Data Quality & Outliers: Complete Guide to Measurement Error, Missing Data & Detection Methods
A comprehensive guide covering data quality fundamentals, including measurement error, systematic bias, missing data mechanisms, and outlier detection. Learn how to assess, diagnose, and improve data quality for reliable statistical analysis and machine learning.

Statistical Modeling Guide: Model Fit, Overfitting vs Underfitting & Cross-Validation
A comprehensive guide covering statistical modeling fundamentals, including measuring model fit with R-squared and RMSE, understanding the bias-variance tradeoff between overfitting and underfitting, and implementing cross-validation for robust model evaluation.

Variable Relationships: Complete Guide to Covariance, Correlation & Regression Analysis
A comprehensive guide covering relationships between variables, including covariance, correlation, simple and multiple regression. Learn how to measure, model, and interpret variable associations while understanding the crucial distinction between correlation and causation.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.