A comprehensive guide to R-squared, the coefficient of determination. Learn what R-squared means, how to calculate it, interpret its value, and use it to evaluate regression models. Includes formulas, intuitive explanations, practical guidelines, and visualizations.
R-squared: Measuring Model Fit
R-squared, also known as the coefficient of determination, is a key metric used to evaluate how well a regression model explains the variability of the dependent variable. This section provides an overview of R-squared, its formula, interpretation, and visual intuition.
Introduction
When building regression models, it's important to assess how well the model fits the data. R-squared quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables.
The R-squared Formula
R-squared is defined as:
Where:
- : Residual sum of squares (unexplained variance)
- : Total sum of squares (total variance)
- : Actual value
- : Predicted value from the model
- : Mean of the actual values
Mathematical Intuition
The formula can be understood through variance decomposition. The total variance in the dependent variable can be split into explained variance (how much the model explains) and unexplained variance (how much remains). R-squared represents the proportion of total variance that is explained by the model. For simple linear regression, R-squared equals the square of the correlation coefficient between the predictor and response variables ().
Interpretation
- : The model explains all the variability of the response data around its mean (perfect fit).
- : The model explains none of the variability (no better than using the mean).
- : The model performs worse than simply using the mean as a predictor (rare but possible with poor models).
- Between 0 and 1: Indicates the proportion of variance explained by the model.
Example:
If , then 85% of the variance in the dependent variable is explained by the model.
Practical Guidelines
While R-squared interpretation depends on context, here are general guidelines: indicates excellent fit (though be cautious of overfitting), suggests good fit, shows moderate fit, and indicates poor fit. However, these are rough guidelines—in some domains like social sciences, might be considered good, while in others like physics, might be unacceptable.
Visualizing R-squared
A higher R-squared means the regression line fits the data points more closely. Below is a conceptual illustration:
Plot 1: High vs. Low R-squared
- Left: Data points closely follow the regression line (high R-squared)
- Right: Data points are widely scattered around the line (low R-squared)
High R-squared scenario demonstrating excellent model fit. The data points (blue circles) closely follow the regression line (red), indicating that the model explains a large proportion of the variance in the dependent variable. This visualization shows how a well-fitting model captures the underlying relationship between variables with minimal unexplained variation.
Low R-squared scenario showing poor model fit. The data points (blue circles) are widely scattered around the regression line (red), indicating that the model explains only a small proportion of the variance. This demonstrates how a poorly fitting model fails to capture the underlying relationship, leaving substantial unexplained variation in the data.
Adjusted R-squared
When dealing with multiple regression, the standard R-squared can be misleading because it always increases (or stays the same) when adding more predictors, even if those predictors don't improve the model. Adjusted R-squared accounts for the number of predictors using the formula , where is the number of observations and is the number of predictors. Unlike standard R-squared, adjusted R-squared can decrease when adding irrelevant predictors, providing a more honest assessment of model quality in multiple regression.
Comparison with Other Metrics
R-squared is just one of many regression evaluation metrics. While R-squared measures the proportion of variance explained, other metrics provide different insights: RMSE and MAE assess absolute prediction errors, AIC and BIC help with model selection, and cross-validation evaluates out-of-sample performance. R-squared is excellent for quick assessment of model fit and communicating results to non-technical audiences, but for understanding prediction errors, use RMSE or MAE; for model selection, use AIC or BIC; and for out-of-sample performance, use cross-validation.
Common Misconceptions
Several misconceptions about R-squared are widespread. First, higher R-squared isn't always better—a model with might be overfitted and perform poorly on new data. Second, R-squared doesn't indicate causation, only correlation. Third, R-squared of 0.3 isn't always bad—context matters, as social sciences might consider this excellent while physics would find it poor. Fourth, R-squared doesn't work for all model types—it's designed for linear regression, and logistic regression requires pseudo-R-squared measures.
When R-squared is Misleading
R-squared can be misleading in several scenarios. For nonlinear relationships, R-squared assumes linearity and might be low even when the model captures the true relationship well. Outliers can dramatically affect R-squared, making it unreliable for overall model performance. When variance changes across the prediction range (heteroscedasticity), R-squared might not reflect true model quality. Overfitting can make R-squared artificially high when the model has too many parameters relative to observations. Finally, with very small sample sizes, R-squared can be unstable and misleading.
Limitations
- R-squared does not indicate whether a regression model is appropriate.
- It can be artificially high for models with many predictors (use adjusted R-squared for multiple regression).
- A high R-squared does not imply causation.
- R-squared doesn't measure prediction accuracy on new data.
- It can be misleading with nonlinear relationships or outliers.
R-squared is a useful first check for model fit, but always consider it alongside other diagnostics and domain knowledge.
Summary
In summary, R-squared measures how well a regression model explains the variability of the dependent variable. It ranges from 0 to 1, with higher values indicating a better fit. However, R-squared alone does not guarantee that the model is appropriate or meaningful, so it should be interpreted in context and used alongside other evaluation metrics.

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Simple Linear Regression: Complete Guide with Formulas, Examples & Python Implementation
A complete hands-on guide to simple linear regression, including formulas, intuitive explanations, worked examples, and Python code. Learn how to fit, interpret, and evaluate a simple linear regression model from scratch.

Building Intelligent Agents with LangChain and LangGraph: Part 2 - Agentic Workflows
Learn how to build agentic workflows with LangChain and LangGraph.

The Mathematics Behind LLM Fine-Tuning: A Beginner's Guide to how and why finetuning works
Understand the mathematical foundations of LLM fine-tuning with clear explanations and minimal prerequisites. Learn how gradient descent, weight updates, and Transformer architectures work together to adapt pre-trained models to new tasks.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.