R-squared (Coefficient of Determination): Formula, Intuition & Model Fit in Regression - Interactive Notebook | Michael Brenndoerfer

A comprehensive guide to R-squared, the coefficient of determination. Learn what R-squared means, how to calculate it, interpret its value, and use it to evaluate regression models. Includes formulas, intuitive explanations, practical guidelines, and visualizations.

R-squared: Measuring Model Fit

R-squared, also known as the coefficient of determination, is a key metric used to evaluate how well a regression model explains the variability of the dependent variable. This section provides an overview of R-squared, its formula, interpretation, and visual intuition.

Introduction

When building regression models, it's important to assess how well the model fits the data. R-squared quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables.

The R-squared Formula

R-squared is defined as:

R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}

Where:

$SS_{\text{res}} = \sum_{i=1}^n (y_i - \hat{y}_i)^2$ : Residual sum of squares (unexplained variance)
$SS_{\text{tot}} = \sum_{i=1}^n (y_i - \bar{y})^2$ : Total sum of squares (total variance)
$y_i$ : Actual value
$\hat{y}_i$ : Predicted value from the model
$\bar{y}$ : Mean of the actual values

Mathematical Intuition

The formula can be understood through variance decomposition. The total variance in the dependent variable can be split into explained variance (how much the model explains) and unexplained variance (how much remains). R-squared represents the proportion of total variance that is explained by the model. For simple linear regression, R-squared equals the square of the correlation coefficient between the predictor and response variables ( $R^2 = r^2$ ).

Interpretation

$R^2 = 1$ : The model explains all the variability of the response data around its mean (perfect fit).
$R^2 = 0$ : The model explains none of the variability (no better than using the mean).
$R^2 < 0$ : The model performs worse than simply using the mean as a predictor (rare but possible with poor models).
Between 0 and 1: Indicates the proportion of variance explained by the model.

Example:
If $R^2 = 0.85$ , then 85% of the variance in the dependent variable is explained by the model.

Practical Guidelines

While R-squared interpretation depends on context, here are general guidelines: $R^2 > 0.9$ indicates excellent fit (though be cautious of overfitting), $0.7 \leq R^2 \leq 0.9$ suggests good fit, $0.5 \leq R^2 < 0.7$ shows moderate fit, and $R^2 < 0.5$ indicates poor fit. However, these are rough guidelines—in some domains like social sciences, $R^2 = 0.3$ might be considered good, while in others like physics, $R^2 < 0.95$ might be unacceptable.

Visualizing R-squared

A higher R-squared means the regression line fits the data points more closely. Below is a conceptual illustration:

Plot 1: High vs. Low R-squared

Left: Data points closely follow the regression line (high R-squared)
Right: Data points are widely scattered around the line (low R-squared)

Out[3]:

Visualization

High R-squared scenario demonstrating excellent model fit. The data points (blue circles) closely follow the regression line (red), indicating that the model explains a large proportion of the variance in the dependent variable. This visualization shows how a well-fitting model captures the underlying relationship between variables with minimal unexplained variation.

Low R-squared scenario showing poor model fit. The data points (blue circles) are widely scattered around the regression line (red), indicating that the model explains only a small proportion of the variance. This demonstrates how a poorly fitting model fails to capture the underlying relationship, leaving substantial unexplained variation in the data.

Adjusted R-squared

When dealing with multiple regression, the standard R-squared can be misleading because it always increases (or stays the same) when adding more predictors, even if those predictors don't improve the model. Adjusted R-squared accounts for the number of predictors using the formula $R^2_{\text{adj}} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}$ , where $n$ is the number of observations and $p$ is the number of predictors. Unlike standard R-squared, adjusted R-squared can decrease when adding irrelevant predictors, providing a more honest assessment of model quality in multiple regression.

Comparison with Other Metrics

R-squared is just one of many regression evaluation metrics. While R-squared measures the proportion of variance explained, other metrics provide different insights: RMSE and MAE assess absolute prediction errors, AIC and BIC help with model selection, and cross-validation evaluates out-of-sample performance. R-squared is excellent for quick assessment of model fit and communicating results to non-technical audiences, but for understanding prediction errors, use RMSE or MAE; for model selection, use AIC or BIC; and for out-of-sample performance, use cross-validation.

Common Misconceptions

Several misconceptions about R-squared are widespread. First, higher R-squared isn't always better—a model with $R^2 = 0.99$ might be overfitted and perform poorly on new data. Second, R-squared doesn't indicate causation, only correlation. Third, R-squared of 0.3 isn't always bad—context matters, as social sciences might consider this excellent while physics would find it poor. Fourth, R-squared doesn't work for all model types—it's designed for linear regression, and logistic regression requires pseudo-R-squared measures.

When R-squared is Misleading

R-squared can be misleading in several scenarios. For nonlinear relationships, R-squared assumes linearity and might be low even when the model captures the true relationship well. Outliers can dramatically affect R-squared, making it unreliable for overall model performance. When variance changes across the prediction range (heteroscedasticity), R-squared might not reflect true model quality. Overfitting can make R-squared artificially high when the model has too many parameters relative to observations. Finally, with very small sample sizes, R-squared can be unstable and misleading.

Limitations

R-squared does not indicate whether a regression model is appropriate.
It can be artificially high for models with many predictors (use adjusted R-squared for multiple regression).
A high R-squared does not imply causation.
R-squared doesn't measure prediction accuracy on new data.
It can be misleading with nonlinear relationships or outliers.

R-squared is a useful first check for model fit, but always consider it alongside other diagnostics and domain knowledge.

Summary

In summary, R-squared measures how well a regression model explains the variability of the dependent variable. It ranges from 0 to 1, with higher values indicating a better fit. However, R-squared alone does not guarantee that the model is appropriate or meaningful, so it should be interpreted in context and used alongside other evaluation metrics.

Interactive NotebookR-squared (Coefficient of Determination): Formula, Intuition & Model Fit in Regression

R-squared: Measuring Model Fit

Introduction

The R-squared Formula

Mathematical Intuition

Interpretation

Practical Guidelines

Visualizing R-squared

Adjusted R-squared

Comparison with Other Metrics

Common Misconceptions

When R-squared is Misleading

Limitations

Summary

About the author: Michael Brenndoerfer

Related Content

Simple Linear Regression: Complete Guide with Formulas, Examples & Python Implementation

Building Intelligent Agents with LangChain and LangGraph: Part 2 - Agentic Workflows

The Mathematics Behind LLM Fine-Tuning: A Beginner's Guide to how and why finetuning works

Stay updated