Regression Analysis: Beta Estimation & Factor Models in Finance

Michael Brenndoerfer

Quantitative Finance Data, Analytics & AI Machine Learning

Master regression analysis for finance: estimate market beta, test alpha significance, diagnose heteroskedasticity, and apply multi-factor models with robust standard errors.

Regression Analysis for Financial RelationshipsLink Copied

Financial markets are built on relationships. The return of a stock depends on broad market movements. Bond yields respond to changes in interest rate expectations. Currency pairs move with economic fundamentals. Understanding these relationships, including quantifying their strength, testing their significance, and using them for prediction, requires a powerful statistical framework: regression analysis.

Regression analysis provides the foundation for many core concepts in quantitative finance. When you hear that a stock has a "beta of 1.2," that number comes from a regression. When you claim to generate "alpha" as a portfolio manager, you're making a statement about regression intercepts. When we test whether momentum or value factors explain returns, we're running regressions. The Capital Asset Pricing Model, which we'll explore in depth in Part IV, is fundamentally a regression model relating expected returns to market exposure.

Building on the statistical inference techniques from Part I and the time-series models from the previous chapters, this chapter develops regression analysis with a focus on financial applications. We'll start with the ordinary least squares framework, then apply it to estimate factor exposures, conduct hypothesis tests, and diagnose the problems that arise when regression meets financial time series data. By the end, you'll be able to estimate a stock's market beta, test whether it generates statistically significant alpha, and correct for the autocorrelation and heteroskedasticity that plague financial data.

The Ordinary Least Squares FrameworkLink Copied

Regression analysis models the relationship between a dependent variable and one or more explanatory variables. In finance, we often want to explain asset returns using factor returns, or forecast prices using economic indicators. The power of regression lies in its ability to decompose variation: separating what can be explained by observable factors from what remains unexplained. This decomposition forms the basis for understanding risk, measuring performance, and building predictive models.

The Simple Linear ModelLink Copied

Consider a linear relationship between a dependent variable $y$ and an independent variable $x$ . The core idea is straightforward: we believe that changes in $x$ produce predictable changes in $y$ , but this relationship is obscured by random noise. Our goal is to uncover the underlying systematic relationship despite this noise.

y_i = \alpha + \beta x_i + \varepsilon_i

where:

$y_i$ : observation of the dependent variable for observation $i$
$x_i$ : corresponding independent variable
$\alpha$ : intercept, representing the expected value of $y$ when $x = 0$
$\beta$ : slope coefficient, representing the change in $y$ for a one-unit change in $x$
$\varepsilon_i$ : error term capturing the variation in $y$ not explained by $x$

The intercept $\alpha$ anchors the regression line at a specific level. In financial contexts, when we regress stock returns on market returns, the intercept represents the return we would expect when the market return is exactly zero. This parameter captures the baseline performance that exists independently of market movements.

The slope coefficient $\beta$ is the heart of the regression. It tells us the marginal effect: for every one-unit increase in $x$ , we expect $y$ to change by $\beta$ units on average. In the context of finance, if $\beta = 1.3$ in a stock-versus-market regression, then a 1% market move corresponds to an expected 1.3% move in the stock. This amplification (or dampening, if $\beta < 1$ ) characterizes the stock's systematic risk profile.

The error term $\varepsilon_i$ represents everything we cannot observe or measure that affects $y$ . This includes firm-specific news, measurement error, and the influence of countless unmeasured factors. We assume it has zero mean, constant variance, and is uncorrelated across observations; these are assumptions we'll scrutinize carefully later. These assumptions matter because they determine whether our statistical inference is valid and whether our confidence intervals are trustworthy.

The OLS EstimatorLink Copied

Having specified the model, we need a method to estimate the unknown parameters $\alpha$ and $\beta$ . Ordinary least squares takes an intuitive approach: find the line that minimizes the total squared distance between observed values and the line's predictions. Why squared distances? Squaring serves two purposes: it treats positive and negative deviations symmetrically, and it penalizes large errors more heavily than small ones, making the estimator sensitive to outliers but also more efficient when errors are normally distributed.

Ordinary least squares finds the estimators by minimizing the sum of squared residuals $S$ :

S(\alpha, \beta) = \sum_{i=1}^{n} (y_i - \alpha - \beta x_i)^2

where:

$S$ : sum of squared residuals to be minimized
$y_i, x_i$ : dependent and independent variable observations for item $i$
$\alpha, \beta$ : intercept and slope coefficients to be estimated
$n$ : number of observations

This objective function creates a surface in two-dimensional parameter space. The optimal parameter values correspond to the lowest point on this surface, where the total squared error reaches its minimum. Finding this minimum requires calculus: we take derivatives with respect to each parameter and set them equal to zero.

Taking partial derivatives with respect to $\alpha$ and $\beta$ and setting them to zero yields the normal equations:

\begin{aligned} \frac{\partial S}{\partial \alpha} &= -2 \sum_{i=1}^n (y_i - \alpha - \beta x_i) = 0 \\ \frac{\partial S}{\partial \beta} &= -2 \sum_{i=1}^n x_i (y_i - \alpha - \beta x_i) = 0 \end{aligned}

The first equation has an intuitive interpretation. It says that the sum of residuals must equal zero, meaning the regression line must pass through the center of mass of the data. Positive errors exactly balance negative errors. From the first equation, dividing by $n$ yields $\bar{y} - \alpha - \beta\bar{x} = 0$ , which gives $\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}$ . This result confirms that the fitted line passes through the point $(\bar{x}, \bar{y})$ , the sample means of both variables.

Substituting this into the second equation eliminates $\alpha$ :

\sum_{i=1}^n x_i (y_i - (\bar{y} - \hat{\beta}\bar{x}) - \hat{\beta} x_i) = 0

Rearranging terms to group $y$ and $x$ :

\sum_{i=1}^n x_i (y_i - \bar{y}) = \hat{\beta} \sum_{i=1}^n x_i (x_i - \bar{x})

Using the algebraic identity $\sum x_i(z_i - \bar{z}) = \sum (x_i - \bar{x})(z_i - \bar{z})$ to center the variables:

\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y}) = \hat{\beta} \sum_{i=1}^n (x_i - \bar{x})^2

gives:

\hat{\beta} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} = \frac{\text{Cov}(x, y)}{\text{Var}(x)}

\hat{\alpha} = \bar{y} - \hat{\beta}\bar{x}

where:

$\hat{\beta}$ : estimated slope coefficient
$\hat{\alpha}$ : estimated intercept
$\bar{x}, \bar{y}$ : sample means of $x$ and $y$
$\text{Cov}(x, y)$ : sample covariance between the independent and dependent variables
$\text{Var}(x)$ : sample variance of the independent variable

The slope coefficient equals the covariance between $x$ and $y$ divided by the variance of $x$ . This formula reveals the essential logic of regression. The numerator, covariance, measures how $x$ and $y$ move together. The denominator, variance, normalizes this co-movement by the spread of $x$ . The ratio measures how much $y$ moves, on average, when $x$ moves by one unit. If $x$ has high variance, we have more information to estimate the relationship precisely, so we divide by a larger number. The intercept ensures the regression line passes through the point $(\bar{x}, \bar{y})$ .

Matrix FormulationLink Copied

For multiple regression with $k$ explanatory variables, the matrix formulation is essential. The scalar derivation becomes unwieldy with multiple variables, but matrix algebra provides an elegant and general solution. This formulation also reveals the geometric interpretation: OLS finds the projection of the dependent variable onto the space spanned by the explanatory variables.

We write:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}

where:

$\mathbf{y}$ : $n \times 1$ vector of observations
$\mathbf{X}$ : $n \times (k+1)$ matrix of explanatory variables (including a column of ones for the intercept)
$\boldsymbol{\beta}$ : $(k+1) \times 1$ vector of coefficients
$\boldsymbol{\varepsilon}$ : $n \times 1$ vector of errors

The design matrix $\mathbf{X}$ stacks all observations of all explanatory variables. Each row represents one observation, and each column represents one variable. The first column typically consists of ones, which corresponds to the intercept term. This structure allows us to handle any number of explanatory variables using the same mathematical framework.

As we covered in Part I's treatment of linear algebra, the OLS solution minimizes the objective function $S(\boldsymbol{\beta})$ . We expand the quadratic form and take the derivative:

\begin{aligned} S(\boldsymbol{\beta}) &= (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \\ &= \mathbf{y}'\mathbf{y} - \mathbf{y}'\mathbf{X}\boldsymbol{\beta} - \boldsymbol{\beta}'\mathbf{X}'\mathbf{y} + \boldsymbol{\beta}'\mathbf{X}'\mathbf{X}\boldsymbol{\beta} \\ &= \mathbf{y}'\mathbf{y} - 2\boldsymbol{\beta}'\mathbf{X}'\mathbf{y} + \boldsymbol{\beta}'\mathbf{X}'\mathbf{X}\boldsymbol{\beta} && \text{(since } \mathbf{y}'\mathbf{X}\boldsymbol{\beta} \text{ is scalar)} \end{aligned}

The expansion follows from distributing the transpose operation and recognizing that $\mathbf{y}'\mathbf{X}\boldsymbol{\beta}$ produces a scalar. Since a scalar equals its own transpose, we have $\mathbf{y}'\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}'\mathbf{X}'\mathbf{y}$ , allowing us to combine these terms.

Taking the derivative with respect to $\boldsymbol{\beta}$ :

\frac{\partial S}{\partial \boldsymbol{\beta}} = -2\mathbf{X}'\mathbf{y} + 2\mathbf{X}'\mathbf{X}\boldsymbol{\beta}

Setting the derivative to zero yields the normal equations:

\mathbf{X}'\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}'\mathbf{y}

These normal equations have a clear interpretation. The term $\mathbf{X}'\mathbf{y}$ captures the correlation between each explanatory variable and the dependent variable. The term $\mathbf{X}'\mathbf{X}$ is the Gram matrix of the regressors, encoding their variances and mutual correlations. The normal equations balance these two components.

Assuming $\mathbf{X}'\mathbf{X}$ is invertible:

\hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}

where:

$\hat{\boldsymbol{\beta}}$ : vector of estimated coefficients
$\mathbf{X}$ : design matrix of explanatory variables
$\mathbf{y}$ : vector of dependent variable observations

Intuitively, this generalizes the scalar slope formula $\text{Cov}(x,y)/\text{Var}(x)$ . The term $\mathbf{X}'\mathbf{y}$ captures the covariation between regressors and the dependent variable, while $(\mathbf{X}'\mathbf{X})^{-1}$ normalizes by the variance-covariance structure of the regressors, effectively "dividing" by the variance while adjusting for correlations between explanatory variables. When regressors are correlated with each other, the matrix inverse accounts for this overlap, ensuring that each coefficient represents the unique contribution of its variable after controlling for all others.

This closed-form solution is the foundation for all OLS regression analysis. Its existence guarantees that we can always compute the optimal coefficients directly, without iterative optimization. The requirement that $\mathbf{X}'\mathbf{X}$ be invertible means no explanatory variable can be a perfect linear combination of others, a condition known as the absence of perfect multicollinearity.

Factor Exposures and the CAPM BetaLink Copied

The single most important application of regression in finance is estimating how an asset responds to systematic risk factors. This relationship, called factor exposure or factor loading, tells us how much of an asset's return can be explained by common factors versus idiosyncratic risk. Understanding factor exposures is essential for portfolio construction, risk management, and performance evaluation. You cannot diversify away systematic risk, so measuring these exposures is crucial for understanding and pricing that risk.

The Market ModelLink Copied

The simplest factor model relates individual stock returns to market returns. This approach, developed in the early days of modern finance, recognizes that stocks tend to move together because they are all affected by common economic forces. Broad market movements, driven by changes in interest rates, economic growth expectations, and investor sentiment, influence virtually all stocks to varying degrees.

r_{i,t} = \alpha_i + \beta_i r_{m,t} + \varepsilon_{i,t}

where:

$r_{i,t}$ : return on stock $i$ at time $t$
$r_{m,t}$ : return on the market index
$\beta_i$ : beta coefficient, measuring sensitivity to market movements
$\alpha_i$ : alpha coefficient, measuring return unexplained by the market
$\varepsilon_{i,t}$ : idiosyncratic error term

This is called the market model or single-index model. The coefficient $\beta_i$ measures the stock's systematic risk, representing how much the stock moves when the market moves. A stock with $\beta = 1.2$ rises 1.2% on average when the market rises 1%, and falls 1.2% when the market falls 1%. This amplification makes high-beta stocks more volatile than the market and more risky in a portfolio context. Conversely, a stock with $\beta = 0.7$ is defensive: it dampens market movements rather than amplifying them.

The idiosyncratic error term $\varepsilon_{i,t}$ captures firm-specific events: earnings surprises, management changes, product launches, and other news that affects only this particular company. Unlike systematic risk, idiosyncratic risk can be diversified away by holding many stocks, because positive surprises at some firms offset negative surprises at others.

Beta

The beta coefficient measures an asset's sensitivity to market movements. Mathematically, it equals the covariance of the asset's return with the market return, divided by the variance of the market return:

\beta_i = \frac{\text{Cov}(r_i, r_m)}{\text{Var}(r_m)}

where:

$\beta_i$ : beta coefficient for asset $i$
$\text{Cov}(r_i, r_m)$ : covariance between asset and market returns
$\text{Var}(r_m)$ : variance of market returns

This formula has a clear interpretation. The numerator measures how the asset and market move together. The denominator normalizes by market volatility. A stock that moves perfectly with the market, with the same magnitude, has beta equal to one. The market portfolio itself always has a beta of exactly one.

Excess Returns and the CAPMLink Copied

The Capital Asset Pricing Model, which we'll derive formally in Part IV, states that the expected excess return on any asset should be proportional to its beta. This fundamental result connects risk and return: assets with higher systematic risk must offer higher expected returns to compensate investors for bearing that risk. The CAPM provides the theoretical foundation for using beta as a measure of risk.

E[r_i - r_f] = \beta_i \cdot E[r_m - r_f]

where:

$E[r_i - r_f]$ : expected excess return on asset $i$
$E[r_m - r_f]$ : expected market risk premium
$r_f$ : risk-free interest rate
$\beta_i$ : asset's beta

The left side represents the expected return above the risk-free rate, which is the premium investors demand for taking risk. The right side says this premium should equal beta times the market risk premium. An asset with beta of 1.5 should earn 1.5 times the market risk premium, because it carries 1.5 times the market's systematic risk.

To test this empirically, we regress excess returns:

r_{i,t} - r_{f,t} = \alpha_i + \beta_i (r_{m,t} - r_{f,t}) + \varepsilon_{i,t}

where:

$r_{i,t} - r_{f,t}$ : excess return on asset $i$
$r_{m,t} - r_{f,t}$ : excess return on the market portfolio
$\alpha_i$ : Jensen's alpha measuring abnormal performance
$\beta_i$ : systematic risk exposure
$\varepsilon_{i,t}$ : idiosyncratic error term

If the CAPM holds, the intercept $\alpha_i$ should be zero. This is because all expected return should be explained by beta exposure to the market. A significantly positive alpha suggests the stock earns more than its risk-adjusted required return: it generates excess risk-adjusted performance. A negative alpha means the stock underperforms relative to its risk. This interpretation makes alpha regression one of the most scrutinized statistical tests in investment management. Every active manager claims to generate alpha, and every client wants evidence to support that claim.

Implementing Beta EstimationLink Copied

Let's estimate the market beta for a stock using historical data. We'll use synthetic data that mimics realistic return characteristics.

In[2]:

Code

import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic market returns (daily, 2 years)
n_days = 504
market_returns = np.random.normal(
    0.0004, 0.012, n_days
)  # ~10% annual return, 19% vol

# Generate stock returns with true beta = 1.3, alpha = 0.0002 (small positive)
true_beta = 1.3
true_alpha = 0.0002
idiosyncratic_vol = 0.015
stock_returns = (
    true_alpha
    + true_beta * market_returns
    + np.random.normal(0, idiosyncratic_vol, n_days)
)

# Create DataFrame
dates = pd.date_range(start="2022-01-01", periods=n_days, freq="B")
df = pd.DataFrame(
    {
        "date": dates,
        "market_return": market_returns,
        "stock_return": stock_returns,
    }
)

import numpy as np
import pandas as pd

# Set random seed for reproducibility
np.random.seed(42)

# Generate synthetic market returns (daily, 2 years)
n_days = 504
market_returns = np.random.normal(
    0.0004, 0.012, n_days
)  # ~10% annual return, 19% vol

# Generate stock returns with true beta = 1.3, alpha = 0.0002 (small positive)
true_beta = 1.3
true_alpha = 0.0002
idiosyncratic_vol = 0.015
stock_returns = (
    true_alpha
    + true_beta * market_returns
    + np.random.normal(0, idiosyncratic_vol, n_days)
)

# Create DataFrame
dates = pd.date_range(start="2022-01-01", periods=n_days, freq="B")
df = pd.DataFrame(
    {
        "date": dates,
        "market_return": market_returns,
        "stock_return": stock_returns,
    }
)

Now we estimate the regression using both manual calculation and statsmodels:

In[3]:

Code

# Manual OLS calculation
x_mean = df["market_return"].mean()
y_mean = df["stock_return"].mean()

# Calculate beta (covariance / variance)
covariance = (
    (df["market_return"] - x_mean) * (df["stock_return"] - y_mean)
).sum()
variance = ((df["market_return"] - x_mean) ** 2).sum()
beta_manual = covariance / variance

# Calculate alpha
alpha_manual = y_mean - beta_manual * x_mean

# Calculate fitted values and residuals
fitted_values = alpha_manual + beta_manual * df["market_return"]
residuals = df["stock_return"] - fitted_values

# Manual OLS calculation
x_mean = df["market_return"].mean()
y_mean = df["stock_return"].mean()

# Calculate beta (covariance / variance)
covariance = (
    (df["market_return"] - x_mean) * (df["stock_return"] - y_mean)
).sum()
variance = ((df["market_return"] - x_mean) ** 2).sum()
beta_manual = covariance / variance

# Calculate alpha
alpha_manual = y_mean - beta_manual * x_mean

# Calculate fitted values and residuals
fitted_values = alpha_manual + beta_manual * df["market_return"]
residuals = df["stock_return"] - fitted_values

Out[4]:

Console

Manual OLS Estimation:
  Beta:  1.3698 (true: 1.3)
  Alpha: 0.000707 (true: 0.0002)

Residual standard deviation: 0.014560
True idiosyncratic vol: 0.015000

The manual estimates closely match the true parameters used to generate the data, confirming the accuracy of the OLS formulas.

Let's also use statsmodels for a complete regression output:

In[5]:

Code

import statsmodels.api as sm

# Add constant for intercept
X = sm.add_constant(df["market_return"])
y = df["stock_return"]

# Fit OLS regression
model = sm.OLS(y, X).fit()

import statsmodels.api as sm

# Add constant for intercept
X = sm.add_constant(df["market_return"])
y = df["stock_return"]

# Fit OLS regression
model = sm.OLS(y, X).fit()

Out[6]:

Console

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           stock_return   R-squared:                       0.552
Model:                            OLS   Adj. R-squared:                  0.552
Method:                 Least Squares   F-statistic:                     619.7
Date:                Sat, 17 Jan 2026   Prob (F-statistic):           1.09e-89
Time:                        22:08:06   Log-Likelihood:                 1417.0
No. Observations:                 504   AIC:                            -2830.
Df Residuals:                     502   BIC:                            -2822.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const             0.0007      0.001      1.088      0.277      -0.001       0.002
market_return     1.3698      0.055     24.894      0.000       1.262       1.478
==============================================================================
Omnibus:                        0.311   Durbin-Watson:                   1.998
Prob(Omnibus):                  0.856   Jarque-Bera (JB):                0.397
Skew:                           0.052   Prob(JB):                        0.820
Kurtosis:                       2.909   Cond. No.                         84.8
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The regression output provides the coefficient estimates along with standard errors, t-statistics, and p-values. Let's visualize the relationship:

Out[7]:

Visualization

Scatter plot showing positive linear relationship between market and stock returns with regression line. — Daily stock returns plotted against market returns with the estimated OLS regression line. The slope of 1.30 represents the stock's market beta, indicating that the asset amplifies market movements by 30% and carries higher systematic risk than the market index.

The scatter plot shows that stock returns move in the same direction as market returns but with greater magnitude, consistent with a beta above 1. The stock amplifies market movements by approximately 30%.

To understand the economic meaning of different beta values, consider how stocks with varying market sensitivities respond to the same market movement:

Out[8]:

Visualization

Line plot showing regression lines for different beta values from 0.5 to 2.0, demonstrating how higher beta means steeper slope. — Hypothetical stock return responses for beta values ranging from 0.5 to 1.8. Steeper slopes represent higher sensitivity to market movements, where high-beta stocks ($\beta > 1$) amplify market returns and low-beta stocks ($\beta < 1$) serve as defensive investments.

Hypothesis Testing and Model EvaluationLink Copied

Estimating coefficients is only the first step. We need to assess whether these estimates are statistically meaningful and how well the model explains variation in the dependent variable. The challenge is that our estimates are random variables: they depend on the particular sample we observed. Different samples would produce different estimates. Statistical inference quantifies this uncertainty, telling us how confident we can be in our conclusions.

Standard Errors and t-StatisticsLink Copied

The uncertainty in coefficient estimates comes from sampling variation. We observe only one realization of the data, but if we could repeatedly sample from the same underlying process, we would get different coefficient estimates each time. The distribution of these hypothetical estimates determines how confident we can be in any single estimate.

Under the classical OLS assumptions (which we'll examine shortly), the coefficient estimates follow a multivariate normal distribution:

\hat{\boldsymbol{\beta}} \sim N\left(\boldsymbol{\beta}, \sigma^2(\mathbf{X}'\mathbf{X})^{-1}\right)

where:

$\hat{\boldsymbol{\beta}}$ : estimator vector
$\boldsymbol{\beta}$ : true coefficient vector
$\sigma^2$ : variance of the error term
$(\mathbf{X}'\mathbf{X})^{-1}$ : inverse of the moment matrix of regressors

This result tells us that OLS estimates are unbiased: their expected value equals the true parameter. The covariance matrix $\sigma^2(\mathbf{X}'\mathbf{X})^{-1}$ determines the precision of each estimate. The diagonal elements give the variances of individual coefficients, while off-diagonal elements capture how estimation errors in different coefficients relate to each other.

Since we don't know $\sigma^2$ , we estimate it using the residual variance:

\hat{\sigma}^2 = \frac{\sum_{i=1}^{n} \hat{\varepsilon}_i^2}{n - k - 1} = \frac{SSR}{n - k - 1}

where:

$\hat{\sigma}^2$ : estimated variance of the error term
$\hat{\varepsilon}_i$ : residual for observation $i$
$SSR$ : sum of squared residuals
$n$ : number of observations
$k$ : number of explanatory variables (excluding intercept)

The denominator $n - k - 1$ accounts for the degrees of freedom lost in estimating $k + 1$ parameters (including the intercept). This adjustment ensures the variance estimate is unbiased. With more parameters to estimate, we have less information available to estimate the error variance, so we divide by a smaller number.

The standard error of coefficient $j$ is:

SE(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(\mathbf{X}'\mathbf{X})^{-1}]_{jj}}

where:

$SE(\hat{\beta}_j)$ : standard error of the $j$ -th coefficient
$\hat{\sigma}$ : estimated standard deviation of the error term
$[(\mathbf{X}'\mathbf{X})^{-1}]_{jj}$ : the $j$ -th diagonal element of the inverse matrix

The term $[(\mathbf{X}'\mathbf{X})^{-1}]_{jj}$ reflects both the variance of the $j$ -th regressor and its correlation with other regressors. High variance in $x_j$ reduces the standard error (more information), while high correlation with other variables increases it (multicollinearity). Intuitively, if a variable moves a lot, we can more precisely estimate its effect. But if it moves together with other variables, it becomes harder to isolate its unique contribution.

To test whether a coefficient differs from zero, we compute the t-statistic:

t_j = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}

where:

$t_j$ : t-statistic for the $j$ -th coefficient
$\hat{\beta}_j$ : estimated coefficient
$SE(\hat{\beta}_j)$ : standard error of the estimate

This ratio compares the magnitude of the estimate to its precision. A large t-statistic (in absolute value) means the estimate is large relative to its uncertainty, suggesting a genuine non-zero effect. Under the null hypothesis that $\beta_j = 0$ , this statistic follows a t-distribution with $n - k - 1$ degrees of freedom. The t-distribution accounts for the additional uncertainty from estimating $\sigma^2$ .

Interpreting p-ValuesLink Copied

The p-value gives the probability of observing a t-statistic as extreme as or more extreme than the one calculated, assuming the null hypothesis is true. A small p-value (typically below 0.05 or 0.01) suggests the coefficient is statistically significantly different from zero. The logic is indirect: if the null hypothesis were true, the observed result would be very unlikely, so we reject the null.

In financial applications, statistical significance doesn't guarantee economic significance. A beta estimate might be highly significant (p < 0.001) but the relationship might be too weak or unstable to be useful for trading. Conversely, economically meaningful effects might not achieve statistical significance in small samples. The p-value depends on both the magnitude of the effect and the precision of our estimate, so a small but precisely estimated effect can be highly significant while a large but imprecisely estimated effect may not be.

In[9]:

Code

# Extract coefficient estimates and statistics
beta_est = model.params["market_return"]
alpha_est = model.params["const"]
beta_se = model.bse["market_return"]
alpha_se = model.bse["const"]
beta_tstat = model.tvalues["market_return"]
alpha_tstat = model.tvalues["const"]
beta_pval = model.pvalues["market_return"]
alpha_pval = model.pvalues["const"]

# Extract coefficient estimates and statistics
beta_est = model.params["market_return"]
alpha_est = model.params["const"]
beta_se = model.bse["market_return"]
alpha_se = model.bse["const"]
beta_tstat = model.tvalues["market_return"]
alpha_tstat = model.tvalues["const"]
beta_pval = model.pvalues["market_return"]
alpha_pval = model.pvalues["const"]

Out[10]:

Console

Hypothesis Testing Results:
--------------------------------------------------

Beta (market sensitivity):
  Estimate:    1.3698
  Std Error:   0.0550
  t-statistic: 24.89
  p-value:     1.09e-89
  Significant at 5%: Yes

Alpha (abnormal return):
  Estimate:    0.000707 (17.81% annualized)
  Std Error:   0.000650
  t-statistic: 1.09
  p-value:     0.2772
  Significant at 5%: No

The beta estimate is highly significant, with a t-statistic far exceeding critical values. This means we can confidently reject the null hypothesis that the stock has no market exposure. The alpha estimate, while positive, may or may not be statistically significant depending on the random realization. This illustrates the challenge of detecting true alpha.

R-Squared: Measuring FitLink Copied

The coefficient of determination, $R^2$ , measures the proportion of variance in the dependent variable explained by the model. This statistic answers a fundamental question: how much of the variation in returns can we attribute to the factors in our model, versus unexplained idiosyncratic variation?

R^2 = 1 - \frac{SSR}{SST} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}

where:

$R^2$ : coefficient of determination
$SSR$ : sum of squared residuals (unexplained variation)
$SST$ : total sum of squares (total variation)
$y_i$ : observed value of dependent variable
$\hat{y}_i$ : fitted value for observation $i$
$\bar{y}$ : sample mean of the dependent variable

$R^2$ ranges from 0 to 1, with higher values indicating better fit. An $R^2$ of 0.5 means that 50% of the variation in returns is explained by the model, while 50% remains unexplained. The unexplained portion represents idiosyncratic risk and the effects of factors not included in the model.

Adjusted R-Squared

When comparing models with different numbers of predictors, use adjusted R-squared, which penalizes for adding variables:

\bar{R}^2 = 1 - (1 - R^2)\frac{n-1}{n-k-1}

where:

$\bar{R}^2$ : adjusted coefficient of determination
$R^2$ : standard coefficient of determination
$n$ : number of observations
$k$ : number of predictors

This prevents overfitting by adding irrelevant predictors that would increase $R^2$ mechanically. The adjustment recognizes that adding any variable, even a random one, will weakly increase $R^2$ . Adjusted $R^2$ can actually decrease if the added variable doesn't improve the model enough to justify the lost degree of freedom.

For single-stock regressions against the market, $R^2$ typically ranges from 0.2 to 0.6 for individual stocks. Lower values indicate more idiosyncratic risk that cannot be diversified through market exposure alone. For portfolios, $R^2$ tends to be higher because idiosyncratic risks cancel out. A well-diversified portfolio might have $R^2$ of 0.9 or higher against the market, indicating that almost all its risk is systematic.

In[11]:

Code

# Calculate R-squared components
SST = ((df["stock_return"] - df["stock_return"].mean()) ** 2).sum()
SSR = (residuals**2).sum()
SSE = SST - SSR  # Sum of squares explained

r_squared = 1 - SSR / SST
adj_r_squared = 1 - (1 - r_squared) * (len(df) - 1) / (len(df) - 2)

# Calculate R-squared components
SST = ((df["stock_return"] - df["stock_return"].mean()) ** 2).sum()
SSR = (residuals**2).sum()
SSE = SST - SSR  # Sum of squares explained

r_squared = 1 - SSR / SST
adj_r_squared = 1 - (1 - r_squared) * (len(df) - 1) / (len(df) - 2)

Out[12]:

Console

Model Fit Statistics:
----------------------------------------
R-squared:          0.5525
Adjusted R-squared: 0.5516

The R-squared value indicates that the market factor explains a significant proportion of the stock's variance, though idiosyncratic risk remains.

Out[13]:

Visualization

Stacked bar chart showing total variance split into explained and unexplained components. — Comparison of explained variance (SSE) and unexplained residual variance (SSR) for the single-factor regression. The fact that SSR exceeds SSE indicates that idiosyncratic factors drive more than 50% of the return variation, highlighting the importance of portfolio diversification.

The F-Test for Overall SignificanceLink Copied

The F-test assesses whether the model as a whole has explanatory power, meaning whether at least one coefficient (excluding the intercept) is significantly different from zero. While t-tests evaluate individual coefficients, the F-test provides a joint test of all slope coefficients simultaneously. This distinction matters because individual coefficients might be insignificant due to multicollinearity even when the model as a whole explains significant variation.

F = \frac{(SST - SSR)/k}{SSR/(n-k-1)} = \frac{R^2/k}{(1-R^2)/(n-k-1)}

where:

$F$ : F-statistic for the model
$SST$ : total sum of squares
$SSR$ : sum of squared residuals
$R^2$ : coefficient of determination
$k$ : number of slope coefficients
$n$ : number of observations

The F-statistic compares the variance explained by the model (per parameter) to the unexplained variance (per degree of freedom). A value significantly greater than 1 indicates the model captures more signal than expected by random chance. The numerator represents average explained variation per variable, while the denominator represents unexplained variation per observation. If the model has no explanatory power, these should be roughly equal, yielding an F-statistic near 1.

Under the null hypothesis that all slope coefficients equal zero, the F-statistic follows an F-distribution with $k$ and $n - k - 1$ degrees of freedom.

In[14]:

Code

f_statistic = model.fvalue
f_pvalue = model.f_pvalue

f_statistic = model.fvalue
f_pvalue = model.f_pvalue

Out[15]:

Console

F-test for overall model significance:
  F-statistic: 619.72
  p-value:     1.09e-89

The highly significant F-statistic confirms that the model explains a statistically significant portion of the variance in stock returns.

For simple regression with one explanatory variable, the F-test is equivalent to the t-test on the slope coefficient (the F-statistic equals the squared t-statistic).

Diagnosing Regression ProblemsLink Copied

The validity of OLS inference depends on several assumptions about the error term. With financial time series data, these assumptions are frequently violated, leading to incorrect standard errors and misleading hypothesis tests. Understanding these assumptions and knowing how to detect their violations is essential for producing reliable results. Failing to address these issues can lead to false confidence in spurious relationships or missed opportunities to identify genuine effects.

The Classical AssumptionsLink Copied

The Gauss-Markov assumptions for OLS include:

Linearity: The relationship between $y$ and $x$ is linear
Exogeneity: $E[\varepsilon_i | x_i] = 0$ (errors are uncorrelated with explanatory variables)
Homoskedasticity: $\text{Var}(\varepsilon_i) = \sigma^2$ (constant error variance)
No autocorrelation: $\text{Cov}(\varepsilon_i, \varepsilon_j) = 0$ for $i \neq j$ (errors are uncorrelated across observations)
Normality: $\varepsilon_i \sim N(0, \sigma^2)$ (required for exact inference in small samples)

When these assumptions hold, OLS is the Best Linear Unbiased Estimator (BLUE); no other linear unbiased estimator has smaller variance. This optimality result justifies the widespread use of OLS. However, when assumptions are violated, the BLUE property fails, and either the estimates themselves become biased or the standard errors become unreliable.

HeteroskedasticityLink Copied

Heteroskedasticity occurs when the error variance changes across observations. In finance, this is common: volatility varies over time, as we explored in Chapter 18 on GARCH models. Periods of market stress have higher return volatility than calm periods. The variance of returns during the 2008 financial crisis was many times higher than during the calm period of 2005-2006. This time-varying volatility directly translates into heteroskedastic errors in regression models.

Heteroskedasticity doesn't bias the coefficient estimates, but it makes the standard errors incorrect, invalidating hypothesis tests. The OLS formula for standard errors assumes constant variance, so when variance actually varies, these standard errors are wrong. They might be too small (leading to false positives) or too large (leading to false negatives), depending on the pattern of heteroskedasticity.

In[16]:

Code

# Create data with heteroskedasticity (volatility increases with market magnitude)
np.random.seed(123)
market_het = np.random.normal(0.0004, 0.012, 500)

# Error variance increases with absolute market return
heterosked_errors = np.random.normal(0, 0.01 + 0.5 * np.abs(market_het), 500)
stock_het = 0.0001 + 1.2 * market_het + heterosked_errors

# Fit regression
X_het = sm.add_constant(market_het)
model_het = sm.OLS(stock_het, X_het).fit()
residuals_het = model_het.resid

# Create data with heteroskedasticity (volatility increases with market magnitude)
np.random.seed(123)
market_het = np.random.normal(0.0004, 0.012, 500)

# Error variance increases with absolute market return
heterosked_errors = np.random.normal(0, 0.01 + 0.5 * np.abs(market_het), 500)
stock_het = 0.0001 + 1.2 * market_het + heterosked_errors

# Fit regression
X_het = sm.add_constant(market_het)
model_het = sm.OLS(stock_het, X_het).fit()
residuals_het = model_het.resid

The Breusch-Pagan test formally tests for heteroskedasticity by regressing squared residuals on the explanatory variables:

In[17]:

Code

from statsmodels.stats.diagnostic import het_breuschpagan

# Breusch-Pagan test
bp_stat, bp_pval, _, _ = het_breuschpagan(model_het.resid, model_het.model.exog)

from statsmodels.stats.diagnostic import het_breuschpagan

# Breusch-Pagan test
bp_stat, bp_pval, _, _ = het_breuschpagan(model_het.resid, model_het.model.exog)

Out[18]:

Console

Breusch-Pagan Test for Heteroskedasticity:
  Test statistic: 5.5053
  p-value:        0.0190

The low p-value indicates we reject the null hypothesis of homoskedasticity, confirming the presence of time-varying volatility in our synthetic data.

Let's visualize heteroskedasticity by plotting residuals against fitted values:

Out[19]:

Visualization

Residuals versus fitted values for heteroskedastic data. The funnel shape shows variance increasing with the magnitude of fitted values.

Residuals versus fitted values for homoskedastic data. The consistent spread around zero indicates constant error variance.

AutocorrelationLink Copied

Autocorrelation occurs when errors are correlated across time periods. This is particularly problematic in financial time series because returns often exhibit short-term momentum or mean-reversion patterns in their volatility. When today's error is correlated with yesterday's error, the observations are not truly independent. This reduces the effective sample size: 500 correlated observations contain less information than 500 independent observations.

The Durbin-Watson statistic tests for first-order autocorrelation:

DW = \frac{\sum_{t=2}^{n}(\hat{\varepsilon}_t - \hat{\varepsilon}_{t-1})^2}{\sum_{t=1}^{n}\hat{\varepsilon}_t^2}

where:

$DW$ : Durbin-Watson statistic
$\hat{\varepsilon}_t$ : residual at time $t$
$n$ : number of observations

The numerator sums squared differences between adjacent residuals. If errors are positively correlated, $\hat{\varepsilon}_t \approx \hat{\varepsilon}_{t-1}$ , making the differences small and $DW$ low (closer to 0). If errors are uncorrelated, the differences are as large as the errors themselves, and the expected value is 2. If errors are negatively correlated, consecutive residuals tend to have opposite signs, making the differences large and $DW$ high (closer to 4).

Values near 2 indicate no autocorrelation, values below 2 indicate positive autocorrelation, and values above 2 indicate negative autocorrelation.

In[20]:

Code

from statsmodels.stats.stattools import durbin_watson

# Generate data with autocorrelated errors
np.random.seed(456)
market_ac = np.random.normal(0.0004, 0.012, 500)

# Create AR(1) errors with persistence parameter 0.3
ar_coef = 0.3
errors_ac = np.zeros(500)
errors_ac[0] = np.random.normal(0, 0.015)
for t in range(1, 500):
    errors_ac[t] = ar_coef * errors_ac[t - 1] + np.random.normal(0, 0.015)

true_beta_ac = 1.2
stock_ac = 0.0001 + true_beta_ac * market_ac + errors_ac

# Fit regression
X_ac = sm.add_constant(market_ac)
model_ac = sm.OLS(stock_ac, X_ac).fit()

# Durbin-Watson test
dw_stat = durbin_watson(model_ac.resid)
dw_original = durbin_watson(model.resid)

from statsmodels.stats.stattools import durbin_watson

# Generate data with autocorrelated errors
np.random.seed(456)
market_ac = np.random.normal(0.0004, 0.012, 500)

# Create AR(1) errors with persistence parameter 0.3
ar_coef = 0.3
errors_ac = np.zeros(500)
errors_ac[0] = np.random.normal(0, 0.015)
for t in range(1, 500):
    errors_ac[t] = ar_coef * errors_ac[t - 1] + np.random.normal(0, 0.015)

true_beta_ac = 1.2
stock_ac = 0.0001 + true_beta_ac * market_ac + errors_ac

# Fit regression
X_ac = sm.add_constant(market_ac)
model_ac = sm.OLS(stock_ac, X_ac).fit()

# Durbin-Watson test
dw_stat = durbin_watson(model_ac.resid)
dw_original = durbin_watson(model.resid)

Out[21]:

Console

Durbin-Watson Test for Autocorrelation:

Autocorrelated data:
  DW statistic: 1.4305

Original data (no autocorrelation):
  DW statistic: 1.9979

The Durbin-Watson statistic differs significantly from 2 in the autocorrelated sample, signaling the presence of serial correlation.

We can also examine the autocorrelation function of the residuals:

Out[22]:

Visualization

Autocorrelation function (ACF) of residuals with serial correlation. Significant spikes at early lags indicate that past errors predict future errors.

Autocorrelation function (ACF) of well-behaved residuals. Correlations fall within the shaded confidence bands, indicating independence.

MulticollinearityLink Copied

Multicollinearity arises when explanatory variables are highly correlated with each other. While not strictly a violation of OLS assumptions, it inflates standard errors and makes coefficient estimates unstable. The fundamental problem is identification: when two variables move together, it becomes difficult to determine which one is driving the effect. Mathematically, the matrix $\mathbf{X}'\mathbf{X}$ approaches singularity, making its inverse unstable.

The Variance Inflation Factor (VIF) quantifies multicollinearity:

VIF_j = \frac{1}{1 - R_j^2}

where:

$VIF_j$ : Variance Inflation Factor for variable $j$
$R_j^2$ : R-squared from regressing variable $j$ on all other explanatory variables

If variable $j$ is highly correlated with other predictors, $R_j^2$ approaches 1, causing the denominator to approach zero and $VIF_j$ to become very large. This mathematically quantifies how much the variance of the estimated coefficient is inflated due to collinearity. A VIF of 5 means the variance is five times what it would be if the variable were uncorrelated with other predictors, making the standard error about 2.2 times larger.

A VIF above 5 or 10 suggests problematic multicollinearity.

In[23]:

Code

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create correlated factors (simulating market, size, and value factors)
np.random.seed(789)
n = 500

# Market factor
factor_market = np.random.normal(0.0004, 0.012, n)

# Size factor (moderately correlated with market)
factor_size = 0.5 * factor_market + np.random.normal(0, 0.008, n)

# Value factor (slightly correlated with both)
factor_value = (
    0.3 * factor_market + 0.2 * factor_size + np.random.normal(0, 0.006, n)
)

# Create DataFrame
factors_df = pd.DataFrame(
    {"market": factor_market, "size": factor_size, "value": factor_value}
)

# Add constant
X_multi = sm.add_constant(factors_df)

# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X_multi.columns
vif_data["VIF"] = [
    variance_inflation_factor(X_multi.values, i)
    for i in range(X_multi.shape[1])
]

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Create correlated factors (simulating market, size, and value factors)
np.random.seed(789)
n = 500

# Market factor
factor_market = np.random.normal(0.0004, 0.012, n)

# Size factor (moderately correlated with market)
factor_size = 0.5 * factor_market + np.random.normal(0, 0.008, n)

# Value factor (slightly correlated with both)
factor_value = (
    0.3 * factor_market + 0.2 * factor_size + np.random.normal(0, 0.006, n)
)

# Create DataFrame
factors_df = pd.DataFrame(
    {"market": factor_market, "size": factor_size, "value": factor_value}
)

# Add constant
X_multi = sm.add_constant(factors_df)

# Calculate VIF for each variable
vif_data = pd.DataFrame()
vif_data["Variable"] = X_multi.columns
vif_data["VIF"] = [
    variance_inflation_factor(X_multi.values, i)
    for i in range(X_multi.shape[1])
]

Out[24]:

Console

Variance Inflation Factors:
------------------------------
  market    : 2.02
  size      : 1.63
  value     : 1.82

Correlation matrix of factors:
        market   size  value
market   1.000  0.593  0.646
size     0.593  1.000  0.528
value    0.646  0.528  1.000

The VIF values quantify the inflation in variance due to correlation among predictors. Values above 5 would suggest problematic multicollinearity requiring attention.

Out[25]:

Visualization

Heatmap showing correlation matrix between market, size, and value factors with color intensity indicating correlation strength. — Correlation heatmap for the Fama-French factors (Market, SMB, and HML). Although the Market and SMB factors exhibit a moderate correlation of 0.50, the relationships are not strong enough to trigger problematic multicollinearity or unstable coefficient estimates.

Robust Standard ErrorsLink Copied

When heteroskedasticity or autocorrelation are present, we can still obtain valid inference by using robust standard errors that don't assume constant variance or independence. These methods adjust the standard error calculation to account for the actual error structure, providing valid confidence intervals and hypothesis tests even when classical assumptions fail. The coefficient estimates themselves remain unbiased; only the uncertainty quantification changes.

Heteroskedasticity-Consistent Standard ErrorsLink Copied

White (1980) proposed heteroskedasticity-consistent (HC) standard errors that remain valid even when error variance varies across observations. The idea is to estimate the covariance matrix of $\hat{\boldsymbol{\beta}}$ using the squared residuals to account for varying error variance. Rather than assuming all errors have the same variance $\sigma^2$ , we allow each observation to have its own variance, estimated by its squared residual.

\widehat{\text{Var}}(\hat{\boldsymbol{\beta}})_{HC} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\hat{\boldsymbol{\Omega}}\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}

where:

$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}})_{HC}$ : heteroskedasticity-consistent covariance matrix of coefficients
$\mathbf{X}$ : matrix of regressors
$\hat{\boldsymbol{\Omega}}$ : diagonal matrix with squared residuals on the diagonal, accounting for heteroskedasticity

This formula is sometimes called the "sandwich" estimator because the middle term $\mathbf{X}'\hat{\boldsymbol{\Omega}}\mathbf{X}$ is sandwiched between two copies of $(\mathbf{X}'\mathbf{X})^{-1}$ . The outer terms handle the variance of the regressors, while the middle term accounts for heteroskedastic errors.

HAC Standard Errors (Newey-West)Link Copied

For financial time series with both heteroskedasticity and autocorrelation, Newey-West (1987) HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors are the standard approach. These extend the White estimator by also accounting for correlation between errors at different time periods. The key insight is that autocorrelated errors cause cross-products of residuals at different lags to be non-zero, and these must be incorporated into the variance estimate.

\widehat{\text{Var}}(\hat{\boldsymbol{\beta}})_{HAC} = (\mathbf{X}'\mathbf{X})^{-1}\hat{\mathbf{S}}(\mathbf{X}'\mathbf{X})^{-1}

where:

$\widehat{\text{Var}}(\hat{\boldsymbol{\beta}})_{HAC}$ : heteroskedasticity and autocorrelation consistent covariance matrix
$\mathbf{X}$ : matrix of regressors
$\hat{\mathbf{S}}$ : estimate of the long-run covariance matrix, accounting for both heteroskedasticity and autocorrelation up to a specified lag

The matrix $\hat{\mathbf{S}}$ is computed using a weighted sum of autocovariance matrices at different lags, with weights that decline as the lag increases. This downweighting ensures that distant correlations, which are typically weaker and harder to estimate, have less influence on the variance estimate. The choice of maximum lag (bandwidth) involves a tradeoff: too few lags may miss important autocorrelation, while too many lags reduce precision.

In[26]:

Code

# Compare standard errors: OLS vs. White (HC) vs. Newey-West (HAC)

# Original OLS on data with autocorrelation
model_ols = sm.OLS(stock_ac, X_ac).fit()

# Heteroskedasticity-consistent (HC3)
model_hc = sm.OLS(stock_ac, X_ac).fit(cov_type="HC3")

# HAC (Newey-West with 4 lags)
model_hac = sm.OLS(stock_ac, X_ac).fit(cov_type="HAC", cov_kwds={"maxlags": 4})

# Compare standard errors: OLS vs. White (HC) vs. Newey-West (HAC)

# Original OLS on data with autocorrelation
model_ols = sm.OLS(stock_ac, X_ac).fit()

# Heteroskedasticity-consistent (HC3)
model_hc = sm.OLS(stock_ac, X_ac).fit(cov_type="HC3")

# HAC (Newey-West with 4 lags)
model_hac = sm.OLS(stock_ac, X_ac).fit(cov_type="HAC", cov_kwds={"maxlags": 4})

Out[27]:

Console

Comparison of Standard Errors for Beta Coefficient:
------------------------------------------------------------
Method               SE(Beta)     t-stat     p-value     
------------------------------------------------------------
OLS (naive)          0.054965    21.89      6.8200e-75  
White (HC3)          0.056571    21.26      2.4227e-100 
Newey-West (HAC)     0.057793    20.82      3.1594e-96  
------------------------------------------------------------

Note: Larger standard errors (HAC) account for autocorrelation
and lead to more conservative t-statistics.

The robust standard errors are typically larger than naive OLS standard errors when autocorrelation is present. This means t-statistics are smaller and p-values are larger, making it harder to reject null hypotheses. This is appropriate: we should be more uncertain about estimates when errors are correlated because we effectively have fewer independent observations. The adjustment prevents us from falsely claiming statistical significance based on overstated precision.

Out[28]:

Visualization

Forest plot showing beta coefficient estimates with error bars for OLS, White, and Newey-West standard errors. — Market beta estimates with 95% confidence intervals using OLS, White, and Newey-West standard errors. While the point estimate is consistent across methods, the Newey-West interval is significantly wider because it correctly accounts for the serial correlation present in the residuals.

Multiple Regression and Factor ModelsLink Copied

While the market model uses a single factor, financial research has established that multiple factors explain cross-sectional return variation. The single-factor CAPM leaves significant patterns unexplained: small stocks earn higher average returns than large stocks, and value stocks (high book-to-market) earn more than growth stocks. These patterns motivated the development of multi-factor models that better capture the systematic risks in equity markets.

The Fama-French three-factor model adds size (SMB: Small Minus Big) and value (HML: High Minus Low) factors:

r_{i,t} - r_{f,t} = \alpha_i + \beta_{i,m}(r_{m,t} - r_{f,t}) + \beta_{i,s}SMB_t + \beta_{i,h}HML_t + \varepsilon_{i,t}

where:

$r_{i,t} - r_{f,t}$ : excess return on asset $i$
$r_{m,t} - r_{f,t}$ : excess return on the market portfolio
$SMB_t$ : Small Minus Big size factor, capturing the return premium of small stocks over large stocks
$HML_t$ : High Minus Low value factor, capturing the return premium of value stocks over growth stocks
$\alpha_i$ : intercept (abnormal return)
$\beta_{i,m}$ : market beta
$\beta_{i,s}$ : stock's sensitivity to the size factor
$\beta_{i,h}$ : stock's sensitivity to the value factor
$\varepsilon_{i,t}$ : error term

Each coefficient represents exposure to a different systematic risk factor. The market beta captures overall equity market risk. The SMB loading indicates whether the stock behaves more like a small-cap or large-cap stock, regardless of its actual size. The HML loading reveals whether the stock exhibits value or growth characteristics in its return behavior. We'll explore these multi-factor models thoroughly in Part IV when we cover Arbitrage Pricing Theory.

Implementing Multi-Factor RegressionLink Copied

In[29]:

Code

# Generate stock return with multiple factor exposures
np.random.seed(101)
n = 504

# Generate factor returns
mkt_factor = np.random.normal(0.0004, 0.012, n)  # Market excess return
smb_factor = np.random.normal(0.0001, 0.008, n)  # Size factor
hml_factor = np.random.normal(0.0001, 0.007, n)  # Value factor

# True exposures: high market beta, positive size tilt, negative value tilt
true_betas = {"market": 1.15, "smb": 0.4, "hml": -0.25}
true_alpha_ff = 0.0001  # Small positive alpha

stock_ff = (
    true_alpha_ff
    + true_betas["market"] * mkt_factor
    + true_betas["smb"] * smb_factor
    + true_betas["hml"] * hml_factor
    + np.random.normal(0, 0.012, n)
)

# Create DataFrame
ff_df = pd.DataFrame(
    {
        "excess_return": stock_ff,
        "MKT": mkt_factor,
        "SMB": smb_factor,
        "HML": hml_factor,
    }
)

# Fit three-factor model
X_ff = sm.add_constant(ff_df[["MKT", "SMB", "HML"]])
model_ff = sm.OLS(ff_df["excess_return"], X_ff).fit(
    cov_type="HAC", cov_kwds={"maxlags": 4}
)

# Generate stock return with multiple factor exposures
np.random.seed(101)
n = 504

# Generate factor returns
mkt_factor = np.random.normal(0.0004, 0.012, n)  # Market excess return
smb_factor = np.random.normal(0.0001, 0.008, n)  # Size factor
hml_factor = np.random.normal(0.0001, 0.007, n)  # Value factor

# True exposures: high market beta, positive size tilt, negative value tilt
true_betas = {"market": 1.15, "smb": 0.4, "hml": -0.25}
true_alpha_ff = 0.0001  # Small positive alpha

stock_ff = (
    true_alpha_ff
    + true_betas["market"] * mkt_factor
    + true_betas["smb"] * smb_factor
    + true_betas["hml"] * hml_factor
    + np.random.normal(0, 0.012, n)
)

# Create DataFrame
ff_df = pd.DataFrame(
    {
        "excess_return": stock_ff,
        "MKT": mkt_factor,
        "SMB": smb_factor,
        "HML": hml_factor,
    }
)

# Fit three-factor model
X_ff = sm.add_constant(ff_df[["MKT", "SMB", "HML"]])
model_ff = sm.OLS(ff_df["excess_return"], X_ff).fit(
    cov_type="HAC", cov_kwds={"maxlags": 4}
)

Out[30]:

Console

Fama-French Three-Factor Model Results
============================================================

Factor       True β     Estimate     Std Err    t-stat     p-value   
------------------------------------------------------------
Alpha        0.0001     0.001172     0.000554   2.12       0.0344    
Market       1.1500     1.0826       0.0458     23.62      2.3307e-123
SMB          0.4000     0.3942       0.0684     5.76       0.0000    
HML          -0.2500    -0.3324      0.0726     -4.58      0.0000    
------------------------------------------------------------

R-squared: 0.5526
Adjusted R-squared: 0.5499
F-statistic: 204.16 (p-value: 1.97e-86)

The results show significant exposure to all three factors. The stock has a market beta of approximately 1.15, meaning it's more volatile than the market. The positive SMB loading indicates a tilt toward small-cap characteristics, while the negative HML loading suggests growth rather than value exposure. Adding factors improves the model's explanatory power, as shown by the higher R-squared compared to a single-factor model.

Out[31]:

Visualization

Horizontal bar chart showing estimated factor loadings for Market, SMB, and HML with confidence interval whiskers. — Estimated factor loadings for the Fama-French three-factor model. The stock exhibits significant positive market beta and size exposure (SMB), with negative value exposure (HML).

Comparing Single and Multi-Factor ModelsLink Copied

In[32]:

Code

# Fit single-factor (CAPM) model for comparison
X_capm = sm.add_constant(ff_df["MKT"])
model_capm = sm.OLS(ff_df["excess_return"], X_capm).fit(
    cov_type="HAC", cov_kwds={"maxlags": 4}
)

# Fit single-factor (CAPM) model for comparison
X_capm = sm.add_constant(ff_df["MKT"])
model_capm = sm.OLS(ff_df["excess_return"], X_capm).fit(
    cov_type="HAC", cov_kwds={"maxlags": 4}
)

Out[33]:

Console

Model Comparison: CAPM vs. Three-Factor
---------------------------------------------
Metric                    CAPM       3-Factor  
---------------------------------------------
R-squared                 0.5060     0.5526    
Adjusted R-squared        0.5050     0.5499    
Residual Std Err          0.013374   0.012752  
Alpha (annualized)        31.81%     29.53%

The three-factor model explains more variance (higher R-squared) and has smaller residual standard errors. Importantly, the alpha estimate often changes when additional factors are included: what appeared as "alpha" in the CAPM might be explained by exposure to size or value factors.

Out[34]:

Visualization

Comparison of explanatory power ($R^2$) between CAPM and Three-Factor models. The Three-Factor model explains a larger proportion of return variance.

Comparison of residual standard error. The Three-Factor model reduces unexplained volatility compared to the single-factor CAPM.

A Complete Regression WorkflowLink Copied

Let's demonstrate a complete workflow for estimating and validating a factor model, incorporating all the diagnostic and inference techniques covered.

In[35]:

Code

from scipy import stats


def complete_regression_analysis(y, X, variable_names=None):
    """
    Perform comprehensive regression analysis with diagnostics.

    Parameters:
    -----------
    y : array-like
        Dependent variable
    X : array-like
        Independent variables (without constant)
    variable_names : list, optional
        Names for the X variables

    Returns:
    --------
    dict : Dictionary containing model results and diagnostics
    """
    if variable_names is None:
        variable_names = [
            f"X{i}" for i in range(X.shape[1] if len(X.shape) > 1 else 1)
        ]

    # Convert to DataFrame if needed
    if isinstance(X, np.ndarray):
        X = pd.DataFrame(X, columns=variable_names)

    # Add constant
    X_const = sm.add_constant(X)

    # Fit models with different standard errors
    model_ols = sm.OLS(y, X_const).fit()
    model_hac = sm.OLS(y, X_const).fit(cov_type="HAC", cov_kwds={"maxlags": 4})

    # Diagnostics
    residuals = model_ols.resid

    # Durbin-Watson
    dw = durbin_watson(residuals)

    # Breusch-Pagan
    bp_stat, bp_pval, _, _ = het_breuschpagan(residuals, X_const)

    # Jarque-Bera (normality)
    jb_stat, jb_pval = stats.jarque_bera(residuals)

    results = {
        "model_ols": model_ols,
        "model_hac": model_hac,
        "durbin_watson": dw,
        "breusch_pagan": (bp_stat, bp_pval),
        "jarque_bera": (jb_stat, jb_pval),
        "residuals": residuals,
    }

    return results

from scipy import stats


def complete_regression_analysis(y, X, variable_names=None):
    """
    Perform comprehensive regression analysis with diagnostics.

    Parameters:
    -----------
    y : array-like
        Dependent variable
    X : array-like
        Independent variables (without constant)
    variable_names : list, optional
        Names for the X variables

    Returns:
    --------
    dict : Dictionary containing model results and diagnostics
    """
    if variable_names is None:
        variable_names = [
            f"X{i}" for i in range(X.shape[1] if len(X.shape) > 1 else 1)
        ]

    # Convert to DataFrame if needed
    if isinstance(X, np.ndarray):
        X = pd.DataFrame(X, columns=variable_names)

    # Add constant
    X_const = sm.add_constant(X)

    # Fit models with different standard errors
    model_ols = sm.OLS(y, X_const).fit()
    model_hac = sm.OLS(y, X_const).fit(cov_type="HAC", cov_kwds={"maxlags": 4})

    # Diagnostics
    residuals = model_ols.resid

    # Durbin-Watson
    dw = durbin_watson(residuals)

    # Breusch-Pagan
    bp_stat, bp_pval, _, _ = het_breuschpagan(residuals, X_const)

    # Jarque-Bera (normality)
    jb_stat, jb_pval = stats.jarque_bera(residuals)

    results = {
        "model_ols": model_ols,
        "model_hac": model_hac,
        "durbin_watson": dw,
        "breusch_pagan": (bp_stat, bp_pval),
        "jarque_bera": (jb_stat, jb_pval),
        "residuals": residuals,
    }

    return results

In[36]:

Code

# Apply to our three-factor data
analysis = complete_regression_analysis(
    ff_df["excess_return"], ff_df[["MKT", "SMB", "HML"]], ["MKT", "SMB", "HML"]
)

# Apply to our three-factor data
analysis = complete_regression_analysis(
    ff_df["excess_return"], ff_df[["MKT", "SMB", "HML"]], ["MKT", "SMB", "HML"]
)

Out[37]:

Console

Complete Regression Analysis
============================================================

1. COEFFICIENT ESTIMATES
------------------------------------------------------------
Variable     Coef         OLS SE       HAC SE      
------------------------------------------------------------
const        0.001172     0.000570     0.000554    
MKT          1.082611     0.045050     0.045832    
SMB          0.394167     0.067886     0.068394    
HML          -0.332426    0.081125     0.072631    

2. MODEL FIT
------------------------------------------------------------
R-squared:          0.5526
Adjusted R-squared: 0.5499
F-statistic:        205.88 (p = 6.19e-87)

3. DIAGNOSTIC TESTS
------------------------------------------------------------
Durbin-Watson:      1.9696  [OK]
Breusch-Pagan:      2.9407 (p = 0.4009)  [OK]
Jarque-Bera:        5.3142 (p = 0.0702)  [OK]

The diagnostic summary quickly flags potential issues. In this case, we look for "WARNING" flags that would necessitate using robust standard errors or model re-specification.

Key ParametersLink Copied

The key parameters for regression models in finance are:

y: Dependent variable (e.g., stock returns). The target variable to be explained.
X: Independent variables (factors). The explanatory variables such as market returns.
$\alpha$ : Intercept (Alpha). Measures abnormal performance not explained by factors.
$\beta$ : Slope (Beta). Measures systematic risk or sensitivity to a factor.
$\varepsilon$ : Error term. Captures idiosyncratic risk and unexplained variation.
$R^2$ : R-squared. Indicates the proportion of variance explained by the model.

Out[38]:

Visualization

Scatter plot of residuals versus fitted values showing random pattern around zero. — Residuals versus fitted values. Random scatter around zero indicates homoskedasticity and correct model specification.

Quantile-quantile plot comparing residual distribution to normal distribution. — Q-Q plot of residuals. Points falling along the diagonal line indicate normally distributed residuals.

Limitations and Practical ConsiderationsLink Copied

Regression analysis is a powerful tool, but its proper application in finance requires understanding several limitations and pitfalls.

The stationarity assumption underlying OLS may be violated in financial applications. Beta estimates, for instance, are not constant through time, as a stock's sensitivity to the market can change as the company grows, changes industries, or faces different competitive pressures. Rolling-window regressions or time-varying parameter models can address this, but the fundamental issue remains: historical relationships may not persist into the future. This is why professional risk managers often blend statistical estimates with fundamental judgment.

The look-ahead bias problem plagues many regression applications in backtesting. If you use data from the entire sample period to estimate betas, then evaluate performance using those same betas, you've implicitly used future information. Proper out-of-sample testing requires estimating parameters only with data available at each point in time, such as through a rolling or expanding window approach. Factor exposures estimated with complete sample data almost always look better than what would have been achievable in real-time.

Omitted variable bias occurs when relevant explanatory variables are excluded from the regression. If an omitted variable is correlated with both the dependent variable and included explanatory variables, coefficient estimates will be biased. This is why CAPM alpha estimates often shrink or disappear when additional factors are included: what looked like skill was actually exposure to priced factors not in the model.

Measurement error in explanatory variables causes attenuation bias: coefficients biased toward zero. Financial data often contains measurement issues: prices might include stale quotes, returns might be calculated from non-synchronous data, and accounting data has reporting lags. Errors-in-variables models and instrumental variable techniques can address this, but they require additional assumptions and data.

Despite these limitations, regression analysis remains foundational because it provides a systematic framework for decomposing returns into explained and unexplained components, testing hypotheses about relationships, and making the uncertainty in our estimates explicit through confidence intervals and p-values. The key is to use regression as a tool for understanding rather than as a black box: always checking assumptions, validating results out-of-sample, and interpreting findings in economic context.

SummaryLink Copied

This chapter developed regression analysis as the primary tool for understanding relationships between financial variables. We covered:

The OLS framework provides closed-form estimates for linear relationships. The slope coefficient measures the change in the dependent variable for a unit change in the explanatory variable, while the intercept captures the baseline level.
Factor exposures and beta quantify systematic risk. In the market model, beta measures an asset's sensitivity to market movements, which is the foundation for the CAPM that we'll explore in Part IV.
Hypothesis testing assesses whether relationships are statistically significant. T-tests evaluate individual coefficients, F-tests assess overall model significance, and R-squared measures explanatory power.
Diagnostic tests reveal violations of OLS assumptions. The Durbin-Watson test detects autocorrelation, the Breusch-Pagan test identifies heteroskedasticity, and VIF measures multicollinearity.
Robust standard errors (HAC/Newey-West) provide valid inference when classical assumptions fail. These are essential for financial time series data where volatility clusters and returns may be serially correlated.
Multiple regression extends the framework to multiple factors, decomposing returns into exposures to market, size, value, and other systematic factors.

The techniques developed here will serve as building blocks for the upcoming chapters. In Chapter 20, we'll use Principal Component Analysis to extract factors from return data rather than specifying them a priori. Part IV will apply these regression tools extensively to estimate and test asset pricing models, measure portfolio performance, and attribute returns to different sources.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about regression analysis for financial relationships.

Loading component...

Comments

Back to Quantitative Finance

Previous Chapter

Modeling Volatility and GARCH Family

Next Chapter

Principal Component Analysis and Factor Extraction

Reference

BIBTEXAcademic

@misc{regressionanalysisbetaestimationfactormodelsinfinance, author = {Michael Brenndoerfer}, title = {Regression Analysis: Beta Estimation & Factor Models in Finance}, year = {2025}, url = {https://mbrenndoerfer.com/writing/regression-analysis-beta-factor-models-finance}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Regression Analysis: Beta Estimation & Factor Models in Finance. Retrieved from https://mbrenndoerfer.com/writing/regression-analysis-beta-factor-models-finance

MLAAcademic

Michael Brenndoerfer. "Regression Analysis: Beta Estimation & Factor Models in Finance." 2026. Web. today. <https://mbrenndoerfer.com/writing/regression-analysis-beta-factor-models-finance>.

CHICAGOAcademic

Michael Brenndoerfer. "Regression Analysis: Beta Estimation & Factor Models in Finance." Accessed today. https://mbrenndoerfer.com/writing/regression-analysis-beta-factor-models-finance.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Regression Analysis: Beta Estimation & Factor Models in Finance'. Available at: https://mbrenndoerfer.com/writing/regression-analysis-beta-factor-models-finance (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Regression Analysis: Beta Estimation & Factor Models in Finance. https://mbrenndoerfer.com/writing/regression-analysis-beta-factor-models-finance

Direct link:

https://mbrenndoerfer.com/writing/regression-analysis-beta-factor-models-finance

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Regression Analysis: Beta Estimation & Factor Models in Finance

Regression Analysis for Financial RelationshipsLink Copied

The Ordinary Least Squares FrameworkLink Copied

The Simple Linear ModelLink Copied

The OLS EstimatorLink Copied

Matrix FormulationLink Copied

Factor Exposures and the CAPM BetaLink Copied

The Market ModelLink Copied

Excess Returns and the CAPMLink Copied

Implementing Beta EstimationLink Copied

Hypothesis Testing and Model EvaluationLink Copied

Standard Errors and t-StatisticsLink Copied

Interpreting p-ValuesLink Copied

R-Squared: Measuring FitLink Copied

The F-Test for Overall SignificanceLink Copied

Diagnosing Regression ProblemsLink Copied

The Classical AssumptionsLink Copied

HeteroskedasticityLink Copied

AutocorrelationLink Copied

MulticollinearityLink Copied

Robust Standard ErrorsLink Copied

Heteroskedasticity-Consistent Standard ErrorsLink Copied

HAC Standard Errors (Newey-West)Link Copied

Multiple Regression and Factor ModelsLink Copied

Implementing Multi-Factor RegressionLink Copied

Comparing Single and Multi-Factor ModelsLink Copied

A Complete Regression WorkflowLink Copied

Key ParametersLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Principal Component Analysis: Factor Extraction for Finance

Time-Series Models for Financial Data: AR, MA & ARIMA

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Principal Component Analysis: Factor Extraction for Finance

Time-Series Models for Financial Data: AR, MA & ARIMA

Stay updated