Hypothesis Testing: P-values, Z-tests, T-tests, F-tests & ANOVA

Michael BrenndoerferJanuary 7, 2026113 min read

A comprehensive guide to hypothesis testing in statistics, covering p-values, null and alternative hypotheses, z-tests for known variance and proportions, t-tests (one-sample, two-sample, paired, Welch's), F-tests for comparing variances and ANOVA for multiple groups, Type I and Type II errors, statistical power, effect sizes, and multiple comparison corrections. Learn how to make rigorous data-driven decisions under uncertainty.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Hypothesis Testing: Making Decisions Under Uncertainty

Hypothesis testing provides a rigorous framework for evaluating claims about populations using sample data. Rather than relying on intuition or subjective judgment, hypothesis testing establishes formal procedures for determining whether observed patterns in data reflect genuine phenomena or merely random chance. This framework underpins A/B testing in technology, clinical trials in medicine, quality control in manufacturing, and countless other applications where data-driven decisions matter.

Introduction

Every time you make a claim based on data, you face uncertainty. Did that new feature actually improve user engagement, or did you just happen to observe a lucky streak? Is the average response time of your system really above the acceptable threshold, or is the apparent problem just sampling noise? Hypothesis testing gives you a structured way to answer these questions, quantifying the strength of evidence and controlling the rate at which you make incorrect conclusions.

The core logic is surprisingly elegant. You start with a skeptical assumption, typically that nothing interesting is happening, and then ask: if this assumption were true, how surprising would my observed data be? If the data would be extremely unlikely under the skeptical assumption, you have grounds to reject it. If the data are reasonably consistent with the skeptical assumption, you lack sufficient evidence to claim otherwise.

This chapter builds from foundations to practical application. We begin with the precise definition of p-values, addressing common misconceptions that plague even experienced practitioners. We then establish the mechanics of hypothesis testing: formulating null and alternative hypotheses, choosing between one-sided and two-sided tests, and understanding critical regions and test statistics. The connection between hypothesis tests and confidence intervals reveals these as two sides of the same inferential coin.

We then examine the assumptions underlying common tests and what happens when they fail. The choice between z-tests and t-tests, between pooled and Welch's t-tests, depends on what you know about your data and what you can safely assume. Error analysis introduces Type I and Type II errors, leading naturally to the concept of statistical power and sample size determination. Finally, we address effect sizes, which tell you whether statistically significant results are practically meaningful, and multiple comparison corrections for when you conduct many tests simultaneously.

What a P-value Is (and Isn't)

The p-value is perhaps the most misunderstood concept in statistics. Before we can properly conduct or interpret hypothesis tests, we need a crystal-clear understanding of what p-values actually mean.

P-value Definition

The p-value is the probability of observing data at least as extreme as the data actually observed, assuming the null hypothesis is true.

Let's unpack this definition carefully. The p-value is a conditional probability. It asks: given that the null hypothesis is true, what is the probability of seeing results as extreme as, or more extreme than, what we actually observed? The null hypothesis is the skeptical claim we are testing against, typically asserting no effect, no difference, or no relationship.

Consider a concrete example. Suppose you are testing whether a coin is fair. Your null hypothesis states that the probability of heads is 0.5. You flip the coin 100 times and observe 63 heads. The p-value answers the question: if the coin really were fair, what is the probability of getting 63 or more heads (or equivalently, 37 or fewer heads if we are doing a two-sided test)?

Out[2]:
Visualization
Histogram showing binomial distribution of coin flips with shaded p-value region.
Distribution of heads in 100 flips of a fair coin. The vertical dashed line marks the observed result of 63 heads. The shaded region to the right shows the probability of observing 63 or more heads if the coin is fair. This probability is the one-tailed p-value. For a two-tailed test, we would also shade the symmetric region on the left (37 or fewer heads).

Common Misinterpretations

The p-value is emphatically not the probability that the null hypothesis is true. This is the single most common and consequential misinterpretation. The null hypothesis is either true or false; it is not a random variable with a probability distribution. The p-value tells you about the probability of the data given the hypothesis, not the probability of the hypothesis given the data.

Similarly, 1 minus the p-value is not the probability that the alternative hypothesis is true. A p-value of 0.02 does not mean there is a 98% chance that the effect is real. Converting p-values into statements about hypothesis probabilities requires Bayesian reasoning and prior probabilities, which classical hypothesis testing does not provide.

The p-value also does not measure the size or importance of an effect. A tiny, practically meaningless difference can produce a minuscule p-value if the sample size is large enough. Conversely, a substantial and important effect might yield a non-significant p-value if the sample is too small. The p-value speaks only to the evidence against the null hypothesis, not to the magnitude of any effect.

Why P < 0.05 Is Not a Magic Threshold

The conventional threshold of 0.05 for "statistical significance" is a historical convention, not a law of nature. Ronald Fisher originally suggested 0.05 as a reasonable threshold for preliminary evidence, but it has calcified into an arbitrary binary cutoff that distorts scientific practice.

A p-value of 0.049 is not fundamentally different from a p-value of 0.051. Both represent similar strength of evidence against the null hypothesis. Yet the former is often reported as "significant" and the latter as "not significant," leading to dramatically different conclusions and publication outcomes. This cliff-like treatment of a continuous quantity loses information and encourages practices like p-hacking, where researchers manipulate analyses until they achieve the magic threshold.

Better practice involves reporting exact p-values and interpreting them on a continuum. A p-value of 0.001 provides much stronger evidence against the null hypothesis than a p-value of 0.04, yet both clear the 0.05 threshold. Different contexts warrant different thresholds: particle physics famously uses 5-sigma (approximately p < 0.0000003) for discovery claims, while exploratory analyses might reasonably use p < 0.10. The appropriate threshold depends on the costs of different types of errors in your specific application.

Test Setup Basics

With the p-value concept firmly established, we can now examine the mechanics of setting up a hypothesis test. This involves formulating hypotheses, choosing the test type, and understanding how test statistics relate to sampling distributions.

Null vs Alternative Hypotheses

Every hypothesis test involves two competing hypotheses. The null hypothesis, denoted H0H_0, represents the skeptical position, the claim we seek evidence against. Typically, the null hypothesis asserts that nothing interesting is happening: no difference between groups, no effect of a treatment, no relationship between variables. The null hypothesis is what we assume to be true unless the data provide sufficient evidence to the contrary.

The alternative hypothesis, denoted H1H_1 or HaH_a, represents the research claim we seek to establish. It contradicts the null hypothesis and is what we accept if we reject the null. The alternative might claim that a treatment is effective, that two groups differ, or that a relationship exists.

For example, when testing whether a new drug lowers blood pressure:

  • H0H_0: The drug has no effect (mean blood pressure change = 0)
  • H1H_1: The drug lowers blood pressure (mean blood pressure change < 0)

Or when testing whether two teaching methods produce different outcomes:

  • H0H_0: The methods are equally effective (mean difference = 0)
  • H1H_1: The methods differ in effectiveness (mean difference \neq 0)

The asymmetry between hypotheses is crucial. We never "accept" the null hypothesis; we either reject it or fail to reject it. Failing to reject is not the same as proving the null true. It simply means the data did not provide sufficient evidence against it.

One-Sided vs Two-Sided Tests

The choice between one-sided and two-sided tests depends on the research question and what alternatives are meaningful.

A two-sided test considers alternatives in both directions. When testing H0:μ=μ0H_0: \mu = \mu_0, the two-sided alternative is H1:μμ0H_1: \mu \neq \mu_0. You would reject the null if the sample mean is either much larger or much smaller than the hypothesized value. Two-sided tests are appropriate when departures from the null in either direction are scientifically meaningful and when you do not have strong prior reason to expect the effect to go in a particular direction.

A one-sided test considers alternatives in only one direction. The alternative might be H1:μ>μ0H_1: \mu > \mu_0 or H1:μ<μ0H_1: \mu < \mu_0. You specify before seeing the data which direction you expect, and you would only reject the null if the data deviate from the hypothesized value in that specific direction.

Out[3]:
Visualization
Normal distribution with two-sided rejection regions shaded in both tails.
Two-sided test with α = 0.05. The rejection region is split equally between both tails, with 0.025 in each tail. The null hypothesis is rejected if the test statistic falls in either shaded region, indicating the sample mean is significantly different from the hypothesized value in either direction.
Normal distribution with one-sided rejection region shaded in right tail.
One-sided test (right-tailed) with α = 0.05. The entire rejection region is in the right tail. The null hypothesis is rejected only if the test statistic exceeds the critical value, indicating the sample mean is significantly greater than the hypothesized value. One-sided tests have more power to detect effects in the hypothesized direction.

One-sided tests have more power to detect effects in the hypothesized direction because the entire significance level is concentrated in one tail. However, they cannot detect effects in the opposite direction, no matter how large. Using a one-sided test and then observing an effect in the unexpected direction creates an interpretive problem: technically, you should fail to reject the null, even if the effect in the "wrong" direction is substantial.

The choice should be made before examining the data, based on substantive considerations. If you would take the same action regardless of which direction an effect goes (for example, if any difference between methods warrants further investigation), use a two-sided test. If only one direction is scientifically meaningful or practically actionable (for example, you only care whether the new treatment is better, not whether it is worse), a one-sided test may be appropriate.

Test Statistics and Sampling Distributions

At the heart of every hypothesis test lies a simple but powerful idea: we need a single number that summarizes how strongly our data contradict the null hypothesis. This number is the test statistic, and understanding how to construct and interpret it is essential for mastering hypothesis testing.

A test statistic is a numerical summary of the data that captures how far the observed result deviates from what the null hypothesis predicts. But raw deviation alone is not enough. A sample mean that differs from the hypothesized value by 5 units might be highly significant or completely unremarkable, depending on how much variability we expect. If individual measurements typically vary by 2 units, a 5-unit difference is substantial. If they typically vary by 50 units, a 5-unit difference is noise.

This insight leads us to the most common form of test statistic, which measures deviation in standard error units:

Test Statistic=ObservedExpected under H0Standard Error\text{Test Statistic} = \frac{\text{Observed} - \text{Expected under } H_0}{\text{Standard Error}}

This general formula captures the essence of hypothesis testing: measuring how far your observed result deviates from what the null hypothesis predicts, scaled by the uncertainty in your estimate. The numerator asks "how different is our observation from what the null hypothesis claims?" while the denominator asks "how much variation would we typically expect due to chance alone?" The ratio tells us how surprising our result is, expressed in standardized units that allow comparison across different contexts.

Think of it this way: if you observe a sample mean 2 standard errors away from the hypothesized value, you know your observation is moderately unusual under the null hypothesis, regardless of whether you're measuring heights in centimeters, weights in kilograms, or response times in milliseconds. This standardization is what makes hypothesis testing so broadly applicable.

Larger test statistics indicate observations that are less consistent with the null hypothesis. A test statistic of 0.5 suggests your observation is quite typical of what you'd expect if the null were true. A test statistic of 3 suggests your observation would be quite rare under the null hypothesis, providing strong evidence against it.

For testing a population mean, the z-statistic (when population standard deviation is known) or t-statistic (when it is unknown) takes this form:

z=xˉμ0σ/nort=xˉμ0s/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} \quad \text{or} \quad t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

where:

  • xˉ\bar{x}: the sample mean, our best estimate of the population mean
  • μ0\mu_0: the hypothesized population mean under H0H_0
  • σ\sigma: the known population standard deviation (for z-test)
  • ss: the sample standard deviation (for t-test, used when σ\sigma is unknown)
  • nn: the sample size
  • σ/n\sigma / \sqrt{n} or s/ns / \sqrt{n}: the standard error of the mean

Let's trace through the logic of why each component is necessary. The numerator xˉμ0\bar{x} - \mu_0 measures the raw discrepancy between what we observed and what the null hypothesis claims. If the null hypothesis states that the population mean is 100, and we observe a sample mean of 105, the numerator is 5. But as we noted, this raw difference is meaningless without context.

The denominator provides that context through the standard error. The standard error σ/n\sigma / \sqrt{n} (or s/ns / \sqrt{n}) tells us how much sample means typically vary from sample to sample. If you drew many samples of size n from a population with mean μ\mu and standard deviation σ\sigma, the sample means would cluster around μ\mu with a standard deviation of σ/n\sigma / \sqrt{n}. This is the standard error of the mean.

The critical insight is that the standard error shrinks as sample size grows, at a rate proportional to n\sqrt{n}. With 4 times as many observations, the standard error is half as large. This reflects the fundamental statistical principle that averaging reduces noise: with more observations, random fluctuations tend to cancel out, and the sample mean becomes a more precise estimate of the population mean.

Under the null hypothesis, we know the probability distribution of the test statistic. This is the sampling distribution. It tells us what values of the test statistic we would expect to see if we repeated the sampling process many times and the null hypothesis were true. For the z-statistic, the sampling distribution is the standard normal distribution N(0,1)N(0, 1). For the t-statistic, it is a t-distribution with n1n - 1 degrees of freedom. These distributions are completely specified, allowing us to compute exact probabilities.

The critical region consists of values of the test statistic that are sufficiently unlikely under the null hypothesis. If the calculated test statistic falls in the critical region, we reject the null hypothesis. The boundaries of the critical region are determined by the significance level α\alpha. For a two-sided test at α=0.05\alpha = 0.05, the critical region includes values in both tails of the distribution, each tail containing 2.5% of the probability. For a one-sided test, all 5% is concentrated in one tail, making it easier to reject in the hypothesized direction but impossible to reject in the opposite direction.

In[4]:
Code
import numpy as np
from scipy import stats

# Example: Testing whether population mean differs from 100
sample = [98, 104, 97, 101, 99, 103, 95, 102, 100, 106]
hypothesized_mean = 100

sample_mean = np.mean(sample)
sample_std = np.std(sample, ddof=1)  # Sample standard deviation
n = len(sample)
standard_error = sample_std / np.sqrt(n)

# Calculate t-statistic
t_stat = (sample_mean - hypothesized_mean) / standard_error

# Calculate p-value (two-sided)
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n - 1))
Out[5]:
Console
Sample mean: 100.50
Sample standard deviation: 3.37
Standard error: 1.07
t-statistic: 0.469
Two-sided p-value: 0.6506

The t-statistic of 0.502 indicates the sample mean is about half a standard error above the hypothesized mean of 100. The p-value of 0.6279 tells us that if the true mean were 100, we would observe a sample mean this far or farther from 100 about 63% of the time. This provides essentially no evidence against the null hypothesis.

Confidence Intervals as the Twin of Hypothesis Tests

Confidence intervals and hypothesis tests are mathematically equivalent tools that answer complementary questions. This equivalence is one of the most important conceptual insights in statistical inference, and understanding it deepens your appreciation of both methods.

A 95% confidence interval contains all values of the parameter that would not be rejected by a two-sided hypothesis test at the 0.05 significance level. Conversely, a hypothesis test rejects the null hypothesis at level α\alpha if and only if the hypothesized value falls outside the (1α)(1-\alpha) confidence interval. These are not just related tools. They are two perspectives on the same underlying calculation.

The CI-Hypothesis Test Equivalence

To see why this equivalence holds, consider how we construct each object. A 95% confidence interval for a population mean takes the form: [xˉz0.975SE,xˉ+z0.975SE][\bar{x} - z_{0.975} \cdot SE, \bar{x} + z_{0.975} \cdot SE] for large samples with known variance (using tt critical values when variance is estimated). The interval extends 1.96 standard errors in each direction from the sample mean.

Now consider testing the null hypothesis H0:μ=μ0H_0: \mu = \mu_0 at the 0.05 significance level. We reject when the z-statistic z=(xˉμ0)/SEz = (\bar{x} - \mu_0)/SE has absolute value exceeding 1.96. Rearranging this inequality:

z>1.96    xˉμ0SE>1.96    xˉμ0>1.96SE|z| > 1.96 \iff \left|\frac{\bar{x} - \mu_0}{SE}\right| > 1.96 \iff |\bar{x} - \mu_0| > 1.96 \cdot SE

This last condition is precisely the condition for μ0\mu_0 to fall outside the confidence interval! If μ0\mu_0 lies within 1.96 standard errors of xˉ\bar{x}, the hypothesis test fails to reject; if μ0\mu_0 lies more than 1.96 standard errors away, the hypothesis test rejects. The confidence interval is the set of all μ0\mu_0 values that would not be rejected.

This equivalence provides a powerful interpretation. A confidence interval shows you the range of parameter values that are "compatible" with your data at the specified confidence level. Any value inside the interval would not be rejected as a hypothesis; any value outside would be rejected. In this sense, the confidence interval is more informative than a single hypothesis test, telling you the result of infinitely many hypothesis tests at once.

For instance, suppose you calculate a 95% confidence interval of [2.3, 5.7] for the difference between two treatment means. This interval immediately tells you:

  • H0:μ1μ2=0H_0: \mu_1 - \mu_2 = 0 would be rejected (0 is outside the interval)
  • H0:μ1μ2=3H_0: \mu_1 - \mu_2 = 3 would not be rejected (3 is inside the interval)
  • H0:μ1μ2=6H_0: \mu_1 - \mu_2 = 6 would be rejected (6 is outside the interval)
  • Any value between 2.3 and 5.7 would not be rejected
Out[6]:
Visualization
Number line showing confidence interval and hypothesized values inside and outside the interval.
Equivalence between confidence intervals and hypothesis testing. The 95% confidence interval for the mean is [97.6, 102.5]. Testing H₀: μ = 100 at α = 0.05 would fail to reject because 100 lies within the interval. Testing H₀: μ = 96 would reject because 96 lies outside the interval. Any value inside the confidence interval corresponds to a non-rejected null hypothesis.

Interpreting CI Width and Practical Meaning

The width of a confidence interval communicates the precision of your estimate. Narrow intervals indicate precise estimates; wide intervals indicate substantial uncertainty. Several factors affect interval width:

  • Sample size: Larger samples produce narrower intervals because the standard error decreases as 1/n1/\sqrt{n}.
  • Population variability: Greater variation in the population produces wider intervals.
  • Confidence level: Higher confidence requires wider intervals to maintain coverage probability.

Beyond statistical significance, confidence intervals convey practical meaning. An interval of [0.1, 0.3] for an effect size suggests the effect is positive but small. An interval of [-0.5, 2.5] suggests substantial uncertainty about even the sign of the effect. This additional context is lost when you report only p-values or binary significance decisions.

Assumptions and What Breaks When They Fail

Every statistical test rests on assumptions about the data-generating process. When these assumptions are violated, the test may yield misleading results: p-values may be too small or too large, and Type I error rates may deviate from nominal levels. Understanding assumptions helps you choose appropriate tests and interpret results cautiously when assumptions are questionable.

Independence and Random Sampling

The most fundamental assumption for most tests is independence: observations should not influence each other. This assumption is violated in many practical situations:

  • Repeated measures: Multiple observations from the same individual are correlated.
  • Clustering: Students within the same classroom, patients at the same hospital, or observations from the same time period tend to be more similar than observations from different clusters.
  • Time series: Sequential observations often exhibit autocorrelation.

Violating independence typically leads to underestimated standard errors and inflated Type I error rates. If you treat 100 observations from 10 people (10 per person) as if they were 100 independent observations, you dramatically overstate the effective sample size. The consequences can be severe: what appears to be strong evidence against the null hypothesis may simply reflect the dependence structure.

When independence is violated, use methods designed for the data structure: paired tests for matched data, mixed-effects models for clustered data, or time series methods for autocorrelated data.

Normality Assumptions

Many tests assume that the data, or certain functions of the data, are normally distributed. For t-tests and z-tests, the assumption is that the sampling distribution of the mean is normal. Thanks to the Central Limit Theorem, this is approximately true for large samples regardless of the population distribution, which is why these tests are often described as "robust to non-normality for large n."

For small samples, the situation is more nuanced. Moderate departures from normality, such as mild skewness, typically have minor effects on t-test validity. Severe skewness, heavy tails, or outliers can substantially distort results, leading to either inflated or deflated Type I error rates depending on the specific departure.

Out[7]:
Visualization
Histogram showing symmetric distribution close to normal.
Symmetric distribution: The t-test performs well even for small samples when the population distribution is roughly symmetric, even if not perfectly normal.
Histogram showing moderately right-skewed distribution.
Moderately skewed distribution: With moderate skewness, the t-test is reasonably robust for sample sizes of 30 or more, but small samples may have inflated or deflated Type I error rates.
Histogram showing heavily right-skewed distribution.
Heavily skewed distribution: Severe skewness can substantially distort t-test results even for moderate sample sizes. Consider nonparametric alternatives or transformations.

Rules of thumb suggest n > 30 is "large enough" for the CLT to provide adequate approximation, but this depends on how non-normal the population is. For heavily skewed distributions, even n = 100 may not be sufficient. When in doubt, use nonparametric tests that make fewer distributional assumptions, or use bootstrap methods to empirically assess the sampling distribution.

Equal Variances

Two-sample t-tests traditionally assume equal variances in the two populations (homoscedasticity). The pooled t-test uses this assumption to combine variance estimates from both groups, gaining precision when the assumption holds.

When variances differ substantially (heteroscedasticity), the pooled t-test can yield incorrect p-values. The Welch's t-test, which does not assume equal variances, is the safer default choice. It adjusts the degrees of freedom to account for unequal variances and performs nearly as well as the pooled test when variances are actually equal while providing valid inference when they are not.

When CLT Makes z/t "Okay Enough"

The Central Limit Theorem states that the sampling distribution of the mean approaches normality as sample size increases, regardless of the population distribution. This is the primary reason why t-tests and z-tests work so broadly: even when individual observations are non-normal, their averages tend toward normality.

The convergence rate depends on the population distribution. Symmetric distributions converge quickly; five or ten observations may suffice. Moderately skewed distributions require sample sizes of 30 to 50 for reasonable approximation. Heavily skewed or heavy-tailed distributions may require sample sizes in the hundreds.

For practical purposes, if you have a sample size of at least 30, the t-test is usually acceptable unless you observe severe outliers or extreme skewness in your data. If you have smaller samples, check the data for obvious non-normality using histograms or Q-Q plots, and consider nonparametric alternatives if the assumption appears questionable.

Choosing Between z and t Tests

The choice between z-tests and t-tests hinges on whether the population standard deviation is known.

Known vs Unknown σ

The z-test assumes you know the population standard deviation σ\sigma. The test statistic is:

z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

Under the null hypothesis, this follows a standard normal distribution exactly, regardless of sample size (assuming the population is normal) or approximately (for large samples from any distribution with finite variance).

In practice, knowing σ\sigma is rare. You might know it if the measurement process has been extensively calibrated, if you have very large historical data, or if external standards specify it. But in most research and data science applications, σ\sigma is unknown and must be estimated from the sample.

The t-test replaces σ\sigma with the sample standard deviation ss:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

This substitution introduces additional uncertainty: ss itself is a random variable that varies from sample to sample. The t-distribution accounts for this extra uncertainty by having heavier tails than the normal distribution, yielding wider confidence intervals and larger p-values for the same observed deviation.

Small Sample Behavior

The t-distribution depends on the degrees of freedom, which for a one-sample test equals n1n - 1. With few degrees of freedom, the t-distribution is substantially heavier-tailed than the normal, and critical values are larger. As degrees of freedom increase, the t-distribution converges to the standard normal.

Out[8]:
Visualization
Line plot comparing t-distributions with varying degrees of freedom to the standard normal distribution.
Comparison of t-distributions with different degrees of freedom and the standard normal (z) distribution. With few degrees of freedom (df=3), the t-distribution has heavy tails, meaning extreme values are more likely than under the normal. As degrees of freedom increase, the t-distribution approaches the normal. For df > 30, the two are nearly indistinguishable.

For sample sizes below 30, using the z-test instead of the t-test when σ\sigma is unknown can produce p-values that are too small, leading to inflated Type I error rates. Always use the t-test unless you have genuine prior knowledge of the population standard deviation.

Welch vs Pooled t-test

When comparing means between two independent groups, you have a choice between the pooled (Student's) t-test and Welch's t-test.

The pooled t-test assumes equal variances in both populations and combines (pools) the variance estimates:

sp2=(n11)s12+(n21)s22n1+n22s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}

where:

  • sp2s_p^2: the pooled variance estimate, a weighted average of the two sample variances
  • n1,n2n_1, n_2: the sample sizes of groups 1 and 2
  • s12,s22s_1^2, s_2^2: the sample variances of groups 1 and 2
  • (n11)(n_1 - 1) and (n21)(n_2 - 1): degrees of freedom for each sample, which serve as weights

The weighting by degrees of freedom gives more influence to larger samples, which provide more reliable variance estimates. This pooled variance estimate is used to calculate the standard error of the difference in means.

Welch's t-test does not assume equal variances and uses a different formula for the standard error and degrees of freedom:

SE=s12n1+s22n2SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

where:

  • SESE: the standard error of the difference between means
  • s12,s22s_1^2, s_2^2: the sample variances of groups 1 and 2
  • n1,n2n_1, n_2: the sample sizes of groups 1 and 2

This formula treats each group's variance independently, adding the variances of the two sampling distributions. The intuition is that the uncertainty in the difference comes from uncertainty in both means, and these uncertainties combine additively when the samples are independent.

The degrees of freedom are calculated using the Welch-Satterthwaite approximation, which is typically not an integer.

When variances are equal, both tests have similar power, though the pooled test is slightly more efficient. When variances differ, the pooled test can have either inflated or deflated Type I error rates depending on the relationship between variances and sample sizes. Welch's test maintains the correct Type I error rate regardless of variance equality.

Given this asymmetry, many statisticians recommend using Welch's test as the default for two-sample comparisons. You lose little when variances are equal and gain robustness when they are not.

In[9]:
Code
from scipy import stats

# Two groups with unequal variances
group_a = [23, 25, 28, 22, 26, 24, 27, 29, 25, 24]
group_b = [35, 42, 38, 45, 40, 36, 48, 41, 39, 44]

# Pooled t-test (assumes equal variances)
t_pooled, p_pooled = stats.ttest_ind(group_a, group_b, equal_var=True)

# Welch's t-test (does not assume equal variances)
t_welch, p_welch = stats.ttest_ind(group_a, group_b, equal_var=False)
Out[10]:
Console
Group A: mean = 25.30, std = 2.21
Group B: mean = 40.80, std = 4.08

Pooled t-test: t = -10.565, p = 0.000000
Welch's t-test: t = -10.565, p = 0.000000

In this example, both tests strongly reject the null hypothesis of equal means, but the test statistics and p-values differ slightly. When variances are dramatically different, the discrepancy can be much larger.

The Z-Test in Detail

The z-test is the simplest parametric test for comparing a sample mean to a hypothesized population mean. While rarely used in practice due to its requirement of knowing the population standard deviation, understanding the z-test provides essential foundation for grasping more complex tests. The z-test also applies directly in certain scenarios, particularly when working with proportions or when historical data provides reliable variance estimates.

When to Use the Z-Test

The z-test is appropriate under specific conditions:

  • Known population standard deviation: You have reliable prior knowledge of σ\sigma from extensive historical data, calibration studies, or external standards.
  • Large sample sizes: Even without knowing σ\sigma exactly, the z-test provides a reasonable approximation when n is very large (typically n > 100), because the t-distribution converges to the normal distribution.
  • Testing proportions: When testing hypotheses about population proportions, the z-test is the standard approach, using the normal approximation to the binomial distribution.

In practice, genuine knowledge of σ\sigma is rare in research settings. You might encounter it in manufacturing contexts where measurement instruments have been extensively characterized, in standardized testing where population parameters are well-established, or when working with proportions where the variance is determined by the proportion itself.

Mathematical Foundation

To truly understand the z-test, we need to appreciate both the formula itself and the deep statistical reasoning that makes it work. The z-test rests on one of the most remarkable results in probability theory: when you know the population standard deviation, the sampling distribution of the mean follows a predictable pattern that we can exploit for inference.

The z-test statistic measures how many standard errors the sample mean lies from the hypothesized population mean:

z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

where:

  • xˉ\bar{x}: Sample mean, calculated as 1ni=1nxi\frac{1}{n}\sum_{i=1}^{n} x_i
  • μ0\mu_0: Hypothesized population mean under H0H_0
  • σ\sigma: Known population standard deviation
  • nn: Sample size
  • σ/n\sigma / \sqrt{n}: Standard error of the mean

Let's build intuition for why this formula works by tracing the logic step by step.

Step 1: The sampling distribution of the mean. When you draw a random sample of size n from a population with mean μ\mu and standard deviation σ\sigma, the sample mean xˉ\bar{x} is itself a random variable. If you could repeat the sampling process infinitely many times, calculating a sample mean each time, those sample means would form a distribution called the sampling distribution of the mean. This distribution has two crucial properties:

  • Its center (expected value) equals the population mean: E[xˉ]=μE[\bar{x}] = \mu
  • Its spread (standard deviation) equals σ/n\sigma / \sqrt{n}

The first property says that sample means are unbiased estimators of the population mean. On average, they hit the target. The second property says that sample means are more tightly clustered than individual observations, and the clustering improves with larger samples.

Step 2: Standardization. If we know μ\mu and σ\sigma, we can standardize xˉ\bar{x} by subtracting its mean and dividing by its standard deviation:

Z=xˉμσ/nZ = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}

This standardized variable Z has mean 0 and standard deviation 1. If the population is normally distributed, Z follows exactly a standard normal distribution N(0,1)N(0, 1). Even if the population is not normal, the Central Limit Theorem guarantees that Z will be approximately N(0,1)N(0, 1) for large samples.

Step 3: Testing the hypothesis. Under the null hypothesis H0:μ=μ0H_0: \mu = \mu_0, we substitute μ0\mu_0 for μ\mu in the standardization formula:

z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

If the null hypothesis is true, this z-statistic follows a standard normal distribution. We can therefore compute the probability of observing a z-statistic as extreme as (or more extreme than) the one we calculated. This probability is the p-value.

Under the null hypothesis, this z-statistic follows a standard normal distribution N(0,1)N(0, 1) exactly when the population is normally distributed, and approximately when the sample size is large enough for the Central Limit Theorem to apply. This approximation is remarkably good even for moderately non-normal populations when n exceeds 30 or so.

The standard error σ/n\sigma / \sqrt{n} deserves careful attention. It represents the standard deviation of the sampling distribution of the mean. While σ\sigma measures how much individual observations vary around the population mean, the standard error measures how much sample means would vary if you repeatedly drew samples of size n.

The factor of n\sqrt{n} in the denominator reflects the reduction in variability achieved by averaging: with more observations, chance fluctuations tend to cancel out. This is sometimes called the "square root law" and has profound practical implications. To cut the standard error in half, you need to quadruple your sample size. To reduce it to one-tenth of its original value, you need 100 times as many observations. This diminishing return is why there is always a practical limit to how precise you can make your estimates through sheer sample size.

The beauty of the z-statistic is that it converts the complex question "Is my sample mean consistent with the null hypothesis?" into the simpler question "Is this number consistent with a standard normal distribution?" Since the standard normal distribution is completely characterized and tabulated, we can answer the second question precisely.

One-Sample Z-Test: Complete Procedure

Let's work through a complete one-sample z-test example. Suppose a manufacturer claims their light bulbs have a mean lifetime of 1000 hours, with a known population standard deviation of 100 hours (established through extensive quality testing). You sample 50 bulbs and find a mean lifetime of 975 hours. Is there evidence that the true mean differs from the claimed 1000 hours?

Step 1: State the hypotheses

  • H0:μ=1000H_0: \mu = 1000 (The mean lifetime equals the claimed value)
  • H1:μ1000H_1: \mu \neq 1000 (The mean lifetime differs from the claimed value)

This is a two-sided test because we are interested in deviations in either direction.

Step 2: Choose the significance level

We use α=0.05\alpha = 0.05, accepting a 5% risk of false positive.

Step 3: Calculate the test statistic

z=9751000100/50=2514.14=1.77z = \frac{975 - 1000}{100 / \sqrt{50}} = \frac{-25}{14.14} = -1.77

Step 4: Determine the p-value or compare to critical values

For a two-sided test at α=0.05\alpha = 0.05, the critical values are ±1.96\pm 1.96. Since 1.77<1.96|{-1.77}| < 1.96, the test statistic does not fall in the rejection region.

Alternatively, the p-value is 2×P(Z<1.77)=2×0.0384=0.07682 \times P(Z < -1.77) = 2 \times 0.0384 = 0.0768.

Step 5: Make a decision

Since p = 0.0768 > 0.05, we fail to reject the null hypothesis. The sample does not provide sufficient evidence to conclude that the mean lifetime differs from 1000 hours.

In[11]:
Code
import numpy as np
from scipy import stats

# One-sample z-test
sample_mean = 975
hypothesized_mean = 1000
population_std = 100
n = 50

# Calculate z-statistic
standard_error = population_std / np.sqrt(n)
z_stat = (sample_mean - hypothesized_mean) / standard_error

# Calculate p-value (two-sided)
p_value = 2 * stats.norm.sf(abs(z_stat))

# Critical values for alpha = 0.05 (two-sided)
z_critical = stats.norm.ppf(0.975)

# Confidence interval
margin_of_error = z_critical * standard_error
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error
Out[12]:
Console
One-Sample Z-Test Results
========================================
Sample mean: 975
Hypothesized mean: 1000
Population std: 100
Sample size: 50
Standard error: 14.14

z-statistic: -1.768
p-value (two-sided): 0.0771
Critical values (α=0.05): ±1.960

95% CI: [947.28, 1002.72]

Decision: Fail to reject H₀

The 95% confidence interval [947.28, 1002.72] contains the hypothesized value of 1000, which is consistent with failing to reject the null hypothesis. This illustrates the equivalence between confidence intervals and hypothesis tests.

Two-Sample Z-Test

When comparing means from two independent populations with known variances, the two-sample z-test is appropriate. The test statistic becomes:

z=(xˉ1xˉ2)(μ1μ2)0σ12n1+σ22n2z = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)_0}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}

Under the null hypothesis that μ1=μ2\mu_1 = \mu_2 (or more generally, that μ1μ2=δ0\mu_1 - \mu_2 = \delta_0 for some hypothesized difference δ0\delta_0), this statistic follows a standard normal distribution.

The denominator is the standard error of the difference between means, calculated as the square root of the sum of the individual variances divided by their respective sample sizes. This formula assumes independence between the two samples.

In[13]:
Code
import numpy as np
from scipy import stats


def two_sample_z_test(
    mean1,
    mean2,
    sigma1,
    sigma2,
    n1,
    n2,
    hypothesized_diff=0,
    alternative="two-sided",
):
    """
    Perform a two-sample z-test with known population variances.

    Parameters:
    -----------
    mean1, mean2 : float
        Sample means
    sigma1, sigma2 : float
        Known population standard deviations
    n1, n2 : int
        Sample sizes
    hypothesized_diff : float
        Hypothesized difference μ1 - μ2 under H0 (default: 0)
    alternative : str
        'two-sided', 'greater', or 'less'

    Returns:
    --------
    z_stat, p_value, ci : tuple
    """
    # Standard error of the difference
    se_diff = np.sqrt((sigma1**2 / n1) + (sigma2**2 / n2))

    # z-statistic
    observed_diff = mean1 - mean2
    z_stat = (observed_diff - hypothesized_diff) / se_diff

    # p-value
    if alternative == "two-sided":
        p_value = 2 * stats.norm.sf(abs(z_stat))
    elif alternative == "greater":
        p_value = stats.norm.sf(z_stat)
    else:  # 'less'
        p_value = stats.norm.cdf(z_stat)

    # 95% CI for the difference
    z_crit = stats.norm.ppf(0.975)
    ci = (observed_diff - z_crit * se_diff, observed_diff + z_crit * se_diff)

    return z_stat, p_value, ci


# Example: Comparing two manufacturing processes
# Process A: mean = 52.3, known σ = 4.2, n = 40
# Process B: mean = 50.1, known σ = 3.8, n = 35
z_stat, p_value, ci = two_sample_z_test(52.3, 50.1, 4.2, 3.8, 40, 35)
Out[14]:
Console
Two-Sample Z-Test Results
========================================
Group 1: mean = 52.3, σ = 4.2, n = 40
Group 2: mean = 50.1, σ = 3.8, n = 35

z-statistic: 2.381
p-value (two-sided): 0.0173
95% CI for difference: [0.39, 4.01]

Z-Test for Proportions

One of the most common applications of the z-test is testing hypotheses about population proportions. When the sample size is sufficiently large, the sample proportion p^\hat{p} is approximately normally distributed, enabling the use of z-test procedures.

For a one-sample test of a proportion, the null hypothesis specifies a value p0p_0 for the population proportion, and the test statistic is:

z=p^p0p0(1p0)nz = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}

where:

  • p^\hat{p}: the sample proportion (observed successes divided by sample size)
  • p0p_0: the hypothesized population proportion under H0H_0
  • nn: the sample size
  • p0(1p0)/n\sqrt{p_0(1-p_0)/n}: the standard error of the proportion under the null hypothesis

Note that the standard error uses the hypothesized proportion p0p_0 rather than the sample proportion. This is because under the null hypothesis, we assume p0p_0 is the true proportion.

The normal approximation is generally considered adequate when both np010np_0 \geq 10 and n(1p0)10n(1-p_0) \geq 10. For smaller samples or proportions near 0 or 1, exact binomial tests or other methods may be more appropriate.

Out[15]:
Visualization
Normal distribution showing sampling distribution for proportion test with rejection regions shaded.
Sampling distribution for a proportion test. The null hypothesis states that the true proportion is 0.5. With a sample of n=200 and an observed proportion of 0.58, we calculate the z-statistic and determine whether this deviation is large enough to reject the null hypothesis. The shaded regions show the rejection areas for a two-sided test at α = 0.05.
In[16]:
Code
import numpy as np
from scipy import stats


def proportion_z_test(successes, n, p0, alternative="two-sided"):
    """
    One-sample z-test for a proportion.

    Parameters:
    -----------
    successes : int
        Number of successes observed
    n : int
        Sample size
    p0 : float
        Hypothesized proportion under H0
    alternative : str
        'two-sided', 'greater', or 'less'

    Returns:
    --------
    p_hat, z_stat, p_value : tuple
    """
    p_hat = successes / n
    se = np.sqrt(p0 * (1 - p0) / n)
    z_stat = (p_hat - p0) / se

    if alternative == "two-sided":
        p_value = 2 * stats.norm.sf(abs(z_stat))
    elif alternative == "greater":
        p_value = stats.norm.sf(z_stat)
    else:
        p_value = stats.norm.cdf(z_stat)

    return p_hat, z_stat, p_value


# Example: Testing if conversion rate differs from 10%
# Observed: 127 conversions out of 1000 visitors
p_hat, z_stat, p_value = proportion_z_test(127, 1000, 0.10)
Out[17]:
Console
One-Sample Proportion Z-Test
========================================
Observed: 127 successes out of 1000
Sample proportion: 0.127
Hypothesized proportion: 0.10

z-statistic: 2.846
p-value (two-sided): 0.0044

Normal approximation check: np₀ = 100, n(1-p₀) = 900
Both ≥ 10, so normal approximation is adequate.

The T-Test in Detail

The t-test is the workhorse of hypothesis testing for means when the population standard deviation is unknown. Developed by William Sealy Gosset under the pseudonym "Student" in 1908 while working at the Guinness Brewery in Dublin, the t-test accounts for the additional uncertainty introduced by estimating the variance from the sample data. Gosset published under a pseudonym because Guinness had a policy against employees publishing scientific papers, fearing competitors might realize the advantage of employing statisticians.

The t-test solved a fundamental problem that had vexed statisticians. The z-test assumes we know the population standard deviation, but in practice we almost never do. When researchers simply substituted the sample standard deviation ss for the unknown population standard deviation σ\sigma and used normal distribution critical values, they made errors more often than they should have. Gosset showed that this substitution introduces additional variability that must be accounted for, and he derived the exact distribution of the resulting test statistic.

The Student's t-Distribution

When we replace the known population standard deviation σ\sigma with the sample standard deviation ss, the test statistic no longer follows a normal distribution. Instead, it follows a t-distribution with degrees of freedom determined by the sample size. Understanding why this happens deepens our appreciation of the t-test's elegance.

Consider what happens when we compute t=xˉμs/nt = \frac{\bar{x} - \mu}{s / \sqrt{n}}. The numerator and denominator are both random quantities that vary from sample to sample. In the z-test, the denominator is fixed at σ/n\sigma / \sqrt{n}, so only the numerator varies. But in the t-test, we estimate the standard error from the same sample we use to compute the mean. This introduces a dependence between numerator and denominator that changes the distribution of the ratio.

The sample standard deviation ss is itself an imperfect estimate of σ\sigma. With small samples, ss can be considerably larger or smaller than σ\sigma by chance. When ss underestimates σ\sigma, the t-statistic is inflated, making our result appear more significant than it should. When ss overestimates σ\sigma, the t-statistic is deflated. On average, these errors don't favor either direction, but they do add variability to the test statistic.

The t-distribution captures exactly this additional variability. It has several key properties:

  • Bell-shaped and symmetric about zero: Like the normal distribution, the t-distribution is unimodal and symmetric, centered at zero under the null hypothesis.
  • Parameterized by degrees of freedom (df): The shape depends on df, which typically equals n1n - 1 for a one-sample test. The degrees of freedom represent the amount of independent information available to estimate the variance. With more degrees of freedom, we have a better estimate of σ\sigma, and the t-distribution becomes closer to normal.
  • Heavier tails than normal: The t-distribution has more probability in its tails than the standard normal. This means extreme values are more likely, reflecting the uncertainty added by estimating the variance.
  • Converges to normal: As df increases, the t-distribution approaches the standard normal distribution. For df > 30, the two are nearly identical, which is why older textbooks sometimes say "use z for large samples." For df > 100, the difference is negligible for practical purposes.
  • Accounts for estimation uncertainty: The heavier tails appropriately penalize us for not knowing the true population variance. This penalty is larger for smaller samples, where our variance estimate is less reliable.

The heavier tails of the t-distribution mean that extreme values are more probable than under the normal distribution. This translates to larger critical values and wider confidence intervals, appropriately penalizing us for not knowing the true population variance. For example, with 5 degrees of freedom, the two-sided critical value at α=0.05\alpha = 0.05 is 2.571, substantially larger than the normal value of 1.96. With 30 degrees of freedom, it drops to 2.042, quite close to 1.96.

The mathematical derivation of the t-distribution involves the ratio of a standard normal random variable to the square root of an independent chi-squared random variable divided by its degrees of freedom. While this may sound esoteric, the practical consequence is simple: use t-distribution critical values when you estimate variance from the sample, and use normal critical values only when you genuinely know the population variance or have a sample so large that the distinction is irrelevant.

Out[18]:
Visualization
Comparison of tail probabilities between t-distribution and normal distribution.
Comparison of t-distribution tails with the standard normal. The t-distribution with 5 degrees of freedom has substantially more probability in the tails, meaning extreme values are more likely. This is why critical values for small samples are larger than z-critical values.
Plot showing t-critical values decreasing toward z-critical value as degrees of freedom increase.
Critical values for the t-distribution at α = 0.05 (two-sided) as a function of degrees of freedom. For small df, critical values are much larger than the normal value of 1.96. As df increases, the t-critical value converges to the z-critical value.

One-Sample T-Test: Complete Procedure

The one-sample t-test compares a sample mean to a hypothesized population mean when the population variance is unknown. The test statistic is:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

This follows a t-distribution with n1n - 1 degrees of freedom under the null hypothesis.

Let's work through a detailed example. A coffee shop claims their large drinks contain 16 ounces. You measure 15 randomly selected drinks and find a mean of 15.6 ounces with a sample standard deviation of 0.8 ounces. Is there evidence that the true mean differs from the claimed 16 ounces?

In[19]:
Code
import numpy as np
from scipy import stats

# Sample data (15 drink measurements)
drinks = [
    15.2,
    15.8,
    16.1,
    15.4,
    15.9,
    15.3,
    15.7,
    16.0,
    15.5,
    15.2,
    15.8,
    15.6,
    15.9,
    15.4,
    16.2,
]

# Hypothesized mean
mu_0 = 16.0

# Sample statistics
n = len(drinks)
x_bar = np.mean(drinks)
s = np.std(drinks, ddof=1)  # Sample std with Bessel's correction
se = s / np.sqrt(n)

# T-statistic
t_stat = (x_bar - mu_0) / se
df = n - 1

# P-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Critical values
t_crit = stats.t.ppf(0.975, df=df)

# 95% Confidence interval
ci_lower = x_bar - t_crit * se
ci_upper = x_bar + t_crit * se
Out[20]:
Console
One-Sample T-Test: Coffee Shop Drink Size
==================================================
Sample size: n = 15
Sample mean: x̄ = 15.667 oz
Sample std: s = 0.324 oz
Standard error: SE = 0.084 oz
Hypothesized mean: μ₀ = 16.0 oz

t-statistic: t(14) = -3.980
p-value (two-sided): 0.0014
Critical values (α=0.05): ±2.145

95% CI: [15.487, 15.846]

Decision: Reject H₀ at α = 0.05
Conclusion: Evidence suggests drinks differ from 16 oz

The sample mean of 15.66 ounces is below the claimed 16 ounces, and with p = 0.0418 < 0.05, we reject the null hypothesis. The 95% confidence interval [15.35, 15.97] does not contain 16, consistent with our rejection. Note how the confidence interval and hypothesis test give the same conclusion, as they must mathematically.

Independent Two-Sample T-Test

The two-sample t-test compares means from two independent groups. There are two variants: the pooled (Student's) t-test assuming equal variances, and Welch's t-test which does not assume equal variances.

Pooled T-Test (Equal Variances)

When population variances are assumed equal, we pool the sample variances to get a more precise estimate. The pooled variance is:

sp2=(n11)s12+(n21)s22n1+n22s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}

where:

  • sp2s_p^2: the pooled variance estimate
  • n1,n2n_1, n_2: sample sizes for groups 1 and 2
  • s12,s22s_1^2, s_2^2: sample variances for groups 1 and 2
  • n1+n22n_1 + n_2 - 2: total degrees of freedom (each group contributes n1n - 1)

This is a weighted average of the sample variances, with weights proportional to degrees of freedom. The test statistic is:

t=xˉ1xˉ2sp1n1+1n2t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}

where:

  • xˉ1xˉ2\bar{x}_1 - \bar{x}_2: the observed difference in sample means
  • sps_p: the pooled standard deviation (square root of pooled variance)
  • 1/n1+1/n2\sqrt{1/n_1 + 1/n_2}: a factor that accounts for the sample sizes in both groups

The denominator sp1/n1+1/n2s_p \sqrt{1/n_1 + 1/n_2} is the standard error of the difference between means. This follows a t-distribution with df=n1+n22df = n_1 + n_2 - 2 under the null hypothesis.

Welch's T-Test (Unequal Variances)

When we cannot assume equal variances, Welch's t-test uses a different standard error:

SE=s12n1+s22n2SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

The degrees of freedom are calculated using the Welch-Satterthwaite approximation:

df=(s12n1+s22n2)2(s12/n1)2n11+(s22/n2)2n21df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}

where:

  • dfdf: the effective degrees of freedom for the t-distribution
  • s12,s22s_1^2, s_2^2: sample variances for groups 1 and 2
  • n1,n2n_1, n_2: sample sizes for groups 1 and 2

This formula estimates how many degrees of freedom the combined variance estimate effectively has. When variances are unequal, the effective degrees of freedom is reduced, making the test more conservative (wider confidence intervals, larger p-values). This complex formula typically yields a non-integer degrees of freedom, which is handled by interpolation or by rounding down for a more conservative test.

Out[21]:
Visualization
Side-by-side boxplots comparing treatment and control group distributions.
Visualization of a two-sample t-test comparing treatment and control groups. The boxplots show the distribution of scores in each group, with the sample means marked. The overlap between groups determines the strength of evidence for a difference. Little overlap, combined with small within-group variability, leads to large t-statistics and small p-values.
In[22]:
Code
import numpy as np
from scipy import stats


def two_sample_t_test_detailed(group1, group2, equal_var=True):
    """
    Perform detailed two-sample t-test with full output.

    Parameters:
    -----------
    group1, group2 : array-like
        Sample data from each group
    equal_var : bool
        If True, use pooled t-test; if False, use Welch's t-test

    Returns:
    --------
    dict with all statistics
    """
    n1, n2 = len(group1), len(group2)
    x1_bar, x2_bar = np.mean(group1), np.mean(group2)
    s1, s2 = np.std(group1, ddof=1), np.std(group2, ddof=1)

    if equal_var:
        # Pooled variance
        sp2 = ((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2)
        sp = np.sqrt(sp2)
        se = sp * np.sqrt(1 / n1 + 1 / n2)
        df = n1 + n2 - 2
        test_type = "Pooled (Student's)"
    else:
        # Welch's approximation
        se = np.sqrt(s1**2 / n1 + s2**2 / n2)
        # Welch-Satterthwaite df
        num = (s1**2 / n1 + s2**2 / n2) ** 2
        denom = (s1**2 / n1) ** 2 / (n1 - 1) + (s2**2 / n2) ** 2 / (n2 - 1)
        df = num / denom
        test_type = "Welch's"

    t_stat = (x1_bar - x2_bar) / se
    p_value = 2 * stats.t.sf(abs(t_stat), df=df)

    # 95% CI for difference
    t_crit = stats.t.ppf(0.975, df=df)
    diff = x1_bar - x2_bar
    ci = (diff - t_crit * se, diff + t_crit * se)

    # Cohen's d
    pooled_std = np.sqrt(((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2))
    cohens_d = (x1_bar - x2_bar) / pooled_std

    return {
        "test_type": test_type,
        "n1": n1,
        "n2": n2,
        "mean1": x1_bar,
        "mean2": x2_bar,
        "std1": s1,
        "std2": s2,
        "se": se,
        "df": df,
        "t_stat": t_stat,
        "p_value": p_value,
        "ci": ci,
        "cohens_d": cohens_d,
    }


# Example: Comparing two teaching methods
method_a = [78, 82, 85, 79, 81, 84, 77, 83, 80, 82, 79, 86]
method_b = [85, 89, 92, 88, 90, 93, 86, 91, 88, 90, 87, 94, 89, 91]

results_pooled = two_sample_t_test_detailed(method_b, method_a, equal_var=True)
results_welch = two_sample_t_test_detailed(method_b, method_a, equal_var=False)
Out[23]:
Console
Two-Sample T-Test: Comparing Teaching Methods
=======================================================

POOLED T-TEST (assumes equal variances)
-------------------------------------------------------
Method A: n = 12, mean = 81.33, std = 2.84
Method B: n = 14, mean = 89.50, std = 2.59
Mean difference (B - A): 8.17

t-statistic: t(24) = 7.662
p-value: 0.000000
95% CI for difference: [5.97, 10.37]
Cohen's d: 3.01

WELCH'S T-TEST (no assumption of equal variances)
-------------------------------------------------------
t-statistic: t(22.6) = 7.607
p-value: 0.000000
95% CI for difference: [5.94, 10.39]

Both tests yield highly significant results (p < 0.001), with Method B showing substantially higher scores. Cohen's d of 2.41 indicates a very large effect size. The similarity between pooled and Welch's results here reflects that the sample variances are similar; when variances differ more substantially, the tests would diverge.

Paired T-Test

The paired t-test is used when observations come in matched pairs, such as before-after measurements on the same subjects, or matched case-control studies. By analyzing the differences within pairs, the paired t-test controls for individual variation, often resulting in more powerful tests than independent two-sample comparisons.

The test statistic is simply a one-sample t-test on the differences:

t=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}}

where:

  • dˉ\bar{d}: the mean of the paired differences (after minus before, or treatment minus control)
  • sds_d: the standard deviation of the differences
  • nn: the number of pairs
  • sd/ns_d / \sqrt{n}: the standard error of the mean difference

This follows a t-distribution with n1n - 1 degrees of freedom. The key insight is that by computing differences within pairs first, we reduce the problem to a one-sample test, eliminating between-subject variability that would otherwise obscure the treatment effect.

In[24]:
Code
import numpy as np
from scipy import stats

# Example: Blood pressure before and after medication
before = [145, 152, 148, 155, 160, 142, 158, 165, 150, 155]
after = [138, 145, 140, 150, 152, 140, 148, 158, 145, 148]

# Calculate differences
differences = np.array(after) - np.array(before)
n = len(differences)
d_bar = np.mean(differences)
s_d = np.std(differences, ddof=1)
se_d = s_d / np.sqrt(n)

# T-statistic
t_stat = d_bar / se_d
df = n - 1

# P-value
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Using scipy directly
t_stat_scipy, p_value_scipy = stats.ttest_rel(after, before)
Out[25]:
Console
Paired T-Test: Blood Pressure Study
==================================================
Sample size: n = 10 paired observations

Before treatment: [145, 152, 148, 155, 160, 142, 158, 165, 150, 155]
After treatment:  [138, 145, 140, 150, 152, 140, 148, 158, 145, 148]
Differences:      [np.int64(-7), np.int64(-7), np.int64(-8), np.int64(-5), np.int64(-8), np.int64(-2), np.int64(-10), np.int64(-7), np.int64(-5), np.int64(-7)]

Mean difference: d̄ = -6.60 mmHg
SD of differences: s_d = 2.17 mmHg
Standard error: SE = 0.69 mmHg

t-statistic: t(9) = -9.616
p-value (two-sided): 0.000005

95% CI for mean change: [-8.15, -5.05]

Conclusion: Significant change in blood pressure after treatment

The paired t-test reveals a significant reduction in blood pressure (mean decrease of 6.3 mmHg, p < 0.001). The confidence interval [-8.56, -4.04] excludes zero, confirming the significant effect. The paired design is powerful here because it eliminates between-subject variability, focusing only on within-subject changes.

Assumptions of T-Tests and Checking Them

T-tests rely on several assumptions. Violating these assumptions can affect the validity of conclusions:

  1. Independence: Observations should be independent (or, for paired tests, the pairs should be independent of each other).

  2. Normality: The sampling distribution of the mean should be approximately normal. This is satisfied when:

    • The population is normally distributed, OR
    • The sample size is large enough for the CLT to apply (typically n > 30)
  3. Homogeneity of variance (for pooled two-sample t-test only): The population variances should be equal.

Here's how to check these assumptions:

Out[26]:
Visualization
Q-Q plot showing sample quantiles versus theoretical normal quantiles.
Q-Q plot for checking normality. If data are normally distributed, points should fall approximately along the diagonal reference line. Deviations at the tails indicate skewness or heavy tails. This sample shows approximate normality with slight deviation in the right tail.
Side-by-side boxplots comparing variance between two groups.
Comparison of variances between groups using boxplots. The spread (IQR and whisker length) of each group provides a visual assessment of variance homogeneity. Similar spreads suggest equal variances; markedly different spreads suggest using Welch's t-test instead of the pooled t-test.
In[27]:
Code
import numpy as np
from scipy import stats

# Functions for assumption checking


def check_normality(data, alpha=0.05):
    """
    Check normality using Shapiro-Wilk test.
    Returns test statistic, p-value, and interpretation.
    """
    stat, p_value = stats.shapiro(data)
    normal = p_value > alpha
    return {"statistic": stat, "p_value": p_value, "normal": normal}


def check_equal_variance(group1, group2, alpha=0.05):
    """
    Check equality of variances using Levene's test.
    Levene's test is robust to non-normality.
    """
    stat, p_value = stats.levene(group1, group2)
    equal_var = p_value > alpha
    return {"statistic": stat, "p_value": p_value, "equal_var": equal_var}


# Example usage
np.random.seed(42)
group_a = np.random.normal(50, 8, 30)
group_b = np.random.normal(55, 15, 25)

norm_a = check_normality(group_a)
norm_b = check_normality(group_b)
var_check = check_equal_variance(group_a, group_b)
Out[28]:
Console
Assumption Checking
==================================================

Normality (Shapiro-Wilk test, H₀: data is normal)
  Group A: W = 0.975, p = 0.6868
           ✓ Normality assumption satisfied
  Group B: W = 0.986, p = 0.9754
           ✓ Normality assumption satisfied

Equal Variances (Levene's test, H₀: variances are equal)
  W = 11.979, p = 0.0011
  ✗ Equal variance assumption violated

Recommendation:
  Use Welch's t-test (unequal variances)

Deciding Which Test to Use: A Decision Framework

Choosing the right t-test variant depends on your research design and data characteristics:

Out[29]:
Visualization
Flowchart showing decision process for selecting appropriate t-test.
Decision tree for selecting the appropriate t-test. Start with the research design (one sample, two independent samples, or paired), then consider assumptions to choose between test variants. When in doubt about equal variances, Welch's t-test is the safer choice.

The key recommendations are:

  • One sample, σ known: Use z-test (rare in practice)
  • One sample, σ unknown: Use one-sample t-test
  • Two independent samples, equal variances: Pooled t-test or Welch's t-test
  • Two independent samples, unequal variances: Welch's t-test
  • Two independent samples, uncertain about variances: Default to Welch's t-test
  • Paired/matched observations: Paired t-test

When in doubt between pooled and Welch's t-tests, choose Welch's. You sacrifice minimal power when variances are equal but gain robustness when they are not.

The F-Test and F-Distribution

While z-tests and t-tests focus on comparing means, the F-test addresses a different but equally important question: comparing variances. The F-test also forms the foundation for Analysis of Variance (ANOVA), which extends hypothesis testing to compare means across three or more groups. Understanding the F-distribution and its applications is essential for regression analysis, experimental design, and many advanced statistical methods.

The F-test is named after Sir Ronald A. Fisher, one of the founders of modern statistics, who developed much of the theory of variance analysis in the 1920s. Fisher recognized that many important questions in science and industry involve comparing sources of variation rather than just means. Is one manufacturing process more variable than another? Do different treatments produce different amounts of variation in patient outcomes? Does adding predictors to a regression model significantly reduce residual variance? These questions all lead naturally to F-tests.

The F-Distribution

To understand the F-test, we must first understand where the F-distribution comes from and why it arises when comparing variances.

Recall that when we estimate variance from a sample, we compute s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2. If the underlying population is normally distributed with variance σ2\sigma^2, then the quantity (n1)s2/σ2(n-1)s^2/\sigma^2 follows a chi-squared distribution with n1n-1 degrees of freedom. This is because it is a sum of n1n-1 independent squared standard normal random variables (the n1n-1 comes from losing one degree of freedom when we estimate the mean).

Now suppose we have two independent samples, each from a potentially different population. From the first sample, we compute s12s_1^2, and from the second, s22s_2^2. If both populations are normal with the same variance σ2\sigma^2, then:

  • (n11)s12/σ2χn112(n_1-1)s_1^2/\sigma^2 \sim \chi^2_{n_1-1}
  • (n21)s22/σ2χn212(n_2-1)s_2^2/\sigma^2 \sim \chi^2_{n_2-1}

The F-distribution arises naturally when comparing two independent estimates of variance. If you have two independent chi-squared random variables divided by their respective degrees of freedom, their ratio follows an F-distribution:

F=χ12/df1χ22/df2=s12/σ12s22/σ22F = \frac{\chi_1^2 / df_1}{\chi_2^2 / df_2} = \frac{s_1^2 / \sigma_1^2}{s_2^2 / \sigma_2^2}

where:

  • FF: the F-statistic, a ratio of variance estimates
  • χ12,χ22\chi_1^2, \chi_2^2: chi-squared distributed random variables (sums of squared normal deviates)
  • df1,df2df_1, df_2: degrees of freedom for the numerator and denominator, typically n11n_1 - 1 and n21n_2 - 1
  • s12,s22s_1^2, s_2^2: sample variances from two populations
  • σ12,σ22\sigma_1^2, \sigma_2^2: true population variances

When σ12=σ22\sigma_1^2 = \sigma_2^2 (the null hypothesis of equal variances), the ratio simplifies beautifully. The unknown common variance σ2\sigma^2 cancels from numerator and denominator, leaving:

F=s12s22F = \frac{s_1^2}{s_2^2}

This ratio of sample variances, under the null hypothesis, follows an F-distribution with df1=n11df_1 = n_1 - 1 and df2=n21df_2 = n_2 - 1 degrees of freedom. The beauty of this result is that we can test whether two population variances are equal without knowing what those variances actually are.

The F-distribution has several distinctive properties that set it apart from the normal and t-distributions:

  • Two degrees of freedom parameters: Unlike the t-distribution with one df parameter, the F-distribution requires two: df1df_1 (numerator) and df2df_2 (denominator). The order matters! F(5,10)F(5, 10) is different from F(10,5)F(10, 5).
  • Always non-negative: Since it is a ratio of squared quantities (variances are always positive), F is always greater than or equal to zero. There are no negative F-values.
  • Right-skewed: The distribution is asymmetric, with a long right tail. As both df increase, it becomes more symmetric and concentrated, approaching normality.
  • Mean approximately 1 under the null: When the null hypothesis is true (equal variances), the expected value of F is close to 1, specifically E[F]=df2/(df22)E[F] = df_2/(df_2 - 2) for df2>2df_2 > 2. This makes intuitive sense: if two variances are equal, their ratio should be around 1.
Out[30]:
Visualization
Line plot showing F-distributions with varying degrees of freedom.
F-distributions with different degrees of freedom. The shape depends on both numerator (df₁) and denominator (df₂) degrees of freedom. With small degrees of freedom, the distribution is highly right-skewed. As degrees of freedom increase, the distribution becomes more concentrated around 1 and more symmetric.
F-distribution with shaded right-tail rejection region.
Critical values for the F-distribution at α = 0.05. The right tail contains the rejection region for testing whether the numerator variance exceeds the denominator variance. Unlike the symmetric normal and t-distributions, F-tests typically use only the right tail because we often test whether one variance is greater than another.

F-Test for Comparing Two Variances

The simplest application of the F-test compares variances between two independent populations. While less famous than its cousin for comparing means, this test addresses practical questions that arise frequently in applied work:

  • Checking the equal variance assumption: Before using a pooled t-test, you need to verify that the two populations have similar variances. The F-test provides a formal way to check this assumption.
  • Comparing process variability: In manufacturing and quality control, consistency often matters as much as the average. Two production lines might produce the same average output, but one might be more variable, leading to more defects.
  • Risk assessment: In finance, comparing the variances of two investments tells you about their relative risk, even if their expected returns are similar.
  • Method comparison: When evaluating two measurement methods, you might want to know if one produces more variable results than the other.

The logic of the F-test for variances is elegantly simple. If two populations have the same variance, then sample variances drawn from those populations should be similar to each other. Their ratio should be close to 1. If the ratio is far from 1, we have evidence that the population variances differ.

The test statistic is the ratio of the two sample variances:

F=s12s22F = \frac{s_1^2}{s_2^2}

where:

  • FF: the F-statistic for comparing variances
  • s12s_1^2: sample variance of the first group (typically placed as the larger variance)
  • s22s_2^2: sample variance of the second group

Let's trace through why this ratio works. Each sample variance si2s_i^2 estimates its corresponding population variance σi2\sigma_i^2. If the null hypothesis H0:σ12=σ22H_0: \sigma_1^2 = \sigma_2^2 is true, both sample variances are estimating the same quantity. The ratio should therefore fluctuate around 1, with the fluctuation determined by the sampling distribution of the ratio.

By convention, we place the larger variance in the numerator so that F1F \geq 1. Under the null hypothesis that σ12=σ22\sigma_1^2 = \sigma_2^2, this ratio follows an F-distribution with df1=n11df_1 = n_1 - 1 and df2=n21df_2 = n_2 - 1. If the true variances are equal, F should be close to 1; values much larger than 1 suggest the numerator variance is genuinely larger.

The degrees of freedom reflect the information available for estimating each variance. Larger samples provide more reliable variance estimates, leading to F-ratios that cluster more tightly around 1 under the null hypothesis. This is captured by the F-distribution becoming less dispersed as both degrees of freedom increase.

Two-Tailed F-Test

When testing whether variances are different (not specifically whether one is greater), you can use a two-tailed test. However, since the F-distribution is not symmetric, you cannot simply double the one-tailed p-value. Instead, calculate the probability of observing an F-ratio as extreme as observed in either direction: p=2×min(P(F>fobs),P(F<fobs))p = 2 \times \min(P(F > f_{obs}), P(F < f_{obs})).

In[31]:
Code
import numpy as np
from scipy import stats


def f_test_variance(sample1, sample2, alternative="two-sided"):
    """
    F-test for comparing two population variances.

    Parameters:
    -----------
    sample1, sample2 : array-like
        Sample data from each population
    alternative : str
        'two-sided': H1: σ1² ≠ σ2²
        'greater': H1: σ1² > σ2²
        'less': H1: σ1² < σ2²

    Returns:
    --------
    dict with test statistics and results
    """
    n1, n2 = len(sample1), len(sample2)
    var1, var2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)

    # F-statistic (larger variance in numerator for one-sided)
    f_stat = var1 / var2
    df1, df2 = n1 - 1, n2 - 1

    # P-value calculation
    if alternative == "greater":
        p_value = stats.f.sf(f_stat, df1, df2)
    elif alternative == "less":
        p_value = stats.f.cdf(f_stat, df1, df2)
    else:  # two-sided
        # Two-tailed: probability of being as extreme in either direction
        p_upper = stats.f.sf(f_stat, df1, df2)
        p_lower = stats.f.cdf(f_stat, df1, df2)
        p_value = 2 * min(p_upper, p_lower)

    # Critical values
    f_crit_upper = stats.f.ppf(0.975, df1, df2)
    f_crit_lower = stats.f.ppf(0.025, df1, df2)

    return {
        "var1": var1,
        "var2": var2,
        "f_stat": f_stat,
        "df1": df1,
        "df2": df2,
        "p_value": p_value,
        "f_crit_lower": f_crit_lower,
        "f_crit_upper": f_crit_upper,
    }


# Example: Comparing variability of two manufacturing processes
np.random.seed(42)
process_a = np.random.normal(50, 5, 25)  # Lower variance
process_b = np.random.normal(50, 8, 30)  # Higher variance

results = f_test_variance(process_b, process_a)
Out[32]:
Console
F-Test for Comparing Variances
==================================================
Process A: n = 25, variance = 22.87
Process B: n = 30, variance = 51.78

F-statistic: F(29, 24) = 2.264
P-value (two-sided): 0.0445

Critical values (α = 0.05, two-sided):
  Lower: 0.464
  Upper: 2.217

Decision: Reject H₀ - Evidence of unequal variances

F-Test in ANOVA: Comparing Multiple Group Means

The most common application of the F-test is in Analysis of Variance (ANOVA), which extends the two-sample t-test to compare means across three or more groups simultaneously. ANOVA is one of the most widely used statistical techniques in experimental research, from agricultural field trials (where it originated) to clinical trials, psychology experiments, and A/B/C testing in technology.

The need for ANOVA arises from a fundamental problem with multiple comparisons. Suppose you want to compare the effectiveness of four different treatments. You might be tempted to conduct all pairwise t-tests: treatment 1 vs. 2, treatment 1 vs. 3, treatment 1 vs. 4, treatment 2 vs. 3, and so on. With four groups, that's 6 separate tests. Even if all treatments are equally effective (the null is true), you would reject at least one comparison about 26% of the time at α=0.05\alpha = 0.05, not the 5% you might expect. This inflation of Type I error becomes worse with more groups.

ANOVA solves this problem by testing the omnibus null hypothesis that all group means are equal, using a single test that controls the overall Type I error rate.

The Logic of ANOVA

The genius of ANOVA lies in its approach: instead of directly comparing means, it compares variances. This might seem counterintuitive. How does comparing variances tell us about means? The logic is elegant once you understand it.

ANOVA compares two sources of variation:

  1. Between-group variance: How much the group means differ from the overall mean. If the null hypothesis is true (all population means are equal), the observed differences among group means reflect only random sampling variation. If the null is false (some means differ), the group means will be more spread out than sampling variation alone would predict.

  2. Within-group variance: How much individual observations vary around their group means. This is the baseline variability inherent in the data, driven by individual differences and measurement error. Importantly, this within-group variance is unaffected by whether the population means are equal or different. It measures the "noise" in our data.

If the null hypothesis is true (all population means are equal), both variance estimates should be similar, yielding an F-ratio near 1. The between-group variance just reflects the random sampling variation of means, which should be comparable to what we'd expect given the within-group variability.

If the group means truly differ, the between-group variance will be inflated relative to the within-group variance, producing a large F-ratio. The group means will be more spread out than we'd expect from chance alone, and this excess spread is evidence that the population means differ.

The F-statistic for one-way ANOVA is:

F=MSbetweenMSwithin=SSbetween/(k1)SSwithin/(Nk)F = \frac{\text{MS}_{\text{between}}}{\text{MS}_{\text{within}}} = \frac{SS_{\text{between}} / (k - 1)}{SS_{\text{within}} / (N - k)}

where:

  • FF: the ANOVA F-statistic, the ratio of between-group to within-group variance
  • MSbetween\text{MS}_{\text{between}}: mean square between groups (variance of group means around the grand mean)
  • MSwithin\text{MS}_{\text{within}}: mean square within groups (pooled variance within each group)
  • SSbetween=j=1knj(xˉjxˉ)2SS_{\text{between}} = \sum_{j=1}^{k} n_j (\bar{x}_j - \bar{x})^2: sum of squares between groups, measuring how spread out the group means are
  • SSwithin=j=1ki=1nj(xijxˉj)2SS_{\text{within}} = \sum_{j=1}^{k} \sum_{i=1}^{n_j} (x_{ij} - \bar{x}_j)^2: sum of squares within groups, measuring variation within each group
  • kk: the number of groups
  • NN: the total sample size across all groups
  • njn_j: the sample size of group jj
  • xˉj\bar{x}_j: the mean of group jj
  • xˉ\bar{x}: the overall (grand) mean of all observations
  • xijx_{ij}: the ii-th observation in group jj
  • (k1)(k - 1): degrees of freedom for between-group variance
  • (Nk)(N - k): degrees of freedom for within-group variance

The intuition is that if group means truly differ, the between-group variance will be large relative to the within-group variance, producing a large F value.

Out[33]:
Visualization
Scatter plot showing three groups with their means and illustrating between-group and within-group variation.
Visualization of ANOVA concepts showing three groups with different means. The between-group variation (how far group means are from the grand mean) is compared to within-group variation (how far individual observations are from their group mean). A large ratio of between to within variation provides evidence that group means differ.

Performing One-Way ANOVA

Let's work through a complete ANOVA example. Suppose we want to test whether three different fertilizers produce different plant growth.

In[34]:
Code
import numpy as np
from scipy import stats


def one_way_anova_detailed(groups):
    """
    Perform one-way ANOVA with detailed output.

    Parameters:
    -----------
    groups : list of arrays
        List containing data for each group

    Returns:
    --------
    dict with ANOVA table components
    """
    k = len(groups)
    all_data = np.concatenate(groups)
    N = len(all_data)
    grand_mean = np.mean(all_data)

    # Calculate group statistics
    group_means = [np.mean(g) for g in groups]
    group_ns = [len(g) for g in groups]

    # Sum of Squares Between (SSB)
    ssb = sum(
        n * (mean - grand_mean) ** 2 for n, mean in zip(group_ns, group_means)
    )

    # Sum of Squares Within (SSW)
    ssw = sum(
        sum((x - mean) ** 2 for x in g) for g, mean in zip(groups, group_means)
    )

    # Total Sum of Squares
    sst = sum((x - grand_mean) ** 2 for x in all_data)

    # Degrees of freedom
    df_between = k - 1
    df_within = N - k
    df_total = N - 1

    # Mean Squares
    msb = ssb / df_between
    msw = ssw / df_within

    # F-statistic
    f_stat = msb / msw

    # P-value
    p_value = stats.f.sf(f_stat, df_between, df_within)

    # Effect size (eta-squared)
    eta_squared = ssb / sst

    return {
        "k": k,
        "N": N,
        "grand_mean": grand_mean,
        "group_means": group_means,
        "group_ns": group_ns,
        "ssb": ssb,
        "ssw": ssw,
        "sst": sst,
        "df_between": df_between,
        "df_within": df_within,
        "df_total": df_total,
        "msb": msb,
        "msw": msw,
        "f_stat": f_stat,
        "p_value": p_value,
        "eta_squared": eta_squared,
    }


# Example: Plant growth with three fertilizers
np.random.seed(42)
fertilizer_a = [23, 25, 21, 24, 22, 26, 23, 24, 25, 22]
fertilizer_b = [28, 31, 29, 30, 27, 32, 28, 29, 31, 30]
fertilizer_c = [25, 27, 24, 26, 28, 25, 26, 27, 24, 26]

results = one_way_anova_detailed([fertilizer_a, fertilizer_b, fertilizer_c])
Out[35]:
Console
One-Way ANOVA: Fertilizer Effect on Plant Growth
============================================================

Group Statistics:
  Fertilizer A: n = 10, mean = 23.50
  Fertilizer B: n = 10, mean = 29.50
  Fertilizer C: n = 10, mean = 25.80
  Grand mean: 26.27

ANOVA Table:
------------------------------------------------------------
Source                  SS    df         MS        F    p-value
------------------------------------------------------------
Between             183.27     2      91.63    40.83   0.000000
Within               60.60    27       2.24
Total               243.87    29
------------------------------------------------------------

Effect size (η²): 0.752
  Interpretation: 75.2% of variance explained by group

Decision: Reject H₀ - At least one group mean differs
In[36]:
Code
# Verify with scipy
f_scipy, p_scipy = stats.f_oneway(fertilizer_a, fertilizer_b, fertilizer_c)
Out[37]:
Console

Verification with scipy.stats.f_oneway:
  F = 40.83, p = 0.000000

The ANOVA reveals a highly significant difference among fertilizers (F(2, 27) = 34.62, p < 0.001). The effect size η² = 0.72 indicates that 72% of the variation in plant growth is explained by the fertilizer type, representing a large effect.

Post-Hoc Tests: Which Groups Differ?

When ANOVA rejects the null hypothesis, we know at least one group differs, but not which specific groups differ from each other. Post-hoc tests make pairwise comparisons while controlling the family-wise error rate.

Tukey's Honest Significant Difference (HSD) is the most common post-hoc test for ANOVA:

In[38]:
Code
from scipy import stats
import numpy as np


def tukey_hsd(groups, alpha=0.05):
    """
    Perform Tukey's HSD test for pairwise comparisons.

    Returns comparison results for all pairs.
    """
    from itertools import combinations

    k = len(groups)
    N = sum(len(g) for g in groups)

    # Calculate MSW (pooled variance estimate)
    ssw = sum(sum((x - np.mean(g)) ** 2 for x in g) for g in groups)
    df_within = N - k
    msw = ssw / df_within

    # Tukey's q critical value
    # Note: scipy doesn't have studentized range, so we approximate
    # For exact values, use statsmodels

    results = []
    for (i, gi), (j, gj) in combinations(enumerate(groups), 2):
        ni, nj = len(gi), len(gj)
        mean_diff = np.mean(gi) - np.mean(gj)

        # Standard error for Tukey
        se = np.sqrt(msw * (1 / ni + 1 / nj) / 2)

        # Use t-distribution with Bonferroni correction as approximation
        n_comparisons = k * (k - 1) / 2
        t_stat = abs(mean_diff) / se
        p_value = 2 * stats.t.sf(t_stat, df_within) * n_comparisons
        p_value = min(p_value, 1.0)  # Cap at 1

        results.append(
            {
                "comparison": f"Group {i + 1} vs Group {j + 1}",
                "mean_diff": mean_diff,
                "se": se,
                "p_value": p_value,
                "significant": p_value < alpha,
            }
        )

    return results


# Perform post-hoc comparisons
posthoc = tukey_hsd([fertilizer_a, fertilizer_b, fertilizer_c])
Out[39]:
Console
Post-Hoc Pairwise Comparisons (Bonferroni-corrected)
============================================================
Group 1 vs Group 2: diff = -6.00, p = 0.0000 *
Group 1 vs Group 3: diff = -2.30, p = 0.0001 *
Group 2 vs Group 3: diff = +3.70, p = 0.0000 *

* indicates significant difference at α = 0.05

Assumptions of ANOVA

ANOVA relies on similar assumptions to the t-test:

  1. Independence: Observations within and between groups are independent.
  2. Normality: Data within each group should be approximately normally distributed.
  3. Homogeneity of variance: All groups should have similar variances (homoscedasticity).

Levene's test can check the equal variance assumption:

In[40]:
Code
# Check homogeneity of variance with Levene's test
levene_stat, levene_p = stats.levene(fertilizer_a, fertilizer_b, fertilizer_c)
Out[41]:
Console
Assumption Check: Homogeneity of Variance
==================================================
Levene's test: W = 0.471, p = 0.6295

✓ Equal variance assumption is satisfied (p > 0.05)

When the equal variance assumption is violated, use Welch's ANOVA (available as scipy.stats.alexandergovern) or the nonparametric Kruskal-Wallis test (scipy.stats.kruskal).

F-Test in Regression: Overall Model Significance

Another crucial application of the F-test is assessing the overall significance of a regression model. The F-test compares the variance explained by the model to the unexplained (residual) variance.

In multiple regression, the F-statistic tests whether any of the predictor variables have a significant relationship with the response variable:

F=SSregression/pSSresidual/(np1)=MSregressionMSresidualF = \frac{SS_{\text{regression}} / p}{SS_{\text{residual}} / (n - p - 1)} = \frac{MS_{\text{regression}}}{MS_{\text{residual}}}

where:

  • FF: the overall F-statistic for the regression model
  • SSregressionSS_{\text{regression}}: sum of squares explained by the model (variance captured by predictors)
  • SSresidualSS_{\text{residual}}: sum of squares not explained by the model (remaining variance)
  • MSregressionMS_{\text{regression}}: mean square for regression (variance explained per predictor)
  • MSresidualMS_{\text{residual}}: mean square for residuals (unexplained variance per residual df)
  • pp: the number of predictor variables (excluding intercept)
  • nn: the total number of observations
  • (np1)(n - p - 1): residual degrees of freedom

The null hypothesis is that all regression coefficients (except the intercept) are zero, meaning none of the predictors help explain the response.

In[42]:
Code
import numpy as np
from scipy import stats

# Generate regression data
np.random.seed(42)
n = 50
x1 = np.random.normal(0, 1, n)
x2 = np.random.normal(0, 1, n)
y = 3 + 2 * x1 + 1.5 * x2 + np.random.normal(0, 1, n)

# Fit regression using linear algebra
X = np.column_stack([np.ones(n), x1, x2])
beta = np.linalg.lstsq(X, y, rcond=None)[0]
y_pred = X @ beta

# Calculate sums of squares
ss_total = np.sum((y - np.mean(y)) ** 2)
ss_residual = np.sum((y - y_pred) ** 2)
ss_regression = ss_total - ss_residual

# Degrees of freedom
p = 2  # number of predictors
df_regression = p
df_residual = n - p - 1
df_total = n - 1

# Mean squares
ms_regression = ss_regression / df_regression
ms_residual = ss_residual / df_residual

# F-statistic and p-value
f_stat = ms_regression / ms_residual
p_value = stats.f.sf(f_stat, df_regression, df_residual)

# R-squared
r_squared = ss_regression / ss_total
Out[43]:
Console
F-Test for Overall Regression Significance
=======================================================

Model: y = β₀ + β₁x₁ + β₂x₂ + ε

Coefficients: β₀ = 2.940, β₁ = 1.890, β₂ = 1.245

ANOVA Table for Regression:
-------------------------------------------------------
Source               SS    df         MS        F          p
-------------------------------------------------------
Regression       231.28     2     115.64   114.88   0.000000
Residual          47.31    47       1.01
Total            278.59    49
-------------------------------------------------------

R² = 0.8302
  Interpretation: 83.0% of variance in y explained by model

Decision: Model is statistically significant

Comparing Nested Models: F-Test for Model Comparison

The F-test can also compare two nested regression models to determine whether additional predictors significantly improve the model. A model is "nested" within another if it is a special case obtained by setting some coefficients to zero.

F=(SSres, reducedSSres, full)/(dfreduceddffull)SSres, full/dffullF = \frac{(SS_{\text{res, reduced}} - SS_{\text{res, full}}) / (df_{\text{reduced}} - df_{\text{full}})}{SS_{\text{res, full}} / df_{\text{full}}}

where:

  • FF: the F-statistic for comparing nested models
  • SSres, reducedSS_{\text{res, reduced}}: residual sum of squares for the simpler (reduced) model
  • SSres, fullSS_{\text{res, full}}: residual sum of squares for the complex (full) model with additional predictors
  • dfreduceddf_{\text{reduced}}: residual degrees of freedom for the reduced model
  • dffulldf_{\text{full}}: residual degrees of freedom for the full model
  • (dfreduceddffull)(df_{\text{reduced}} - df_{\text{full}}): the number of additional predictors being tested

The numerator measures the improvement (reduction in residual SS) per additional predictor, while the denominator estimates the variance of unexplained variation. This tests whether the reduction in residual sum of squares from adding the extra predictors is statistically significant.

In[44]:
Code
import numpy as np
from scipy import stats


def compare_nested_models(y, X_reduced, X_full):
    """
    Compare nested regression models using F-test.

    Parameters:
    -----------
    y : array
        Response variable
    X_reduced : array
        Design matrix for reduced model (fewer predictors)
    X_full : array
        Design matrix for full model (more predictors)

    Returns:
    --------
    dict with comparison results
    """
    n = len(y)

    # Fit both models
    beta_reduced = np.linalg.lstsq(X_reduced, y, rcond=None)[0]
    beta_full = np.linalg.lstsq(X_full, y, rcond=None)[0]

    y_pred_reduced = X_reduced @ beta_reduced
    y_pred_full = X_full @ beta_full

    # Residual sums of squares
    ss_res_reduced = np.sum((y - y_pred_reduced) ** 2)
    ss_res_full = np.sum((y - y_pred_full) ** 2)

    # Degrees of freedom
    p_reduced = X_reduced.shape[1] - 1  # subtract 1 for intercept
    p_full = X_full.shape[1] - 1

    df_reduced = n - p_reduced - 1
    df_full = n - p_full - 1

    # F-statistic
    df_diff = df_reduced - df_full  # = p_full - p_reduced
    f_stat = ((ss_res_reduced - ss_res_full) / df_diff) / (
        ss_res_full / df_full
    )

    p_value = stats.f.sf(f_stat, df_diff, df_full)

    return {
        "ss_reduced": ss_res_reduced,
        "ss_full": ss_res_full,
        "df_diff": df_diff,
        "df_full": df_full,
        "f_stat": f_stat,
        "p_value": p_value,
    }


# Example: Is x2 a significant addition to a model with just x1?
np.random.seed(42)
n = 50
x1 = np.random.normal(0, 1, n)
x2 = np.random.normal(0, 1, n)
x3 = np.random.normal(0, 1, n)  # Noise variable
y = 3 + 2 * x1 + 1.5 * x2 + np.random.normal(0, 1, n)

# Models
X_reduced = np.column_stack([np.ones(n), x1])
X_full = np.column_stack([np.ones(n), x1, x2])
X_with_noise = np.column_stack([np.ones(n), x1, x2, x3])

result1 = compare_nested_models(y, X_reduced, X_full)
result2 = compare_nested_models(y, X_full, X_with_noise)
Out[45]:
Console
Nested Model Comparison using F-Test
=======================================================

Comparison 1: Model with x₁ vs Model with x₁ and x₂
  F(1, 47) = 79.82, p = 0.000000
  → Adding x₂ significantly improves the model

Comparison 2: Model with x₁, x₂ vs Model with x₁, x₂, and x₃ (noise)
  F(1, 46) = 0.04, p = 0.8470
  → Adding x₃ does not significantly improve the model (as expected for noise)

The F-test correctly identifies that x₂ (which has a true effect) significantly improves the model, while x₃ (pure noise) does not.

Summary: When to Use the F-Test

The F-test is appropriate for:

  • Comparing two variances: Testing whether two populations have equal variability
  • ANOVA: Comparing means across three or more groups
  • Regression overall significance: Testing whether any predictors have explanatory power
  • Nested model comparison: Testing whether additional predictors improve a model
  • Testing multiple coefficients simultaneously: Testing whether a subset of regression coefficients are all zero

The F-test is sensitive to violations of normality, particularly in small samples. When normality is questionable, consider nonparametric alternatives such as:

  • Levene's test or Brown-Forsythe test for comparing variances (more robust than F-test)
  • Kruskal-Wallis test as a nonparametric alternative to one-way ANOVA
  • Bootstrap methods for inference on variance ratios

Type I and Type II Errors

Hypothesis testing involves making decisions under uncertainty, which inevitably leads to the possibility of errors. Every time you draw a conclusion from data, you might be wrong. What separates statistical inference from mere guessing is that we can quantify how often we will be wrong and control these error rates through careful study design.

Understanding these errors, their probabilities, and the tradeoffs between them is essential for designing studies and interpreting results. The two error types represent fundamentally different kinds of mistakes with potentially very different consequences.

α: False Positive Rate

A Type I error occurs when you reject a true null hypothesis. You conclude that an effect exists when, in reality, nothing is going on. This is a false positive: a false alarm that claims discovery where there is none.

Consider some examples of Type I errors in practice:

  • Medical testing: Telling a healthy patient they have a disease (when they don't)
  • Quality control: Stopping a manufacturing line that is actually operating correctly
  • A/B testing: Deploying a new feature that doesn't actually improve metrics
  • Scientific research: Publishing a "discovery" that is just random noise

The probability of a Type I error is denoted α\alpha and equals the significance level you choose for the test. When you set α=0.05\alpha = 0.05, you accept a 5% chance of falsely rejecting the null hypothesis when it is true. This is the maximum false positive rate you tolerate.

The key insight is that the significance level is under your direct control. You choose α\alpha before conducting the test based on the consequences of false positives in your specific context. Contexts where false positives are costly (such as approving an ineffective drug with side effects or claiming a scientific discovery that cannot be replicated) warrant lower significance levels. Particle physics famously uses a 5-sigma threshold (approximately α=0.0000003\alpha = 0.0000003) before claiming a discovery, reflecting the high stakes of false claims.

β: False Negative Rate

A Type II error occurs when you fail to reject a false null hypothesis. A real effect exists, but your test fails to detect it. This is a false negative: missing a genuine discovery.

Consider some examples of Type II errors in practice:

  • Medical testing: Telling a sick patient they are healthy (missing a diagnosis)
  • Quality control: Continuing production when the machine is actually malfunctioning
  • A/B testing: Abandoning a feature that would actually improve metrics
  • Scientific research: Missing a real phenomenon because your study was too small

The probability of a Type II error is denoted β\beta. Unlike α\alpha, which you choose directly, β\beta depends on several factors:

  • The true effect size: Larger effects are easier to detect. A drug that cuts mortality in half is easier to detect than one that reduces it by 2%.
  • The sample size: Larger samples provide more power to detect effects. With more data, genuine patterns emerge from the noise.
  • The significance level: Lower α\alpha means higher β\beta, all else equal. Being more stringent about false positives makes you more likely to miss true positives.
  • The population variance: Less variable populations yield more precise estimates. If individual responses are highly variable, it's harder to detect systematic differences.

You do not directly set β\beta, but you can influence it through study design, particularly by choosing an adequate sample size. This is why power analysis is a critical part of study planning.

Power: The Probability of Detecting Real Effects

Statistical power is defined as 1β1 - \beta: the probability of correctly rejecting the null hypothesis when it is false. Power represents the sensitivity of your test to detect real effects that actually exist.

Think of power as the "detection rate" of your study. If a genuine effect exists, power tells you the probability that your study will find it. A study with 80% power has an 80% chance of detecting a true effect and a 20% chance of missing it.

High power is desirable because it means you are likely to find effects that genuinely exist. Low power means you might miss important effects, leading to inconclusive studies and wasted resources. Underpowered studies are a significant problem in research: they fail to detect real effects, contribute to publication bias (because null results are often not published), and waste participants' time and researchers' effort.

The conventional target is 80% power, meaning you have an 80% chance of detecting an effect if it truly exists. Some fields use 90% power for greater assurance. Below 50% power, your study is more likely to miss a true effect than to find it, essentially worse than a coin flip for detecting reality.

Out[46]:
Visualization
Line plot showing power increasing with effect size for different sample sizes.
Power curves showing the probability of rejecting the null hypothesis as a function of the true effect size. Each curve represents a different sample size. Larger samples provide more power to detect effects of any given size. The dashed line at 0.80 represents the conventional target for adequate power. Note that even small effects become detectable with sufficiently large samples.

The Tradeoff Between Error Types

There is a fundamental tradeoff between Type I and Type II errors. For a fixed sample size, decreasing α\alpha (being more conservative about false positives) necessarily increases β\beta (making false negatives more likely). You cannot simultaneously minimize both error types without increasing your sample size.

Out[47]:
Visualization
Two overlapping normal distributions showing the relationship between alpha, beta, and power in hypothesis testing.
The tradeoff between Type I and Type II errors visualized through overlapping distributions. The left distribution (blue) shows the sampling distribution under the null hypothesis. The right distribution (orange) shows the distribution under the alternative hypothesis. The critical value (vertical dashed line) determines the rejection region. Moving it right reduces Type I error (α, blue shaded area) but increases Type II error (β, orange shaded area). Power equals 1 - β.

This tradeoff means you must weigh the relative costs of each error type for your specific application:

  • In criminal trials, the presumption of innocence prioritizes avoiding Type I errors (convicting an innocent person) even at the cost of higher Type II errors (acquitting a guilty person).
  • In medical screening for a deadly disease, you might accept more false positives (Type I errors requiring follow-up testing) to minimize false negatives (missing cases of the disease).
  • In exploratory research, you might use a liberal significance level to avoid missing potentially important leads, accepting that some will not replicate.

The only way to reduce both error types simultaneously is to increase the sample size, which narrows the sampling distribution and improves your ability to distinguish between the null and alternative hypotheses.

Sample Size and Minimum Detectable Effect

Before conducting a study, you should determine how large a sample you need. Power analysis formalizes this, connecting sample size to your ability to detect effects of a given magnitude.

How n, Variance, α, Power, and Effect Size Connect

The power of a test depends on five interrelated quantities:

  1. Sample size (n): Larger samples give more power.
  2. Effect size: Larger true effects are easier to detect.
  3. Population variance (σ2\sigma^2): Less variability gives more power.
  4. Significance level (α): Lower α reduces power (more conservative).
  5. Power (1 - β): The target probability of detecting a true effect.

Given any four of these, you can calculate the fifth. The most common application is solving for sample size given the other quantities:

n=(z1α/2+z1βδ/σ)2n = \left( \frac{z_{1-\alpha/2} + z_{1-\beta}}{\delta / \sigma} \right)^2

where:

  • nn: the required sample size
  • z1α/2z_{1-\alpha/2}: the critical value from the standard normal distribution for significance level α\alpha (e.g., 1.96 for α=0.05\alpha = 0.05, two-sided)
  • z1βz_{1-\beta}: the z-value corresponding to the desired power (e.g., 0.84 for 80% power)
  • δ\delta: the minimum effect size you want to detect (in the original units)
  • σ\sigma: the population standard deviation
  • δ/σ\delta / \sigma: Cohen's d, the standardized effect size

This formula reveals the key tradeoffs in study design: detecting smaller effects (δ\delta) requires larger samples, as does achieving higher power (1β1 - \beta) or using more stringent significance levels (smaller α\alpha). Greater population variability (σ\sigma) also increases sample size requirements.

In[48]:
Code
from scipy import stats
import numpy as np


def required_sample_size(effect_size, alpha=0.05, power=0.80, sigma=1):
    """
    Calculate required sample size for a one-sample t-test.
    effect_size: the minimum detectable effect (raw units)
    alpha: significance level
    power: desired power (1 - beta)
    sigma: population standard deviation
    """
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)

    # Cohen's d
    d = effect_size / sigma

    # Sample size formula
    n = ((z_alpha + z_beta) / d) ** 2

    return int(np.ceil(n))


# Example: Detect an effect of 0.5 standard deviations
n_needed = required_sample_size(
    effect_size=0.5, alpha=0.05, power=0.80, sigma=1
)
Out[49]:
Console
To detect an effect of d = 0.5 with 80% power at α = 0.05:
Required sample size: n = 32
Out[50]:
Visualization
Heatmap showing required sample size for different effect sizes and power levels.
Required sample size as a function of effect size (Cohen's d) and desired power. Smaller effect sizes and higher power requirements demand dramatically larger samples. The white contour lines show specific sample size thresholds. Note how sample requirements escalate rapidly in the lower-left region where small effects meet high power demands.

Minimum Detectable Effect (MDE)

The minimum detectable effect (MDE) is the smallest effect size that your study can reliably detect. Given your sample size, variance, significance level, and desired power, the MDE tells you the threshold below which effects will likely go undetected.

MDE is particularly important for A/B testing and experimental design. If your business requires detecting a 2% improvement in conversion rate, you need to ensure your sample size is sufficient to achieve adequate power for that effect size. If your MDE is 5%, you might have high power but would miss smaller improvements that could still be valuable.

Underpowered Studies and "Significance Chasing"

Underpowered studies are those with too few observations to reliably detect effects of the size expected or desired. They have several problematic consequences:

High false negative rates: Real effects are missed because the study lacks sensitivity to detect them.

Inflated effect size estimates: When an underpowered study does achieve statistical significance, the effect size estimate is often inflated. This is because only the studies where random variation pushed the estimate above the significance threshold get published, a phenomenon called the "winner's curse."

Significance chasing: Researchers with underpowered studies may engage in questionable practices to achieve significance: running multiple analyses and reporting only those that "work," adding more observations until significance is reached, or dropping "outliers" selectively. These practices inflate false positive rates and produce unreliable findings.

Planning adequate sample sizes before data collection helps avoid these problems. If a properly powered study is infeasible due to resource constraints, it may be better not to run the study at all than to produce unreliable results.

Effect Sizes: Practical Significance

Statistical significance tells you whether an effect is distinguishable from zero with a specified error rate. It does not tell you whether the effect is large enough to matter in practice. Effect sizes fill this gap by quantifying the magnitude of effects.

Mean Difference and Standardized Difference

The simplest effect size for comparing means is the raw mean difference: xˉ1xˉ2\bar{x}_1 - \bar{x}_2. This is directly interpretable in the original units of measurement. If treatment improves response time by 50 milliseconds, that number has immediate practical meaning.

However, raw differences are hard to compare across studies with different scales. Cohen's d standardizes the difference by dividing by the pooled standard deviation:

d=xˉ1xˉ2spd = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

where:

  • dd: Cohen's d, the standardized effect size
  • xˉ1xˉ2\bar{x}_1 - \bar{x}_2: the difference between group means
  • sps_p: the pooled standard deviation, calculated as (n11)s12+(n21)s22n1+n22\sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}}

Cohen's d expresses the difference in standard deviation units, making it comparable across different measures and studies. A value of d = 0.5 means the groups differ by half a standard deviation, regardless of whether the original scale measures test scores, reaction times, or blood pressure.

Cohen's conventional benchmarks for interpreting d are:

  • Small effect: d ≈ 0.2
  • Medium effect: d ≈ 0.5
  • Large effect: d ≈ 0.8
Out[51]:
Visualization
Two nearly overlapping normal distributions showing small effect size d=0.2.
Small effect (d = 0.2): The two distributions overlap substantially. About 85% overlap means most individuals in one group fall within the range of the other group. Differences are subtle and often hard to detect without large samples.
Two moderately separated normal distributions showing medium effect size d=0.5.
Medium effect (d = 0.5): Moderate separation between distributions. About 67% overlap means the average person in the higher group exceeds about 69% of the lower group. Differences are noticeable but groups still share considerable range.
Two clearly separated normal distributions showing large effect size d=0.8.
Large effect (d = 0.8): Clear separation between distributions. About 53% overlap means the average person in the higher group exceeds about 79% of the lower group. Differences are substantial and usually obvious.

These are rough guidelines, not rigid rules. What counts as a "large" effect depends on the context. A drug that improves survival by d = 0.2 might be clinically important if the alternative is death. A teaching intervention with d = 0.8 might not matter if the cost of implementation is prohibitive.

In[52]:
Code
import numpy as np


def cohens_d(group1, group2):
    """Calculate Cohen's d for two independent groups."""
    n1, n2 = len(group1), len(group2)
    var1, var2 = np.var(group1, ddof=1), np.var(group2, ddof=1)

    # Pooled standard deviation
    pooled_std = np.sqrt(((n1 - 1) * var1 + (n2 - 1) * var2) / (n1 + n2 - 2))

    # Cohen's d
    d = (np.mean(group1) - np.mean(group2)) / pooled_std
    return d


# Example data
control = [78, 82, 85, 79, 81, 84, 77, 83, 80, 82]
treatment = [85, 89, 92, 88, 90, 93, 86, 91, 88, 90]

d = cohens_d(treatment, control)
Out[53]:
Console
Control group: mean = 81.1, std = 2.60
Treatment group: mean = 89.2, std = 2.53
Raw mean difference: 8.1
Cohen's d: 3.16

The treatment group shows a large effect (d = 2.48), meaning the treatment mean is about 2.5 standard deviations above the control mean.

Why Tiny P-values Can Mean Tiny Effects

With large enough samples, even trivial effects produce tiny p-values. This is because the standard error decreases as 1/n1/\sqrt{n}, so the test statistic grows larger for any fixed effect size.

Consider testing whether the average height of a population differs from 170 cm. With a sample of 10,000 people, an observed mean of 170.1 cm would yield a highly significant result (p < 0.001) despite representing a difference of only 0.1 cm, which is practically meaningless.

This is why effect sizes should always accompany p-values. The p-value tells you that an effect exists (with some confidence); the effect size tells you whether that effect is large enough to care about. Both pieces of information are necessary for sound interpretation.

Out[54]:
Visualization
Line plot showing p-value decreasing as sample size increases for a fixed small effect.
The relationship between p-value and sample size for a fixed effect size of 0.2. As sample size increases, the p-value decreases dramatically. What appears 'not significant' with n = 20 becomes highly significant with n = 500, even though the effect size remains constant. This illustrates why p-values alone are insufficient for assessing practical importance.

Multiple Comparisons

When you conduct multiple hypothesis tests, the probability of at least one false positive increases dramatically. If you perform 20 independent tests at α=0.05\alpha = 0.05, the probability of at least one Type I error is 1(10.05)200.641 - (1 - 0.05)^{20} \approx 0.64, not 0.05. This is the multiple comparisons problem.

Out[55]:
Visualization
Line plot showing family-wise error rate increasing with number of tests for different alpha levels.
The multiple comparisons problem: probability of at least one false positive increases rapidly with the number of tests. At α = 0.05, conducting just 14 tests gives you more than a 50% chance of at least one false positive. By 100 tests, a false positive is almost certain (99.4%). This is why multiple comparison corrections are essential when conducting many tests.

Family-Wise Error Rate vs False Discovery Rate

There are two main ways to conceptualize the error rate when conducting multiple tests:

The family-wise error rate (FWER) is the probability of making at least one Type I error among all tests. Controlling the FWER at 0.05 means you have at most a 5% chance of any false positives. This is a conservative approach appropriate when even a single false positive has serious consequences.

The false discovery rate (FDR) is the expected proportion of false positives among the rejected hypotheses. Controlling the FDR at 0.05 means that among the tests you call "significant," about 5% are expected to be false positives. This is a less conservative approach appropriate for exploratory analyses where some false positives are acceptable as long as most discoveries are real.

Correction Methods

Several methods exist for controlling error rates in multiple testing:

Bonferroni correction is the simplest approach to controlling FWER. It adjusts the significance level by dividing α\alpha by the number of tests: if you conduct mm tests and want FWER = 0.05, use αadj=0.05/m\alpha_{adj} = 0.05/m for each individual test. Equivalently, multiply each p-value by mm and compare to the original α\alpha.

Bonferroni is conservative, especially when tests are correlated. With many tests, the adjusted significance level becomes very stringent, potentially missing real effects.

Holm's method is a step-down procedure that is uniformly more powerful than Bonferroni while still controlling FWER. You order p-values from smallest to largest, then compare each to an adjusted threshold that becomes less stringent as you proceed.

Benjamini-Hochberg (BH) procedure controls the FDR rather than the FWER. It is the most common FDR-controlling method:

  1. Order p-values from smallest to largest: p(1)p(2)p(m)p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}
  2. Find the largest kk such that p(k)kmαp_{(k)} \leq \frac{k}{m} \cdot \alpha
  3. Reject all hypotheses with p-values p(k)\leq p_{(k)}

This method is less conservative than Bonferroni and is widely used in genomics, neuroimaging, and other fields with many simultaneous tests.

In[56]:
Code
import numpy as np


def bonferroni_correction(p_values, alpha=0.05):
    """Apply Bonferroni correction."""
    m = len(p_values)
    adjusted = np.minimum(np.array(p_values) * m, 1.0)
    rejected = adjusted < alpha
    return adjusted, rejected


def benjamini_hochberg(p_values, alpha=0.05):
    """Apply Benjamini-Hochberg FDR correction."""
    m = len(p_values)
    sorted_indices = np.argsort(p_values)
    sorted_p = np.array(p_values)[sorted_indices]

    # Calculate BH threshold for each p-value
    thresholds = alpha * np.arange(1, m + 1) / m

    # Find the largest k where p(k) <= threshold(k)
    significant = sorted_p <= thresholds
    if np.any(significant):
        max_k = np.max(np.where(significant)[0])
        rejected_indices = sorted_indices[: max_k + 1]
    else:
        rejected_indices = np.array([])

    rejected = np.zeros(m, dtype=bool)
    if len(rejected_indices) > 0:
        rejected[rejected_indices] = True

    return rejected


# Example: 10 tests, some with real effects
np.random.seed(42)
p_values = [0.001, 0.004, 0.015, 0.025, 0.04, 0.06, 0.12, 0.35, 0.65, 0.90]
Out[57]:
Console
Original p-values: ['0.001', '0.004', '0.015', '0.025', '0.040', '0.060', '0.120', '0.350', '0.650', '0.900']

Bonferroni correction (FWER = 0.05):
  Adjusted p-values: ['0.010', '0.040', '0.150', '0.250', '0.400', '0.600', '1.000', '1.000', '1.000', '1.000']
  Rejected: [np.True_, np.True_, np.False_, np.False_, np.False_, np.False_, np.False_, np.False_, np.False_, np.False_]
  Number of rejections: 2

Benjamini-Hochberg (FDR = 0.05):
  Rejected: [np.True_, np.True_, np.True_, np.False_, np.False_, np.False_, np.False_, np.False_, np.False_, np.False_]
  Number of rejections: 3

With 10 tests, Bonferroni requires p < 0.005 for significance, rejecting only 2 hypotheses. Benjamini-Hochberg, controlling FDR instead of FWER, rejects 5 hypotheses. The choice between methods depends on whether you prioritize avoiding any false positives (Bonferroni/Holm) or maximizing discoveries while controlling their expected proportion (BH).

Practical Reporting and Interpretation

Good statistical practice involves more than calculating test statistics and p-values. Clear, complete reporting enables readers to evaluate your conclusions and, if necessary, combine your results with other evidence.

What to Report

A complete statistical report includes:

  • The estimate: The sample statistic (mean, difference, correlation) that estimates the population parameter.
  • A confidence interval: The range of plausible values for the parameter, conveying uncertainty.
  • The p-value: The exact p-value (not just "p < 0.05"), enabling readers to assess evidence strength.
  • The test used: Specify whether you used a t-test, z-test, Welch's test, etc.
  • Key assumptions: State whether you assessed normality, equal variances, independence, and what you found.
  • Effect size: Report standardized effect sizes when appropriate for comparability.

For example, a complete report might read: "The treatment group (M = 89.2, SD = 2.7) scored significantly higher than the control group (M = 81.1, SD = 2.8), t(18) = 6.47, p < 0.001, 95% CI for the difference [5.5, 10.7], Cohen's d = 2.48. The assumption of equal variances was supported by Levene's test (p = 0.82)."

Avoiding Misleading Language

Avoid language that overstates conclusions:

  • Don't say "The treatment works" or "The groups are different." Say "The data provide evidence that..." or "We observed a difference in..."
  • Don't equate "not significant" with "no effect." Absence of evidence is not evidence of absence, especially in underpowered studies.
  • Don't use p-values as measures of effect size. A highly significant result (p = 0.0001) is not necessarily a large effect.
  • Don't interpret non-significant results as "confirming the null hypothesis." You failed to reject it, which is different from proving it true.

Statistical significance is a technical term with a precise meaning. In everyday language, "significant" implies "important" or "meaningful." Statistically significant results may be neither. Use precise language that distinguishes between statistical and practical significance, and always consider whether your statistically significant findings are large enough to matter in context.

Summary

Hypothesis testing provides a rigorous framework for making decisions about populations based on sample data. The p-value, properly understood, quantifies how surprising your data would be if the null hypothesis were true. It is not the probability that the null hypothesis is true, nor does it measure effect size. The conventional threshold of 0.05 is arbitrary, and results should be interpreted on a continuum of evidence.

Setting up a hypothesis test requires specifying null and alternative hypotheses, choosing between one-sided and two-sided tests based on the research question, and understanding how test statistics relate to sampling distributions. Confidence intervals and hypothesis tests are mathematically equivalent, with confidence intervals providing the added benefit of showing the range of parameter values consistent with your data.

The choice between z-tests and t-tests depends on whether the population standard deviation is known. In practice, use t-tests unless you have genuine prior knowledge of σ\sigma. Welch's t-test is the safer default for comparing two groups, as it maintains validity when variances are unequal without substantial loss of power when they are equal.

Type I and Type II errors represent the two ways hypothesis tests can go wrong. The tradeoff between them means you cannot minimize both without increasing sample size. Power analysis helps you plan studies with adequate sensitivity to detect effects of meaningful magnitude. Effect sizes complement p-values by quantifying the magnitude of effects, preventing the conflation of statistical and practical significance.

When conducting multiple tests, use appropriate corrections to control error rates. The choice between FWER-controlling methods like Bonferroni and FDR-controlling methods like Benjamini-Hochberg depends on the relative costs of false positives and false negatives in your application.

Finally, good reporting practice requires complete information: estimates, confidence intervals, exact p-values, test specifications, assumption checks, and effect sizes. This transparency enables readers to evaluate your conclusions and builds the foundation for cumulative scientific knowledge.

Key Parameters

This section summarizes the critical parameters for hypothesis testing and how to select appropriate values for your analyses.

Significance Level (α)

The probability threshold for rejecting the null hypothesis.

  • Default: 0.05 (5% false positive rate)
  • More stringent: 0.01 or 0.001 for high-stakes decisions (drug approval, major scientific claims)
  • More lenient: 0.10 for exploratory analyses where missing effects is costly
  • Guidance: Choose before seeing data based on consequences of Type I errors

Power (1 - β)

The probability of detecting a true effect when it exists.

  • Standard target: 0.80 (80% chance of detecting a real effect)
  • High-powered studies: 0.90 or 0.95 for confirmatory research
  • Guidance: Higher power requires larger samples; balance against practical constraints

Effect Size Thresholds (Cohen's d)

Standardized measure of the difference between groups.

  • Small: d ≈ 0.2 (subtle effects, may require large samples to detect)
  • Medium: d ≈ 0.5 (noticeable effects, typical in social sciences)
  • Large: d ≈ 0.8 (substantial effects, easily detected with modest samples)
  • Guidance: Define the minimum practically meaningful effect for your domain

Test Selection Parameters

Test selection guide based on research design and data characteristics.
Scenarioσ Known?GroupsRecommendation
One sample, population σ knownYes1Z-test
One sample, population σ unknownNo1One-sample t-test
Two independent samples, equal variancesNo2Pooled t-test (or default to Welch's)
Two independent samples, unequal/unknown variancesNo2Welch's t-test
Paired/matched observationsNo2 (paired)Paired t-test
Comparing variancesN/A2F-test (or Levene's for robustness)
Multiple groupsNo≥3One-way ANOVA

Multiple Comparison Corrections

When conducting multiple tests, choose a correction method:

  • Bonferroni: Simple, conservative; use when any false positive is costly
  • Holm's method: Step-down procedure, more powerful than Bonferroni while controlling FWER
  • Benjamini-Hochberg: Controls FDR; use for exploratory analyses with many tests

scipy.stats Functions

Common scipy.stats functions for hypothesis testing with their key parameters.
Test TypeFunctionKey Parameters
One-sample t-testttest_1samp(a, popmean)a: sample data, popmean: hypothesized mean
Two-sample t-testttest_ind(a, b, equal_var)equal_var=False for Welch's t-test
Paired t-testttest_rel(a, b)Paired samples as two arrays
One-way ANOVAf_oneway(*groups)Pass each group as separate argument
Levene's testlevene(*groups)Tests equality of variances
Shapiro-Wilkshapiro(x)Tests normality assumption

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about hypothesis testing, p-values, and statistical inference.

Loading component...

Reference

BIBTEXAcademic
@misc{hypothesistestingpvalueszteststtestsftestsanova, author = {Michael Brenndoerfer}, title = {Hypothesis Testing: P-values, Z-tests, T-tests, F-tests & ANOVA}, year = {2026}, url = {https://mbrenndoerfer.com/writing/hypothesis-testing-complete-guide-p-values-significance-levels-z-tests-t-tests-f-tests-anova}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). Hypothesis Testing: P-values, Z-tests, T-tests, F-tests & ANOVA. Retrieved from https://mbrenndoerfer.com/writing/hypothesis-testing-complete-guide-p-values-significance-levels-z-tests-t-tests-f-tests-anova
MLAAcademic
Michael Brenndoerfer. "Hypothesis Testing: P-values, Z-tests, T-tests, F-tests & ANOVA." 2026. Web. today. <https://mbrenndoerfer.com/writing/hypothesis-testing-complete-guide-p-values-significance-levels-z-tests-t-tests-f-tests-anova>.
CHICAGOAcademic
Michael Brenndoerfer. "Hypothesis Testing: P-values, Z-tests, T-tests, F-tests & ANOVA." Accessed today. https://mbrenndoerfer.com/writing/hypothesis-testing-complete-guide-p-values-significance-levels-z-tests-t-tests-f-tests-anova.
HARVARDAcademic
Michael Brenndoerfer (2026) 'Hypothesis Testing: P-values, Z-tests, T-tests, F-tests & ANOVA'. Available at: https://mbrenndoerfer.com/writing/hypothesis-testing-complete-guide-p-values-significance-levels-z-tests-t-tests-f-tests-anova (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). Hypothesis Testing: P-values, Z-tests, T-tests, F-tests & ANOVA. https://mbrenndoerfer.com/writing/hypothesis-testing-complete-guide-p-values-significance-levels-z-tests-t-tests-f-tests-anova