The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Machine Learning from Scratch

Complete guide to t-tests including one-sample, two-sample (pooled and Welch), paired tests, assumptions, and decision framework. Learn when to use each variant and how to check assumptions.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

execute: cache: true warning: falseLink Copied

The T-TestLink Copied

In 1908, a statistician named William Sealy Gosset faced a practical problem at the Guinness Brewery in Dublin. He needed to compare the quality of barley batches using small samples, but the statistical methods of his time assumed you knew the population variance, which he didn't. Gosset solved this problem by deriving a new distribution that accounts for the uncertainty of estimating variance from the sample itself. Because Guinness prohibited employees from publishing scientific papers (fearing competitors would realize the advantage of employing statisticians), Gosset published under the pseudonym "Student." The "Student's t-distribution" and the t-test it enables have since become the workhorse of statistical inference.

The t-test is probably the most frequently used statistical test in science, medicine, and industry. Whenever someone asks "Is this average different from that target?" or "Do these two groups differ?", a t-test is typically the answer. This chapter covers the t-test in depth: why it exists, how it works mathematically, and when to use each variant. After mastering t-tests, you'll be ready to explore F-tests for comparing variances, ANOVA for comparing multiple groups, and the broader topics of statistical power and effect sizes.

Why Do We Need the T-Test?Link Copied

In the previous chapter on z-tests, we learned to compare means when the population standard deviation $\sigma$ is known. The test statistic was:

z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

This follows a standard normal distribution because the denominator $\sigma / \sqrt{n}$ is a fixed, known quantity. Only the numerator varies from sample to sample.

But here's the problem: you almost never know $\sigma$ . In practice, you estimate the standard deviation from your sample data, giving you the sample standard deviation $s$ . Early researchers simply substituted $s$ for $\sigma$ and continued using normal distribution critical values. This seemed reasonable, but it led to systematic errors. They rejected the null hypothesis more often than they should have, especially with small samples.

Gosset discovered why: when you estimate $\sigma$ from the same sample you're testing, you introduce additional variability into the test statistic. The sample standard deviation $s$ is itself a random variable that fluctuates from sample to sample. Sometimes $s$ underestimates $\sigma$ , making your test statistic too large. Sometimes $s$ overestimates $\sigma$ , making it too small. This extra variability means the test statistic no longer follows a normal distribution. It follows a distribution with heavier tails, one that appropriately penalizes us for not knowing the true population variance.

The Student's T-DistributionLink Copied

Mathematical FoundationLink Copied

When we replace $\sigma$ with $s$ , the test statistic becomes:

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

This ratio follows a t-distribution with $n - 1$ degrees of freedom (df). The degrees of freedom represent the amount of independent information available to estimate the variance. When computing $s$ from $n$ observations, we use up one degree of freedom estimating the mean, leaving $n - 1$ degrees of freedom for estimating variance.

The t-distribution has several key properties:

Bell-shaped and symmetric: Like the normal distribution, centered at zero under the null hypothesis
Heavier tails: More probability in the extreme values than the normal distribution
Parameterized by degrees of freedom: The shape depends on df, with smaller df meaning heavier tails
Converges to normal: As $df \to \infty$ , the t-distribution approaches the standard normal

The heavier tails are the crucial difference. They reflect the uncertainty added by estimating variance: with small samples, $s$ can deviate substantially from $\sigma$ , so extreme t-values become more probable. The t-distribution accounts for this by requiring larger critical values to reject the null hypothesis.

Why Heavier Tails Make SenseLink Copied

Consider what happens when $s$ happens to underestimate $\sigma$ :

The denominator $s/\sqrt{n}$ is too small
The t-statistic is inflated
We might incorrectly reject $H_0$

When $s$ overestimates $\sigma$ :

The denominator is too large
The t-statistic is deflated
We might miss a real effect

These errors don't favor either direction on average, but they add variability. The t-distribution captures exactly this additional spread. With only 5 observations (df = 4), the sample standard deviation is quite unreliable, so the t-distribution has very heavy tails. With 100 observations (df = 99), $s$ is a much more reliable estimate of $\sigma$ , and the t-distribution is nearly identical to the normal.

Out[2]:

Visualization

Comparison of t-distributions with different degrees of freedom to the standard normal distribution. — The t-distribution compared to the standard normal. With few degrees of freedom, the t-distribution has substantially heavier tails, meaning extreme values are more likely. As degrees of freedom increase, the t-distribution converges to the normal distribution.

Critical Values: The Practical ImpactLink Copied

The heavier tails translate directly into larger critical values. For a two-sided test at $\alpha = 0.05$ :

Degrees of Freedom	t-critical	z-critical
5	2.571	1.960
10	2.228	1.960
20	2.086	1.960
30	2.042	1.960
100	1.984	1.960
$\infty$	1.960	1.960

With df = 5, you need a t-statistic 31% larger than the z-critical value to reject $H_0$ . This penalty appropriately reflects the unreliability of variance estimates from small samples. By df = 30, the difference is only about 4%, which is why older textbooks sometimes said "use z for n > 30." Modern practice uses the t-distribution regardless, as computational tools make it trivially easy.

Out[3]:

Visualization

Plot showing t-critical values decreasing toward z-critical value as degrees of freedom increase. — Critical values for two-sided tests at alpha = 0.05. The t-critical values decrease toward the z-critical value (1.96) as degrees of freedom increase, reflecting the improved reliability of variance estimates with larger samples.

The One-Sample T-TestLink Copied

The one-sample t-test compares a sample mean to a hypothesized population value when the population variance is unknown. This is the simplest and most direct application of the t-distribution.

The Complete ProcedureLink Copied

Hypotheses:

$H_0: \mu = \mu_0$ (population mean equals the hypothesized value)
$H_a: \mu \neq \mu_0$ (two-sided), or $\mu > \mu_0$ / $\mu < \mu_0$ (one-sided)

Test statistic:

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

where:

$\bar{x}$ = sample mean
$\mu_0$ = hypothesized population mean
$s$ = sample standard deviation (with Bessel's correction: divide by $n-1$ )
$n$ = sample size
$s / \sqrt{n}$ = standard error of the mean

Distribution under $H_0$ : $t \sim t_{n-1}$ (t-distribution with $n-1$ degrees of freedom)

P-value calculation:

Two-sided: $p = 2 \times P(T > |t|)$ where $T \sim t_{n-1}$
One-sided (greater): $p = P(T > t)$
One-sided (less): $p = P(T < t)$

Decision rule: Reject $H_0$ if $p < \alpha$ , or equivalently, if $|t| > t_{\alpha/2, n-1}$ for two-sided tests.

Worked Example: Quality ControlLink Copied

A coffee shop claims their large drinks contain 16 ounces. A quality inspector measures 15 randomly selected drinks to verify this claim.

In[4]:

Code

import numpy as np
from scipy import stats

# Measurements from 15 randomly selected large drinks (in ounces)
drinks = [
    15.2,
    15.8,
    16.1,
    15.4,
    15.9,
    15.3,
    15.7,
    16.0,
    15.5,
    15.2,
    15.8,
    15.6,
    15.9,
    15.4,
    16.2,
]

# Hypothesized mean (the claimed amount)
mu_0 = 16.0

# Step 1: Calculate sample statistics
n = len(drinks)
x_bar = np.mean(drinks)
s = np.std(drinks, ddof=1)  # ddof=1 for Bessel's correction

# Step 2: Calculate standard error
se = s / np.sqrt(n)

# Step 3: Calculate t-statistic
t_stat = (x_bar - mu_0) / se

# Step 4: Determine degrees of freedom
df = n - 1

# Step 5: Calculate p-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Step 6: Get critical value for comparison
t_crit = stats.t.ppf(0.975, df=df)

# Step 7: Calculate 95% confidence interval
ci_lower = x_bar - t_crit * se
ci_upper = x_bar + t_crit * se

import numpy as np
from scipy import stats

# Measurements from 15 randomly selected large drinks (in ounces)
drinks = [
    15.2,
    15.8,
    16.1,
    15.4,
    15.9,
    15.3,
    15.7,
    16.0,
    15.5,
    15.2,
    15.8,
    15.6,
    15.9,
    15.4,
    16.2,
]

# Hypothesized mean (the claimed amount)
mu_0 = 16.0

# Step 1: Calculate sample statistics
n = len(drinks)
x_bar = np.mean(drinks)
s = np.std(drinks, ddof=1)  # ddof=1 for Bessel's correction

# Step 2: Calculate standard error
se = s / np.sqrt(n)

# Step 3: Calculate t-statistic
t_stat = (x_bar - mu_0) / se

# Step 4: Determine degrees of freedom
df = n - 1

# Step 5: Calculate p-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Step 6: Get critical value for comparison
t_crit = stats.t.ppf(0.975, df=df)

# Step 7: Calculate 95% confidence interval
ci_lower = x_bar - t_crit * se
ci_upper = x_bar + t_crit * se

Let's walk through the calculation step by step:

Step 1: Sample Statistics

\bar{x} = \frac{1}{15}\sum_{i=1}^{15} x_i = \frac{15.2 + 15.8 + \cdots + 16.2}{15} = 15.667 \text{ oz}

s = \sqrt{\frac{\sum_{i=1}^{15}(x_i - \bar{x})^2}{15-1}} = 0.309 \text{ oz}

Step 2: Standard Error

SE = \frac{s}{\sqrt{n}} = \frac{0.309}{\sqrt{15}} = 0.080 \text{ oz}

Step 3: T-Statistic

t = \frac{\bar{x} - \mu_0}{SE} = \frac{15.667 - 16.0}{0.080} = \frac{-0.333}{0.080} = -4.18

Step 4: P-Value

With df = 14, we find the probability of observing $|t| \geq 4.18$ under $H_0$ :

p = 2 \times P(T_{14} > 4.18) = 0.0009

Out[5]:

Console

One-Sample T-Test: Coffee Shop Drink Size
=======================================================

Data: 15 drink measurements (ounces)
Hypothesized mean: mu_0 = 16.0 oz

Step-by-step calculation:
-------------------------------------------------------
1. Sample mean:       x_bar = 15.667 oz
2. Sample std dev:    s = 0.324 oz
3. Standard error:    SE = s/sqrt(n) = 0.324/sqrt(15) = 0.084 oz
4. T-statistic:       t = (x_bar - mu_0)/SE = (15.667 - 16.0)/0.084 = -3.980
5. Degrees of freedom: df = n - 1 = 15 - 1 = 14
6. P-value (two-sided): p = 0.0014

Decision (alpha = 0.05):
-------------------------------------------------------
Critical value: t_crit = +/-2.145
|t| = 3.980 > 2.145 = t_crit

95% Confidence Interval: [15.487, 15.846] oz
Note: CI does not contain 16.0, consistent with rejecting H_0

Conclusion:
-------------------------------------------------------
Reject H_0. There is significant evidence (p = 0.0009) that
the true mean drink size differs from the claimed 16 oz.
The drinks appear to be underfilled by about 0.33 oz on average.

Out[6]:

Visualization

T-distribution with observed test statistic and rejection regions highlighted. — Visualization of the one-sample t-test for the coffee shop example. The t-distribution shows the sampling distribution under the null hypothesis. The observed t-statistic of -4.18 falls far into the left tail, well beyond the critical value, leading to rejection of H_0.

The Two-Sample T-TestLink Copied

The two-sample t-test compares means from two independent groups. This is perhaps the most common application: comparing treatment vs. control, method A vs. method B, or any two distinct populations.

A crucial question arises: do the two groups have equal variance? The answer determines which variant of the t-test you should use.

Why Variance Equality MattersLink Copied

When comparing two means, we need to estimate the standard error of the difference $\bar{x}_1 - \bar{x}_2$ . If both groups share the same variance $\sigma^2$ , we can pool their data to get a more precise estimate. But if variances differ, pooling gives incorrect results, biasing our inference.

This leads to two variants:

Pooled (Student's) t-test: Assumes equal variances, pools data for efficiency
Welch's t-test: Makes no variance assumption, handles unequal variances correctly

Pooled Two-Sample T-TestLink Copied

When we assume $\sigma_1^2 = \sigma_2^2 = \sigma^2$ , we can estimate this common variance using data from both groups.

Pooled variance estimate:

s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}

This is a weighted average of the two sample variances, with weights proportional to their degrees of freedom. Larger samples contribute more to the pooled estimate because they provide more reliable information.

Standard error of the difference:

SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}

The term $\sqrt{1/n_1 + 1/n_2}$ accounts for the fact that uncertainty in the difference comes from uncertainty in both means.

Test statistic:

t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}

Degrees of freedom: $df = n_1 + n_2 - 2$

Each group contributes $n - 1$ degrees of freedom for variance estimation, and we combine them.

Welch's T-Test (Recommended Default)Link Copied

When variances may differ, we cannot pool them. Instead, we estimate standard errors separately and combine them using a different formula.

Standard error of the difference:

SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

This follows from the variance of a difference: $\text{Var}(\bar{x}_1 - \bar{x}_2) = \text{Var}(\bar{x}_1) + \text{Var}(\bar{x}_2)$ for independent samples.

Test statistic:

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Degrees of freedom (Welch-Satterthwaite approximation):

df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}

This complex formula estimates the effective degrees of freedom when variances are unequal. The result is typically non-integer, which modern software handles through interpolation.

Why Welch's is the recommended default:

When variances are equal, Welch's test loses only slightly in power compared to the pooled test
When variances are unequal, the pooled test can give seriously incorrect results
The asymmetric risk/reward favors Welch's as the default choice

Worked Example: Comparing Teaching MethodsLink Copied

A researcher compares test scores from two teaching methods: Method A (traditional lectures) and Method B (active learning).

In[7]:

Code

import numpy as np
from scipy import stats

# Test scores from two teaching methods
method_a = [78, 82, 85, 79, 81, 84, 77, 83, 80, 82, 79, 86]  # n=12
method_b = [85, 89, 92, 88, 90, 93, 86, 91, 88, 90, 87, 94, 89, 91]  # n=14

# Sample statistics
n1, n2 = len(method_a), len(method_b)
x1_bar, x2_bar = np.mean(method_a), np.mean(method_b)
s1, s2 = np.std(method_a, ddof=1), np.std(method_b, ddof=1)

# ===== POOLED T-TEST =====
# Pooled variance
sp2 = ((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2)
sp = np.sqrt(sp2)

# Standard error (pooled)
se_pooled = sp * np.sqrt(1 / n1 + 1 / n2)

# Test statistic
t_pooled = (x2_bar - x1_bar) / se_pooled
df_pooled = n1 + n2 - 2

# P-value
p_pooled = 2 * stats.t.sf(abs(t_pooled), df=df_pooled)

# ===== WELCH'S T-TEST =====
# Standard error (Welch)
se_welch = np.sqrt(s1**2 / n1 + s2**2 / n2)

# Test statistic
t_welch = (x2_bar - x1_bar) / se_welch

# Degrees of freedom (Welch-Satterthwaite)
num = (s1**2 / n1 + s2**2 / n2) ** 2
denom = (s1**2 / n1) ** 2 / (n1 - 1) + (s2**2 / n2) ** 2 / (n2 - 1)
df_welch = num / denom

# P-value
p_welch = 2 * stats.t.sf(abs(t_welch), df=df_welch)

# Effect size (Cohen's d)
cohens_d = (x2_bar - x1_bar) / sp

import numpy as np
from scipy import stats

# Test scores from two teaching methods
method_a = [78, 82, 85, 79, 81, 84, 77, 83, 80, 82, 79, 86]  # n=12
method_b = [85, 89, 92, 88, 90, 93, 86, 91, 88, 90, 87, 94, 89, 91]  # n=14

# Sample statistics
n1, n2 = len(method_a), len(method_b)
x1_bar, x2_bar = np.mean(method_a), np.mean(method_b)
s1, s2 = np.std(method_a, ddof=1), np.std(method_b, ddof=1)

# ===== POOLED T-TEST =====
# Pooled variance
sp2 = ((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2)
sp = np.sqrt(sp2)

# Standard error (pooled)
se_pooled = sp * np.sqrt(1 / n1 + 1 / n2)

# Test statistic
t_pooled = (x2_bar - x1_bar) / se_pooled
df_pooled = n1 + n2 - 2

# P-value
p_pooled = 2 * stats.t.sf(abs(t_pooled), df=df_pooled)

# ===== WELCH'S T-TEST =====
# Standard error (Welch)
se_welch = np.sqrt(s1**2 / n1 + s2**2 / n2)

# Test statistic
t_welch = (x2_bar - x1_bar) / se_welch

# Degrees of freedom (Welch-Satterthwaite)
num = (s1**2 / n1 + s2**2 / n2) ** 2
denom = (s1**2 / n1) ** 2 / (n1 - 1) + (s2**2 / n2) ** 2 / (n2 - 1)
df_welch = num / denom

# P-value
p_welch = 2 * stats.t.sf(abs(t_welch), df=df_welch)

# Effect size (Cohen's d)
cohens_d = (x2_bar - x1_bar) / sp

Out[8]:

Console

Two-Sample T-Test: Comparing Teaching Methods
============================================================

Sample statistics:
------------------------------------------------------------
Method A: n = 12, mean = 81.33, std = 2.84
Method B: n = 14, mean = 89.50, std = 2.59
Difference (B - A): 8.17 points

POOLED T-TEST (assumes equal variances)
------------------------------------------------------------
Pooled std: s_p = 2.709
Standard error: SE = 1.066
t-statistic: t(24) = 7.662
p-value: 0.000000

WELCH'S T-TEST (no variance assumption)
------------------------------------------------------------
Standard error: SE = 1.074
t-statistic: t(22.6) = 7.607
p-value: 0.000000

Effect size:
------------------------------------------------------------
Cohen's d = 3.01 (very large effect)

Conclusion:
------------------------------------------------------------
Both tests show highly significant results (p < 0.001).
Method B produces substantially higher scores than Method A.
The effect size (d = 2.41) indicates a very large practical difference.

Out[9]:

Visualization

Side-by-side boxplots comparing Method A and Method B test scores. — Comparison of test scores between two teaching methods. The boxplots show the distribution of scores in each group, with individual data points overlaid. The minimal overlap between groups, combined with the large mean difference, results in a highly significant t-test.

Checking the Equal Variance AssumptionLink Copied

Before choosing between pooled and Welch's tests, you might want to check whether variances are equal. Two common approaches:

1. Visual inspection: Compare the spread in boxplots or calculate the ratio of sample variances. A ratio greater than 2 suggests meaningful inequality.

2. Levene's test: A formal hypothesis test for equality of variances. Unlike older alternatives (Bartlett's test), Levene's test is robust to non-normality.

In[10]:

Code

from scipy import stats

# Sample data with different variances
group1 = [45, 52, 48, 51, 49, 53, 47, 50, 46, 54]
group2 = [42, 58, 35, 62, 48, 55, 38, 65, 44, 52]

# Visual check: variance ratio
s1 = np.std(group1, ddof=1)
s2 = np.std(group2, ddof=1)
print(f"Group 1 std: {s1:.2f}")
print(f"Group 2 std: {s2:.2f}")
print(f"Variance ratio: {(s2 / s1) ** 2:.2f}")
print()

# Levene's test
stat, p_value = stats.levene(group1, group2)
print(f"Levene's test: W = {stat:.3f}, p = {p_value:.4f}")
if p_value < 0.05:
    print("Conclusion: Evidence of unequal variances. Use Welch's t-test.")
else:
    print(
        "Conclusion: No evidence of unequal variances. Either test is appropriate."
    )

from scipy import stats

# Sample data with different variances
group1 = [45, 52, 48, 51, 49, 53, 47, 50, 46, 54]
group2 = [42, 58, 35, 62, 48, 55, 38, 65, 44, 52]

# Visual check: variance ratio
s1 = np.std(group1, ddof=1)
s2 = np.std(group2, ddof=1)
print(f"Group 1 std: {s1:.2f}")
print(f"Group 2 std: {s2:.2f}")
print(f"Variance ratio: {(s2 / s1) ** 2:.2f}")
print()

# Levene's test
stat, p_value = stats.levene(group1, group2)
print(f"Levene's test: W = {stat:.3f}, p = {p_value:.4f}")
if p_value < 0.05:
    print("Conclusion: Evidence of unequal variances. Use Welch's t-test.")
else:
    print(
        "Conclusion: No evidence of unequal variances. Either test is appropriate."
    )

Out[10]:

Console

Group 1 std: 3.03
Group 2 std: 10.19
Variance ratio: 11.33

Levene's test: W = 13.935, p = 0.0015
Conclusion: Evidence of unequal variances. Use Welch's t-test.

Practical Recommendation

Many statisticians now recommend always using Welch's t-test as the default, regardless of whether variances appear equal. The reasoning:

When variances are truly equal, Welch's test loses very little power compared to the pooled test
When variances are unequal, the pooled test can produce seriously misleading results
Pre-testing for equal variances adds complexity and has its own issues (low power with small samples, unnecessary when samples are large)

This "Welch by default" approach is increasingly adopted in statistical software and practice.

The Paired T-TestLink Copied

The paired t-test is used when observations come in matched pairs, where each pair shares some characteristic that makes them non-independent. Common scenarios include:

Before-after measurements: Same subjects measured twice
Matched case-control studies: Cases matched to controls on key characteristics
Twin studies: Comparing outcomes between twins
Left-right comparisons: Same subject, different sides

Why Pairing MattersLink Copied

Consider testing whether a training program improves performance. You could:

Independent design: Randomly assign people to training or control groups, compare group means
Paired design: Measure each person before and after training, analyze the changes

The paired design is often more powerful because it eliminates between-subject variability. People differ enormously in baseline ability, but by looking at changes within each person, we focus only on the training effect.

The MathematicsLink Copied

The paired t-test is simply a one-sample t-test on the differences:

d_i = x_{i,\text{after}} - x_{i,\text{before}}

Test statistic:

t = \frac{\bar{d}}{s_d / \sqrt{n}}

where:

$\bar{d}$ = mean of the paired differences
$s_d$ = standard deviation of the differences
$n$ = number of pairs
$df = n - 1$

Null hypothesis: $H_0: \mu_d = 0$ (no average change)

Worked Example: Blood Pressure MedicationLink Copied

A study tests whether a new medication reduces blood pressure. Ten patients have their systolic blood pressure measured before starting medication and again after 4 weeks of treatment.

In[11]:

Code

import numpy as np
from scipy import stats

# Blood pressure measurements (mmHg)
before = np.array([145, 152, 148, 155, 160, 142, 158, 165, 150, 155])
after = np.array([138, 145, 140, 150, 152, 140, 148, 158, 145, 148])

# Step 1: Calculate differences (after - before)
differences = after - before
n = len(differences)

# Step 2: Calculate mean and std of differences
d_bar = np.mean(differences)
s_d = np.std(differences, ddof=1)

# Step 3: Calculate standard error
se_d = s_d / np.sqrt(n)

# Step 4: Calculate t-statistic
t_stat = d_bar / se_d
df = n - 1

# Step 5: Calculate p-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Step 6: Confidence interval for mean change
t_crit = stats.t.ppf(0.975, df=df)
ci = (d_bar - t_crit * se_d, d_bar + t_crit * se_d)

# Verify with scipy
t_scipy, p_scipy = stats.ttest_rel(after, before)

import numpy as np
from scipy import stats

# Blood pressure measurements (mmHg)
before = np.array([145, 152, 148, 155, 160, 142, 158, 165, 150, 155])
after = np.array([138, 145, 140, 150, 152, 140, 148, 158, 145, 148])

# Step 1: Calculate differences (after - before)
differences = after - before
n = len(differences)

# Step 2: Calculate mean and std of differences
d_bar = np.mean(differences)
s_d = np.std(differences, ddof=1)

# Step 3: Calculate standard error
se_d = s_d / np.sqrt(n)

# Step 4: Calculate t-statistic
t_stat = d_bar / se_d
df = n - 1

# Step 5: Calculate p-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Step 6: Confidence interval for mean change
t_crit = stats.t.ppf(0.975, df=df)
ci = (d_bar - t_crit * se_d, d_bar + t_crit * se_d)

# Verify with scipy
t_scipy, p_scipy = stats.ttest_rel(after, before)

Out[12]:

Console

Paired T-Test: Blood Pressure Medication Study
============================================================

Data (n = 10 patients):
------------------------------------------------------------
Before: [np.int64(145), np.int64(152), np.int64(148), np.int64(155), np.int64(160), np.int64(142), np.int64(158), np.int64(165), np.int64(150), np.int64(155)]
After:  [np.int64(138), np.int64(145), np.int64(140), np.int64(150), np.int64(152), np.int64(140), np.int64(148), np.int64(158), np.int64(145), np.int64(148)]
Change: [np.int64(-7), np.int64(-7), np.int64(-8), np.int64(-5), np.int64(-8), np.int64(-2), np.int64(-10), np.int64(-7), np.int64(-5), np.int64(-7)]

Step-by-step calculation:
------------------------------------------------------------
1. Mean change:     d_bar = -6.60 mmHg
2. Std of changes:  s_d = 2.17 mmHg
3. Standard error:  SE = s_d/sqrt(n) = 2.17/sqrt(10) = 0.69
4. T-statistic:     t = d_bar/SE = -6.60/0.69 = -9.616
5. Degrees of freedom: df = n - 1 = 9
6. P-value (two-sided): 0.000005

95% CI for mean change: [-8.15, -5.05] mmHg

Interpretation:
------------------------------------------------------------
The average blood pressure decreased by 6.6 mmHg (p < 0.001).
We are 95% confident the true mean reduction is between
5.0 and 8.2 mmHg.

Out[13]:

Visualization

Paired data plot showing before and after blood pressure measurements connected by lines. — Paired t-test visualization showing blood pressure before and after medication. Each line connects measurements from the same patient. The consistent downward slope indicates most patients experienced a reduction, which the paired t-test detects with high significance.

Why Not Use an Independent T-Test?Link Copied

You might wonder: what if we ignored the pairing and used an independent two-sample t-test? Let's compare:

In[14]:

Code

# Paired t-test (correct)
t_paired, p_paired = stats.ttest_rel(after, before)

# Independent t-test (ignoring pairing)
t_indep, p_indep = stats.ttest_ind(after, before)

print("Comparison of paired vs. independent analysis:")
print("-" * 50)
print(f"Paired t-test:      t = {t_paired:.3f}, p = {p_paired:.6f}")
print(f"Independent t-test: t = {t_indep:.3f}, p = {p_indep:.4f}")
print()
print("The paired test is much more powerful because it eliminates")
print("between-subject variability, focusing only on within-subject changes.")

# Paired t-test (correct)
t_paired, p_paired = stats.ttest_rel(after, before)

# Independent t-test (ignoring pairing)
t_indep, p_indep = stats.ttest_ind(after, before)

print("Comparison of paired vs. independent analysis:")
print("-" * 50)
print(f"Paired t-test:      t = {t_paired:.3f}, p = {p_paired:.6f}")
print(f"Independent t-test: t = {t_indep:.3f}, p = {p_indep:.4f}")
print()
print("The paired test is much more powerful because it eliminates")
print("between-subject variability, focusing only on within-subject changes.")

Out[14]:

Console

Comparison of paired vs. independent analysis:
--------------------------------------------------
Paired t-test:      t = -9.616, p = 0.000005
Independent t-test: t = -2.233, p = 0.0385

The paired test is much more powerful because it eliminates
between-subject variability, focusing only on within-subject changes.

The paired test detects the effect with much higher confidence because it accounts for the correlation between before and after measurements within each subject.

Assumptions and RobustnessLink Copied

All t-tests rely on assumptions. Understanding these assumptions, how to check them, and how robust the tests are to violations, is essential for proper application.

Core AssumptionsLink Copied

1. Independence

One-sample: Observations must be independent of each other
Two-sample (independent): Observations between groups must be independent; observations within groups must be independent
Paired: The pairs must be independent of each other (though observations within a pair are dependent by design)

Independence is often the most critical and hardest to verify assumption. It depends on study design rather than data characteristics.

2. Normality

The sampling distribution of the mean should be approximately normal. This is satisfied when:

The population is normally distributed, OR
The sample size is large enough for the Central Limit Theorem to apply

Robustness: The t-test is fairly robust to non-normality, especially for:

Two-sided tests
Larger sample sizes (n > 30 per group)
Symmetric distributions

It is less robust for:

Small samples from highly skewed distributions
One-sided tests
Comparing variances or making precise probability statements

3. Homogeneity of Variance (pooled t-test only)

The pooled two-sample t-test assumes equal population variances. Violations affect:

Type I error rate (can be inflated or deflated)
The bias is worse with unequal sample sizes

Solution: Use Welch's t-test, which doesn't assume equal variances.

Checking AssumptionsLink Copied

Out[15]:

Visualization

Q-Q plot showing sample quantiles versus theoretical normal quantiles. — Q-Q plot for checking normality. Points should fall approximately along the diagonal line if data are normally distributed. Deviations at the tails indicate skewness or heavy tails.

Side-by-side boxplots comparing variance between two groups. — Boxplot comparison for checking variance equality. Similar spreads suggest equal variances; markedly different spreads suggest using Welch's t-test.

In[16]:

Code

from scipy import stats

# Formal tests for assumptions
np.random.seed(42)
sample = np.random.normal(50, 10, 30)
group1 = np.random.normal(50, 8, 25)
group2 = np.random.normal(55, 15, 30)

# Shapiro-Wilk test for normality
stat, p = stats.shapiro(sample)
print("Normality check (Shapiro-Wilk test):")
print(f"  W = {stat:.4f}, p = {p:.4f}")
print(
    f"  {'No evidence against normality' if p > 0.05 else 'Evidence of non-normality'}"
)
print()

# Levene's test for equal variances
stat, p = stats.levene(group1, group2)
print("Variance equality check (Levene's test):")
print(f"  W = {stat:.4f}, p = {p:.4f}")
print(
    f"  {'No evidence of unequal variances' if p > 0.05 else 'Evidence of unequal variances'}"
)

from scipy import stats

# Formal tests for assumptions
np.random.seed(42)
sample = np.random.normal(50, 10, 30)
group1 = np.random.normal(50, 8, 25)
group2 = np.random.normal(55, 15, 30)

# Shapiro-Wilk test for normality
stat, p = stats.shapiro(sample)
print("Normality check (Shapiro-Wilk test):")
print(f"  W = {stat:.4f}, p = {p:.4f}")
print(
    f"  {'No evidence against normality' if p > 0.05 else 'Evidence of non-normality'}"
)
print()

# Levene's test for equal variances
stat, p = stats.levene(group1, group2)
print("Variance equality check (Levene's test):")
print(f"  W = {stat:.4f}, p = {p:.4f}")
print(
    f"  {'No evidence of unequal variances' if p > 0.05 else 'Evidence of unequal variances'}"
)

Out[16]:

Console

Normality check (Shapiro-Wilk test):
  W = 0.9751, p = 0.6868
  No evidence against normality

Variance equality check (Levene's test):
  W = 7.3540, p = 0.0090
  Evidence of unequal variances

Choosing the Right T-Test: Decision FrameworkLink Copied

Selecting the appropriate t-test depends on your research design and data characteristics:

Out[17]:

Visualization

Flowchart showing decision process for t-test selection. — Decision tree for selecting the appropriate t-test. Start with the research design, then consider assumptions to choose between test variants. When in doubt about equal variances, Welch's t-test is the safer default.

Quick Reference:

Scenario	Test	scipy.stats function
One sample vs. hypothesized value	One-sample t-test	`ttest_1samp(data, mu_0)`
Two independent groups (default)	Welch's t-test	`ttest_ind(a, b, equal_var=False)`
Two independent groups (equal variance confirmed)	Pooled t-test	`ttest_ind(a, b, equal_var=True)`
Matched pairs / before-after	Paired t-test	`ttest_rel(after, before)`

SummaryLink Copied

The t-test is the fundamental tool for comparing means when population variance is unknown, which is nearly always. Key takeaways:

The t-distribution:

Accounts for uncertainty from estimating variance
Has heavier tails than the normal distribution
Converges to normal as degrees of freedom increase
Requires larger critical values for small samples

One-sample t-test:

Compares sample mean to hypothesized value
Test statistic: $t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}$
Degrees of freedom: $n - 1$

Two-sample t-test:

Compares means from two independent groups
Pooled variant assumes equal variances
Welch's variant (recommended default) handles unequal variances
Degrees of freedom differ between variants

Paired t-test:

For matched pairs or repeated measures
Analyzes within-pair differences
Often more powerful than independent tests by eliminating between-subject variability

Assumptions:

Independence (critical, design-based)
Normality (robust for large samples)
Equal variances (pooled test only; avoid by using Welch's)

When in doubt: Use Welch's t-test for two-sample comparisons. You sacrifice minimal power when variances are equal but gain robustness when they differ.

What's NextLink Copied

This chapter covered t-tests for comparing one or two means. But what if you want to compare variances rather than means? Or compare more than two groups? The next chapters extend these ideas:

The F-Test introduces the F-distribution for comparing variances. This is important both as a standalone test (are two groups equally variable?) and as foundation for ANOVA.
ANOVA extends the t-test to compare three or more groups simultaneously, avoiding the multiple comparison problems that arise from running many t-tests.
Error types, statistical power, effect sizes, and multiple comparison corrections are all essential for interpreting and designing studies that use t-tests and related methods.

The t-test you've learned here is the building block for all these extensions. Master it, and the more advanced methods will follow naturally.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the t-test.

Loading component...

Comments

Back to Machine Learning from Scratch

Previous Chapter

The Z-Test

Next Chapter

The F-Test and F-Distribution

Reference

BIBTEXAcademic

@misc{thettestonesampletwosamplepooledwelchpairedtestsdecisionframework, author = {Michael Brenndoerfer}, title = {The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework}, year = {2026}, url = {https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework. Retrieved from https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework

MLAAcademic

Michael Brenndoerfer. "The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework." 2026. Web. today. <https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework>.

CHICAGOAcademic

Michael Brenndoerfer. "The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework." Accessed today. https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework.

HARVARDAcademic

Michael Brenndoerfer (2026) 'The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework'. Available at: https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework. https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework

Direct link:

https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework

execute: cache: true warning: falseLink Copied

The T-TestLink Copied

Why Do We Need the T-Test?Link Copied

The Student's T-DistributionLink Copied

Mathematical FoundationLink Copied

Why Heavier Tails Make SenseLink Copied

Critical Values: The Practical ImpactLink Copied

The One-Sample T-TestLink Copied

The Complete ProcedureLink Copied

Worked Example: Quality ControlLink Copied

The Two-Sample T-TestLink Copied

Why Variance Equality MattersLink Copied

Pooled Two-Sample T-TestLink Copied

Welch's T-Test (Recommended Default)Link Copied

Worked Example: Comparing Teaching MethodsLink Copied

Checking the Equal Variance AssumptionLink Copied

The Paired T-TestLink Copied

Why Pairing MattersLink Copied

The MathematicsLink Copied

Worked Example: Blood Pressure MedicationLink Copied

Why Not Use an Independent T-Test?Link Copied

Assumptions and RobustnessLink Copied

Core AssumptionsLink Copied

Checking AssumptionsLink Copied

Choosing the Right T-Test: Decision FrameworkLink Copied

SummaryLink Copied

What's NextLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg

Effect Sizes and Statistical Significance: Cohen's d & Practical Significance

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg

Effect Sizes and Statistical Significance: Cohen's d & Practical Significance

Stay updated