The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework

Michael BrenndoerferJanuary 3, 202630 min read

Complete guide to t-tests including one-sample, two-sample (pooled and Welch), paired tests, assumptions, and decision framework. Learn when to use each variant and how to check assumptions.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.


execute: cache: true warning: false

The T-Test

In 1908, a statistician named William Sealy Gosset faced a practical problem at the Guinness Brewery in Dublin. He needed to compare the quality of barley batches using small samples, but the statistical methods of his time assumed you knew the population variance, which he didn't. Gosset solved this problem by deriving a new distribution that accounts for the uncertainty of estimating variance from the sample itself. Because Guinness prohibited employees from publishing scientific papers (fearing competitors would realize the advantage of employing statisticians), Gosset published under the pseudonym "Student." The "Student's t-distribution" and the t-test it enables have since become the workhorse of statistical inference.

The t-test is probably the most frequently used statistical test in science, medicine, and industry. Whenever someone asks "Is this average different from that target?" or "Do these two groups differ?", a t-test is typically the answer. This chapter covers the t-test in depth: why it exists, how it works mathematically, and when to use each variant. After mastering t-tests, you'll be ready to explore F-tests for comparing variances, ANOVA for comparing multiple groups, and the broader topics of statistical power and effect sizes.

Why Do We Need the T-Test?

In the previous chapter on z-tests, we learned to compare means when the population standard deviation σ\sigma is known. The test statistic was:

z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

This follows a standard normal distribution because the denominator σ/n\sigma / \sqrt{n} is a fixed, known quantity. Only the numerator varies from sample to sample.

But here's the problem: you almost never know σ\sigma. In practice, you estimate the standard deviation from your sample data, giving you the sample standard deviation ss. Early researchers simply substituted ss for σ\sigma and continued using normal distribution critical values. This seemed reasonable, but it led to systematic errors. They rejected the null hypothesis more often than they should have, especially with small samples.

Gosset discovered why: when you estimate σ\sigma from the same sample you're testing, you introduce additional variability into the test statistic. The sample standard deviation ss is itself a random variable that fluctuates from sample to sample. Sometimes ss underestimates σ\sigma, making your test statistic too large. Sometimes ss overestimates σ\sigma, making it too small. This extra variability means the test statistic no longer follows a normal distribution. It follows a distribution with heavier tails, one that appropriately penalizes us for not knowing the true population variance.

The Student's T-Distribution

Mathematical Foundation

When we replace σ\sigma with ss, the test statistic becomes:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

This ratio follows a t-distribution with n1n - 1 degrees of freedom (df). The degrees of freedom represent the amount of independent information available to estimate the variance. When computing ss from nn observations, we use up one degree of freedom estimating the mean, leaving n1n - 1 degrees of freedom for estimating variance.

The t-distribution has several key properties:

  1. Bell-shaped and symmetric: Like the normal distribution, centered at zero under the null hypothesis
  2. Heavier tails: More probability in the extreme values than the normal distribution
  3. Parameterized by degrees of freedom: The shape depends on df, with smaller df meaning heavier tails
  4. Converges to normal: As dfdf \to \infty, the t-distribution approaches the standard normal

The heavier tails are the crucial difference. They reflect the uncertainty added by estimating variance: with small samples, ss can deviate substantially from σ\sigma, so extreme t-values become more probable. The t-distribution accounts for this by requiring larger critical values to reject the null hypothesis.

Why Heavier Tails Make Sense

Consider what happens when ss happens to underestimate σ\sigma:

  • The denominator s/ns/\sqrt{n} is too small
  • The t-statistic is inflated
  • We might incorrectly reject H0H_0

When ss overestimates σ\sigma:

  • The denominator is too large
  • The t-statistic is deflated
  • We might miss a real effect

These errors don't favor either direction on average, but they add variability. The t-distribution captures exactly this additional spread. With only 5 observations (df = 4), the sample standard deviation is quite unreliable, so the t-distribution has very heavy tails. With 100 observations (df = 99), ss is a much more reliable estimate of σ\sigma, and the t-distribution is nearly identical to the normal.

Out[2]:
Visualization
Comparison of t-distributions with different degrees of freedom to the standard normal distribution.
The t-distribution compared to the standard normal. With few degrees of freedom, the t-distribution has substantially heavier tails, meaning extreme values are more likely. As degrees of freedom increase, the t-distribution converges to the normal distribution.

Critical Values: The Practical Impact

The heavier tails translate directly into larger critical values. For a two-sided test at α=0.05\alpha = 0.05:

Degrees of Freedomt-criticalz-critical
52.5711.960
102.2281.960
202.0861.960
302.0421.960
1001.9841.960
\infty1.9601.960

With df = 5, you need a t-statistic 31% larger than the z-critical value to reject H0H_0. This penalty appropriately reflects the unreliability of variance estimates from small samples. By df = 30, the difference is only about 4%, which is why older textbooks sometimes said "use z for n > 30." Modern practice uses the t-distribution regardless, as computational tools make it trivially easy.

Out[3]:
Visualization
Plot showing t-critical values decreasing toward z-critical value as degrees of freedom increase.
Critical values for two-sided tests at alpha = 0.05. The t-critical values decrease toward the z-critical value (1.96) as degrees of freedom increase, reflecting the improved reliability of variance estimates with larger samples.

The One-Sample T-Test

The one-sample t-test compares a sample mean to a hypothesized population value when the population variance is unknown. This is the simplest and most direct application of the t-distribution.

The Complete Procedure

Hypotheses:

  • H0:μ=μ0H_0: \mu = \mu_0 (population mean equals the hypothesized value)
  • Ha:μμ0H_a: \mu \neq \mu_0 (two-sided), or μ>μ0\mu > \mu_0 / μ<μ0\mu < \mu_0 (one-sided)

Test statistic:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

where:

  • xˉ\bar{x} = sample mean
  • μ0\mu_0 = hypothesized population mean
  • ss = sample standard deviation (with Bessel's correction: divide by n1n-1)
  • nn = sample size
  • s/ns / \sqrt{n} = standard error of the mean

Distribution under H0H_0: ttn1t \sim t_{n-1} (t-distribution with n1n-1 degrees of freedom)

P-value calculation:

  • Two-sided: p=2×P(T>t)p = 2 \times P(T > |t|) where Ttn1T \sim t_{n-1}
  • One-sided (greater): p=P(T>t)p = P(T > t)
  • One-sided (less): p=P(T<t)p = P(T < t)

Decision rule: Reject H0H_0 if p<αp < \alpha, or equivalently, if t>tα/2,n1|t| > t_{\alpha/2, n-1} for two-sided tests.

Worked Example: Quality Control

A coffee shop claims their large drinks contain 16 ounces. A quality inspector measures 15 randomly selected drinks to verify this claim.

In[4]:
Code
import numpy as np
from scipy import stats

# Measurements from 15 randomly selected large drinks (in ounces)
drinks = [
    15.2,
    15.8,
    16.1,
    15.4,
    15.9,
    15.3,
    15.7,
    16.0,
    15.5,
    15.2,
    15.8,
    15.6,
    15.9,
    15.4,
    16.2,
]

# Hypothesized mean (the claimed amount)
mu_0 = 16.0

# Step 1: Calculate sample statistics
n = len(drinks)
x_bar = np.mean(drinks)
s = np.std(drinks, ddof=1)  # ddof=1 for Bessel's correction

# Step 2: Calculate standard error
se = s / np.sqrt(n)

# Step 3: Calculate t-statistic
t_stat = (x_bar - mu_0) / se

# Step 4: Determine degrees of freedom
df = n - 1

# Step 5: Calculate p-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Step 6: Get critical value for comparison
t_crit = stats.t.ppf(0.975, df=df)

# Step 7: Calculate 95% confidence interval
ci_lower = x_bar - t_crit * se
ci_upper = x_bar + t_crit * se

Let's walk through the calculation step by step:

Step 1: Sample Statistics

xˉ=115i=115xi=15.2+15.8++16.215=15.667 oz\bar{x} = \frac{1}{15}\sum_{i=1}^{15} x_i = \frac{15.2 + 15.8 + \cdots + 16.2}{15} = 15.667 \text{ oz} s=i=115(xixˉ)2151=0.309 ozs = \sqrt{\frac{\sum_{i=1}^{15}(x_i - \bar{x})^2}{15-1}} = 0.309 \text{ oz}

Step 2: Standard Error

SE=sn=0.30915=0.080 ozSE = \frac{s}{\sqrt{n}} = \frac{0.309}{\sqrt{15}} = 0.080 \text{ oz}

Step 3: T-Statistic

t=xˉμ0SE=15.66716.00.080=0.3330.080=4.18t = \frac{\bar{x} - \mu_0}{SE} = \frac{15.667 - 16.0}{0.080} = \frac{-0.333}{0.080} = -4.18

Step 4: P-Value

With df = 14, we find the probability of observing t4.18|t| \geq 4.18 under H0H_0:

p=2×P(T14>4.18)=0.0009p = 2 \times P(T_{14} > 4.18) = 0.0009
Out[5]:
Console
One-Sample T-Test: Coffee Shop Drink Size
=======================================================

Data: 15 drink measurements (ounces)
Hypothesized mean: mu_0 = 16.0 oz

Step-by-step calculation:
-------------------------------------------------------
1. Sample mean:       x_bar = 15.667 oz
2. Sample std dev:    s = 0.324 oz
3. Standard error:    SE = s/sqrt(n) = 0.324/sqrt(15) = 0.084 oz
4. T-statistic:       t = (x_bar - mu_0)/SE = (15.667 - 16.0)/0.084 = -3.980
5. Degrees of freedom: df = n - 1 = 15 - 1 = 14
6. P-value (two-sided): p = 0.0014

Decision (alpha = 0.05):
-------------------------------------------------------
Critical value: t_crit = +/-2.145
|t| = 3.980 > 2.145 = t_crit

95% Confidence Interval: [15.487, 15.846] oz
Note: CI does not contain 16.0, consistent with rejecting H_0

Conclusion:
-------------------------------------------------------
Reject H_0. There is significant evidence (p = 0.0009) that
the true mean drink size differs from the claimed 16 oz.
The drinks appear to be underfilled by about 0.33 oz on average.
Out[6]:
Visualization
T-distribution with observed test statistic and rejection regions highlighted.
Visualization of the one-sample t-test for the coffee shop example. The t-distribution shows the sampling distribution under the null hypothesis. The observed t-statistic of -4.18 falls far into the left tail, well beyond the critical value, leading to rejection of H_0.

The Two-Sample T-Test

The two-sample t-test compares means from two independent groups. This is perhaps the most common application: comparing treatment vs. control, method A vs. method B, or any two distinct populations.

A crucial question arises: do the two groups have equal variance? The answer determines which variant of the t-test you should use.

Why Variance Equality Matters

When comparing two means, we need to estimate the standard error of the difference xˉ1xˉ2\bar{x}_1 - \bar{x}_2. If both groups share the same variance σ2\sigma^2, we can pool their data to get a more precise estimate. But if variances differ, pooling gives incorrect results, biasing our inference.

This leads to two variants:

  • Pooled (Student's) t-test: Assumes equal variances, pools data for efficiency
  • Welch's t-test: Makes no variance assumption, handles unequal variances correctly

Pooled Two-Sample T-Test

When we assume σ12=σ22=σ2\sigma_1^2 = \sigma_2^2 = \sigma^2, we can estimate this common variance using data from both groups.

Pooled variance estimate:

sp2=(n11)s12+(n21)s22n1+n22s_p^2 = \frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}

This is a weighted average of the two sample variances, with weights proportional to their degrees of freedom. Larger samples contribute more to the pooled estimate because they provide more reliable information.

Standard error of the difference:

SE=sp1n1+1n2SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}

The term 1/n1+1/n2\sqrt{1/n_1 + 1/n_2} accounts for the fact that uncertainty in the difference comes from uncertainty in both means.

Test statistic:

t=xˉ1xˉ2sp1n1+1n2t = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}

Degrees of freedom: df=n1+n22df = n_1 + n_2 - 2

Each group contributes n1n - 1 degrees of freedom for variance estimation, and we combine them.

When variances may differ, we cannot pool them. Instead, we estimate standard errors separately and combine them using a different formula.

Standard error of the difference:

SE=s12n1+s22n2SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}

This follows from the variance of a difference: Var(xˉ1xˉ2)=Var(xˉ1)+Var(xˉ2)\text{Var}(\bar{x}_1 - \bar{x}_2) = \text{Var}(\bar{x}_1) + \text{Var}(\bar{x}_2) for independent samples.

Test statistic:

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Degrees of freedom (Welch-Satterthwaite approximation):

df=(s12n1+s22n2)2(s12/n1)2n11+(s22/n2)2n21df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1-1} + \frac{(s_2^2/n_2)^2}{n_2-1}}

This complex formula estimates the effective degrees of freedom when variances are unequal. The result is typically non-integer, which modern software handles through interpolation.

Why Welch's is the recommended default:

  • When variances are equal, Welch's test loses only slightly in power compared to the pooled test
  • When variances are unequal, the pooled test can give seriously incorrect results
  • The asymmetric risk/reward favors Welch's as the default choice

Worked Example: Comparing Teaching Methods

A researcher compares test scores from two teaching methods: Method A (traditional lectures) and Method B (active learning).

In[7]:
Code
import numpy as np
from scipy import stats

# Test scores from two teaching methods
method_a = [78, 82, 85, 79, 81, 84, 77, 83, 80, 82, 79, 86]  # n=12
method_b = [85, 89, 92, 88, 90, 93, 86, 91, 88, 90, 87, 94, 89, 91]  # n=14

# Sample statistics
n1, n2 = len(method_a), len(method_b)
x1_bar, x2_bar = np.mean(method_a), np.mean(method_b)
s1, s2 = np.std(method_a, ddof=1), np.std(method_b, ddof=1)

# ===== POOLED T-TEST =====
# Pooled variance
sp2 = ((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2)
sp = np.sqrt(sp2)

# Standard error (pooled)
se_pooled = sp * np.sqrt(1 / n1 + 1 / n2)

# Test statistic
t_pooled = (x2_bar - x1_bar) / se_pooled
df_pooled = n1 + n2 - 2

# P-value
p_pooled = 2 * stats.t.sf(abs(t_pooled), df=df_pooled)

# ===== WELCH'S T-TEST =====
# Standard error (Welch)
se_welch = np.sqrt(s1**2 / n1 + s2**2 / n2)

# Test statistic
t_welch = (x2_bar - x1_bar) / se_welch

# Degrees of freedom (Welch-Satterthwaite)
num = (s1**2 / n1 + s2**2 / n2) ** 2
denom = (s1**2 / n1) ** 2 / (n1 - 1) + (s2**2 / n2) ** 2 / (n2 - 1)
df_welch = num / denom

# P-value
p_welch = 2 * stats.t.sf(abs(t_welch), df=df_welch)

# Effect size (Cohen's d)
cohens_d = (x2_bar - x1_bar) / sp
Out[8]:
Console
Two-Sample T-Test: Comparing Teaching Methods
============================================================

Sample statistics:
------------------------------------------------------------
Method A: n = 12, mean = 81.33, std = 2.84
Method B: n = 14, mean = 89.50, std = 2.59
Difference (B - A): 8.17 points

POOLED T-TEST (assumes equal variances)
------------------------------------------------------------
Pooled std: s_p = 2.709
Standard error: SE = 1.066
t-statistic: t(24) = 7.662
p-value: 0.000000

WELCH'S T-TEST (no variance assumption)
------------------------------------------------------------
Standard error: SE = 1.074
t-statistic: t(22.6) = 7.607
p-value: 0.000000

Effect size:
------------------------------------------------------------
Cohen's d = 3.01 (very large effect)

Conclusion:
------------------------------------------------------------
Both tests show highly significant results (p < 0.001).
Method B produces substantially higher scores than Method A.
The effect size (d = 2.41) indicates a very large practical difference.
Out[9]:
Visualization
Side-by-side boxplots comparing Method A and Method B test scores.
Comparison of test scores between two teaching methods. The boxplots show the distribution of scores in each group, with individual data points overlaid. The minimal overlap between groups, combined with the large mean difference, results in a highly significant t-test.

Checking the Equal Variance Assumption

Before choosing between pooled and Welch's tests, you might want to check whether variances are equal. Two common approaches:

1. Visual inspection: Compare the spread in boxplots or calculate the ratio of sample variances. A ratio greater than 2 suggests meaningful inequality.

2. Levene's test: A formal hypothesis test for equality of variances. Unlike older alternatives (Bartlett's test), Levene's test is robust to non-normality.

In[10]:
Code
from scipy import stats

# Sample data with different variances
group1 = [45, 52, 48, 51, 49, 53, 47, 50, 46, 54]
group2 = [42, 58, 35, 62, 48, 55, 38, 65, 44, 52]

# Visual check: variance ratio
s1 = np.std(group1, ddof=1)
s2 = np.std(group2, ddof=1)
print(f"Group 1 std: {s1:.2f}")
print(f"Group 2 std: {s2:.2f}")
print(f"Variance ratio: {(s2 / s1) ** 2:.2f}")
print()

# Levene's test
stat, p_value = stats.levene(group1, group2)
print(f"Levene's test: W = {stat:.3f}, p = {p_value:.4f}")
if p_value < 0.05:
    print("Conclusion: Evidence of unequal variances. Use Welch's t-test.")
else:
    print(
        "Conclusion: No evidence of unequal variances. Either test is appropriate."
    )
Out[10]:
Console
Group 1 std: 3.03
Group 2 std: 10.19
Variance ratio: 11.33

Levene's test: W = 13.935, p = 0.0015
Conclusion: Evidence of unequal variances. Use Welch's t-test.
Practical Recommendation

Many statisticians now recommend always using Welch's t-test as the default, regardless of whether variances appear equal. The reasoning:

  1. When variances are truly equal, Welch's test loses very little power compared to the pooled test
  2. When variances are unequal, the pooled test can produce seriously misleading results
  3. Pre-testing for equal variances adds complexity and has its own issues (low power with small samples, unnecessary when samples are large)

This "Welch by default" approach is increasingly adopted in statistical software and practice.

The Paired T-Test

The paired t-test is used when observations come in matched pairs, where each pair shares some characteristic that makes them non-independent. Common scenarios include:

  • Before-after measurements: Same subjects measured twice
  • Matched case-control studies: Cases matched to controls on key characteristics
  • Twin studies: Comparing outcomes between twins
  • Left-right comparisons: Same subject, different sides

Why Pairing Matters

Consider testing whether a training program improves performance. You could:

  1. Independent design: Randomly assign people to training or control groups, compare group means
  2. Paired design: Measure each person before and after training, analyze the changes

The paired design is often more powerful because it eliminates between-subject variability. People differ enormously in baseline ability, but by looking at changes within each person, we focus only on the training effect.

The Mathematics

The paired t-test is simply a one-sample t-test on the differences:

di=xi,afterxi,befored_i = x_{i,\text{after}} - x_{i,\text{before}}

Test statistic:

t=dˉsd/nt = \frac{\bar{d}}{s_d / \sqrt{n}}

where:

  • dˉ\bar{d} = mean of the paired differences
  • sds_d = standard deviation of the differences
  • nn = number of pairs
  • df=n1df = n - 1

Null hypothesis: H0:μd=0H_0: \mu_d = 0 (no average change)

Worked Example: Blood Pressure Medication

A study tests whether a new medication reduces blood pressure. Ten patients have their systolic blood pressure measured before starting medication and again after 4 weeks of treatment.

In[11]:
Code
import numpy as np
from scipy import stats

# Blood pressure measurements (mmHg)
before = np.array([145, 152, 148, 155, 160, 142, 158, 165, 150, 155])
after = np.array([138, 145, 140, 150, 152, 140, 148, 158, 145, 148])

# Step 1: Calculate differences (after - before)
differences = after - before
n = len(differences)

# Step 2: Calculate mean and std of differences
d_bar = np.mean(differences)
s_d = np.std(differences, ddof=1)

# Step 3: Calculate standard error
se_d = s_d / np.sqrt(n)

# Step 4: Calculate t-statistic
t_stat = d_bar / se_d
df = n - 1

# Step 5: Calculate p-value (two-sided)
p_value = 2 * stats.t.sf(abs(t_stat), df=df)

# Step 6: Confidence interval for mean change
t_crit = stats.t.ppf(0.975, df=df)
ci = (d_bar - t_crit * se_d, d_bar + t_crit * se_d)

# Verify with scipy
t_scipy, p_scipy = stats.ttest_rel(after, before)
Out[12]:
Console
Paired T-Test: Blood Pressure Medication Study
============================================================

Data (n = 10 patients):
------------------------------------------------------------
Before: [np.int64(145), np.int64(152), np.int64(148), np.int64(155), np.int64(160), np.int64(142), np.int64(158), np.int64(165), np.int64(150), np.int64(155)]
After:  [np.int64(138), np.int64(145), np.int64(140), np.int64(150), np.int64(152), np.int64(140), np.int64(148), np.int64(158), np.int64(145), np.int64(148)]
Change: [np.int64(-7), np.int64(-7), np.int64(-8), np.int64(-5), np.int64(-8), np.int64(-2), np.int64(-10), np.int64(-7), np.int64(-5), np.int64(-7)]

Step-by-step calculation:
------------------------------------------------------------
1. Mean change:     d_bar = -6.60 mmHg
2. Std of changes:  s_d = 2.17 mmHg
3. Standard error:  SE = s_d/sqrt(n) = 2.17/sqrt(10) = 0.69
4. T-statistic:     t = d_bar/SE = -6.60/0.69 = -9.616
5. Degrees of freedom: df = n - 1 = 9
6. P-value (two-sided): 0.000005

95% CI for mean change: [-8.15, -5.05] mmHg

Interpretation:
------------------------------------------------------------
The average blood pressure decreased by 6.6 mmHg (p < 0.001).
We are 95% confident the true mean reduction is between
5.0 and 8.2 mmHg.
Out[13]:
Visualization
Paired data plot showing before and after blood pressure measurements connected by lines.
Paired t-test visualization showing blood pressure before and after medication. Each line connects measurements from the same patient. The consistent downward slope indicates most patients experienced a reduction, which the paired t-test detects with high significance.

Why Not Use an Independent T-Test?

You might wonder: what if we ignored the pairing and used an independent two-sample t-test? Let's compare:

In[14]:
Code
# Paired t-test (correct)
t_paired, p_paired = stats.ttest_rel(after, before)

# Independent t-test (ignoring pairing)
t_indep, p_indep = stats.ttest_ind(after, before)

print("Comparison of paired vs. independent analysis:")
print("-" * 50)
print(f"Paired t-test:      t = {t_paired:.3f}, p = {p_paired:.6f}")
print(f"Independent t-test: t = {t_indep:.3f}, p = {p_indep:.4f}")
print()
print("The paired test is much more powerful because it eliminates")
print("between-subject variability, focusing only on within-subject changes.")
Out[14]:
Console
Comparison of paired vs. independent analysis:
--------------------------------------------------
Paired t-test:      t = -9.616, p = 0.000005
Independent t-test: t = -2.233, p = 0.0385

The paired test is much more powerful because it eliminates
between-subject variability, focusing only on within-subject changes.

The paired test detects the effect with much higher confidence because it accounts for the correlation between before and after measurements within each subject.

Assumptions and Robustness

All t-tests rely on assumptions. Understanding these assumptions, how to check them, and how robust the tests are to violations, is essential for proper application.

Core Assumptions

1. Independence

  • One-sample: Observations must be independent of each other
  • Two-sample (independent): Observations between groups must be independent; observations within groups must be independent
  • Paired: The pairs must be independent of each other (though observations within a pair are dependent by design)

Independence is often the most critical and hardest to verify assumption. It depends on study design rather than data characteristics.

2. Normality

The sampling distribution of the mean should be approximately normal. This is satisfied when:

  • The population is normally distributed, OR
  • The sample size is large enough for the Central Limit Theorem to apply

Robustness: The t-test is fairly robust to non-normality, especially for:

  • Two-sided tests
  • Larger sample sizes (n > 30 per group)
  • Symmetric distributions

It is less robust for:

  • Small samples from highly skewed distributions
  • One-sided tests
  • Comparing variances or making precise probability statements

3. Homogeneity of Variance (pooled t-test only)

The pooled two-sample t-test assumes equal population variances. Violations affect:

  • Type I error rate (can be inflated or deflated)
  • The bias is worse with unequal sample sizes

Solution: Use Welch's t-test, which doesn't assume equal variances.

Checking Assumptions

Out[15]:
Visualization
Q-Q plot showing sample quantiles versus theoretical normal quantiles.
Q-Q plot for checking normality. Points should fall approximately along the diagonal line if data are normally distributed. Deviations at the tails indicate skewness or heavy tails.
Side-by-side boxplots comparing variance between two groups.
Boxplot comparison for checking variance equality. Similar spreads suggest equal variances; markedly different spreads suggest using Welch's t-test.
In[16]:
Code
from scipy import stats

# Formal tests for assumptions
np.random.seed(42)
sample = np.random.normal(50, 10, 30)
group1 = np.random.normal(50, 8, 25)
group2 = np.random.normal(55, 15, 30)

# Shapiro-Wilk test for normality
stat, p = stats.shapiro(sample)
print("Normality check (Shapiro-Wilk test):")
print(f"  W = {stat:.4f}, p = {p:.4f}")
print(
    f"  {'No evidence against normality' if p > 0.05 else 'Evidence of non-normality'}"
)
print()

# Levene's test for equal variances
stat, p = stats.levene(group1, group2)
print("Variance equality check (Levene's test):")
print(f"  W = {stat:.4f}, p = {p:.4f}")
print(
    f"  {'No evidence of unequal variances' if p > 0.05 else 'Evidence of unequal variances'}"
)
Out[16]:
Console
Normality check (Shapiro-Wilk test):
  W = 0.9751, p = 0.6868
  No evidence against normality

Variance equality check (Levene's test):
  W = 7.3540, p = 0.0090
  Evidence of unequal variances

Choosing the Right T-Test: Decision Framework

Selecting the appropriate t-test depends on your research design and data characteristics:

Out[17]:
Visualization
Flowchart showing decision process for t-test selection.
Decision tree for selecting the appropriate t-test. Start with the research design, then consider assumptions to choose between test variants. When in doubt about equal variances, Welch's t-test is the safer default.

Quick Reference:

ScenarioTestscipy.stats function
One sample vs. hypothesized valueOne-sample t-testttest_1samp(data, mu_0)
Two independent groups (default)Welch's t-testttest_ind(a, b, equal_var=False)
Two independent groups (equal variance confirmed)Pooled t-testttest_ind(a, b, equal_var=True)
Matched pairs / before-afterPaired t-testttest_rel(after, before)

Summary

The t-test is the fundamental tool for comparing means when population variance is unknown, which is nearly always. Key takeaways:

The t-distribution:

  • Accounts for uncertainty from estimating variance
  • Has heavier tails than the normal distribution
  • Converges to normal as degrees of freedom increase
  • Requires larger critical values for small samples

One-sample t-test:

  • Compares sample mean to hypothesized value
  • Test statistic: t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
  • Degrees of freedom: n1n - 1

Two-sample t-test:

  • Compares means from two independent groups
  • Pooled variant assumes equal variances
  • Welch's variant (recommended default) handles unequal variances
  • Degrees of freedom differ between variants

Paired t-test:

  • For matched pairs or repeated measures
  • Analyzes within-pair differences
  • Often more powerful than independent tests by eliminating between-subject variability

Assumptions:

  • Independence (critical, design-based)
  • Normality (robust for large samples)
  • Equal variances (pooled test only; avoid by using Welch's)

When in doubt: Use Welch's t-test for two-sample comparisons. You sacrifice minimal power when variances are equal but gain robustness when they differ.

What's Next

This chapter covered t-tests for comparing one or two means. But what if you want to compare variances rather than means? Or compare more than two groups? The next chapters extend these ideas:

  • The F-Test introduces the F-distribution for comparing variances. This is important both as a standalone test (are two groups equally variable?) and as foundation for ANOVA.

  • ANOVA extends the t-test to compare three or more groups simultaneously, avoiding the multiple comparison problems that arise from running many t-tests.

  • Error types, statistical power, effect sizes, and multiple comparison corrections are all essential for interpreting and designing studies that use t-tests and related methods.

The t-test you've learned here is the building block for all these extensions. Master it, and the more advanced methods will follow naturally.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the t-test.

Loading component...

Reference

BIBTEXAcademic
@misc{thettestonesampletwosamplepooledwelchpairedtestsdecisionframework, author = {Michael Brenndoerfer}, title = {The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework}, year = {2026}, url = {https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework. Retrieved from https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework
MLAAcademic
Michael Brenndoerfer. "The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework." 2026. Web. today. <https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework>.
CHICAGOAcademic
Michael Brenndoerfer. "The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework." Accessed today. https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework.
HARVARDAcademic
Michael Brenndoerfer (2026) 'The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework'. Available at: https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). The T-Test: One-Sample, Two-Sample (Pooled & Welch), Paired Tests & Decision Framework. https://mbrenndoerfer.com/writing/t-test-student-t-distribution-one-sample-two-sample-pooled-welch-paired-assumptions-decision-framework