P-values and Hypothesis Test Setup

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Machine Learning from Scratch

Foundation of hypothesis testing covering p-values, null and alternative hypotheses, one-sided vs two-sided tests, and test statistics. Learn how to set up and interpret hypothesis tests correctly.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

P-values and Hypothesis Test SetupLink Copied

You run an A/B test and your new checkout flow shows a 3% higher conversion rate. Is this a real improvement, or just random noise? A clinical trial finds that patients on a new medication have lower blood pressure than those on placebo. Should doctors start prescribing it? Your manufacturing line produces widgets with an average weight that seems off from the specification. Is the machine miscalibrated, or are you just seeing normal variation?

These questions share a common structure: you have data, you have a claim, and you need to decide whether the data support the claim or whether what you're seeing could easily happen by chance. Hypothesis testing provides the framework for making these decisions rigorously, quantifying the strength of evidence so you can make informed choices rather than guessing.

This chapter is the first in a comprehensive series on hypothesis testing. Here, we establish the foundational concepts that everything else builds upon: what p-values really mean, how to set up null and alternative hypotheses, the difference between one-sided and two-sided tests, and how test statistics translate raw data into evidence. Master these fundamentals, and the specific tests covered in later chapters (z-tests, t-tests, F-tests, ANOVA) will feel like natural applications of the same core logic.

The Logic of Hypothesis TestingLink Copied

Before diving into formulas, let's understand the reasoning that makes hypothesis testing work. The logic is elegant but counterintuitive: we don't try to prove what we believe is true. Instead, we assume the opposite and see if the data make that assumption look ridiculous.

Imagine you flip a coin 100 times and get 63 heads. You suspect the coin might be biased toward heads. How would you go about testing this suspicion?

The direct approach, trying to prove the coin is biased, runs into an immediate problem: what does "biased" mean exactly? A coin that lands heads 51% of the time is biased, but so is a coin that lands heads 99% of the time. There are infinitely many ways a coin could be biased, so you can't calculate the probability of "biased" without specifying exactly how biased.

The indirect approach works much better. Start by assuming the coin is fair (50% heads). Under this assumption, calculate the probability of getting 63 or more heads in 100 flips. If this probability is very small, say 1%, then either:

The coin really is fair, and you just witnessed something that happens only 1% of the time, or
The coin isn't fair, and what you witnessed is actually quite likely

When the probability is small enough, option (2) becomes more plausible than option (1). You reject the assumption that the coin is fair.

This is the essence of hypothesis testing: assume nothing interesting is happening (the null hypothesis), calculate how surprising your data would be under that assumption (the p-value), and reject the assumption if the data are surprising enough.

What a P-value Is (and Isn't)Link Copied

The p-value is the most misunderstood concept in statistics, but it's actually simple once you internalize what question it answers.

P-value Definition

The p-value is the probability of observing data at least as extreme as what you actually observed, assuming the null hypothesis is true.

Let's unpack this carefully with our coin example. You flipped 100 times, got 63 heads, and the null hypothesis says P(heads) = 0.5. The p-value asks: if the coin really were fair, how often would you see 63 or more heads (or, for a two-sided test, also 37 or fewer heads)?

Out[2]:

Visualization

Histogram showing binomial distribution of coin flips with shaded p-value region. — Distribution of heads in 100 flips of a fair coin. The vertical dashed line marks the observed result of 63 heads. The red shaded regions show outcomes as extreme or more extreme than what was observed. The total area of these regions is the two-tailed p-value.

The p-value of about 0.012 tells us that if the coin were fair, we'd see a result this extreme only about 1.2% of the time. That's pretty surprising! It doesn't prove the coin is biased, but it suggests the "fair coin" assumption doesn't fit our data very well.

Computing the P-value Step by StepLink Copied

Let's work through the math explicitly. Under the null hypothesis that P(heads) = 0.5, the number of heads X follows a binomial distribution with n = 100 and p = 0.5:

P(X = k) = \binom{100}{k} (0.5)^k (0.5)^{100-k} = \binom{100}{k} (0.5)^{100}

For a two-tailed test, we want the probability of being as extreme or more extreme than 63 in either direction. Since 63 is 13 away from the expected value of 50, we count outcomes of 63 or more heads, OR 37 or fewer heads:

\text{p-value} = P(X \geq 63) + P(X \leq 37)

By symmetry of the binomial distribution when p = 0.5:

\text{p-value} = 2 \times P(X \geq 63) = 2 \times \sum_{k=63}^{100} \binom{100}{k} (0.5)^{100}

In[3]:

Code

from scipy import stats

# Calculate exact p-value using binomial distribution
n = 100
p = 0.5
observed = 63

# P(X >= 63) = 1 - P(X <= 62) = 1 - CDF(62)
p_right_tail = 1 - stats.binom.cdf(observed - 1, n, p)

# Two-tailed p-value (by symmetry)
p_value = 2 * p_right_tail

print(f"P(X ≥ 63) = {p_right_tail:.6f}")
print(f"Two-tailed p-value = {p_value:.6f}")

from scipy import stats

# Calculate exact p-value using binomial distribution
n = 100
p = 0.5
observed = 63

# P(X >= 63) = 1 - P(X <= 62) = 1 - CDF(62)
p_right_tail = 1 - stats.binom.cdf(observed - 1, n, p)

# Two-tailed p-value (by symmetry)
p_value = 2 * p_right_tail

print(f"P(X ≥ 63) = {p_right_tail:.6f}")
print(f"Two-tailed p-value = {p_value:.6f}")

Out[3]:

Console

P(X ≥ 63) = 0.006016
Two-tailed p-value = 0.012033

What the P-value Does NOT MeanLink Copied

Understanding what the p-value isn't is as important as understanding what it is. These three misinterpretations are incredibly common, even among experienced practitioners:

Misinterpretation 1: "The p-value is the probability that the null hypothesis is true."

This is wrong. The null hypothesis is either true or false. It's a fixed (but unknown) fact about reality, not a random variable. The p-value is about the probability of the data given the hypothesis, not the probability of the hypothesis given the data.

If you want to know P(hypothesis | data), you need Bayesian methods with prior probabilities. Classical hypothesis testing doesn't give you this.

Misinterpretation 2: "1 minus the p-value is the probability the effect is real."

Also wrong. A p-value of 0.02 does not mean there's a 98% chance the alternative hypothesis is true. This is the same error as above, just phrased differently.

Misinterpretation 3: "A small p-value means the effect is large or important."

Wrong again. A tiny p-value can arise from a tiny effect with a huge sample size. If you test a million users, even a 0.001% difference in conversion rates might be "statistically significant" with p < 0.001, but that doesn't make it practically meaningful.

The p-value measures evidence against the null hypothesis, not the size or importance of an effect. Effect sizes, covered in a later chapter, address that question.

Why P < 0.05 Is a Convention, Not a LawLink Copied

The threshold of 0.05 for "statistical significance" comes from Ronald Fisher, who in 1925 wrote that results below this threshold were "worth taking seriously." It was never meant to be a rigid cutoff.

Yet the 0.05 threshold has calcified into a binary decision rule that distorts research. A p-value of 0.049 gets published as "significant," while 0.051 gets buried as "not significant," despite conveying nearly identical evidence. This cliff encourages p-hacking, selective reporting, and publication bias.

Better practice is to:

Report exact p-values, not just "p < 0.05"
Interpret p-values on a continuum (0.001 is much stronger evidence than 0.04)
Choose thresholds based on context (physics uses 5-sigma ≈ p < 0.0000003 for discoveries)
Consider effect sizes alongside p-values

Out[4]:

Visualization

Bar chart showing evidence strength for different p-value ranges. — P-values should be interpreted as a continuum of evidence, not as a binary significant/not-significant split. The strength of evidence increases as p-values get smaller.

Setting Up a Hypothesis TestLink Copied

With p-values understood, let's examine the mechanics of actually conducting a test. Every hypothesis test has the same structure: formulate hypotheses, choose a test, collect data, compute a test statistic, and make a decision.

Null and Alternative HypothesesLink Copied

The null hypothesis ( $H_0$ ) represents the skeptical position. It typically claims "nothing interesting is happening": no effect, no difference, no relationship. The null is what we assume true unless the data provide strong evidence otherwise.

The alternative hypothesis ( $H_1$ or $H_a$ ) is what we're trying to establish. It contradicts the null and represents the research claim.

Example 1: Drug efficacy

$H_0$ : The drug has no effect on blood pressure (mean change = 0)
$H_1$ : The drug affects blood pressure (mean change ≠ 0)

Example 2: A/B test

$H_0$ : The new design has the same conversion rate as the old (rate_new = rate_old)
$H_1$ : The new design has a different conversion rate (rate_new ≠ rate_old)

Example 3: Quality control

$H_0$ : The machine is calibrated correctly (mean weight = 50g)
$H_1$ : The machine is miscalibrated (mean weight ≠ 50g)

The Asymmetry of Hypothesis Testing

We never "accept" the null hypothesis. We either reject it (if evidence is strong enough) or fail to reject it (if evidence is insufficient). Failing to reject doesn't prove the null is true. It just means we don't have enough evidence to claim otherwise. This asymmetry reflects the logic: we're testing whether the data are inconsistent with the null, not whether they prove it.

One-Sided vs Two-Sided TestsLink Copied

The alternative hypothesis can be directional (one-sided) or non-directional (two-sided).

A two-sided test considers alternatives in both directions:

$H_0: \mu = \mu_0$
$H_1: \mu \neq \mu_0$

Use this when deviations in either direction are scientifically meaningful. If you're testing whether a new drug affects blood pressure, you care whether it raises OR lowers it.

A one-sided test considers alternatives in only one direction:

$H_0: \mu = \mu_0$ (or $\mu \leq \mu_0$ )
$H_1: \mu > \mu_0$ (right-tailed)

Or:

$H_0: \mu = \mu_0$ (or $\mu \geq \mu_0$ )
$H_1: \mu < \mu_0$ (left-tailed)

Use this when only one direction matters. If you're testing whether a new treatment improves outcomes, you might not care if it makes things worse (that's a different problem).

Out[5]:

Visualization

Normal distribution with two-sided rejection regions shaded in both tails. — Two-sided test: The rejection region is split between both tails. We reject H0 if the test statistic is unusually large OR unusually small.

Normal distribution with one-sided rejection region shaded in right tail. — One-sided test (right-tailed): The entire rejection region is in the right tail. We reject H0 only if the test statistic is unusually large.

The tradeoff: One-sided tests have more power to detect effects in the hypothesized direction because the entire significance level is concentrated in one tail. But they completely miss effects in the opposite direction. If you use a right-tailed test and the true effect is negative, you'll never reject the null, no matter how strong the effect.

Rule of thumb: Unless you have a compelling scientific reason to expect the effect in only one direction, use a two-sided test. It's more conservative and avoids interpretive problems.

The Test Statistic: Standardizing EvidenceLink Copied

Raw differences are meaningless without context. If I tell you my sample mean is 5 units above the hypothesized value, you can't assess the evidence without knowing how much variability to expect.

The test statistic solves this by standardizing the deviation. It answers: "How many standard errors away from the null hypothesis is my observation?"

The General FormulaLink Copied

For most tests involving means, the test statistic takes this form:

\text{Test Statistic} = \frac{\text{Observed value} - \text{Expected value under } H_0}{\text{Standard Error of the estimate}}

Let's break this down:

Numerator: How far is my observation from what the null predicts?

If $H_0$ claims the population mean is 100 and I observe a sample mean of 105, the numerator is $105 - 100 = 5$ .

Denominator: How much random variation should I expect?

The standard error measures the typical sampling variation. For a sample mean, it's $\sigma / \sqrt{n}$ (if we know the population standard deviation) or $s / \sqrt{n}$ (if we estimate it from the sample).

The ratio: How surprising is my observation?

A test statistic of 0.5 means my observation is half a standard error from the null. That's not surprising. A test statistic of 3 means my observation is 3 standard errors from the null. That's quite unusual.

Example: Step-by-Step CalculationLink Copied

Let's work through a complete example. Suppose you want to test whether the average weight of packages from a production line differs from the target of 500g. You sample 25 packages and measure their weights.

Step 1: State the hypotheses

$H_0: \mu = 500$ (the machine is calibrated correctly)
$H_1: \mu \neq 500$ (the machine is miscalibrated)

Step 2: Collect data and compute summary statistics

In[6]:

Code

import numpy as np

# Sample data: weights of 25 packages (in grams)
weights = np.array(
    [
        502,
        498,
        505,
        501,
        497,
        503,
        499,
        506,
        500,
        504,
        501,
        498,
        502,
        505,
        499,
        503,
        500,
        497,
        504,
        501,
        506,
        499,
        502,
        498,
        503,
    ]
)

# Summary statistics
n = len(weights)
sample_mean = np.mean(weights)
sample_std = np.std(weights, ddof=1)  # ddof=1 for sample standard deviation
standard_error = sample_std / np.sqrt(n)

print(f"Sample size: n = {n}")
print(f"Sample mean: x̄ = {sample_mean:.2f}g")
print(f"Sample standard deviation: s = {sample_std:.2f}g")
print(f"Standard error: SE = {standard_error:.2f}g")

import numpy as np

# Sample data: weights of 25 packages (in grams)
weights = np.array(
    [
        502,
        498,
        505,
        501,
        497,
        503,
        499,
        506,
        500,
        504,
        501,
        498,
        502,
        505,
        499,
        503,
        500,
        497,
        504,
        501,
        506,
        499,
        502,
        498,
        503,
    ]
)

# Summary statistics
n = len(weights)
sample_mean = np.mean(weights)
sample_std = np.std(weights, ddof=1)  # ddof=1 for sample standard deviation
standard_error = sample_std / np.sqrt(n)

print(f"Sample size: n = {n}")
print(f"Sample mean: x̄ = {sample_mean:.2f}g")
print(f"Sample standard deviation: s = {sample_std:.2f}g")
print(f"Standard error: SE = {standard_error:.2f}g")

Out[6]:

Console

Sample size: n = 25
Sample mean: x̄ = 501.32g
Sample standard deviation: s = 2.78g
Standard error: SE = 0.56g

Step 3: Calculate the test statistic

Since we don't know the population standard deviation, we use the sample standard deviation and the t-statistic:

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} = \frac{501.36 - 500}{2.80 / \sqrt{25}} = \frac{1.36}{0.56} = 2.43

In[7]:

Code

# Calculate t-statistic
hypothesized_mean = 500
t_stat = (sample_mean - hypothesized_mean) / standard_error

print(f"t-statistic: t = {t_stat:.3f}")

# Calculate t-statistic
hypothesized_mean = 500
t_stat = (sample_mean - hypothesized_mean) / standard_error

print(f"t-statistic: t = {t_stat:.3f}")

Out[7]:

Console

t-statistic: t = 2.374

Step 4: Find the p-value

The t-statistic follows a t-distribution with $n - 1 = 24$ degrees of freedom. For a two-sided test, we need the probability of being as extreme or more extreme than our observed t-statistic in either direction:

\text{p-value} = 2 \times P(T > |t|) = 2 \times P(T > 2.43)

In[8]:

Code

# Calculate two-sided p-value
df = n - 1
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=df))

print(f"Degrees of freedom: df = {df}")
print(f"Two-sided p-value: p = {p_value:.4f}")

# Calculate two-sided p-value
df = n - 1
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=df))

print(f"Degrees of freedom: df = {df}")
print(f"Two-sided p-value: p = {p_value:.4f}")

Out[8]:

Console

Degrees of freedom: df = 24
Two-sided p-value: p = 0.0259

Step 5: Make a decision

At the conventional α = 0.05 significance level, our p-value of 0.023 is less than 0.05. We reject the null hypothesis and conclude there's statistically significant evidence that the mean weight differs from 500g.

Out[9]:

Visualization

t-distribution with observed test statistic and shaded p-value regions. — Visualization of the t-test. The t-statistic of 2.43 falls in the right tail of the t-distribution. The shaded regions show the p-value: the probability of observing a t-statistic this extreme or more extreme if the null hypothesis were true.

Why Sample Size MattersLink Copied

The standard error $s / \sqrt{n}$ shrinks as sample size grows, but not linearly. It shrinks proportionally to $\sqrt{n}$ :

4× the sample size → 2× smaller standard error
100× the sample size → 10× smaller standard error

This has profound implications:

Larger samples detect smaller effects: With enough data, even tiny differences become "statistically significant."
Diminishing returns: Each additional observation contributes less than the previous one. Quadrupling your sample only halves your uncertainty.
Statistical vs practical significance: A huge sample might find a "significant" effect that's too small to matter in practice.

Out[10]:

Visualization

Line plot showing how standard error decreases with increasing sample size. — The standard error decreases with sample size, but with diminishing returns. Doubling from n=10 to n=20 cuts the standard error by about 30%, but doubling from n=100 to n=200 only cuts it by about 30% again.

The Decision ProcessLink Copied

We've calculated our test statistic and p-value. Now what? The decision follows this logic:

Before seeing data: Choose a significance level α (conventionally 0.05, but context-dependent)
After computing p-value:
- If p ≤ α: Reject $H_0$ . The data provide sufficient evidence against the null.
- If p > α: Fail to reject $H_0$ . The data don't provide sufficient evidence.
Report results: Include the test statistic, p-value, sample size, and effect size. Never just say "significant" or "not significant."

The Critical Value Approach

An equivalent approach uses critical values instead of p-values. The critical value is the test statistic boundary separating "reject" from "fail to reject."

For a two-sided z-test at α = 0.05, the critical values are ±1.96. You reject $H_0$ if |z| > 1.96.

For a two-sided t-test, critical values depend on degrees of freedom. With df = 24 and α = 0.05, the critical values are approximately ±2.064.

Both approaches give the same answer. The p-value approach is more informative because it tells you exactly how extreme your result was.

SummaryLink Copied

This chapter established the foundations of hypothesis testing:

The logic: We assume the null hypothesis is true and ask how surprising our data would be under that assumption. If the data are surprising enough (low p-value), we reject the null.
P-values: The probability of observing data as extreme as ours, assuming the null is true. P-values measure evidence against the null, not the probability the null is true, and not the size of any effect.
Hypotheses: The null ( $H_0$ ) is the skeptical position; the alternative ( $H_1$ ) is what we're trying to establish. We never accept $H_0$ ; we either reject it or fail to reject it.
One-sided vs two-sided: Two-sided tests are the default. Use one-sided only when effects in one direction are meaningless.
Test statistics: Standardize the observed deviation by the expected variability. This allows comparison across different contexts and scales.
Decisions: Compare the p-value to your significance level α. Report exact p-values, not just "significant" or "not significant."

What's NextLink Copied

In the next chapter, Confidence Intervals and Test Assumptions, we'll explore:

The deep connection between confidence intervals and hypothesis tests (they're mathematically equivalent!)
The assumptions underlying common tests and what happens when they fail
How to choose between z-tests and t-tests
Welch's t-test as a robust default for comparing means

After that, you'll dive into specific tests: the z-test for when population variance is known, the t-test for when it's unknown, F-tests for comparing variances, and ANOVA for comparing multiple groups. Later chapters will cover Type I and Type II errors, power analysis, effect sizes, and multiple comparison corrections.

Each chapter builds on what you've learned here. The logic of hypothesis testing (assume nothing, measure surprise, make decisions) remains the same throughout. Only the specific formulas change.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about p-values and hypothesis test setup.

Loading component...

Comments

Back to Machine Learning from Scratch

Previous Chapter

Statistical Modelling

Next Chapter

Confidence Intervals and Test Assumptions

Reference

BIBTEXAcademic

@misc{pvaluesandhypothesistestsetup, author = {Michael Brenndoerfer}, title = {P-values and Hypothesis Test Setup}, year = {2026}, url = {https://mbrenndoerfer.com/writing/p-values-hypothesis-test-setup-null-alternative-hypotheses-test-statistics}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). P-values and Hypothesis Test Setup. Retrieved from https://mbrenndoerfer.com/writing/p-values-hypothesis-test-setup-null-alternative-hypotheses-test-statistics

MLAAcademic

Michael Brenndoerfer. "P-values and Hypothesis Test Setup." 2026. Web. today. <https://mbrenndoerfer.com/writing/p-values-hypothesis-test-setup-null-alternative-hypotheses-test-statistics>.

CHICAGOAcademic

Michael Brenndoerfer. "P-values and Hypothesis Test Setup." Accessed today. https://mbrenndoerfer.com/writing/p-values-hypothesis-test-setup-null-alternative-hypotheses-test-statistics.

HARVARDAcademic

Michael Brenndoerfer (2026) 'P-values and Hypothesis Test Setup'. Available at: https://mbrenndoerfer.com/writing/p-values-hypothesis-test-setup-null-alternative-hypotheses-test-statistics (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). P-values and Hypothesis Test Setup. https://mbrenndoerfer.com/writing/p-values-hypothesis-test-setup-null-alternative-hypotheses-test-statistics

Direct link:

https://mbrenndoerfer.com/writing/p-values-hypothesis-test-setup-null-alternative-hypotheses-test-statistics

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

P-values and Hypothesis Test Setup

P-values and Hypothesis Test SetupLink Copied

The Logic of Hypothesis TestingLink Copied

What a P-value Is (and Isn't)Link Copied

Computing the P-value Step by StepLink Copied

What the P-value Does NOT MeanLink Copied

Why P < 0.05 Is a Convention, Not a LawLink Copied

Setting Up a Hypothesis TestLink Copied

Null and Alternative HypothesesLink Copied

One-Sided vs Two-Sided TestsLink Copied

The Test Statistic: Standardizing EvidenceLink Copied

The General FormulaLink Copied

Example: Step-by-Step CalculationLink Copied

Why Sample Size MattersLink Copied

The Decision ProcessLink Copied

SummaryLink Copied

What's NextLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg

Effect Sizes and Statistical Significance: Cohen's d & Practical Significance

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg

Effect Sizes and Statistical Significance: Cohen's d & Practical Significance

Stay updated