Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Machine Learning from Scratch

Family-wise error rate, false discovery rate, Bonferroni correction, Holm's method, and Benjamini-Hochberg procedure. Learn how to control error rates when conducting multiple hypothesis tests.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Multiple Comparisons and False Discovery ControlLink Copied

In 2011, neuroscientist Craig Bennett presented a poster at a conference that would become one of the most discussed studies in modern science. He had placed a subject in an fMRI scanner, shown them photographs of people in social situations, and asked them to determine what emotion the person was experiencing. The scanner detected statistically significant brain activity during this task. The catch? The subject was a dead Atlantic salmon.

How could a dead fish show "brain activity"? The answer lies in the multiple comparisons problem. An fMRI scan divides the brain into roughly 130,000 tiny regions called voxels. At each voxel, researchers test whether activity differs from baseline. With 130,000 tests at $\alpha = 0.05$ , we expect about 6,500 false positives purely by chance, more than enough to create apparent "activation patterns" in a dead fish. Bennett's study was a brilliant demonstration of why multiple comparison corrections are essential in modern research.

This chapter explores the mathematics of multiple testing: why conducting many tests inflates error rates, how to quantify this inflation, and the correction methods that restore valid inference. Understanding these concepts is crucial for anyone working with data that involves multiple statistical tests: from genomics and neuroimaging to A/B testing and clinical trials.

The Multiple Comparisons ProblemLink Copied

When we conduct a single hypothesis test at significance level $\alpha = 0.05$ , we accept a 5% chance of a Type I error: rejecting a true null hypothesis. This seems like an acceptable risk. But what happens when we conduct many tests?

The Mathematics of MultiplicityLink Copied

Consider conducting $m$ independent hypothesis tests, each at significance level $\alpha$ . If all null hypotheses are true, each test has probability $\alpha$ of producing a false positive. The probability of making at least one false positive across all tests is:

$P(\text{at least one false positive}) = 1 - P(\text{no false positives})$

Since the tests are independent:

$P(\text{no false positives}) = (1 - \alpha)^m$

Therefore:

$P(\text{at least one false positive}) = 1 - (1 - \alpha)^m$

This is called the family-wise error rate (FWER): the probability of making at least one Type I error among a family of tests.

Let's see how quickly this grows:

Number of tests ( $m$ )	FWER at $\alpha = 0.05$
1	0.050
5	0.226
10	0.401
14	0.512
20	0.642
50	0.923
100	0.994

With just 14 tests, you have more than a 50% chance of at least one false positive. By 100 tests, a false positive is virtually certain (99.4%).

In[3]:

Code

n_tests = np.arange(1, 101)
alphas = [0.05, 0.01, 0.001]
colors = ["#d62728", "#ff7f0e", "#2ca02c"]
labels = ["α = 0.05", "α = 0.01", "α = 0.001"]

fig, ax = plt.subplots(figsize=(8, 5))

for alpha, color, label in zip(alphas, colors, labels):
    fwer = 1 - (1 - alpha) ** n_tests
    ax.plot(n_tests, fwer, linewidth=2.5, color=color, label=label)

# Reference lines
ax.axhline(0.5, color="gray", linestyle="--", linewidth=1, alpha=0.7)
ax.axhline(0.95, color="gray", linestyle="--", linewidth=1, alpha=0.7)

# Find where α=0.05 crosses 50%
m_50 = np.log(0.5) / np.log(1 - 0.05)
ax.axvline(m_50, color="gray", linestyle=":", linewidth=1, alpha=0.5)
ax.annotate(
    f"50% FWER at m = {int(m_50)}",
    xy=(m_50, 0.5),
    xytext=(25, 0.35),
    fontsize=9,
    arrowprops=dict(arrowstyle="->", color="gray", alpha=0.7),
)

ax.annotate(
    "95% FWER",
    xy=(80, 0.95),
    xytext=(60, 0.85),
    fontsize=9,
    arrowprops=dict(arrowstyle="->", color="gray", alpha=0.7),
)

ax.set_xlabel("Number of Independent Tests (m)")
ax.set_ylabel("Family-Wise Error Rate (FWER)")
ax.set_title("The Multiple Comparisons Problem")
ax.legend(loc="lower right")
ax.set_xlim(0, 100)
ax.set_ylim(0, 1.05)
plt.show()

n_tests = np.arange(1, 101)
alphas = [0.05, 0.01, 0.001]
colors = ["#d62728", "#ff7f0e", "#2ca02c"]
labels = ["α = 0.05", "α = 0.01", "α = 0.001"]

fig, ax = plt.subplots(figsize=(8, 5))

for alpha, color, label in zip(alphas, colors, labels):
    fwer = 1 - (1 - alpha) ** n_tests
    ax.plot(n_tests, fwer, linewidth=2.5, color=color, label=label)

# Reference lines
ax.axhline(0.5, color="gray", linestyle="--", linewidth=1, alpha=0.7)
ax.axhline(0.95, color="gray", linestyle="--", linewidth=1, alpha=0.7)

# Find where α=0.05 crosses 50%
m_50 = np.log(0.5) / np.log(1 - 0.05)
ax.axvline(m_50, color="gray", linestyle=":", linewidth=1, alpha=0.5)
ax.annotate(
    f"50% FWER at m = {int(m_50)}",
    xy=(m_50, 0.5),
    xytext=(25, 0.35),
    fontsize=9,
    arrowprops=dict(arrowstyle="->", color="gray", alpha=0.7),
)

ax.annotate(
    "95% FWER",
    xy=(80, 0.95),
    xytext=(60, 0.85),
    fontsize=9,
    arrowprops=dict(arrowstyle="->", color="gray", alpha=0.7),
)

ax.set_xlabel("Number of Independent Tests (m)")
ax.set_ylabel("Family-Wise Error Rate (FWER)")
ax.set_title("The Multiple Comparisons Problem")
ax.legend(loc="lower right")
ax.set_xlim(0, 100)
ax.set_ylim(0, 1.05)
plt.show()

Out[3]:

Visualization

Line plot showing FWER increasing with number of tests for different alpha levels. — The family-wise error rate grows rapidly with the number of tests. At the commonly used α = 0.05, conducting 14 tests already gives a 50% chance of at least one false positive. This illustrates why multiple comparison corrections are essential in high-dimensional research.

Simulation: Seeing False Positives EmergeLink Copied

Let's simulate the multiple comparisons problem directly. We'll generate data where all null hypotheses are true (no real effects) and watch how false positives accumulate.

In[4]:

Code

np.random.seed(42)
n_tests_sim = 100
n_samples = 30
p_values = []

# Conduct 100 t-tests with no real effect
for _ in range(n_tests_sim):
    group1 = np.random.normal(0, 1, n_samples)
    group2 = np.random.normal(0, 1, n_samples)  # Same distribution!
    _, p = stats.ttest_ind(group1, group2)
    p_values.append(p)

p_values = np.array(p_values)
significant = p_values < 0.05

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: P-value distribution
ax1 = axes[0]
ax1.hist(p_values, bins=20, edgecolor="white", alpha=0.7, color="#1f77b4")
ax1.axvline(
    0.05, color="#d62728", linewidth=2, linestyle="--", label="α = 0.05"
)
ax1.set_xlabel("P-value")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of P-values (No Real Effects)")
ax1.legend()
ax1.text(
    0.5,
    0.9,
    "P-values under null hypothesis\nshould be uniformly distributed",
    transform=ax1.transAxes,
    fontsize=9,
    verticalalignment="top",
    bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5),
)

# Right: Test results visualization
ax2 = axes[1]
test_indices = np.arange(n_tests_sim)
colors_scatter = ["#d62728" if sig else "#1f77b4" for sig in significant]
ax2.scatter(test_indices, p_values, c=colors_scatter, s=50, alpha=0.7)
ax2.axhline(0.05, color="#d62728", linewidth=2, linestyle="--")
ax2.set_xlabel("Test Number")
ax2.set_ylabel("P-value")
ax2.set_title(
    f"100 Tests with No Real Effects: {sum(significant)} 'Significant' Results"
)
ax2.text(
    0.95,
    0.95,
    f"False positives: {sum(significant)}\nExpected: {int(0.05 * n_tests_sim)}",
    transform=ax2.transAxes,
    fontsize=10,
    verticalalignment="top",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

plt.tight_layout()
plt.show()

np.random.seed(42)
n_tests_sim = 100
n_samples = 30
p_values = []

# Conduct 100 t-tests with no real effect
for _ in range(n_tests_sim):
    group1 = np.random.normal(0, 1, n_samples)
    group2 = np.random.normal(0, 1, n_samples)  # Same distribution!
    _, p = stats.ttest_ind(group1, group2)
    p_values.append(p)

p_values = np.array(p_values)
significant = p_values < 0.05

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Left: P-value distribution
ax1 = axes[0]
ax1.hist(p_values, bins=20, edgecolor="white", alpha=0.7, color="#1f77b4")
ax1.axvline(
    0.05, color="#d62728", linewidth=2, linestyle="--", label="α = 0.05"
)
ax1.set_xlabel("P-value")
ax1.set_ylabel("Frequency")
ax1.set_title("Distribution of P-values (No Real Effects)")
ax1.legend()
ax1.text(
    0.5,
    0.9,
    "P-values under null hypothesis\nshould be uniformly distributed",
    transform=ax1.transAxes,
    fontsize=9,
    verticalalignment="top",
    bbox=dict(boxstyle="round", facecolor="wheat", alpha=0.5),
)

# Right: Test results visualization
ax2 = axes[1]
test_indices = np.arange(n_tests_sim)
colors_scatter = ["#d62728" if sig else "#1f77b4" for sig in significant]
ax2.scatter(test_indices, p_values, c=colors_scatter, s=50, alpha=0.7)
ax2.axhline(0.05, color="#d62728", linewidth=2, linestyle="--")
ax2.set_xlabel("Test Number")
ax2.set_ylabel("P-value")
ax2.set_title(
    f"100 Tests with No Real Effects: {sum(significant)} 'Significant' Results"
)
ax2.text(
    0.95,
    0.95,
    f"False positives: {sum(significant)}\nExpected: {int(0.05 * n_tests_sim)}",
    transform=ax2.transAxes,
    fontsize=10,
    verticalalignment="top",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

plt.tight_layout()
plt.show()

Out[4]:

Visualization

The simulation confirms our mathematical analysis: with 100 tests and no real effects, we obtained {python} sum(significant) "significant" results, false positives that could mislead researchers into believing they've found real effects.

Error Rate Concepts: FWER vs FDRLink Copied

Before diving into correction methods, we need to clearly distinguish two different ways to conceptualize error rates in multiple testing. Each has its place depending on your research goals.

Family-Wise Error Rate (FWER)Link Copied

The family-wise error rate is the probability of making at least one Type I error among all tests:

$\text{FWER} = P(V \geq 1)$

where $V$ is the number of false positives (true null hypotheses incorrectly rejected).

When to control FWER:

Confirmatory research where conclusions must be definitive
Clinical trials where each significant result leads to treatment decisions
Any situation where a single false positive has serious consequences
Small numbers of pre-specified comparisons

Controlling FWER at level $\alpha$ guarantees that the probability of any false positive is at most $\alpha$ . This is the most conservative form of error control.

False Discovery Rate (FDR)Link Copied

The false discovery rate is the expected proportion of false positives among rejected hypotheses:

$\text{FDR} = E\left[\frac{V}{R}\right] = E\left[\frac{V}{\max(R, 1)}\right]$

where $V$ is the number of false positives and $R$ is the total number of rejections.

When to control FDR:

Exploratory research where you're screening many hypotheses
Genomics, proteomics, and other -omics studies
Neuroimaging with thousands of voxels
A/B testing with many metrics
Situations where some false positives are acceptable if most discoveries are real

Controlling FDR at level $q$ means that among your "significant" results, the expected proportion of false discoveries is at most $q$ . This allows more discoveries while still controlling the error rate.

Visualizing the DifferenceLink Copied

In[5]:

Code

np.random.seed(42)
n_null = 950  # True null hypotheses
n_alt = 50  # True alternatives (real effects)
effect_size = 3.0

# Generate p-values
p_null = np.random.uniform(0, 1, n_null)  # Uniform under null
p_alt = 2 * (
    1 - stats.norm.cdf(effect_size / 2 + np.random.normal(0, 0.5, n_alt))
)  # Small p-values for alternatives
p_values_all = np.concatenate([p_null, p_alt])
is_null = np.array([True] * n_null + [False] * n_alt)

# Sort for plotting
sorted_indices = np.argsort(p_values_all)
p_sorted = p_values_all[sorted_indices]
null_sorted = is_null[sorted_indices]

m = len(p_values_all)

# Bonferroni threshold
bonf_threshold = 0.05 / m

# BH threshold (find largest k where p(k) <= k/m * 0.05)
bh_thresholds = 0.05 * np.arange(1, m + 1) / m
bh_significant = p_sorted <= bh_thresholds
if np.any(bh_significant):
    bh_cutoff_idx = np.max(np.where(bh_significant)[0])
    bh_cutoff = p_sorted[bh_cutoff_idx]
else:
    bh_cutoff_idx = -1
    bh_cutoff = 0

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Left: Bonferroni (FWER)
ax1 = axes[0]
colors1 = ["#2ca02c" if not null else "#1f77b4" for null in null_sorted[:100]]
ax1.scatter(range(100), p_sorted[:100], c=colors1, s=30, alpha=0.7)
ax1.axhline(
    bonf_threshold,
    color="#d62728",
    linewidth=2,
    linestyle="--",
    label=f"Bonferroni: α/m = {bonf_threshold:.2e}",
)
ax1.set_xlabel("Sorted Test Rank")
ax1.set_ylabel("P-value")
ax1.set_title("FWER Control (Bonferroni)")
ax1.legend(loc="upper left", fontsize=8)
ax1.set_yscale("log")
ax1.set_ylim(1e-6, 1)

# Count rejections
bonf_reject = p_sorted < bonf_threshold
n_bonf_reject = np.sum(bonf_reject)
n_bonf_fp = np.sum(bonf_reject & null_sorted)
n_bonf_tp = np.sum(bonf_reject & ~null_sorted)
ax1.text(
    0.95,
    0.5,
    f"Rejected: {n_bonf_reject}\nTrue positives: {n_bonf_tp}\nFalse positives: {n_bonf_fp}",
    transform=ax1.transAxes,
    fontsize=9,
    verticalalignment="center",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

# Right: Benjamini-Hochberg (FDR)
ax2 = axes[1]
colors2 = ["#2ca02c" if not null else "#1f77b4" for null in null_sorted[:100]]
ax2.scatter(range(100), p_sorted[:100], c=colors2, s=30, alpha=0.7)
ax2.plot(
    range(100),
    bh_thresholds[:100],
    color="#d62728",
    linewidth=2,
    linestyle="--",
    label="BH: (k/m)×α threshold",
)
ax2.set_xlabel("Sorted Test Rank (k)")
ax2.set_ylabel("P-value")
ax2.set_title("FDR Control (Benjamini-Hochberg)")
ax2.legend(loc="upper left", fontsize=8)
ax2.set_yscale("log")
ax2.set_ylim(1e-6, 1)

# Count rejections for BH
bh_reject = np.zeros(m, dtype=bool)
if bh_cutoff_idx >= 0:
    bh_reject[: bh_cutoff_idx + 1] = True
n_bh_reject = np.sum(bh_reject)
n_bh_fp = np.sum(bh_reject & null_sorted)
n_bh_tp = np.sum(bh_reject & ~null_sorted)
ax2.text(
    0.95,
    0.5,
    f"Rejected: {n_bh_reject}\nTrue positives: {n_bh_tp}\nFalse positives: {n_bh_fp}",
    transform=ax2.transAxes,
    fontsize=9,
    verticalalignment="center",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

# Add legend for colors
for ax in axes:
    ax.scatter([], [], c="#2ca02c", s=30, label="True alternative")
    ax.scatter([], [], c="#1f77b4", s=30, label="True null")
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(handles, labels, loc="upper left", fontsize=8)

plt.tight_layout()
plt.show()

np.random.seed(42)
n_null = 950  # True null hypotheses
n_alt = 50  # True alternatives (real effects)
effect_size = 3.0

# Generate p-values
p_null = np.random.uniform(0, 1, n_null)  # Uniform under null
p_alt = 2 * (
    1 - stats.norm.cdf(effect_size / 2 + np.random.normal(0, 0.5, n_alt))
)  # Small p-values for alternatives
p_values_all = np.concatenate([p_null, p_alt])
is_null = np.array([True] * n_null + [False] * n_alt)

# Sort for plotting
sorted_indices = np.argsort(p_values_all)
p_sorted = p_values_all[sorted_indices]
null_sorted = is_null[sorted_indices]

m = len(p_values_all)

# Bonferroni threshold
bonf_threshold = 0.05 / m

# BH threshold (find largest k where p(k) <= k/m * 0.05)
bh_thresholds = 0.05 * np.arange(1, m + 1) / m
bh_significant = p_sorted <= bh_thresholds
if np.any(bh_significant):
    bh_cutoff_idx = np.max(np.where(bh_significant)[0])
    bh_cutoff = p_sorted[bh_cutoff_idx]
else:
    bh_cutoff_idx = -1
    bh_cutoff = 0

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Left: Bonferroni (FWER)
ax1 = axes[0]
colors1 = ["#2ca02c" if not null else "#1f77b4" for null in null_sorted[:100]]
ax1.scatter(range(100), p_sorted[:100], c=colors1, s=30, alpha=0.7)
ax1.axhline(
    bonf_threshold,
    color="#d62728",
    linewidth=2,
    linestyle="--",
    label=f"Bonferroni: α/m = {bonf_threshold:.2e}",
)
ax1.set_xlabel("Sorted Test Rank")
ax1.set_ylabel("P-value")
ax1.set_title("FWER Control (Bonferroni)")
ax1.legend(loc="upper left", fontsize=8)
ax1.set_yscale("log")
ax1.set_ylim(1e-6, 1)

# Count rejections
bonf_reject = p_sorted < bonf_threshold
n_bonf_reject = np.sum(bonf_reject)
n_bonf_fp = np.sum(bonf_reject & null_sorted)
n_bonf_tp = np.sum(bonf_reject & ~null_sorted)
ax1.text(
    0.95,
    0.5,
    f"Rejected: {n_bonf_reject}\nTrue positives: {n_bonf_tp}\nFalse positives: {n_bonf_fp}",
    transform=ax1.transAxes,
    fontsize=9,
    verticalalignment="center",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

# Right: Benjamini-Hochberg (FDR)
ax2 = axes[1]
colors2 = ["#2ca02c" if not null else "#1f77b4" for null in null_sorted[:100]]
ax2.scatter(range(100), p_sorted[:100], c=colors2, s=30, alpha=0.7)
ax2.plot(
    range(100),
    bh_thresholds[:100],
    color="#d62728",
    linewidth=2,
    linestyle="--",
    label="BH: (k/m)×α threshold",
)
ax2.set_xlabel("Sorted Test Rank (k)")
ax2.set_ylabel("P-value")
ax2.set_title("FDR Control (Benjamini-Hochberg)")
ax2.legend(loc="upper left", fontsize=8)
ax2.set_yscale("log")
ax2.set_ylim(1e-6, 1)

# Count rejections for BH
bh_reject = np.zeros(m, dtype=bool)
if bh_cutoff_idx >= 0:
    bh_reject[: bh_cutoff_idx + 1] = True
n_bh_reject = np.sum(bh_reject)
n_bh_fp = np.sum(bh_reject & null_sorted)
n_bh_tp = np.sum(bh_reject & ~null_sorted)
ax2.text(
    0.95,
    0.5,
    f"Rejected: {n_bh_reject}\nTrue positives: {n_bh_tp}\nFalse positives: {n_bh_fp}",
    transform=ax2.transAxes,
    fontsize=9,
    verticalalignment="center",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

# Add legend for colors
for ax in axes:
    ax.scatter([], [], c="#2ca02c", s=30, label="True alternative")
    ax.scatter([], [], c="#1f77b4", s=30, label="True null")
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(handles, labels, loc="upper left", fontsize=8)

plt.tight_layout()
plt.show()

Out[5]:

Visualization

Side-by-side comparison of FWER and FDR correction methods showing different rejection thresholds. — Comparison of FWER and FDR control. With 1000 tests (950 true nulls, 50 true alternatives), FWER control (Bonferroni) rejects only the strongest signals to ensure no false positives, while FDR control (Benjamini-Hochberg) accepts some false positives to detect more true effects.

The figure illustrates the key tradeoff:

Bonferroni (FWER control): Very conservative threshold rejects only the strongest signals, minimizing false positives but missing many true effects
Benjamini-Hochberg (FDR control): Less stringent threshold detects more true effects at the cost of accepting some false positives

FWER-Controlling MethodsLink Copied

Bonferroni CorrectionLink Copied

The Bonferroni correction is the simplest and most widely known method for controlling FWER. The idea is mathematically elegant: if you want the overall error rate to be at most $\alpha$ across $m$ tests, make each individual test more stringent.

The method: To test $m$ hypotheses while controlling FWER at level $\alpha$ :

Reject hypothesis $i$ if $p_i < \alpha/m$
Equivalently, compute adjusted p-values: $\tilde{p}_i = \min(m \cdot p_i, 1)$

Mathematical justification: The Bonferroni inequality states that for any events $A_1, \ldots, A_m$ :

$P\left(\bigcup_{i=1}^{m} A_i\right) \leq \sum_{i=1}^{m} P(A_i)$

Let $A_i$ be the event that test $i$ produces a false positive. If we use threshold $\alpha/m$ for each test:

$\text{FWER} = P\left(\bigcup_{i=1}^{m} A_i\right) \leq \sum_{i=1}^{m} P(A_i) = \sum_{i=1}^{m} \frac{\alpha}{m} = \alpha$

Properties:

Always valid (makes no assumptions about test dependence)
Very conservative, especially for large $m$
Power decreases as $m$ increases
Simple to implement and explain

In[6]:

Code

def bonferroni_correction(p_values, alpha=0.05):
    """
    Apply Bonferroni correction for multiple comparisons.

    Parameters
    ----------
    p_values : array-like
        Raw p-values from multiple tests
    alpha : float
        Desired FWER level

    Returns
    -------
    adjusted_p : ndarray
        Bonferroni-adjusted p-values
    rejected : ndarray
        Boolean array indicating which hypotheses to reject
    """
    p = np.asarray(p_values)
    m = len(p)
    adjusted_p = np.minimum(p * m, 1.0)
    rejected = p < (alpha / m)
    return adjusted_p, rejected

def bonferroni_correction(p_values, alpha=0.05):
    """
    Apply Bonferroni correction for multiple comparisons.

    Parameters
    ----------
    p_values : array-like
        Raw p-values from multiple tests
    alpha : float
        Desired FWER level

    Returns
    -------
    adjusted_p : ndarray
        Bonferroni-adjusted p-values
    rejected : ndarray
        Boolean array indicating which hypotheses to reject
    """
    p = np.asarray(p_values)
    m = len(p)
    adjusted_p = np.minimum(p * m, 1.0)
    rejected = p < (alpha / m)
    return adjusted_p, rejected

Šidák CorrectionLink Copied

The Šidák correction is slightly less conservative than Bonferroni and is exact when tests are independent.

The method: For $m$ independent tests at FWER level $\alpha$ :

Reject hypothesis $i$ if $p_i < 1 - (1-\alpha)^{1/m}$

Mathematical justification: If tests are independent and we want FWER = $\alpha$ :

$\text{FWER} = 1 - (1-\alpha_{\text{per-test}})^m = \alpha$

Solving for the per-test significance level:

$\alpha_{\text{per-test}} = 1 - (1-\alpha)^{1/m}$

Comparison with Bonferroni: For $m = 20$ tests at $\alpha = 0.05$ :

Bonferroni: $\alpha/m = 0.0025$
Šidák: $1 - (1-0.05)^{1/20} = 0.00256$

The difference is small but Šidák is slightly more powerful. However, Šidák requires independence while Bonferroni is always valid.

In[7]:

Code

def sidak_correction(p_values, alpha=0.05):
    """
    Apply Šidák correction for multiple comparisons.
    Requires test independence for exact FWER control.
    """
    p = np.asarray(p_values)
    m = len(p)
    alpha_adj = 1 - (1 - alpha) ** (1 / m)
    # Adjusted p-values
    adjusted_p = 1 - (1 - p) ** m
    rejected = p < alpha_adj
    return adjusted_p, rejected

def sidak_correction(p_values, alpha=0.05):
    """
    Apply Šidák correction for multiple comparisons.
    Requires test independence for exact FWER control.
    """
    p = np.asarray(p_values)
    m = len(p)
    alpha_adj = 1 - (1 - alpha) ** (1 / m)
    # Adjusted p-values
    adjusted_p = 1 - (1 - p) ** m
    rejected = p < alpha_adj
    return adjusted_p, rejected

Holm's Step-Down ProcedureLink Copied

Holm's method (1979) is uniformly more powerful than Bonferroni while still controlling FWER. The key insight is that we can be less stringent on tests with larger p-values.

The algorithm:

Order p-values: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$
For $j = 1, 2, \ldots, m$ $j = 1, 2, \dots, m$ :
- If $p_{(j)} > \alpha/(m-j+1)$ , stop and reject all hypotheses with smaller p-values
- Otherwise, continue to the next p-value
If all p-values pass their thresholds, reject all hypotheses

Mathematical justification: The adjusted thresholds form a step-down sequence: $\frac{\alpha}{m}, \frac{\alpha}{m-1}, \frac{\alpha}{m-2}, \ldots, \frac{\alpha}{2}, \alpha$

The first comparison uses the Bonferroni threshold, but subsequent comparisons are less stringent. This maintains FWER control because once we fail to reject at step $j$ , we know that hypothesis has $p > \alpha/(m-j+1)$ , and remaining hypotheses have even larger p-values.

In[8]:

Code

def holm_correction(p_values, alpha=0.05):
    """
    Apply Holm's step-down procedure for FWER control.

    More powerful than Bonferroni while maintaining FWER control.
    """
    p = np.asarray(p_values)
    m = len(p)

    # Get sorted indices
    sorted_idx = np.argsort(p)
    p_sorted = p[sorted_idx]

    # Calculate Holm thresholds
    thresholds = alpha / (m - np.arange(m))

    # Find first non-significant result
    rejected_sorted = np.zeros(m, dtype=bool)
    for j in range(m):
        if p_sorted[j] > thresholds[j]:
            break
        rejected_sorted[j] = True

    # Map back to original order
    rejected = np.zeros(m, dtype=bool)
    rejected[sorted_idx] = rejected_sorted

    # Calculate adjusted p-values
    adjusted_sorted = np.zeros(m)
    for j in range(m):
        adjusted_sorted[j] = min(
            max(p_sorted[: j + 1] * (m - np.arange(j + 1))), 1.0
        )
    adjusted_p = np.zeros(m)
    adjusted_p[sorted_idx] = adjusted_sorted

    return adjusted_p, rejected

def holm_correction(p_values, alpha=0.05):
    """
    Apply Holm's step-down procedure for FWER control.

    More powerful than Bonferroni while maintaining FWER control.
    """
    p = np.asarray(p_values)
    m = len(p)

    # Get sorted indices
    sorted_idx = np.argsort(p)
    p_sorted = p[sorted_idx]

    # Calculate Holm thresholds
    thresholds = alpha / (m - np.arange(m))

    # Find first non-significant result
    rejected_sorted = np.zeros(m, dtype=bool)
    for j in range(m):
        if p_sorted[j] > thresholds[j]:
            break
        rejected_sorted[j] = True

    # Map back to original order
    rejected = np.zeros(m, dtype=bool)
    rejected[sorted_idx] = rejected_sorted

    # Calculate adjusted p-values
    adjusted_sorted = np.zeros(m)
    for j in range(m):
        adjusted_sorted[j] = min(
            max(p_sorted[: j + 1] * (m - np.arange(j + 1))), 1.0
        )
    adjusted_p = np.zeros(m)
    adjusted_p[sorted_idx] = adjusted_sorted

    return adjusted_p, rejected

Hochberg's Step-Up ProcedureLink Copied

Hochberg's method (1988) is similar to Holm but works in the opposite direction (step-up instead of step-down). It's more powerful than Holm but requires the assumption that tests are independent or positively correlated.

The algorithm:

Order p-values: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$
For $j = m, m-1, \ldots, 1$ $j = m, m - 1, \dots, 1$ (starting from largest):
- If $p_{(j)} \leq \alpha/(m-j+1)$ , reject all hypotheses with p-values $\leq p_{(j)}$
- Otherwise, continue to the next smaller p-value

When to use:

When tests are independent or positively correlated
When you want more power than Holm
Not valid for negatively correlated tests

Comparing FWER MethodsLink Copied

In[9]:

Code

# Example p-values
p_example = np.array(
    [0.001, 0.008, 0.012, 0.025, 0.032, 0.058, 0.089, 0.15, 0.42, 0.76]
)
m = len(p_example)
alpha = 0.05

fig, ax = plt.subplots(figsize=(10, 6))

# Plot p-values
x = np.arange(m)
ax.scatter(
    x, np.sort(p_example), s=100, c="#1f77b4", zorder=5, label="Sorted p-values"
)

# Bonferroni threshold (constant)
bonf_thresh = alpha / m
ax.axhline(
    bonf_thresh,
    color="#d62728",
    linestyle="-",
    linewidth=2,
    label=f"Bonferroni (α/m = {bonf_thresh:.4f})",
)

# Holm thresholds (step-down)
holm_thresh = alpha / (m - np.arange(m))
ax.step(
    x,
    holm_thresh,
    where="mid",
    color="#2ca02c",
    linestyle="--",
    linewidth=2,
    label="Holm (α/(m-j+1))",
)

# Hochberg thresholds (same as Holm but applied step-up)
ax.step(
    x,
    holm_thresh,
    where="mid",
    color="#9467bd",
    linestyle=":",
    linewidth=2,
    label="Hochberg (step-up)",
)

# Mark rejections
p_sorted = np.sort(p_example)
bonf_reject = p_sorted < bonf_thresh
holm_reject = p_sorted < holm_thresh
# For Holm, need to find first failure
holm_reject_corrected = np.zeros(m, dtype=bool)
for j in range(m):
    if p_sorted[j] > holm_thresh[j]:
        break
    holm_reject_corrected[j] = True

ax.scatter(
    x[bonf_reject],
    p_sorted[bonf_reject],
    s=200,
    facecolors="none",
    edgecolors="#d62728",
    linewidths=2,
    marker="o",
    zorder=6,
)
ax.scatter(
    x[holm_reject_corrected],
    p_sorted[holm_reject_corrected],
    s=250,
    facecolors="none",
    edgecolors="#2ca02c",
    linewidths=2,
    marker="s",
    zorder=6,
)

ax.set_xlabel("Sorted Rank (j)")
ax.set_ylabel("P-value / Threshold")
ax.set_title("Comparison of FWER-Controlling Methods")
ax.legend(loc="upper left")
ax.set_yscale("log")
ax.set_ylim(0.0005, 1)
ax.set_xlim(-0.5, m - 0.5)

# Annotations
ax.text(
    0.95,
    0.15,
    f"Bonferroni rejects: {sum(bonf_reject)}\nHolm rejects: {sum(holm_reject_corrected)}",
    transform=ax.transAxes,
    fontsize=10,
    verticalalignment="bottom",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

plt.show()

# Example p-values
p_example = np.array(
    [0.001, 0.008, 0.012, 0.025, 0.032, 0.058, 0.089, 0.15, 0.42, 0.76]
)
m = len(p_example)
alpha = 0.05

fig, ax = plt.subplots(figsize=(10, 6))

# Plot p-values
x = np.arange(m)
ax.scatter(
    x, np.sort(p_example), s=100, c="#1f77b4", zorder=5, label="Sorted p-values"
)

# Bonferroni threshold (constant)
bonf_thresh = alpha / m
ax.axhline(
    bonf_thresh,
    color="#d62728",
    linestyle="-",
    linewidth=2,
    label=f"Bonferroni (α/m = {bonf_thresh:.4f})",
)

# Holm thresholds (step-down)
holm_thresh = alpha / (m - np.arange(m))
ax.step(
    x,
    holm_thresh,
    where="mid",
    color="#2ca02c",
    linestyle="--",
    linewidth=2,
    label="Holm (α/(m-j+1))",
)

# Hochberg thresholds (same as Holm but applied step-up)
ax.step(
    x,
    holm_thresh,
    where="mid",
    color="#9467bd",
    linestyle=":",
    linewidth=2,
    label="Hochberg (step-up)",
)

# Mark rejections
p_sorted = np.sort(p_example)
bonf_reject = p_sorted < bonf_thresh
holm_reject = p_sorted < holm_thresh
# For Holm, need to find first failure
holm_reject_corrected = np.zeros(m, dtype=bool)
for j in range(m):
    if p_sorted[j] > holm_thresh[j]:
        break
    holm_reject_corrected[j] = True

ax.scatter(
    x[bonf_reject],
    p_sorted[bonf_reject],
    s=200,
    facecolors="none",
    edgecolors="#d62728",
    linewidths=2,
    marker="o",
    zorder=6,
)
ax.scatter(
    x[holm_reject_corrected],
    p_sorted[holm_reject_corrected],
    s=250,
    facecolors="none",
    edgecolors="#2ca02c",
    linewidths=2,
    marker="s",
    zorder=6,
)

ax.set_xlabel("Sorted Rank (j)")
ax.set_ylabel("P-value / Threshold")
ax.set_title("Comparison of FWER-Controlling Methods")
ax.legend(loc="upper left")
ax.set_yscale("log")
ax.set_ylim(0.0005, 1)
ax.set_xlim(-0.5, m - 0.5)

# Annotations
ax.text(
    0.95,
    0.15,
    f"Bonferroni rejects: {sum(bonf_reject)}\nHolm rejects: {sum(holm_reject_corrected)}",
    transform=ax.transAxes,
    fontsize=10,
    verticalalignment="bottom",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

plt.show()

Out[9]:

Visualization

FDR-Controlling MethodsLink Copied

Benjamini-Hochberg ProcedureLink Copied

The Benjamini-Hochberg (BH) procedure (1995) is the most widely used method for controlling the false discovery rate. It's less conservative than FWER methods, allowing more discoveries while controlling the expected proportion of false positives among rejected hypotheses.

The algorithm:

Order p-values: $p_{(1)} \leq p_{(2)} \leq \ldots \leq p_{(m)}$
Find the largest $k$ such that $p_{(k)} \leq \frac{k}{m} \cdot q$ (where $q$ is the desired FDR level)
Reject all hypotheses with p-values $\leq p_{(k)}$

Mathematical justification: The BH procedure controls FDR at level $q$ under the assumption that the test statistics are independent (or positively dependent under some conditions). The key insight is that the threshold $\frac{k}{m} \cdot q$ increases with rank, allowing more rejections among the smaller p-values.

Adjusted p-values: The BH-adjusted p-value for the hypothesis with rank $k$ is: $\tilde{p}_{(k)} = \min_{j \geq k} \left\{ \min\left(\frac{m \cdot p_{(j)}}{j}, 1\right) \right\}$

This ensures monotonicity: if $p_{(i)} < p_{(j)}$ , then $\tilde{p}_{(i)} \leq \tilde{p}_{(j)}$ .

In[10]:

Code

def benjamini_hochberg(p_values, q=0.05):
    """
    Apply Benjamini-Hochberg procedure for FDR control.

    Parameters
    ----------
    p_values : array-like
        Raw p-values from multiple tests
    q : float
        Desired FDR level

    Returns
    -------
    adjusted_p : ndarray
        BH-adjusted p-values
    rejected : ndarray
        Boolean array indicating which hypotheses to reject
    """
    p = np.asarray(p_values)
    m = len(p)

    # Sort p-values
    sorted_idx = np.argsort(p)
    p_sorted = p[sorted_idx]

    # BH thresholds
    thresholds = q * np.arange(1, m + 1) / m

    # Find largest k where p(k) <= threshold(k)
    significant = p_sorted <= thresholds
    if np.any(significant):
        k_max = np.max(np.where(significant)[0])
        rejected_sorted = np.arange(m) <= k_max
    else:
        rejected_sorted = np.zeros(m, dtype=bool)

    # Map back to original order
    rejected = np.zeros(m, dtype=bool)
    rejected[sorted_idx] = rejected_sorted

    # Calculate adjusted p-values (ensuring monotonicity)
    adjusted_sorted = np.zeros(m)
    adjusted_sorted[m - 1] = p_sorted[m - 1]
    for i in range(m - 2, -1, -1):
        adjusted_sorted[i] = min(
            adjusted_sorted[i + 1], m * p_sorted[i] / (i + 1)
        )
    adjusted_sorted = np.minimum(adjusted_sorted, 1.0)

    adjusted_p = np.zeros(m)
    adjusted_p[sorted_idx] = adjusted_sorted

    return adjusted_p, rejected

def benjamini_hochberg(p_values, q=0.05):
    """
    Apply Benjamini-Hochberg procedure for FDR control.

    Parameters
    ----------
    p_values : array-like
        Raw p-values from multiple tests
    q : float
        Desired FDR level

    Returns
    -------
    adjusted_p : ndarray
        BH-adjusted p-values
    rejected : ndarray
        Boolean array indicating which hypotheses to reject
    """
    p = np.asarray(p_values)
    m = len(p)

    # Sort p-values
    sorted_idx = np.argsort(p)
    p_sorted = p[sorted_idx]

    # BH thresholds
    thresholds = q * np.arange(1, m + 1) / m

    # Find largest k where p(k) <= threshold(k)
    significant = p_sorted <= thresholds
    if np.any(significant):
        k_max = np.max(np.where(significant)[0])
        rejected_sorted = np.arange(m) <= k_max
    else:
        rejected_sorted = np.zeros(m, dtype=bool)

    # Map back to original order
    rejected = np.zeros(m, dtype=bool)
    rejected[sorted_idx] = rejected_sorted

    # Calculate adjusted p-values (ensuring monotonicity)
    adjusted_sorted = np.zeros(m)
    adjusted_sorted[m - 1] = p_sorted[m - 1]
    for i in range(m - 2, -1, -1):
        adjusted_sorted[i] = min(
            adjusted_sorted[i + 1], m * p_sorted[i] / (i + 1)
        )
    adjusted_sorted = np.minimum(adjusted_sorted, 1.0)

    adjusted_p = np.zeros(m)
    adjusted_p[sorted_idx] = adjusted_sorted

    return adjusted_p, rejected

Benjamini-Yekutieli ProcedureLink Copied

The Benjamini-Yekutieli (BY) procedure (2001) extends BH to handle arbitrary dependence structures among tests. It's more conservative than BH but valid under any dependence.

The method: Replace the BH threshold $\frac{k}{m} \cdot q$ with $\frac{k}{m \cdot c(m)} \cdot q$ , where:

$c(m) = \sum_{i=1}^{m} \frac{1}{i} \approx \ln(m) + 0.577$

This correction factor accounts for potential negative correlations among tests.

In[11]:

Code

def benjamini_yekutieli(p_values, q=0.05):
    """
    Apply Benjamini-Yekutieli procedure for FDR control under arbitrary dependence.
    """
    p = np.asarray(p_values)
    m = len(p)

    # Correction factor for arbitrary dependence
    c_m = np.sum(1 / np.arange(1, m + 1))

    # Sort p-values
    sorted_idx = np.argsort(p)
    p_sorted = p[sorted_idx]

    # BY thresholds
    thresholds = q * np.arange(1, m + 1) / (m * c_m)

    # Find largest k where p(k) <= threshold(k)
    significant = p_sorted <= thresholds
    if np.any(significant):
        k_max = np.max(np.where(significant)[0])
        rejected_sorted = np.arange(m) <= k_max
    else:
        rejected_sorted = np.zeros(m, dtype=bool)

    rejected = np.zeros(m, dtype=bool)
    rejected[sorted_idx] = rejected_sorted

    return rejected

def benjamini_yekutieli(p_values, q=0.05):
    """
    Apply Benjamini-Yekutieli procedure for FDR control under arbitrary dependence.
    """
    p = np.asarray(p_values)
    m = len(p)

    # Correction factor for arbitrary dependence
    c_m = np.sum(1 / np.arange(1, m + 1))

    # Sort p-values
    sorted_idx = np.argsort(p)
    p_sorted = p[sorted_idx]

    # BY thresholds
    thresholds = q * np.arange(1, m + 1) / (m * c_m)

    # Find largest k where p(k) <= threshold(k)
    significant = p_sorted <= thresholds
    if np.any(significant):
        k_max = np.max(np.where(significant)[0])
        rejected_sorted = np.arange(m) <= k_max
    else:
        rejected_sorted = np.zeros(m, dtype=bool)

    rejected = np.zeros(m, dtype=bool)
    rejected[sorted_idx] = rejected_sorted

    return rejected

Visualizing the BH ProcedureLink Copied

In[12]:

Code

np.random.seed(42)
# Generate example p-values (mix of null and alternative)
m = 25
p_null = np.random.uniform(0, 1, 20)
p_alt = np.array([0.001, 0.003, 0.008, 0.015, 0.022])  # True effects
p_example = np.concatenate([p_null, p_alt])
is_null_ex = np.array([True] * 20 + [False] * 5)

# Sort
sorted_idx = np.argsort(p_example)
p_sorted = p_example[sorted_idx]
null_sorted = is_null_ex[sorted_idx]

# BH procedure
q = 0.05
bh_threshold = q * np.arange(1, m + 1) / m

# Find cutoff
significant = p_sorted <= bh_threshold
k_max = np.max(np.where(significant)[0]) if np.any(significant) else -1

fig, ax = plt.subplots(figsize=(10, 6))

# Plot p-values colored by true status
colors = ["#2ca02c" if not null else "#1f77b4" for null in null_sorted]
ax.scatter(np.arange(1, m + 1), p_sorted, c=colors, s=100, zorder=5)

# Plot BH threshold line
ax.plot(
    np.arange(1, m + 1),
    bh_threshold,
    color="#d62728",
    linewidth=2,
    linestyle="--",
    label="BH threshold: (k/m)×q",
)

# Mark the cutoff
if k_max >= 0:
    ax.axvline(k_max + 1.5, color="gray", linestyle=":", linewidth=2, alpha=0.7)
    ax.fill_between(
        [0, k_max + 1.5], 0, max(p_sorted) * 1.1, alpha=0.1, color="green"
    )
    ax.text(
        k_max / 2 + 1,
        max(p_sorted) * 0.9,
        "Rejected",
        fontsize=12,
        ha="center",
        color="#2ca02c",
        fontweight="bold",
    )
    ax.text(
        k_max + 4,
        max(p_sorted) * 0.9,
        "Not rejected",
        fontsize=12,
        ha="center",
        color="gray",
        fontweight="bold",
    )

# Legend
ax.scatter([], [], c="#2ca02c", s=80, label="True alternative")
ax.scatter([], [], c="#1f77b4", s=80, label="True null")
ax.legend(loc="center right")

ax.set_xlabel("Rank (k)")
ax.set_ylabel("P-value")
ax.set_title(f"Benjamini-Hochberg Procedure (q = {q})")
ax.set_xlim(0, m + 1)
ax.set_ylim(-0.02, max(p_sorted) * 1.1)

# Summary statistics
n_rejected = k_max + 1 if k_max >= 0 else 0
n_tp = np.sum(~null_sorted[: k_max + 1]) if k_max >= 0 else 0
n_fp = np.sum(null_sorted[: k_max + 1]) if k_max >= 0 else 0
ax.text(
    0.98,
    0.3,
    f"Rejected: {n_rejected}\nTrue positives: {n_tp}\nFalse positives: {n_fp}\nFDR: {n_fp / max(n_rejected, 1):.1%}",
    transform=ax.transAxes,
    fontsize=10,
    verticalalignment="center",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

plt.show()

np.random.seed(42)
# Generate example p-values (mix of null and alternative)
m = 25
p_null = np.random.uniform(0, 1, 20)
p_alt = np.array([0.001, 0.003, 0.008, 0.015, 0.022])  # True effects
p_example = np.concatenate([p_null, p_alt])
is_null_ex = np.array([True] * 20 + [False] * 5)

# Sort
sorted_idx = np.argsort(p_example)
p_sorted = p_example[sorted_idx]
null_sorted = is_null_ex[sorted_idx]

# BH procedure
q = 0.05
bh_threshold = q * np.arange(1, m + 1) / m

# Find cutoff
significant = p_sorted <= bh_threshold
k_max = np.max(np.where(significant)[0]) if np.any(significant) else -1

fig, ax = plt.subplots(figsize=(10, 6))

# Plot p-values colored by true status
colors = ["#2ca02c" if not null else "#1f77b4" for null in null_sorted]
ax.scatter(np.arange(1, m + 1), p_sorted, c=colors, s=100, zorder=5)

# Plot BH threshold line
ax.plot(
    np.arange(1, m + 1),
    bh_threshold,
    color="#d62728",
    linewidth=2,
    linestyle="--",
    label="BH threshold: (k/m)×q",
)

# Mark the cutoff
if k_max >= 0:
    ax.axvline(k_max + 1.5, color="gray", linestyle=":", linewidth=2, alpha=0.7)
    ax.fill_between(
        [0, k_max + 1.5], 0, max(p_sorted) * 1.1, alpha=0.1, color="green"
    )
    ax.text(
        k_max / 2 + 1,
        max(p_sorted) * 0.9,
        "Rejected",
        fontsize=12,
        ha="center",
        color="#2ca02c",
        fontweight="bold",
    )
    ax.text(
        k_max + 4,
        max(p_sorted) * 0.9,
        "Not rejected",
        fontsize=12,
        ha="center",
        color="gray",
        fontweight="bold",
    )

# Legend
ax.scatter([], [], c="#2ca02c", s=80, label="True alternative")
ax.scatter([], [], c="#1f77b4", s=80, label="True null")
ax.legend(loc="center right")

ax.set_xlabel("Rank (k)")
ax.set_ylabel("P-value")
ax.set_title(f"Benjamini-Hochberg Procedure (q = {q})")
ax.set_xlim(0, m + 1)
ax.set_ylim(-0.02, max(p_sorted) * 1.1)

# Summary statistics
n_rejected = k_max + 1 if k_max >= 0 else 0
n_tp = np.sum(~null_sorted[: k_max + 1]) if k_max >= 0 else 0
n_fp = np.sum(null_sorted[: k_max + 1]) if k_max >= 0 else 0
ax.text(
    0.98,
    0.3,
    f"Rejected: {n_rejected}\nTrue positives: {n_tp}\nFalse positives: {n_fp}\nFDR: {n_fp / max(n_rejected, 1):.1%}",
    transform=ax.transAxes,
    fontsize=10,
    verticalalignment="center",
    horizontalalignment="right",
    bbox=dict(boxstyle="round", facecolor="lightyellow", alpha=0.8),
)

plt.show()

Out[12]:

Visualization

Post-Hoc Tests for ANOVALink Copied

When ANOVA reveals significant differences among groups, post-hoc tests determine which specific pairs differ. These are inherently multiple comparison problems since comparing $k$ groups involves $\binom{k}{2} = \frac{k(k-1)}{2}$ pairwise comparisons.

Tukey's Honest Significant Difference (HSD)Link Copied

Tukey's HSD is the most common post-hoc test for ANOVA. It controls the family-wise error rate for all pairwise comparisons when group sizes are equal.

The method: Two groups $i$ and $j$ are significantly different if:

$|\bar{x}_i - \bar{x}_j| > q_{\alpha, k, df_W} \cdot \sqrt{\frac{MS_W}{n}}$

where:

$q_{\alpha, k, df_W}$ is the critical value from the Studentized Range distribution
$k$ is the number of groups
$df_W = N - k$ is the within-groups degrees of freedom
$MS_W$ is the within-groups mean square (from ANOVA)
$n$ is the sample size per group

When to use:

All pairwise comparisons are of interest
Equal (or approximately equal) sample sizes
Homogeneity of variance assumption met

In[13]:

Code

from scipy.stats import tukey_hsd

# Example: Three teaching methods
np.random.seed(42)
method_a = np.random.normal(75, 10, 25)  # Traditional
method_b = np.random.normal(82, 10, 25)  # Interactive
method_c = np.random.normal(78, 10, 25)  # Mixed

# Perform Tukey's HSD
result = tukey_hsd(method_a, method_b, method_c)
print(result)

from scipy.stats import tukey_hsd

# Example: Three teaching methods
np.random.seed(42)
method_a = np.random.normal(75, 10, 25)  # Traditional
method_b = np.random.normal(82, 10, 25)  # Interactive
method_c = np.random.normal(78, 10, 25)  # Mixed

# Perform Tukey's HSD
result = tukey_hsd(method_a, method_b, method_c)
print(result)

Out[13]:

Console

Pairwise Group Comparisons (95.0% Confidence Interval)
Comparison  Statistic  p-value  Lower CI  Upper CI
 (0 - 1)     -5.761     0.091   -12.237     0.715
 (0 - 2)     -5.696     0.096   -12.172     0.780
 (1 - 0)      5.761     0.091    -0.715    12.237
 (1 - 2)      0.065     1.000    -6.411     6.541
 (2 - 0)      5.696     0.096    -0.780    12.172
 (2 - 1)     -0.065     1.000    -6.541     6.411

Dunnett's TestLink Copied

Dunnett's test compares each treatment group to a single control group. It's more powerful than Tukey's HSD when you only care about comparisons with the control.

When to use:

One group is a control/reference
Only comparisons with control are of interest
More powerful than Tukey for this specific situation

In[14]:

Code

from scipy.stats import dunnett

# Example: Control vs two treatments
np.random.seed(42)
control = np.random.normal(50, 8, 20)
treatment1 = np.random.normal(55, 8, 20)
treatment2 = np.random.normal(58, 8, 20)

# Dunnett's test (comparing each treatment to control)
result = dunnett(treatment1, treatment2, control=control)
print(f"Treatment 1 vs Control: p = {result.pvalue[0]:.4f}")
print(f"Treatment 2 vs Control: p = {result.pvalue[1]:.4f}")

from scipy.stats import dunnett

# Example: Control vs two treatments
np.random.seed(42)
control = np.random.normal(50, 8, 20)
treatment1 = np.random.normal(55, 8, 20)
treatment2 = np.random.normal(58, 8, 20)

# Dunnett's test (comparing each treatment to control)
result = dunnett(treatment1, treatment2, control=control)
print(f"Treatment 1 vs Control: p = {result.pvalue[0]:.4f}")
print(f"Treatment 2 vs Control: p = {result.pvalue[1]:.4f}")

Out[14]:

Console

Treatment 1 vs Control: p = 0.1299
Treatment 2 vs Control: p = 0.0004

Scheffé's MethodLink Copied

Scheffé's method is the most flexible but also most conservative post-hoc test. It controls FWER for all possible contrasts (not just pairwise comparisons), including complex comparisons like "Group A vs. the average of Groups B and C."

When to use:

Unplanned (post-hoc) comparisons
Complex contrasts beyond simple pairwise comparisons
Unequal sample sizes
Most conservative option

Summary of Post-Hoc MethodsLink Copied

Method	Comparisons	Sample Sizes	Power	Use Case
Tukey HSD	All pairwise	Equal	Moderate	Standard choice for all pairwise
Bonferroni	Any	Any	Low	Simple, always valid
Holm	Any	Any	Moderate	Better than Bonferroni
Dunnett	vs. Control only	Any	High	Treatment vs. control
Scheffé	Any contrast	Any	Low	Complex, unplanned contrasts

Practical Implementation with scipy.statsLink Copied

Let's work through a complete example using scipy's built-in multiple comparison functions.

In[15]:

Code

from scipy.stats import false_discovery_control

# Simulate a genomics study: testing 1000 genes
np.random.seed(42)
n_genes = 1000
n_real_effects = 50

# Generate p-values
# Most genes: null hypothesis true (uniform p-values)
p_null = np.random.uniform(0, 1, n_genes - n_real_effects)
# Some genes: real effects (small p-values)
p_alt = np.random.beta(0.5, 5, n_real_effects)  # Skewed toward 0

p_values = np.concatenate([p_null, p_alt])
is_null = np.array(
    [True] * (n_genes - n_real_effects) + [False] * n_real_effects
)

# Shuffle to mix them up
shuffle_idx = np.random.permutation(n_genes)
p_values = p_values[shuffle_idx]
is_null = is_null[shuffle_idx]

print(f"Testing {n_genes} genes ({n_real_effects} with true effects)")
print(f"Without correction: {np.sum(p_values < 0.05)} significant (α = 0.05)")
print()

# Apply different corrections
# Bonferroni
bonf_sig = p_values < (0.05 / n_genes)
print(f"Bonferroni (FWER): {np.sum(bonf_sig)} significant")
print(f"  True positives: {np.sum(bonf_sig & ~is_null)}")
print(f"  False positives: {np.sum(bonf_sig & is_null)}")
print()

# Benjamini-Hochberg using scipy
bh_adjusted = false_discovery_control(p_values, method="bh")
bh_sig = bh_adjusted < 0.05
print(f"Benjamini-Hochberg (FDR): {np.sum(bh_sig)} significant")
print(f"  True positives: {np.sum(bh_sig & ~is_null)}")
print(f"  False positives: {np.sum(bh_sig & is_null)}")
print(
    f"  Empirical FDR: {np.sum(bh_sig & is_null) / max(np.sum(bh_sig), 1):.2%}"
)
print()

# Benjamini-Yekutieli using scipy
by_adjusted = false_discovery_control(p_values, method="by")
by_sig = by_adjusted < 0.05
print(f"Benjamini-Yekutieli (FDR, conservative): {np.sum(by_sig)} significant")
print(f"  True positives: {np.sum(by_sig & ~is_null)}")
print(f"  False positives: {np.sum(by_sig & is_null)}")

from scipy.stats import false_discovery_control

# Simulate a genomics study: testing 1000 genes
np.random.seed(42)
n_genes = 1000
n_real_effects = 50

# Generate p-values
# Most genes: null hypothesis true (uniform p-values)
p_null = np.random.uniform(0, 1, n_genes - n_real_effects)
# Some genes: real effects (small p-values)
p_alt = np.random.beta(0.5, 5, n_real_effects)  # Skewed toward 0

p_values = np.concatenate([p_null, p_alt])
is_null = np.array(
    [True] * (n_genes - n_real_effects) + [False] * n_real_effects
)

# Shuffle to mix them up
shuffle_idx = np.random.permutation(n_genes)
p_values = p_values[shuffle_idx]
is_null = is_null[shuffle_idx]

print(f"Testing {n_genes} genes ({n_real_effects} with true effects)")
print(f"Without correction: {np.sum(p_values < 0.05)} significant (α = 0.05)")
print()

# Apply different corrections
# Bonferroni
bonf_sig = p_values < (0.05 / n_genes)
print(f"Bonferroni (FWER): {np.sum(bonf_sig)} significant")
print(f"  True positives: {np.sum(bonf_sig & ~is_null)}")
print(f"  False positives: {np.sum(bonf_sig & is_null)}")
print()

# Benjamini-Hochberg using scipy
bh_adjusted = false_discovery_control(p_values, method="bh")
bh_sig = bh_adjusted < 0.05
print(f"Benjamini-Hochberg (FDR): {np.sum(bh_sig)} significant")
print(f"  True positives: {np.sum(bh_sig & ~is_null)}")
print(f"  False positives: {np.sum(bh_sig & is_null)}")
print(
    f"  Empirical FDR: {np.sum(bh_sig & is_null) / max(np.sum(bh_sig), 1):.2%}"
)
print()

# Benjamini-Yekutieli using scipy
by_adjusted = false_discovery_control(p_values, method="by")
by_sig = by_adjusted < 0.05
print(f"Benjamini-Yekutieli (FDR, conservative): {np.sum(by_sig)} significant")
print(f"  True positives: {np.sum(by_sig & ~is_null)}")
print(f"  False positives: {np.sum(by_sig & is_null)}")

Out[15]:

Console

Testing 1000 genes (50 with true effects)
Without correction: 72 significant (α = 0.05)

Bonferroni (FWER): 0 significant
  True positives: 0
  False positives: 0

Benjamini-Hochberg (FDR): 0 significant
  True positives: 0
  False positives: 0
  Empirical FDR: 0.00%

Benjamini-Yekutieli (FDR, conservative): 0 significant
  True positives: 0
  False positives: 0

Comparing Methods on Real-World ScaleLink Copied

In[16]:

Code

methods = ["Uncorrected", "Bonferroni", "Holm", "BH (FDR)", "BY (FDR)"]

# Apply all methods
uncorr_sig = p_values < 0.05
bonf_sig = p_values < (0.05 / n_genes)
_, holm_sig = holm_correction(p_values, 0.05)
_, bh_sig = benjamini_hochberg(p_values, 0.05)
by_sig = benjamini_yekutieli(p_values, 0.05)

all_sigs = [uncorr_sig, bonf_sig, holm_sig, bh_sig, by_sig]

tp = [np.sum(sig & ~is_null) for sig in all_sigs]
fp = [np.sum(sig & is_null) for sig in all_sigs]

fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(methods))
width = 0.35

bars1 = ax.bar(
    x - width / 2, tp, width, label="True Positives", color="#2ca02c", alpha=0.8
)
bars2 = ax.bar(
    x + width / 2,
    fp,
    width,
    label="False Positives",
    color="#d62728",
    alpha=0.8,
)

ax.axhline(
    n_real_effects,
    color="gray",
    linestyle="--",
    linewidth=1,
    alpha=0.7,
    label=f"True effects ({n_real_effects})",
)

ax.set_xlabel("Correction Method")
ax.set_ylabel("Number of Rejections")
ax.set_title(
    f"Multiple Testing Corrections: {n_genes} Tests, {n_real_effects} True Effects"
)
ax.set_xticks(x)
ax.set_xticklabels(methods)
ax.legend(loc="upper right")

# Add value labels
for bar, val in zip(bars1, tp):
    ax.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 1,
        str(val),
        ha="center",
        va="bottom",
        fontsize=9,
        color="#2ca02c",
    )
for bar, val in zip(bars2, fp):
    if val > 0:
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            bar.get_height() + 1,
            str(val),
            ha="center",
            va="bottom",
            fontsize=9,
            color="#d62728",
        )

plt.tight_layout()
plt.show()

methods = ["Uncorrected", "Bonferroni", "Holm", "BH (FDR)", "BY (FDR)"]

# Apply all methods
uncorr_sig = p_values < 0.05
bonf_sig = p_values < (0.05 / n_genes)
_, holm_sig = holm_correction(p_values, 0.05)
_, bh_sig = benjamini_hochberg(p_values, 0.05)
by_sig = benjamini_yekutieli(p_values, 0.05)

all_sigs = [uncorr_sig, bonf_sig, holm_sig, bh_sig, by_sig]

tp = [np.sum(sig & ~is_null) for sig in all_sigs]
fp = [np.sum(sig & is_null) for sig in all_sigs]

fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(methods))
width = 0.35

bars1 = ax.bar(
    x - width / 2, tp, width, label="True Positives", color="#2ca02c", alpha=0.8
)
bars2 = ax.bar(
    x + width / 2,
    fp,
    width,
    label="False Positives",
    color="#d62728",
    alpha=0.8,
)

ax.axhline(
    n_real_effects,
    color="gray",
    linestyle="--",
    linewidth=1,
    alpha=0.7,
    label=f"True effects ({n_real_effects})",
)

ax.set_xlabel("Correction Method")
ax.set_ylabel("Number of Rejections")
ax.set_title(
    f"Multiple Testing Corrections: {n_genes} Tests, {n_real_effects} True Effects"
)
ax.set_xticks(x)
ax.set_xticklabels(methods)
ax.legend(loc="upper right")

# Add value labels
for bar, val in zip(bars1, tp):
    ax.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 1,
        str(val),
        ha="center",
        va="bottom",
        fontsize=9,
        color="#2ca02c",
    )
for bar, val in zip(bars2, fp):
    if val > 0:
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            bar.get_height() + 1,
            str(val),
            ha="center",
            va="bottom",
            fontsize=9,
            color="#d62728",
        )

plt.tight_layout()
plt.show()

Out[16]:

Visualization

Bar chart comparing rejections by different correction methods, split by true/false positives. — Comparison of multiple testing correction methods on simulated genomics data (1000 genes, 50 with true effects). Uncorrected testing produces many false positives. Bonferroni is highly conservative. BH finds a good balance, detecting most true effects while controlling FDR.

Common Pitfalls and Best PracticesLink Copied

Pitfall 1: Ignoring Multiple ComparisonsLink Copied

The most serious pitfall is ignoring the multiple comparisons problem entirely. This leads to inflated false positive rates and unreliable conclusions.

Warning: The Hidden Multiplicity

Multiple comparisons can be hidden in various ways:

Testing many subgroups without pre-specification
Analyzing data at multiple time points
Testing multiple outcomes without correction
"Peeking" at data during sequential data collection
Testing different model specifications

All of these constitute multiple testing and require appropriate correction.

Pitfall 2: OvercorrectingLink Copied

Applying overly conservative corrections (like Bonferroni when BH would suffice) reduces power and may cause you to miss real effects. Match your correction method to your research goals.

Pitfall 3: Correcting Within Correlated FamiliesLink Copied

When tests are strongly positively correlated (e.g., testing the same hypothesis in overlapping subsamples), standard corrections become overly conservative. Consider methods designed for dependent tests or reduce to independent tests.

Best PracticesLink Copied

Pre-specify your analysis plan: Define which tests you'll conduct before seeing the data
Match correction to research phase: Use FWER for confirmatory research, FDR for exploratory
Report the correction method used: Always state which correction you applied
Report both raw and adjusted p-values: Allows readers to apply their own standards
Consider the dependence structure: Use BH for independent/positive dependence, BY for arbitrary dependence
Focus on effect sizes: A barely significant corrected p-value with a tiny effect size isn't very informative

Decision FrameworkLink Copied

In[17]:

Code

fig, ax = plt.subplots(figsize=(12, 8))
ax.set_xlim(0, 12)
ax.set_ylim(0, 10)
ax.axis("off")


# Define boxes
def draw_box(x, y, w, h, text, color, fontsize=10, bold=False):
    rect = plt.Rectangle(
        (x, y), w, h, facecolor=color, edgecolor="black", linewidth=2
    )
    ax.add_patch(rect)
    weight = "bold" if bold else "normal"
    ax.text(
        x + w / 2,
        y + h / 2,
        text,
        ha="center",
        va="center",
        fontsize=fontsize,
        wrap=True,
        weight=weight,
    )


def draw_arrow(x1, y1, x2, y2, text=""):
    ax.annotate(
        "",
        xy=(x2, y2),
        xytext=(x1, y1),
        arrowprops=dict(arrowstyle="->", color="black", lw=1.5),
    )
    if text:
        ax.text((x1 + x2) / 2 + 0.2, (y1 + y2) / 2, text, fontsize=9, ha="left")


# Start node
draw_box(4, 8.5, 4, 1, "Multiple Tests?", "#ffcc80", fontsize=11, bold=True)

# First branch - research type
draw_box(
    1.5, 6.5, 3.5, 1, "Confirmatory\n(FWER control)", "#aed581", fontsize=10
)
draw_box(7, 6.5, 3.5, 1, "Exploratory\n(FDR control)", "#81d4fa", fontsize=10)

draw_arrow(5, 8.5, 3.25, 7.5, "Hypothesis-\ndriven")
draw_arrow(7, 8.5, 8.75, 7.5, "Data-\ndriven")

# FWER branches
draw_box(
    0, 4.5, 2.5, 1.2, "Few tests\nor simple\nBonferroni", "#c8e6c9", fontsize=9
)
draw_box(
    3, 4.5, 2.5, 1.2, "Many tests\nor need power\nHolm", "#c8e6c9", fontsize=9
)

draw_arrow(2.5, 6.5, 1.25, 5.7)
draw_arrow(4, 6.5, 4.25, 5.7)

# FDR branches
draw_box(
    6, 4.5, 2.5, 1.2, "Independent\nor positive\nBH", "#b3e5fc", fontsize=9
)
draw_box(9, 4.5, 2.5, 1.2, "Arbitrary\ndependence\nBY", "#b3e5fc", fontsize=9)

draw_arrow(8, 6.5, 7.25, 5.7)
draw_arrow(9.5, 6.5, 10.25, 5.7)

# Post-hoc tests (special case)
draw_box(
    4,
    2,
    4,
    1.5,
    "Post-hoc (after ANOVA):\n• All pairs: Tukey HSD\n• vs Control: Dunnett\n• Complex: Scheffé",
    "#fff59d",
    fontsize=9,
)

ax.set_title(
    "Choosing a Multiple Comparison Correction",
    fontsize=14,
    fontweight="bold",
    y=1.02,
)

plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(12, 8))
ax.set_xlim(0, 12)
ax.set_ylim(0, 10)
ax.axis("off")


# Define boxes
def draw_box(x, y, w, h, text, color, fontsize=10, bold=False):
    rect = plt.Rectangle(
        (x, y), w, h, facecolor=color, edgecolor="black", linewidth=2
    )
    ax.add_patch(rect)
    weight = "bold" if bold else "normal"
    ax.text(
        x + w / 2,
        y + h / 2,
        text,
        ha="center",
        va="center",
        fontsize=fontsize,
        wrap=True,
        weight=weight,
    )


def draw_arrow(x1, y1, x2, y2, text=""):
    ax.annotate(
        "",
        xy=(x2, y2),
        xytext=(x1, y1),
        arrowprops=dict(arrowstyle="->", color="black", lw=1.5),
    )
    if text:
        ax.text((x1 + x2) / 2 + 0.2, (y1 + y2) / 2, text, fontsize=9, ha="left")


# Start node
draw_box(4, 8.5, 4, 1, "Multiple Tests?", "#ffcc80", fontsize=11, bold=True)

# First branch - research type
draw_box(
    1.5, 6.5, 3.5, 1, "Confirmatory\n(FWER control)", "#aed581", fontsize=10
)
draw_box(7, 6.5, 3.5, 1, "Exploratory\n(FDR control)", "#81d4fa", fontsize=10)

draw_arrow(5, 8.5, 3.25, 7.5, "Hypothesis-\ndriven")
draw_arrow(7, 8.5, 8.75, 7.5, "Data-\ndriven")

# FWER branches
draw_box(
    0, 4.5, 2.5, 1.2, "Few tests\nor simple\nBonferroni", "#c8e6c9", fontsize=9
)
draw_box(
    3, 4.5, 2.5, 1.2, "Many tests\nor need power\nHolm", "#c8e6c9", fontsize=9
)

draw_arrow(2.5, 6.5, 1.25, 5.7)
draw_arrow(4, 6.5, 4.25, 5.7)

# FDR branches
draw_box(
    6, 4.5, 2.5, 1.2, "Independent\nor positive\nBH", "#b3e5fc", fontsize=9
)
draw_box(9, 4.5, 2.5, 1.2, "Arbitrary\ndependence\nBY", "#b3e5fc", fontsize=9)

draw_arrow(8, 6.5, 7.25, 5.7)
draw_arrow(9.5, 6.5, 10.25, 5.7)

# Post-hoc tests (special case)
draw_box(
    4,
    2,
    4,
    1.5,
    "Post-hoc (after ANOVA):\n• All pairs: Tukey HSD\n• vs Control: Dunnett\n• Complex: Scheffé",
    "#fff59d",
    fontsize=9,
)

ax.set_title(
    "Choosing a Multiple Comparison Correction",
    fontsize=14,
    fontweight="bold",
    y=1.02,
)

plt.tight_layout()
plt.show()

Out[17]:

Visualization

Decision tree diagram for selecting multiple comparison correction methods. — Decision framework for choosing a multiple comparison correction method. The choice depends on whether you're doing confirmatory or exploratory research, the dependence structure among tests, and your tolerance for different types of errors.

SummaryLink Copied

The multiple comparisons problem is one of the most important concepts in modern statistics. When conducting many hypothesis tests, the probability of false positives increases dramatically: from 5% for a single test to 99.4% for 100 tests at $\alpha = 0.05$ .

Key concepts:

Family-wise error rate (FWER): Probability of at least one false positive; controlled by Bonferroni, Šidák, Holm, Hochberg
False discovery rate (FDR): Expected proportion of false positives among rejections; controlled by Benjamini-Hochberg, Benjamini-Yekutieli
FWER methods are more conservative, appropriate for confirmatory research
FDR methods allow more discoveries, appropriate for exploratory research

Correction methods:

Method	Controls	Power	Assumptions
Bonferroni	FWER	Low	None
Šidák	FWER	Low-Moderate	Independence
Holm	FWER	Moderate	None
Hochberg	FWER	Moderate-High	Independence/positive dependence
BH	FDR	High	Independence/positive dependence
BY	FDR	Moderate	None

For ANOVA post-hoc tests:

Tukey HSD for all pairwise comparisons
Dunnett for comparisons with control
Scheffé for complex contrasts

The dead salmon study reminds us that even absurd results can appear "significant" when we conduct enough tests without correction. Proper handling of multiple comparisons is essential for valid statistical inference.

What's NextLink Copied

This chapter has covered the multiple comparisons problem and methods for controlling error rates when conducting many tests. You've learned the distinction between FWER and FDR, implemented correction methods from Bonferroni to Benjamini-Hochberg, and explored post-hoc tests for ANOVA.

In the final chapter of this series, we'll bring together all the concepts from hypothesis testing into a comprehensive practical guide. You'll find reference tables, decision frameworks, reporting guidelines, and a complete workflow for conducting hypothesis tests in practice: everything you need to apply these concepts to real data analysis.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about multiple comparisons and false discovery control.

Loading component...

Comments

Back to Machine Learning from Scratch

Previous Chapter

Effect Sizes and Statistical Significance

Next Chapter

Summary and Practical Guide to Hypothesis Testing

Reference

BIBTEXAcademic

@misc{multiplecomparisonsfwerfdrbonferroniholmbenjaminihochberg, author = {Michael Brenndoerfer}, title = {Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg}, year = {2026}, url = {https://mbrenndoerfer.com/writing/multiple-comparisons-fwer-fdr-bonferroni-holm-benjamini-hochberg}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg. Retrieved from https://mbrenndoerfer.com/writing/multiple-comparisons-fwer-fdr-bonferroni-holm-benjamini-hochberg

MLAAcademic

Michael Brenndoerfer. "Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg." 2026. Web. today. <https://mbrenndoerfer.com/writing/multiple-comparisons-fwer-fdr-bonferroni-holm-benjamini-hochberg>.

CHICAGOAcademic

Michael Brenndoerfer. "Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg." Accessed today. https://mbrenndoerfer.com/writing/multiple-comparisons-fwer-fdr-bonferroni-holm-benjamini-hochberg.

HARVARDAcademic

Michael Brenndoerfer (2026) 'Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg'. Available at: https://mbrenndoerfer.com/writing/multiple-comparisons-fwer-fdr-bonferroni-holm-benjamini-hochberg (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg. https://mbrenndoerfer.com/writing/multiple-comparisons-fwer-fdr-bonferroni-holm-benjamini-hochberg

Direct link:

https://mbrenndoerfer.com/writing/multiple-comparisons-fwer-fdr-bonferroni-holm-benjamini-hochberg

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Multiple Comparisons: FWER, FDR, Bonferroni, Holm & Benjamini-Hochberg

Multiple Comparisons and False Discovery ControlLink Copied

The Multiple Comparisons ProblemLink Copied

The Mathematics of MultiplicityLink Copied

Simulation: Seeing False Positives EmergeLink Copied

Error Rate Concepts: FWER vs FDRLink Copied

Family-Wise Error Rate (FWER)Link Copied

False Discovery Rate (FDR)Link Copied

Visualizing the DifferenceLink Copied

FWER-Controlling MethodsLink Copied

Bonferroni CorrectionLink Copied

Šidák CorrectionLink Copied

Holm's Step-Down ProcedureLink Copied

Hochberg's Step-Up ProcedureLink Copied

Comparing FWER MethodsLink Copied

FDR-Controlling MethodsLink Copied

Benjamini-Hochberg ProcedureLink Copied

Benjamini-Yekutieli ProcedureLink Copied

Visualizing the BH ProcedureLink Copied

Post-Hoc Tests for ANOVALink Copied

Tukey's Honest Significant Difference (HSD)Link Copied

Dunnett's TestLink Copied

Scheffé's MethodLink Copied

Summary of Post-Hoc MethodsLink Copied

Practical Implementation with scipy.statsLink Copied

Comparing Methods on Real-World ScaleLink Copied

Common Pitfalls and Best PracticesLink Copied

Pitfall 1: Ignoring Multiple ComparisonsLink Copied

Pitfall 2: OvercorrectingLink Copied

Pitfall 3: Correcting Within Correlated FamiliesLink Copied

Best PracticesLink Copied

Decision FrameworkLink Copied

SummaryLink Copied

What's NextLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Effect Sizes and Statistical Significance: Cohen's d & Practical Significance

Sample Size, Minimum Detectable Effect & Power: Power Analysis & MDE Calculation

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Hypothesis Testing Summary & Practical Guide: Reporting, Test Selection & scipy.stats

Effect Sizes and Statistical Significance: Cohen's d & Practical Significance

Sample Size, Minimum Detectable Effect & Power: Power Analysis & MDE Calculation

Stay updated