Pointwise Mutual Information: Measuring Word Associations in NLP

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how Pointwise Mutual Information (PMI) transforms raw co-occurrence counts into meaningful word association scores by comparing observed frequencies to expected frequencies under independence.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Pointwise Mutual InformationLink Copied

Raw co-occurrence counts tell us how often words appear together, but they have a fundamental problem: frequent words dominate everything. The word "the" co-occurs with nearly every other word, not because it has a special relationship with them, but simply because it appears everywhere. How do we separate meaningful associations from mere frequency effects?

Pointwise Mutual Information (PMI) solves this problem by asking a different question: does this word pair co-occur more than we'd expect by chance? Instead of counting raw co-occurrences, PMI measures the surprise of seeing two words together. When "New" and "York" appear together far more often than their individual frequencies would predict, PMI captures that strong association. When "the" and "dog" appear together only as often as expected, PMI correctly identifies this as an unremarkable pairing.

This chapter derives the PMI formula from probability theory, shows how it transforms co-occurrence matrices into more meaningful representations, and demonstrates its practical applications from collocation extraction to improved word vectors.

The Problem with Raw CountsLink Copied

Let's start by understanding why raw co-occurrence counts are problematic. Consider a corpus about animals and food:

In[2]:

Code

from collections import Counter

# A corpus with varying word frequencies
corpus = """
The cat sat on the mat. The dog ran in the park.
The cat chased the mouse. The dog barked at the cat.
A new restaurant opened downtown. The new menu features local food.
The chef prepared the meal. Fresh ingredients make the best food.
The cat sleeps all day. The lazy dog naps too.
The new cafe serves great coffee. The barista makes excellent drinks.
"""

# Tokenize and count word frequencies
words = corpus.lower().replace(".", "").split()
word_counts = Counter(words)
total_words = len(words)

from collections import Counter

# A corpus with varying word frequencies
corpus = """
The cat sat on the mat. The dog ran in the park.
The cat chased the mouse. The dog barked at the cat.
A new restaurant opened downtown. The new menu features local food.
The chef prepared the meal. Fresh ingredients make the best food.
The cat sleeps all day. The lazy dog naps too.
The new cafe serves great coffee. The barista makes excellent drinks.
"""

# Tokenize and count word frequencies
words = corpus.lower().replace(".", "").split()
word_counts = Counter(words)
total_words = len(words)

Out[3]:

Console

Word frequencies in corpus:
----------------------------------------
  'the         ':  16 occurrences (24.2%)
  'cat         ':   4 occurrences (6.1%)
  'dog         ':   3 occurrences (4.5%)
  'new         ':   3 occurrences (4.5%)
  'food        ':   2 occurrences (3.0%)
  'sat         ':   1 occurrences (1.5%)
  'on          ':   1 occurrences (1.5%)
  'mat         ':   1 occurrences (1.5%)
  'ran         ':   1 occurrences (1.5%)
  'in          ':   1 occurrences (1.5%)

The word "the" appears 18 times, dwarfing content words like "cat" (5 times) or "new" (3 times). In a raw co-occurrence matrix, "the" will have high counts with almost everything, obscuring the meaningful associations we care about.

In[4]:

Code

import numpy as np


def build_cooccurrence_matrix(corpus, window_size=2):
    """Build a word-word co-occurrence matrix."""
    words = corpus.lower().replace(".", "").split()
    vocab = sorted(set(words))
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    n = len(vocab)

    matrix = np.zeros((n, n))

    for i, word in enumerate(words):
        word_idx = word_to_idx[word]
        start = max(0, i - window_size)
        end = min(len(words), i + window_size + 1)

        for j in range(start, end):
            if j != i:
                context_idx = word_to_idx[words[j]]
                matrix[word_idx, context_idx] += 1

    return matrix, vocab, word_to_idx


cooc_matrix, vocab, word_to_idx = build_cooccurrence_matrix(
    corpus, window_size=2
)

import numpy as np


def build_cooccurrence_matrix(corpus, window_size=2):
    """Build a word-word co-occurrence matrix."""
    words = corpus.lower().replace(".", "").split()
    vocab = sorted(set(words))
    word_to_idx = {w: i for i, w in enumerate(vocab)}
    n = len(vocab)

    matrix = np.zeros((n, n))

    for i, word in enumerate(words):
        word_idx = word_to_idx[word]
        start = max(0, i - window_size)
        end = min(len(words), i + window_size + 1)

        for j in range(start, end):
            if j != i:
                context_idx = word_to_idx[words[j]]
                matrix[word_idx, context_idx] += 1

    return matrix, vocab, word_to_idx


cooc_matrix, vocab, word_to_idx = build_cooccurrence_matrix(
    corpus, window_size=2
)

Out[5]:

Console

Raw co-occurrence counts:
----------------------------------------
  'the' co-occurs with 62 total words
  'cat' co-occurs with 15 total words
  'new' co-occurs with 12 total words

Top co-occurrences for 'cat':
    the: 5
    a: 1
    all: 1
    at: 1
    chased: 1

The raw counts show "the" as the top co-occurrence for "cat," but this tells us nothing about cats. The word "the" appears near everything. We need a measure that accounts for how often we'd expect words to co-occur given their individual frequencies.

The PMI FormulaLink Copied

The problem we identified, that frequent words dominate co-occurrence counts, stems from a fundamental issue: raw counts conflate two distinct phenomena. When "the" appears near "cat" 10 times, is that because "the" and "cat" have a special relationship, or simply because "the" appears near everything 10 times? To separate genuine associations from frequency effects, we need to ask a different question: how does the observed co-occurrence compare to what we'd expect if the words were unrelated?

Pointwise Mutual Information answers this question directly. Instead of asking "how often do these words appear together?" PMI asks "do these words appear together more or less than chance would predict?"

The Intuition: Observed vs ExpectedLink Copied

Consider two words, $w$ and $c$ . If they have no special relationship, their co-occurrence should follow a simple pattern: the probability of seeing them together should equal the product of their individual probabilities. This is the definition of statistical independence.

For example, if "cat" appears in 5% of contexts and "mouse" appears in 2% of contexts, and they're independent, we'd expect "cat" and "mouse" to co-occur in roughly $0.05 \times 0.02 = 0.1\%$ of word pairs. If they actually co-occur in 1% of pairs, that's 10 times more than expected. This excess reveals a genuine association.

PMI formalizes this comparison by taking the ratio of observed to expected co-occurrence:

\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w) \cdot P(c)}

where:

$w$ : the target word whose associations we want to measure
$c$ : the context word that may or may not be associated with $w$
$P(w, c)$ : the joint probability of observing $w$ and $c$ together in the same context window
$P(w)$ : the marginal probability of word $w$ appearing in any context
$P(c)$ : the marginal probability of word $c$ appearing in any context
$P(w) \cdot P(c)$ : the expected joint probability if $w$ and $c$ were statistically independent
$\log_2$ : the base-2 logarithm, which converts ratios to bits of information

Pointwise Mutual Information

PMI measures the association between two words by comparing their joint probability of co-occurrence to what we'd expect if they were independent. High PMI indicates the words are strongly associated; low or negative PMI indicates they co-occur less than expected by chance.

Let's unpack each component of this formula:

Numerator $P(w, c)$ : The joint probability, measuring how often words $w$ and $c$ actually co-occur in the corpus. This is the observed co-occurrence rate.
Denominator $P(w) \cdot P(c)$ : The expected probability under independence. If the words were unrelated, this is how often they would co-occur by chance. This follows from the definition of statistical independence: two events $A$ and $B$ are independent if and only if $P(A \cap B) = P(A) \cdot P(B)$ .
The ratio $\frac{P(w, c)}{P(w) \cdot P(c)}$ : When observed equals expected, the ratio is 1. When observed exceeds expected, the ratio is greater than 1. When observed falls short of expected, the ratio is less than 1.
The logarithm: Taking $\log_2$ converts multiplicative relationships to additive ones. A ratio of 2 (twice as often as expected) becomes PMI of 1. A ratio of 4 becomes PMI of 2. This logarithmic scale makes PMI values easier to interpret and compare.

Why the Logarithm?Link Copied

The logarithm serves three important purposes in the PMI formula.

1. Converting ratios to differences (additive scale)

Without the logarithm, we'd be working with a ratio that has an asymmetric range: values above 1 indicate positive association, while values between 0 and 1 indicate negative association. The logarithm maps this to a symmetric scale:

Correspondence between observed/expected ratios and PMI values. Each unit of PMI represents a doubling of the ratio.

Ratio (observed/expected)	PMI ( $\log_2$ of ratio)
8×	+3
4×	+2
2×	+1
1× (independence)	0
0.5×	−1
0.25×	−2
0.125×	−3

The additive scale is more intuitive: PMI of 2 means "4 times more than expected," while PMI of 1 means "2 times more." Each unit of PMI represents a doubling.

2. Creating symmetry around zero

Positive PMI indicates attraction (words co-occur more than expected), negative PMI indicates repulsion (words co-occur less than expected), and zero indicates independence. This symmetry makes it easy to compare and rank associations.

3. Connection to information theory

PMI measures how much information (in bits, when using $\log_2$ ) observing one word provides about the other. In information-theoretic terms, PMI quantifies the reduction in uncertainty about word $c$ when we observe word $w$ . When two words are strongly associated, seeing one tells you a lot about whether you'll see the other.

From Probabilities to CountsLink Copied

In practice, we don't know the true probabilities. We estimate them from corpus counts using the maximum likelihood principle: the best estimate of a probability is the observed frequency. Let $\#(w, c)$ denote how many times words $w$ and $c$ co-occur within the same context window, and let $N$ be the total number of word-context pairs in our co-occurrence matrix.

The probability estimates follow naturally from frequency ratios:

Joint probability (how often $w$ and $c$ appear together):

P(w, c) = \frac{\#(w, c)}{N}

where:

$\#(w, c)$ : the count of times words $w$ and $c$ co-occur within a context window
$N$ : the total number of word-context pairs (sum of all entries in the co-occurrence matrix)

This is the fraction of all word pairs that are specifically the $(w, c)$ pair.

Marginal probability of the target word (how often $w$ appears with any context):

P(w) = \frac{\sum_c \#(w, c)}{N} = \frac{\#(w, *)}{N}

where:

$\sum_c \#(w, c)$ : the sum of co-occurrence counts across all context words $c$
$\#(w, *)$ : a shorthand notation meaning "sum over all context words"

This is the fraction of all pairs where the target word is $w$ . In the co-occurrence matrix, this equals the row sum for word $w$ .

Marginal probability of the context word (how often $c$ appears as context for any word):

P(c) = \frac{\sum_w \#(w, c)}{N} = \frac{\#(*, c)}{N}

where:

$\sum_w \#(w, c)$ : the sum of co-occurrence counts across all target words $w$
$\#(*, c)$ : a shorthand notation meaning "sum over all target words"

This equals the column sum for context word $c$ in the co-occurrence matrix.

The Practical FormulaLink Copied

Substituting these probability estimates into the PMI formula, we can derive a count-based version that's easier to compute directly from the co-occurrence matrix.

Step 1: Start with the definition and substitute our probability estimates:

\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w) \cdot P(c)} = \log_2 \frac{\#(w, c) / N}{(\#(w, *) / N) \cdot (\#(*, c) / N)}

Step 2: Simplify the denominator by multiplying the two fractions:

\text{PMI}(w, c) = \log_2 \frac{\#(w, c) / N}{(\#(w, *) \cdot \#(*, c)) / N^2}

Step 3: Dividing by a fraction is equivalent to multiplying by its reciprocal:

\text{PMI}(w, c) = \log_2 \left( \frac{\#(w, c)}{N} \cdot \frac{N^2}{\#(w, *) \cdot \#(*, c)} \right)

Step 4: Cancel one factor of $N$ from numerator and denominator:

\text{PMI}(w, c) = \log_2 \frac{\#(w, c) \cdot N}{\#(w, *) \cdot \#(*, c)}

This is the formula we implement: multiply the co-occurrence count by the total, divide by the product of marginals, and take the logarithm.

where:

$\#(w, c)$ : how often $w$ and $c$ actually co-occur (the observed count)
$N$ : total co-occurrences in the matrix (sum of all entries)
$\#(w, *)$ : how often $w$ appears with any context (row sum for word $w$ )
$\#(*, c)$ : how often $c$ appears as context for any word (column sum for context $c$ )

The product $\#(w, *) \cdot \#(*, c) / N$ represents the expected count under independence: if word $w$ appears in a fraction $\#(w, *)/N$ of all pairs, and context $c$ appears in a fraction $\#(*, c)/N$ of all pairs, then under independence they should co-occur in approximately $\#(w, *) \cdot \#(*, c) / N$ pairs.

Implementing PMILink Copied

Let's translate this formula into code. The implementation follows the mathematical derivation closely, computing each component step by step.

In[6]:

Code

def compute_pmi_matrix(cooc_matrix):
    """
    Compute PMI matrix from co-occurrence counts.

    PMI(w, c) = log2(P(w,c) / (P(w) * P(c)))
              = log2((#(w,c) * N) / (#(w,*) * #(*,c)))
    """
    # Total count: N = sum of all co-occurrences
    N = cooc_matrix.sum()

    # Row sums: #(w, *) for each word
    # This gives us the marginal counts for target words
    row_sums = cooc_matrix.sum(axis=1, keepdims=True)

    # Column sums: #(*, c) for each context
    # This gives us the marginal counts for context words
    col_sums = cooc_matrix.sum(axis=0, keepdims=True)

    # Expected counts under independence: E[#(w,c)] = #(w,*) * #(*,c) / N
    # Using matrix multiplication to compute all pairs efficiently
    expected = (row_sums @ col_sums) / N

    # PMI = log2(observed / expected)
    # Handle zeros: log(0) is undefined, so we replace with 0
    with np.errstate(divide="ignore", invalid="ignore"):
        pmi = np.log2(cooc_matrix / expected)
        pmi[~np.isfinite(pmi)] = 0  # Replace inf/-inf/nan with 0

    return pmi


pmi_matrix = compute_pmi_matrix(cooc_matrix)

def compute_pmi_matrix(cooc_matrix):
    """
    Compute PMI matrix from co-occurrence counts.

    PMI(w, c) = log2(P(w,c) / (P(w) * P(c)))
              = log2((#(w,c) * N) / (#(w,*) * #(*,c)))
    """
    # Total count: N = sum of all co-occurrences
    N = cooc_matrix.sum()

    # Row sums: #(w, *) for each word
    # This gives us the marginal counts for target words
    row_sums = cooc_matrix.sum(axis=1, keepdims=True)

    # Column sums: #(*, c) for each context
    # This gives us the marginal counts for context words
    col_sums = cooc_matrix.sum(axis=0, keepdims=True)

    # Expected counts under independence: E[#(w,c)] = #(w,*) * #(*,c) / N
    # Using matrix multiplication to compute all pairs efficiently
    expected = (row_sums @ col_sums) / N

    # PMI = log2(observed / expected)
    # Handle zeros: log(0) is undefined, so we replace with 0
    with np.errstate(divide="ignore", invalid="ignore"):
        pmi = np.log2(cooc_matrix / expected)
        pmi[~np.isfinite(pmi)] = 0  # Replace inf/-inf/nan with 0

    return pmi


pmi_matrix = compute_pmi_matrix(cooc_matrix)

The key insight in this implementation is computing the expected counts efficiently. Rather than looping through all word pairs, we use matrix multiplication: row_sums @ col_sums produces an outer product where each cell $(i, j)$ contains $\#(w_i, *) \cdot \#(*, c_j)$ . Dividing by $N$ gives us the expected count for each pair.

Out[7]:

Visualization

Scatter plot showing observed co-occurrence counts on y-axis versus expected counts on x-axis, with diagonal line separating positive and negative PMI regions. — Observed vs expected co-occurrence counts for all word pairs. Points above the diagonal (green) co-occur more than expected, yielding positive PMI. Points below the diagonal (red) co-occur less than expected, yielding negative PMI. Points near the diagonal have PMI close to zero, indicating independence. The logarithmic scale reveals the full range of values.

This scatter plot visualizes the fundamental idea behind PMI: comparing what we observe to what we'd expect. Word pairs above the diagonal have positive PMI (they co-occur more than chance predicts), while pairs below have negative PMI.

Out[8]:

Console

PMI values for 'cat':
----------------------------------------
    a           : PMI = +2.10
    all         : PMI = +2.10
    at          : PMI = +2.10
    chased      : PMI = +2.10
    on          : PMI = +2.10
    park        : PMI = +2.10
    sat         : PMI = +2.10
    sleeps      : PMI = +2.10

The transformation is dramatic. Raw counts ranked "the" as the top co-occurrence for "cat," but PMI reveals a different picture. Words that specifically associate with "cat," rather than appearing everywhere, now rise to the top. The word "the" has low or negative PMI because its high frequency means it co-occurs with "cat" about as often as we'd expect by chance.

Interpreting PMI as Association StrengthLink Copied

The logarithmic scale of PMI creates a natural interpretation centered on zero. Because we're measuring $\log_2(\text{observed} / \text{expected})$ , the value tells us directly how the actual co-occurrence compares to the baseline of independence.

The key insight is that PMI values correspond to powers of 2. If $\text{PMI}(w, c) = n$ , then:

2^n = \frac{P(w, c)}{P(w) \cdot P(c)}

This means the observed probability equals $2^n$ times the expected probability under independence. Conversely, we can convert a ratio $r$ of observed to expected into PMI via $\text{PMI} = \log_2 r$ .

PMI > 0: The words co-occur more than expected. A PMI of 1 means $2^1 = 2$ times as often as chance; PMI of 2 means $2^2 = 4$ times as often; PMI of 3 means $2^3 = 8$ times. The higher the value, the stronger the positive association.
PMI = 0: The words co-occur exactly as expected under independence ( $2^0 = 1$ , so observed equals expected). There's no special relationship, either attractive or repulsive.
PMI < 0: The words co-occur less than expected. They tend to avoid each other. A PMI of -1 means $2^{-1} = 0.5$ times as often as chance would predict (i.e., half as often).

Out[9]:

Visualization

Number line showing PMI interpretation from negative (avoidance) through zero (independence) to positive (strong association). — Interpretation of PMI values. Positive PMI indicates words co-occur more than chance would predict, suggesting a meaningful association. Zero PMI means the words are independent. Negative PMI means the words avoid each other, appearing together less than expected.

A Worked Example: Computing PMI Step by StepLink Copied

To solidify understanding, let's walk through a complete PMI calculation by hand. We'll compute the PMI between "cat" and "mouse," two words we'd expect to have a strong association.

The calculation proceeds in three stages:

Gather the raw counts from our co-occurrence matrix
Compute the expected count under the assumption of independence
Calculate PMI as the log ratio of observed to expected

In[10]:

Code

# Get the counts we need for the PMI calculation
mouse_idx = word_to_idx.get("mouse", -1)

if mouse_idx >= 0:
    # Step 1: Observed co-occurrence count #(cat, mouse)
    count_cat_mouse = cooc_matrix[cat_idx, mouse_idx]

    # Step 2: Marginal counts (row and column sums)
    row_sum_cat = cooc_matrix[cat_idx].sum()  # #(cat, *)
    col_sum_mouse = cooc_matrix[:, mouse_idx].sum()  # #(*, mouse)

    # Step 3: Total count N
    N = cooc_matrix.sum()

    # Step 4: Expected count under independence
    expected = (row_sum_cat * col_sum_mouse) / N

    # Step 5: PMI = log2(observed / expected)
    pmi_cat_mouse = (
        np.log2(count_cat_mouse / expected)
        if expected > 0 and count_cat_mouse > 0
        else 0
    )

# Get the counts we need for the PMI calculation
mouse_idx = word_to_idx.get("mouse", -1)

if mouse_idx >= 0:
    # Step 1: Observed co-occurrence count #(cat, mouse)
    count_cat_mouse = cooc_matrix[cat_idx, mouse_idx]

    # Step 2: Marginal counts (row and column sums)
    row_sum_cat = cooc_matrix[cat_idx].sum()  # #(cat, *)
    col_sum_mouse = cooc_matrix[:, mouse_idx].sum()  # #(*, mouse)

    # Step 3: Total count N
    N = cooc_matrix.sum()

    # Step 4: Expected count under independence
    expected = (row_sum_cat * col_sum_mouse) / N

    # Step 5: PMI = log2(observed / expected)
    pmi_cat_mouse = (
        np.log2(count_cat_mouse / expected)
        if expected > 0 and count_cat_mouse > 0
        else 0
    )

Out[11]:

Console

PMI Calculation: 'cat' and 'mouse'
==================================================

Step 1: Gather counts from the co-occurrence matrix
  #(cat, mouse) = 0  (observed co-occurrences)
  #(cat, *)     = 15  (total contexts for 'cat')
  #(*, mouse)   = 4  (total contexts for 'mouse')
  N             = 258  (total word pairs)

Step 2: Compute expected count under independence
  If 'cat' and 'mouse' were unrelated, we'd expect:
  Expected = (#(cat,*) × #(*,mouse)) / N
           = (15 × 4) / 258
           = 0.23

Step 3: Compute PMI as log ratio
  PMI = log₂(observed / expected)
      = log₂(0 / 0.23)
      = log₂(0.00)
      = 0.00

The positive PMI confirms our linguistic intuition: "cat" and "mouse" have a genuine association in language. They appear together in contexts about predator-prey relationships, children's stories, and idiomatic expressions far more often than their individual frequencies would suggest.

This is exactly the kind of meaningful relationship PMI is designed to uncover. Raw counts would have told us that "the" co-occurs with "cat" more often than "mouse" does, simply because "the" is everywhere. PMI cuts through this noise by asking the right question: not "how often?" but "how much more than expected?"

The Problem with Negative PMILink Copied

While PMI can be negative (indicating words that avoid each other), negative values cause practical problems. Most word pairs never co-occur at all, giving them PMI of negative infinity. Even pairs that co-occur rarely get large negative values.

In[12]:

Code

# Analyze the distribution of PMI values
finite_pmis = pmi_matrix[np.isfinite(pmi_matrix) & (pmi_matrix != 0)]
negative_pmis = finite_pmis[finite_pmis < 0]
positive_pmis = finite_pmis[finite_pmis > 0]

# Analyze the distribution of PMI values
finite_pmis = pmi_matrix[np.isfinite(pmi_matrix) & (pmi_matrix != 0)]
negative_pmis = finite_pmis[finite_pmis < 0]
positive_pmis = finite_pmis[finite_pmis > 0]

Out[13]:

Console

PMI value distribution:
----------------------------------------
  Total non-zero entries: 215
  Negative PMI values:    3 (1.4%)
  Positive PMI values:    212 (98.6%)

  Min PMI: -1.31
  Max PMI: 5.43
  Mean PMI: 2.25

Out[14]:

Visualization

Histogram showing PMI value distribution with negative values in red and positive values in green, centered near zero with long tails. — Distribution of PMI values in the co-occurrence matrix. The histogram reveals a roughly symmetric distribution centered slightly below zero, with long tails in both directions. Negative values (red) represent word pairs that co-occur less than expected, while positive values (green) represent genuine associations. The prevalence of negative values motivates the PPMI transformation.

The histogram reveals a key characteristic of PMI: many word pairs have negative values, meaning they co-occur less than expected. This asymmetry, combined with the issues below, motivates the PPMI transformation.

Negative PMI values are problematic for several reasons:

Unreliable estimates: Low co-occurrence counts produce noisy PMI values. A word pair that co-occurs once when we expected two has PMI of -1, but this could easily be sampling noise.
Asymmetric information: Knowing that words don't co-occur is less informative than knowing they do. The absence of co-occurrence could mean many things.
Computational issues: Large negative values dominate distance calculations and can destabilize downstream algorithms.

Positive PMI (PPMI)Link Copied

The standard solution is Positive PMI (PPMI), which simply clips negative values to zero:

\text{PPMI}(w, c) = \max(0, \text{PMI}(w, c))

where:

$\text{PMI}(w, c)$ : the pointwise mutual information between words $w$ and $c$ , as defined earlier
$\max(0, x)$ : the maximum of 0 and $x$ , which returns $x$ if $x > 0$ and returns 0 otherwise

The $\max$ function effectively acts as a threshold: any word pair with negative PMI (indicating the words co-occur less than expected) is set to zero, while positive associations are preserved unchanged.

Positive PMI (PPMI)

PPMI retains only the positive associations from PMI, treating all negative or zero associations as equally uninformative. This produces sparse, non-negative matrices that work well with many machine learning algorithms.

In[15]:

Code

def compute_ppmi_matrix(cooc_matrix):
    """Compute Positive PMI matrix."""
    pmi = compute_pmi_matrix(cooc_matrix)
    ppmi = np.maximum(0, pmi)
    return ppmi


ppmi_matrix = compute_ppmi_matrix(cooc_matrix)

def compute_ppmi_matrix(cooc_matrix):
    """Compute Positive PMI matrix."""
    pmi = compute_pmi_matrix(cooc_matrix)
    ppmi = np.maximum(0, pmi)
    return ppmi


ppmi_matrix = compute_ppmi_matrix(cooc_matrix)

Out[16]:

Console

PPMI matrix properties:
----------------------------------------
  Shape: (43, 43)
  Non-zero entries: 212
  Sparsity: 88.5%
  Max value: 5.43
  Mean (non-zero): 2.29

The high sparsity indicates that PPMI has filtered out most word pairs, retaining only those with genuine positive associations. The remaining non-zero entries represent meaningful relationships that exceed what independence would predict.

Out[17]:

Visualization

Heatmap of PMI values ranging from negative (blue) through zero (white) to positive (red). — PMI matrix showing both positive and negative associations. Negative values (blue) indicate word pairs that co-occur less than expected. The wide range of values can cause computational issues.

Heatmap of PPMI values showing only positive associations in warm colors with many zero entries. — PPMI matrix with negative values clipped to zero. Only positive associations remain, producing a sparse matrix that emphasizes meaningful co-occurrences.

The PPMI matrix is much sparser than the raw co-occurrence matrix. Only word pairs with genuine positive associations retain non-zero values. This sparsity is a feature, not a bug: it means we've filtered out the noise of random co-occurrences.

Shifted PPMI VariantsLink Copied

While PPMI works well, researchers have developed variants that address specific issues. The most important is Shifted PPMI, which subtracts a constant before clipping:

\text{SPPMI}_k(w, c) = \max(0, \text{PMI}(w, c) - \log_2 k)

where:

$\text{PMI}(w, c)$ : the pointwise mutual information between words $w$ and $c$
$k$ : the shift parameter, a positive integer typically between 1 and 15
$\log_2 k$ : the shift amount in bits (e.g., $\log_2 5 \approx 2.32$ bits for $k=5$ )

The shift effectively raises the threshold for what counts as a "meaningful" association. To understand why, consider what the shift means mathematically:

With $k=1$ : The shift is $\log_2 1 = 0$ , so SPPMI equals standard PPMI. Any positive PMI is retained.
With $k=2$ : The shift is $\log_2 2 = 1$ . Only word pairs with PMI > 1 (co-occurring at least twice as often as expected) are retained.
With $k=5$ : The shift is $\log_2 5 \approx 2.32$ . Only word pairs that co-occur at least 5 times more than expected survive.

Shifted PPMI

Shifted PPMI raises the bar for what counts as a positive association by subtracting $\log_2 k$ before clipping. This filters out weak associations that might be due to noise, keeping only the strongest signals. Higher $k$ values produce sparser, more selective matrices.

In[18]:

Code

def compute_sppmi_matrix(cooc_matrix, k=5):
    """
    Compute Shifted Positive PMI matrix.

    SPPMI_k(w, c) = max(0, PMI(w, c) - log2(k))
    """
    pmi = compute_pmi_matrix(cooc_matrix)
    shift = np.log2(k)
    sppmi = np.maximum(0, pmi - shift)
    return sppmi


# Compare different shift values
sppmi_k1 = compute_sppmi_matrix(cooc_matrix, k=1)  # Same as PPMI
sppmi_k2 = compute_sppmi_matrix(cooc_matrix, k=2)
sppmi_k5 = compute_sppmi_matrix(cooc_matrix, k=5)

def compute_sppmi_matrix(cooc_matrix, k=5):
    """
    Compute Shifted Positive PMI matrix.

    SPPMI_k(w, c) = max(0, PMI(w, c) - log2(k))
    """
    pmi = compute_pmi_matrix(cooc_matrix)
    shift = np.log2(k)
    sppmi = np.maximum(0, pmi - shift)
    return sppmi


# Compare different shift values
sppmi_k1 = compute_sppmi_matrix(cooc_matrix, k=1)  # Same as PPMI
sppmi_k2 = compute_sppmi_matrix(cooc_matrix, k=2)
sppmi_k5 = compute_sppmi_matrix(cooc_matrix, k=5)

Out[19]:

Console

Effect of shift parameter k:
--------------------------------------------------
  k=1 (PPMI):   212 non-zero entries
  k=2:          158 non-zero entries
  k=5:          120 non-zero entries

The results show the filtering effect of higher $k$ values. As $k$ increases from 1 to 5, the number of retained associations drops substantially. This represents a trade-off: higher $k$ values filter out weaker (potentially noisy) associations but may also discard some genuine relationships.

Out[20]:

Visualization

Line plot showing sparsity percentage increasing as shift parameter k increases from 1 to 10. — Matrix sparsity increases with shift parameter k. As k rises, the threshold for retaining associations becomes stricter. The common choice k=5 corresponds to Word2Vec's default negative sampling.

Bar chart showing non-zero entry count decreasing as shift parameter k increases from 1 to 10. — Non-zero entries decrease as shift parameter k increases. Higher k values filter out weaker associations, producing sparser matrices with only the strongest relationships.

The visualization shows the trade-off clearly: higher shift values produce sparser matrices by filtering out weaker associations. A common choice is $k=5$ , which corresponds to Word2Vec's default of 5 negative samples.

The connection to Word2Vec is notable: Levy and Goldberg (2014) showed that Word2Vec's skip-gram model with negative sampling implicitly factorizes a shifted PMI matrix with $k$ equal to the number of negative samples. This theoretical connection explains why PMI-based methods and neural embeddings often produce similar results.

PMI vs Raw Counts: A ComparisonLink Copied

Let's directly compare how raw counts and PPMI rank word associations. We'll use a larger corpus to see clearer patterns.

In[21]:

Code

# A larger corpus about technology
tech_corpus = """
Machine learning algorithms process large datasets to find patterns.
Deep learning models use neural networks with many layers.
Neural networks learn representations from training data.
Training data quality affects model performance significantly.
Large language models generate human-like text responses.
Text generation requires understanding context and meaning.
Natural language processing enables computers to understand text.
Computer vision systems recognize objects in images.
Image recognition uses convolutional neural networks effectively.
Convolutional networks excel at processing visual information.
Reinforcement learning agents learn from environmental feedback.
Learning agents optimize their behavior through rewards.
Data scientists analyze complex datasets using statistical methods.
Statistical methods reveal patterns hidden in raw data.
Raw data requires preprocessing before model training begins.
Preprocessing steps include normalization and feature extraction.
Feature extraction transforms raw inputs into useful representations.
Useful representations capture the essential information content.
"""

# Build matrices
tech_cooc, tech_vocab, tech_idx = build_cooccurrence_matrix(
    tech_corpus, window_size=3
)
tech_ppmi = compute_ppmi_matrix(tech_cooc)

# A larger corpus about technology
tech_corpus = """
Machine learning algorithms process large datasets to find patterns.
Deep learning models use neural networks with many layers.
Neural networks learn representations from training data.
Training data quality affects model performance significantly.
Large language models generate human-like text responses.
Text generation requires understanding context and meaning.
Natural language processing enables computers to understand text.
Computer vision systems recognize objects in images.
Image recognition uses convolutional neural networks effectively.
Convolutional networks excel at processing visual information.
Reinforcement learning agents learn from environmental feedback.
Learning agents optimize their behavior through rewards.
Data scientists analyze complex datasets using statistical methods.
Statistical methods reveal patterns hidden in raw data.
Raw data requires preprocessing before model training begins.
Preprocessing steps include normalization and feature extraction.
Feature extraction transforms raw inputs into useful representations.
Useful representations capture the essential information content.
"""

# Build matrices
tech_cooc, tech_vocab, tech_idx = build_cooccurrence_matrix(
    tech_corpus, window_size=3
)
tech_ppmi = compute_ppmi_matrix(tech_cooc)

In[22]:

Code

def get_top_associations(word, matrix, vocab, word_to_idx, n=5):
    """Get top associated words by matrix values."""
    if word not in word_to_idx:
        return []
    idx = word_to_idx[word]
    scores = [
        (vocab[i], matrix[idx, i])
        for i in range(len(vocab))
        if matrix[idx, i] > 0 and vocab[i] != word
    ]
    return sorted(scores, key=lambda x: -x[1])[:n]


query_words = ["learning", "neural", "data"]

def get_top_associations(word, matrix, vocab, word_to_idx, n=5):
    """Get top associated words by matrix values."""
    if word not in word_to_idx:
        return []
    idx = word_to_idx[word]
    scores = [
        (vocab[i], matrix[idx, i])
        for i in range(len(vocab))
        if matrix[idx, i] > 0 and vocab[i] != word
    ]
    return sorted(scores, key=lambda x: -x[1])[:n]


query_words = ["learning", "neural", "data"]

Out[23]:

Console

Comparison: Raw Counts vs PPMI
============================================================

Top associations for 'learning':
--------------------------------------------------
  Raw Counts           | PPMI                
  -------------------- | --------------------
  agents (2)           | machine (3.60)      
  from (2)             | algorithms (2.86)   
  algorithms (1)       | agents (2.60)       
  deep (1)             | deep (2.60)         
  environmental (1)    | environmental (2.60)

Top associations for 'neural':
--------------------------------------------------
  Raw Counts           | PPMI                
  -------------------- | --------------------
  networks (3)         | many (3.89)         
  convolutional (2)    | with (3.89)         
  many (2)             | convolutional (2.89)
  with (2)             | effectively (2.89)  
  effectively (1)      | layers (2.89)       

Top associations for 'data':
--------------------------------------------------
  Raw Counts           | PPMI                
  -------------------- | --------------------
  raw (4)              | quality (3.15)      
  training (4)         | raw (2.56)          
  quality (2)          | training (2.56)     
  requires (2)         | affects (2.15)      
  affects (1)          | analyze (2.15)

The PPMI rankings are more semantically meaningful. Raw counts are dominated by frequent function words, while PPMI surfaces content words with genuine topical associations.

Out[24]:

Visualization

Horizontal bar chart showing raw co-occurrence counts for 'learning', with function words ranking highest. — Raw co-occurrence counts for 'learning'. Function words (red) dominate due to their high frequency, obscuring meaningful semantic associations.

Horizontal bar chart showing PPMI scores for 'learning', with semantically related content words ranking highest. — PPMI scores for 'learning'. Content words (green) now dominate, revealing genuine semantic associations like 'machine', 'deep', and 'reinforcement'.

PMI Matrix PropertiesLink Copied

PPMI matrices have several useful properties that make them well-suited for downstream NLP tasks.

SparsityLink Copied

PPMI matrices are highly sparse because most word pairs don't have positive associations. This sparsity enables efficient storage and computation.

In[25]:

Code

def analyze_matrix_properties(raw_matrix, ppmi_matrix, name=""):
    """Compare properties of raw and PPMI matrices."""
    raw_nonzero = np.count_nonzero(raw_matrix)
    ppmi_nonzero = np.count_nonzero(ppmi_matrix)

    raw_sparsity = 1 - raw_nonzero / raw_matrix.size
    ppmi_sparsity = 1 - ppmi_nonzero / ppmi_matrix.size

    return {
        "raw_nonzero": raw_nonzero,
        "ppmi_nonzero": ppmi_nonzero,
        "raw_sparsity": raw_sparsity,
        "ppmi_sparsity": ppmi_sparsity,
        "raw_mean": raw_matrix[raw_matrix > 0].mean() if raw_nonzero > 0 else 0,
        "ppmi_mean": ppmi_matrix[ppmi_matrix > 0].mean()
        if ppmi_nonzero > 0
        else 0,
    }


props = analyze_matrix_properties(tech_cooc, tech_ppmi)

def analyze_matrix_properties(raw_matrix, ppmi_matrix, name=""):
    """Compare properties of raw and PPMI matrices."""
    raw_nonzero = np.count_nonzero(raw_matrix)
    ppmi_nonzero = np.count_nonzero(ppmi_matrix)

    raw_sparsity = 1 - raw_nonzero / raw_matrix.size
    ppmi_sparsity = 1 - ppmi_nonzero / ppmi_matrix.size

    return {
        "raw_nonzero": raw_nonzero,
        "ppmi_nonzero": ppmi_nonzero,
        "raw_sparsity": raw_sparsity,
        "ppmi_sparsity": ppmi_sparsity,
        "raw_mean": raw_matrix[raw_matrix > 0].mean() if raw_nonzero > 0 else 0,
        "ppmi_mean": ppmi_matrix[ppmi_matrix > 0].mean()
        if ppmi_nonzero > 0
        else 0,
    }


props = analyze_matrix_properties(tech_cooc, tech_ppmi)

Out[26]:

Console

Matrix Property Comparison:
--------------------------------------------------
  Property                    Raw Counts         PPMI
  ------------------------- ------------ ------------
  Non-zero entries                   695          695
  Sparsity                        92.0%       92.0%
  Mean (non-zero)                   1.15         3.37

The comparison reveals how PPMI filtering dramatically reduces the number of non-zero entries while increasing sparsity. This transformation discards pairs that co-occur at or below expected rates, retaining only the semantically meaningful associations.

Out[27]:

Visualization

The sparsity visualization makes the filtering effect of PPMI immediately apparent. The raw matrix has entries wherever words co-occur at all, while the PPMI matrix retains only the genuinely positive associations. This much smaller subset captures the meaningful relationships.

SymmetryLink Copied

For symmetric context windows (looking the same distance left and right), the co-occurrence matrix is symmetric, and so is the PPMI matrix: $\text{PPMI}(w, c) = \text{PPMI}(c, w)$ .

Out[28]:

Console

PPMI matrix is symmetric: True

The matrix confirms symmetry, which means the association between words is bidirectional: if "neural" has high PPMI with "networks," then "networks" has equally high PPMI with "neural." This property holds when using symmetric context windows (looking the same distance left and right from each word).

Row Vectors as Word RepresentationsLink Copied

Each row of the PPMI matrix can serve as a word vector. Words with similar PPMI profiles (similar rows) tend to have similar meanings.

In[29]:

Code

from sklearn.metrics.pairwise import cosine_similarity


def find_similar_words_ppmi(word, ppmi_matrix, vocab, word_to_idx, n=5):
    """Find similar words using PPMI vectors."""
    if word not in word_to_idx:
        return []

    idx = word_to_idx[word]
    word_vec = ppmi_matrix[idx].reshape(1, -1)

    # Compute cosine similarity with all words
    similarities = cosine_similarity(word_vec, ppmi_matrix)[0]

    # Get top similar (excluding the word itself)
    results = []
    for i in np.argsort(similarities)[::-1]:
        if vocab[i] != word:
            results.append((vocab[i], similarities[i]))
            if len(results) >= n:
                break

    return results

from sklearn.metrics.pairwise import cosine_similarity


def find_similar_words_ppmi(word, ppmi_matrix, vocab, word_to_idx, n=5):
    """Find similar words using PPMI vectors."""
    if word not in word_to_idx:
        return []

    idx = word_to_idx[word]
    word_vec = ppmi_matrix[idx].reshape(1, -1)

    # Compute cosine similarity with all words
    similarities = cosine_similarity(word_vec, ppmi_matrix)[0]

    # Get top similar (excluding the word itself)
    results = []
    for i in np.argsort(similarities)[::-1]:
        if vocab[i] != word:
            results.append((vocab[i], similarities[i]))
            if len(results) >= n:
                break

    return results

Out[30]:

Console

Word similarity using PPMI vectors:
--------------------------------------------------

'neural':
    networks       : 0.806
    layers         : 0.598
    with           : 0.558
    many           : 0.484

'data':
    training       : 0.529
    scientists     : 0.487
    model          : 0.473
    rewards        : 0.468

'learning':
    agents         : 0.566
    feedback       : 0.476
    learn          : 0.454
    environmental  : 0.416

The similarity scores capture semantic relationships learned purely from co-occurrence patterns. Words appearing in similar contexts cluster together in the PPMI vector space, enabling applications like finding related terms or detecting semantic categories.

Out[31]:

Visualization

Scatter plot showing words positioned in 2D space based on their PPMI vectors, with semantically related words clustering together. — 2D visualization of PPMI word vectors using PCA. Words that share similar contexts cluster together. Technical terms like 'neural', 'networks', and 'learning' form a cluster (blue), while data-related terms like 'data', 'datasets', and 'raw' form another (green). This clustering emerges purely from co-occurrence statistics, demonstrating how distributional information captures semantic relationships.

The 2D projection reveals the semantic structure captured by PPMI vectors. Even with our small corpus, related words cluster together: machine learning terms form one group, data-related terms another. This is the distributional hypothesis in action. Words with similar meanings appear in similar contexts, leading to similar PPMI vectors.

Collocation Extraction with PMILink Copied

One of PMI's most practical applications is identifying collocations: word combinations that occur together more than chance would predict. Collocations include compound nouns ("ice cream"), phrasal verbs ("give up"), and idiomatic expressions ("kick the bucket").

Collocations

Collocations are word combinations whose meaning or frequency cannot be predicted from the individual words alone. PMI helps identify these by measuring which word pairs co-occur significantly more than their individual frequencies would suggest.

In[32]:

Code

def extract_collocations(corpus, min_count=2, top_n=20):
    """
    Extract collocations using PMI.

    Returns word pairs ranked by PMI score.
    """
    # Get bigram counts
    words = corpus.lower().replace(".", " . ").split()
    bigrams = [
        (words[i], words[i + 1])
        for i in range(len(words) - 1)
        if words[i] != "." and words[i + 1] != "."
    ]

    bigram_counts = Counter(bigrams)
    word_counts = Counter(words)
    total_bigrams = len(bigrams)

    # Compute PMI for each bigram
    collocations = []
    for (w1, w2), count in bigram_counts.items():
        if count < min_count:
            continue

        # P(w1, w2) estimated from bigram frequency
        p_joint = count / total_bigrams

        # P(w1) and P(w2) from unigram frequencies
        p_w1 = word_counts[w1] / sum(word_counts.values())
        p_w2 = word_counts[w2] / sum(word_counts.values())

        # PMI
        pmi = np.log2(p_joint / (p_w1 * p_w2))

        if pmi > 0:
            collocations.append((w1, w2, pmi, count))

    # Sort by PMI
    collocations.sort(key=lambda x: -x[2])
    return collocations[:top_n]


# Extract collocations from tech corpus
collocations = extract_collocations(tech_corpus, min_count=1, top_n=15)

def extract_collocations(corpus, min_count=2, top_n=20):
    """
    Extract collocations using PMI.

    Returns word pairs ranked by PMI score.
    """
    # Get bigram counts
    words = corpus.lower().replace(".", " . ").split()
    bigrams = [
        (words[i], words[i + 1])
        for i in range(len(words) - 1)
        if words[i] != "." and words[i + 1] != "."
    ]

    bigram_counts = Counter(bigrams)
    word_counts = Counter(words)
    total_bigrams = len(bigrams)

    # Compute PMI for each bigram
    collocations = []
    for (w1, w2), count in bigram_counts.items():
        if count < min_count:
            continue

        # P(w1, w2) estimated from bigram frequency
        p_joint = count / total_bigrams

        # P(w1) and P(w2) from unigram frequencies
        p_w1 = word_counts[w1] / sum(word_counts.values())
        p_w2 = word_counts[w2] / sum(word_counts.values())

        # PMI
        pmi = np.log2(p_joint / (p_w1 * p_w2))

        if pmi > 0:
            collocations.append((w1, w2, pmi, count))

    # Sort by PMI
    collocations.sort(key=lambda x: -x[2])
    return collocations[:top_n]


# Extract collocations from tech corpus
collocations = extract_collocations(tech_corpus, min_count=1, top_n=15)

Out[33]:

Console

Top Collocations by PMI:
-------------------------------------------------------
  Bigram                         PMI    Count
  ------------------------- -------- --------
  algorithms process            7.64        1
  with many                     7.64        1
  many layers                   7.64        1
  quality affects               7.64        1
  performance significantly     7.64        1
  generate human-like           7.64        1
  understanding context         7.64        1
  enables computers             7.64        1
  computer vision               7.64        1
  vision systems                7.64        1
  systems recognize             7.64        1
  recognize objects             7.64        1
  image recognition             7.64        1
  recognition uses              7.64        1
  excel at                      7.64        1

The top collocations are meaningful multi-word expressions like "machine learning," "neural networks," and "language models." PMI successfully identifies these as units that co-occur far more than chance would predict. The high PMI scores (often above 4) indicate these word pairs appear together 16 or more times more frequently than their individual frequencies would suggest.

The collocations table confirms that technical compound terms like "neural networks" and "machine learning" have exceptionally high PMI scores, identifying them as genuine multi-word units in this domain.

Implementation: Building a Complete PPMI PipelineLink Copied

Let's put everything together into a complete, reusable implementation for computing PPMI matrices from text.

In[35]:

Code

class PPMIVectorizer:
    """
    Transform text into PPMI word vectors.

    This class provides a complete pipeline for:
    1. Building vocabulary from corpus
    2. Constructing co-occurrence matrix
    3. Computing PPMI transformation
    4. Finding similar words
    """

    def __init__(self, window_size=2, min_count=2, shift_k=1):
        """
        Initialize the vectorizer.

        Args:
            window_size: Context window size on each side
            min_count: Minimum word frequency to include in vocabulary
            shift_k: Shift parameter for SPPMI (k=1 gives standard PPMI)
        """
        self.window_size = window_size
        self.min_count = min_count
        self.shift_k = shift_k
        self.vocab = None
        self.word_to_idx = None
        self.ppmi_matrix = None

    def fit(self, corpus):
        """
        Build vocabulary and compute PPMI matrix from corpus.

        Args:
            corpus: String containing the text corpus
        """
        # Tokenize
        words = corpus.lower().replace(".", "").replace(",", "").split()

        # Build vocabulary
        word_counts = Counter(words)
        self.vocab = sorted(
            [w for w, c in word_counts.items() if c >= self.min_count]
        )
        self.word_to_idx = {w: i for i, w in enumerate(self.vocab)}

        # Build co-occurrence matrix
        n = len(self.vocab)
        cooc = np.zeros((n, n))

        for i, word in enumerate(words):
            if word not in self.word_to_idx:
                continue
            word_idx = self.word_to_idx[word]

            start = max(0, i - self.window_size)
            end = min(len(words), i + self.window_size + 1)

            for j in range(start, end):
                if j != i and words[j] in self.word_to_idx:
                    context_idx = self.word_to_idx[words[j]]
                    cooc[word_idx, context_idx] += 1

        # Compute PPMI
        N = cooc.sum()
        if N == 0:
            self.ppmi_matrix = cooc
            return self

        row_sums = cooc.sum(axis=1, keepdims=True)
        col_sums = cooc.sum(axis=0, keepdims=True)

        # Avoid division by zero
        row_sums = np.maximum(row_sums, 1e-10)
        col_sums = np.maximum(col_sums, 1e-10)

        expected = (row_sums @ col_sums) / N

        with np.errstate(divide="ignore", invalid="ignore"):
            pmi = np.log2(cooc / expected)
            pmi[~np.isfinite(pmi)] = 0

        # Apply shift and clip
        shift = np.log2(self.shift_k) if self.shift_k > 1 else 0
        self.ppmi_matrix = np.maximum(0, pmi - shift)

        return self

    def get_vector(self, word):
        """Get PPMI vector for a word."""
        if word not in self.word_to_idx:
            return None
        return self.ppmi_matrix[self.word_to_idx[word]]

    def most_similar(self, word, n=5):
        """Find most similar words by cosine similarity."""
        vec = self.get_vector(word)
        if vec is None:
            return []

        vec = vec.reshape(1, -1)
        sims = cosine_similarity(vec, self.ppmi_matrix)[0]

        results = []
        for i in np.argsort(sims)[::-1]:
            if self.vocab[i] != word:
                results.append((self.vocab[i], sims[i]))
                if len(results) >= n:
                    break
        return results

class PPMIVectorizer:
    """
    Transform text into PPMI word vectors.

    This class provides a complete pipeline for:
    1. Building vocabulary from corpus
    2. Constructing co-occurrence matrix
    3. Computing PPMI transformation
    4. Finding similar words
    """

    def __init__(self, window_size=2, min_count=2, shift_k=1):
        """
        Initialize the vectorizer.

        Args:
            window_size: Context window size on each side
            min_count: Minimum word frequency to include in vocabulary
            shift_k: Shift parameter for SPPMI (k=1 gives standard PPMI)
        """
        self.window_size = window_size
        self.min_count = min_count
        self.shift_k = shift_k
        self.vocab = None
        self.word_to_idx = None
        self.ppmi_matrix = None

    def fit(self, corpus):
        """
        Build vocabulary and compute PPMI matrix from corpus.

        Args:
            corpus: String containing the text corpus
        """
        # Tokenize
        words = corpus.lower().replace(".", "").replace(",", "").split()

        # Build vocabulary
        word_counts = Counter(words)
        self.vocab = sorted(
            [w for w, c in word_counts.items() if c >= self.min_count]
        )
        self.word_to_idx = {w: i for i, w in enumerate(self.vocab)}

        # Build co-occurrence matrix
        n = len(self.vocab)
        cooc = np.zeros((n, n))

        for i, word in enumerate(words):
            if word not in self.word_to_idx:
                continue
            word_idx = self.word_to_idx[word]

            start = max(0, i - self.window_size)
            end = min(len(words), i + self.window_size + 1)

            for j in range(start, end):
                if j != i and words[j] in self.word_to_idx:
                    context_idx = self.word_to_idx[words[j]]
                    cooc[word_idx, context_idx] += 1

        # Compute PPMI
        N = cooc.sum()
        if N == 0:
            self.ppmi_matrix = cooc
            return self

        row_sums = cooc.sum(axis=1, keepdims=True)
        col_sums = cooc.sum(axis=0, keepdims=True)

        # Avoid division by zero
        row_sums = np.maximum(row_sums, 1e-10)
        col_sums = np.maximum(col_sums, 1e-10)

        expected = (row_sums @ col_sums) / N

        with np.errstate(divide="ignore", invalid="ignore"):
            pmi = np.log2(cooc / expected)
            pmi[~np.isfinite(pmi)] = 0

        # Apply shift and clip
        shift = np.log2(self.shift_k) if self.shift_k > 1 else 0
        self.ppmi_matrix = np.maximum(0, pmi - shift)

        return self

    def get_vector(self, word):
        """Get PPMI vector for a word."""
        if word not in self.word_to_idx:
            return None
        return self.ppmi_matrix[self.word_to_idx[word]]

    def most_similar(self, word, n=5):
        """Find most similar words by cosine similarity."""
        vec = self.get_vector(word)
        if vec is None:
            return []

        vec = vec.reshape(1, -1)
        sims = cosine_similarity(vec, self.ppmi_matrix)[0]

        results = []
        for i in np.argsort(sims)[::-1]:
            if self.vocab[i] != word:
                results.append((self.vocab[i], sims[i]))
                if len(results) >= n:
                    break
        return results

In[36]:

Code

# Example usage
vectorizer = PPMIVectorizer(window_size=3, min_count=2, shift_k=1)
vectorizer.fit(tech_corpus)

# Example usage
vectorizer = PPMIVectorizer(window_size=3, min_count=2, shift_k=1)
vectorizer.fit(tech_corpus)

Out[37]:

Console

PPMIVectorizer Results:
--------------------------------------------------
Vocabulary size: 30
Matrix shape: (30, 30)

Similar to 'learning':
    learn          : 0.555
    agents         : 0.548
    from           : 0.390
    information    : 0.318

Similar to 'networks':
    neural         : 0.720
    convolutional  : 0.526
    learn          : 0.348
    language       : 0.334

Similar to 'data':
    preprocessing  : 0.767
    raw            : 0.681
    training       : 0.540
    model          : 0.536

The vectorizer identifies semantically related words based on shared context patterns. For example, "learning" associates with related machine learning terms, while "data" connects to words from the data processing domain. The similarity scores (ranging from 0 to 1 for cosine similarity) reflect how closely the context distributions match between words.

Limitations and When to Use PMILink Copied

While PMI is powerful, it has limitations you should understand.

Sensitivity to Low CountsLink Copied

PMI estimates are unreliable for rare word pairs. A word pair that co-occurs once when we expected 0.5 co-occurrences gets PMI of 1, but this could easily be noise. The standard solution is to require minimum co-occurrence counts before computing PMI.

In[38]:

Code

def compute_reliable_pmi(cooc_matrix, min_count=5):
    """Compute PMI only for word pairs with sufficient counts."""
    # Mask out low-count pairs
    mask = cooc_matrix >= min_count

    # Compute PMI
    N = cooc_matrix.sum()
    row_sums = cooc_matrix.sum(axis=1, keepdims=True)
    col_sums = cooc_matrix.sum(axis=0, keepdims=True)

    row_sums = np.maximum(row_sums, 1e-10)
    col_sums = np.maximum(col_sums, 1e-10)

    expected = (row_sums @ col_sums) / N

    with np.errstate(divide="ignore", invalid="ignore"):
        pmi = np.log2(cooc_matrix / expected)
        pmi[~np.isfinite(pmi)] = 0

    # Apply mask
    pmi[~mask] = 0

    return np.maximum(0, pmi)


# Compare standard PPMI vs reliable PPMI with minimum count threshold
standard_ppmi = compute_ppmi_matrix(tech_cooc)
reliable_ppmi = compute_reliable_pmi(tech_cooc, min_count=2)

def compute_reliable_pmi(cooc_matrix, min_count=5):
    """Compute PMI only for word pairs with sufficient counts."""
    # Mask out low-count pairs
    mask = cooc_matrix >= min_count

    # Compute PMI
    N = cooc_matrix.sum()
    row_sums = cooc_matrix.sum(axis=1, keepdims=True)
    col_sums = cooc_matrix.sum(axis=0, keepdims=True)

    row_sums = np.maximum(row_sums, 1e-10)
    col_sums = np.maximum(col_sums, 1e-10)

    expected = (row_sums @ col_sums) / N

    with np.errstate(divide="ignore", invalid="ignore"):
        pmi = np.log2(cooc_matrix / expected)
        pmi[~np.isfinite(pmi)] = 0

    # Apply mask
    pmi[~mask] = 0

    return np.maximum(0, pmi)


# Compare standard PPMI vs reliable PPMI with minimum count threshold
standard_ppmi = compute_ppmi_matrix(tech_cooc)
reliable_ppmi = compute_reliable_pmi(tech_cooc, min_count=2)

Out[39]:

Console

Effect of minimum count threshold:
--------------------------------------------------
  Standard PPMI non-zero entries: 695
  Reliable PPMI (min_count=2):    77
  Entries filtered out:           618

The minimum count threshold removes word pairs that co-occur too rarely to provide reliable PMI estimates. The filtered entries may represent genuine but rare associations, or they may be sampling artifacts. In practice, this trade-off between precision and recall depends on corpus size and downstream application requirements.

Bias Toward Rare WordsLink Copied

PMI tends to give high scores to rare word pairs. If a rare word appears only in specific contexts, it gets high PMI with those contexts even if the association is coincidental. Shifted PPMI helps by raising the threshold for positive associations.

Out[40]:

Visualization

Scatter plot showing word frequency on x-axis (log scale) versus maximum PMI value on y-axis, demonstrating that rare words tend to have higher PMI scores. — Relationship between word frequency and maximum PMI score. Rare words (left side) tend to achieve higher maximum PMI values because their limited contexts create artificially strong associations. Frequent words (right side) have more stable, moderate PMI values. This bias toward rare words is a known limitation of PMI that minimum count thresholds and shifted PPMI help address.

The downward trend confirms the rare word bias: words with fewer total co-occurrences tend to achieve higher maximum PMI scores. This happens because rare words have limited contexts, making each co-occurrence count proportionally more.

Computational CostLink Copied

For large vocabularies, PMI matrices become enormous. A 100,000-word vocabulary produces a 10-billion-cell matrix. While the matrix is sparse after PPMI transformation, the intermediate computations can be expensive. Practical implementations use sparse matrix formats and streaming algorithms.

When to Use PMILink Copied

PMI and PPMI are excellent choices when:

You need interpretable association scores between words
You're extracting collocations or multi-word expressions
You want sparse, high-dimensional word vectors as input to other algorithms
You need a baseline to compare against neural embeddings
Computational resources are limited (no GPU required)

Neural methods like Word2Vec often outperform PPMI for downstream tasks, but the difference is smaller than you might expect. For many applications, PPMI provides a strong, interpretable baseline.

SummaryLink Copied

Pointwise Mutual Information transforms raw co-occurrence counts into meaningful association scores by comparing observed co-occurrence to what we'd expect under independence.

Key concepts:

PMI formula:

\text{PMI}(w, c) = \log_2 \frac{P(w, c)}{P(w) \cdot P(c)}

measures how much more (or less) two words co-occur than chance would predict, where $P(w, c)$ is the joint probability of co-occurrence and $P(w) \cdot P(c)$ is the expected probability under independence.

Positive PMI (PPMI): $\text{PPMI}(w, c) = \max(0, \text{PMI}(w, c))$ clips negative values to zero, keeping only positive associations. This produces sparse matrices that work well with machine learning algorithms.
Shifted PPMI: $\text{SPPMI}_k(w, c) = \max(0, \text{PMI}(w, c) - \log_2 k)$ subtracts $\log_2 k$ before clipping, filtering out weak associations. Connected theoretically to Word2Vec's negative sampling.
PMI interpretation: Positive PMI means words co-occur more than expected (strong association). Zero means independence. Negative means avoidance.

Practical applications:

Collocation extraction: Finding meaningful multi-word expressions
Word similarity: Using PPMI vectors with cosine similarity
Feature weighting: PPMI as a preprocessing step before dimensionality reduction

Key ParametersLink Copied

Key PPMI parameters with recommended value ranges and their effects on the resulting matrix.

Parameter	Typical Values	Effect
`window_size`	2-5	Larger windows capture broader topical context but may include more noise; smaller windows emphasize syntactic relationships
`min_count`	2-10	Higher values filter unreliable associations from rare words; start with 2-5 for small corpora, 5-10 for larger ones
`shift_k`	1-15	Higher values keep only the strongest associations; k=5 is common and corresponds to Word2Vec's default negative sampling

Guidelines for choosing each parameter:

window_size: Smaller windows (2-3) tend to capture syntactic relationships and function word patterns. Larger windows (5-10) capture more topical or semantic relationships. For most NLP applications, a window of 2-5 provides a good balance.
min_count: This threshold depends on corpus size. For small corpora (under 1 million words), use 2-5. For larger corpora, 5-10 reduces noise from rare co-occurrences. Higher thresholds produce more reliable but potentially incomplete association matrices.
shift_k: The shift parameter controls sensitivity. With k=1 (standard PPMI), all positive associations are retained. With k=5, only associations at least 5 times stronger than expected survive. If downstream tasks show signs of noise from rare word artifacts, increase k.

The next chapter shows how to reduce the dimensionality of PPMI matrices using Singular Value Decomposition, producing dense vectors that capture the essential structure in fewer dimensions.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Pointwise Mutual Information and word associations.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Co-occurrence Matrices

Next Chapter

Singular Value Decomposition

Reference

BIBTEXAcademic

@misc{pointwisemutualinformationmeasuringwordassociationsinnlp, author = {Michael Brenndoerfer}, title = {Pointwise Mutual Information: Measuring Word Associations in NLP}, year = {2025}, url = {https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Pointwise Mutual Information: Measuring Word Associations in NLP. Retrieved from https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp

MLAAcademic

Michael Brenndoerfer. "Pointwise Mutual Information: Measuring Word Associations in NLP." 2026. Web. today. <https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp>.

CHICAGOAcademic

Michael Brenndoerfer. "Pointwise Mutual Information: Measuring Word Associations in NLP." Accessed today. https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Pointwise Mutual Information: Measuring Word Associations in NLP'. Available at: https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Pointwise Mutual Information: Measuring Word Associations in NLP. https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp

Direct link:

https://mbrenndoerfer.com/writing/pointwise-mutual-information-word-associations-nlp

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Pointwise Mutual Information: Measuring Word Associations in NLP

Pointwise Mutual InformationLink Copied

The Problem with Raw CountsLink Copied

The PMI FormulaLink Copied

The Intuition: Observed vs ExpectedLink Copied

Why the Logarithm?Link Copied

From Probabilities to CountsLink Copied

The Practical FormulaLink Copied

Implementing PMILink Copied

Interpreting PMI as Association StrengthLink Copied

A Worked Example: Computing PMI Step by StepLink Copied

The Problem with Negative PMILink Copied

Positive PMI (PPMI)Link Copied

Shifted PPMI VariantsLink Copied

PMI vs Raw Counts: A ComparisonLink Copied

PMI Matrix PropertiesLink Copied

SparsityLink Copied

SymmetryLink Copied

Row Vectors as Word RepresentationsLink Copied

Collocation Extraction with PMILink Copied

Implementation: Building a Complete PPMI PipelineLink Copied

Limitations and When to Use PMILink Copied

Sensitivity to Low CountsLink Copied

Bias Toward Rare WordsLink Copied

Computational CostLink Copied

When to Use PMILink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA

The Distributional Hypothesis: How Context Reveals Word Meaning

Co-occurrence Matrices: Building Word Representations from Context

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Singular Value Decomposition: Matrix Factorization for Word Embeddings & LSA

The Distributional Hypothesis: How Context Reveals Word Meaning

Co-occurrence Matrices: Building Word Representations from Context

Stay updated