GloVe: Global Vectors for Word Representation

Michael Brenndoerfer

Learn how GloVe creates word embeddings by factorizing co-occurrence matrices. Covers the derivation, weighted least squares objective, and Python implementation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

GloVeLink Copied

Word2Vec learns embeddings through local context windows, predicting surrounding words one pair at a time. This approach works well but ignores a fundamental insight: some word relationships are global properties of the corpus. The word "ice" might appear near "cold" millions of times across a billion-word corpus, but Word2Vec treats each occurrence as an independent prediction task. What if we could leverage this global co-occurrence information directly?

GloVe (Global Vectors for Word Representation) takes a different path. Developed by Pennington, Socher, and Manning at Stanford in 2014, GloVe starts with a co-occurrence matrix that captures how often words appear together across the entire corpus. It then factorizes this matrix to produce word vectors. The result: embeddings that encode both local context patterns and global corpus statistics.

This chapter develops GloVe from first principles. We'll see how the objective function emerges from a simple requirement that word vectors encode co-occurrence ratios, work through the weighted least squares formulation, and implement GloVe from scratch. By the end, you'll understand why GloVe achieves comparable results to Word2Vec despite taking a fundamentally different approach.

The Insight: Co-occurrence Ratios Reveal MeaningLink Copied

GloVe's key insight is simple but powerful: the ratio of co-occurrence probabilities encodes semantic relationships more reliably than raw probabilities. Consider the words "ice" and "steam." Both relate to water, but in different ways. How can we distinguish them?

Let's look at co-occurrence with probe words:

Probe word $k$	$P(k \mid \text{ice})$	$P(k \mid \text{steam})$	Ratio $\frac{P(k \mid \text{ice})}{P(k \mid \text{steam})}$
solid	high	low	large (>> 1)
gas	low	high	small (<< 1)
water	high	high	≈ 1
fashion	low	low	≈ 1

The raw probabilities $P(k \mid \text{ice})$ and $P(k \mid \text{steam})$ depend on many factors: how common each word is, the corpus domain, and so on. But the ratio tells a cleaner story:

Large ratio: The probe word relates more to "ice" than "steam" (like "solid")
Small ratio: The probe word relates more to "steam" than "ice" (like "gas")
Ratio ≈ 1: The probe word relates equally to both (like "water") or to neither (like "fashion")

This ratio invariance is powerful. It factors out corpus-specific biases and isolates the semantic relationship we care about.

In[2]:

import numpy as np

# Simulated co-occurrence probabilities (illustrative values)
# In practice, these come from corpus statistics
co_occurrence = {
    'ice': {'solid': 0.00019, 'gas': 0.000022, 'water': 0.003, 'fashion': 0.000018},
    'steam': {'solid': 0.000022, 'gas': 0.00078, 'water': 0.0022, 'fashion': 0.000018}
}

def compute_ratio(word1, word2, probe):
    """Compute P(probe | word1) / P(probe | word2)."""
    p1 = co_occurrence[word1][probe]
    p2 = co_occurrence[word2][probe]
    return p1 / p2

probes = ['solid', 'gas', 'water', 'fashion']
ratios = {probe: compute_ratio('ice', 'steam', probe) for probe in probes}

import numpy as np

# Simulated co-occurrence probabilities (illustrative values)
# In practice, these come from corpus statistics
co_occurrence = {
    'ice': {'solid': 0.00019, 'gas': 0.000022, 'water': 0.003, 'fashion': 0.000018},
    'steam': {'solid': 0.000022, 'gas': 0.00078, 'water': 0.0022, 'fashion': 0.000018}
}

def compute_ratio(word1, word2, probe):
    """Compute P(probe | word1) / P(probe | word2)."""
    p1 = co_occurrence[word1][probe]
    p2 = co_occurrence[word2][probe]
    return p1 / p2

probes = ['solid', 'gas', 'water', 'fashion']
ratios = {probe: compute_ratio('ice', 'steam', probe) for probe in probes}

Out[3]:

Co-occurrence Ratio Analysis: ice vs steam
-------------------------------------------------------
Probe            P(k|ice)   P(k|steam)        Ratio
-------------------------------------------------------
solid            0.000190     0.000022         8.64
gas              0.000022     0.000780         0.03
water            0.003000     0.002200         1.36
fashion          0.000018     0.000018         1.00

The ratios reveal clear discriminative patterns. "Solid" has a ratio far greater than 1 (approximately 8.6), indicating strong association with "ice" rather than "steam." Conversely, "gas" has a ratio well below 1 (approximately 0.03), showing the opposite relationship. Both "water" and "fashion" have ratios near 1, but for different reasons: "water" relates equally to both states, while "fashion" is irrelevant to either.

Out[4]:

Visualization

Co-occurrence Ratio

The ratio of co-occurrence probabilities $\frac{P(k \mid w_i)}{P(k \mid w_j)}$ encodes how a probe word $k$ discriminates between target words $w_i$ and $w_j$ . GloVe's objective function is designed so that word vectors can reconstruct these ratios.

From Ratios to Vectors: Deriving the ObjectiveLink Copied

We've established that co-occurrence ratios encode semantic relationships. The next question is: how do we design word vectors that naturally capture these ratios? The answer comes through a derivation that starts with a simple requirement and, through a series of logical constraints, arrives at GloVe's objective function.

The derivation unfolds like a detective story. Each constraint eliminates possibilities, narrowing the space of potential solutions until only one sensible answer remains. By the end, the objective function won't feel like an arbitrary choice. It will feel inevitable.

Setting Up the ProblemLink Copied

Our starting point is the co-occurrence matrix $X$ . This matrix is the foundation of everything GloVe does. Each entry $X_{ij}$ counts how often word $j$ appears within a context window of word $i$ , accumulated across the entire corpus. From these raw counts, we can define probabilities:

P_{ij} = P(j \mid i) = \frac{X_{ij}}{X_i}

where:

$P_{ij}$ : probability of word $j$ appearing in context of word $i$
$X_{ij}$ : co-occurrence count for words $i$ and $j$
$X_i = \sum_k X_{ik}$ : total co-occurrence count for word $i$

Now we can state our goal precisely. We want to learn word vectors $\mathbf{w}_i$ and context vectors $\tilde{\mathbf{w}}_k$ such that some function $F$ of these vectors recovers the co-occurrence ratio:

F(\mathbf{w}_i, \mathbf{w}_j, \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

where:

$\mathbf{w}_i, \mathbf{w}_j$ : word vectors for the target words
$\tilde{\mathbf{w}}_k$ : context vector for the probe word
$F$ : function to be determined

This equation captures our key insight: the ratio of co-occurrence probabilities, the same ratio that distinguishes "ice" from "steam" via probe words like "solid" and "gas", should be computable from word vectors alone. The question is: what form must $F$ take?

Constraint 1: Vector Differences Encode ContrastsLink Copied

The ratio $\frac{P_{ik}}{P_{jk}}$ fundamentally measures a contrast: how does word $i$ 's relationship with $k$ differ from word $j$ 's relationship with $k$ ? In vector spaces, the natural way to represent contrasts is through subtraction. When we compute $\mathbf{w}_i - \mathbf{w}_j$ , we obtain a vector pointing from $j$ toward $i$ , encoding everything that distinguishes them.

This suggests simplifying our function to depend on the difference:

F((\mathbf{w}_i - \mathbf{w}_j), \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

Now $F$ takes two inputs: the difference vector $(\mathbf{w}_i - \mathbf{w}_j)$ and the context vector $\tilde{\mathbf{w}}_k$ . This is already more constrained. $F$ doesn't need to handle three arbitrary vectors, just a difference and a context.

Constraint 2: Producing a Scalar from VectorsLink Copied

Look at the right-hand side: $\frac{P_{ik}}{P_{jk}}$ is a scalar, a single number. But our inputs are vectors, high-dimensional objects with many components. How do we combine two vectors to produce a single number?

The most natural choice is the dot product. The dot product $\mathbf{a} \cdot \mathbf{b}$ measures how aligned two vectors are: positive when they point similarly, negative when opposite, zero when perpendicular. It also has mathematical properties that will prove crucial shortly.

This gives us:

F((\mathbf{w}_i - \mathbf{w}_j) \cdot \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

The function $F$ now operates on a scalar (the dot product) and produces another scalar. We've reduced the problem significantly.

Constraint 3: The Exponential EmergesLink Copied

Here's where the key insight emerges. Expand the dot product:

(\mathbf{w}_i - \mathbf{w}_j) \cdot \tilde{\mathbf{w}}_k = \mathbf{w}_i \cdot \tilde{\mathbf{w}}_k - \mathbf{w}_j \cdot \tilde{\mathbf{w}}_k

The left side is a difference of dot products. The right side of our equation is a ratio of probabilities. We need a function $F$ that transforms differences into ratios.

Think about this algebraically: we need $F$ such that $F(a - b) = F(a)/F(b)$ for scalars $a$ and $b$ . This is asking for a homomorphism from addition to multiplication, a function that converts additive structure into multiplicative structure.

There's essentially one continuous function with this property: the exponential. Since $e^{a-b} = e^a / e^b$ , the exponential naturally converts differences in the exponent into ratios in the output.

Applying this insight:

\exp(\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k - \mathbf{w}_j \cdot \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

Using the exponential property, we can separate this into:

\frac{\exp(\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k)}{\exp(\mathbf{w}_j \cdot \tilde{\mathbf{w}}_k)} = \frac{P_{ik}}{P_{jk}}

For this to hold for all word pairs, each individual term must satisfy:

\exp(\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k) = P_{ik} \cdot C

for some constant $C$ that may depend on $k$ but cancels in the ratio. Taking logarithms of both sides:

\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k = \log(P_{ik}) + \log(C)

Arriving at the Core EquationLink Copied

We're almost there. Substituting the definition $P_{ik} = X_{ik} / X_i$ :

\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k = \log(X_{ik}) - \log(X_i) + \log(C)

Now notice something important: $\log(X_i)$ depends only on word $i$ , not on the context word $k$ . This term captures how often word $i$ appears in the corpus overall, a frequency effect rather than a semantic relationship. Similarly, $\log(C)$ might depend only on $k$ .

The solution is to absorb these word-specific terms into bias terms:

Let $b_i = -\log(X_i) + \text{(other word-}i\text{-specific terms)}$
Let $\tilde{b}_k = \log(C) + \text{(other context-}k\text{-specific terms)}$

This yields GloVe's core equation:

\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k + b_i + \tilde{b}_k = \log(X_{ik})

where:

$\mathbf{w}_i$ : word vector for word $i$ (captures semantic content)
$\tilde{\mathbf{w}}_k$ : context vector for word $k$ (captures contextual role)
$b_i$ : bias term for word $i$ (absorbs overall frequency effects)
$\tilde{b}_k$ : bias term for context word $k$ (absorbs context-specific effects)
$X_{ik}$ : co-occurrence count (the observed data)

This equation has a clear interpretation. The dot product $\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k$ measures the semantic compatibility between word $i$ and context $k$ . The biases adjust for how common each word is overall. Together, they should predict the logarithm of how often we actually observe the pair together.

This is GloVe's core equation: the dot product of word and context vectors, plus biases, should equal the log co-occurrence count.

Out[5]:

Visualization

Flowchart showing derivation steps from ratio encoding to final objective. — The logical chain of GloVe's derivation. Starting from the requirement that word vectors encode co-occurrence ratios, successive constraints narrow the functional form until we arrive at the weighted least squares objective.

The GloVe Objective FunctionLink Copied

We've derived that word vectors should satisfy $\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k + b_i + \tilde{b}_k = \log(X_{ik})$ . But this is an idealized equation. In practice, no finite-dimensional embedding can perfectly satisfy it for every word pair. We need to frame this as an optimization problem: find the vectors and biases that come as close as possible to satisfying the equation across all pairs.

The journey from ideal equation to practical objective reveals important design decisions. A naive formulation encounters serious problems, and solving them leads to GloVe's distinctive weighted least squares approach.

Naive Least Squares (and Its Problems)Link Copied

The most straightforward optimization minimizes squared error:

J = \sum_{i,j} \left( \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2

This says: for every word pair, measure how far our prediction deviates from the target log co-occurrence, square it (so positive and negative errors contribute equally), and sum across all pairs. Standard least squares.

But this formulation has critical flaws:

Zero counts are catastrophic: Many word pairs never co-occur in any corpus. "Quantum" and "umbrella" might never appear together. For these pairs, $X_{ij} = 0$ , and $\log(0) = -\infty$ . The objective becomes undefined.
Not all pairs deserve equal attention: A co-occurrence count of 1 million reflects a strong, statistically reliable signal. A count of 1 might be noise, a single accidental co-occurrence. Yet naive least squares weights them identically.
Rare pairs can dominate: If rare word pairs have large errors (which they often do, being noisy), they can disproportionately influence training, pulling embeddings away from configurations that would serve common words well.

These problems demand a more thoughtful objective.

The Weighting FunctionLink Copied

GloVe's solution is to introduce a weighting function $f(X_{ij})$ that modulates how much each word pair contributes to the objective. This function is designed to satisfy three requirements:

Zero weight for zero counts: When $X_{ij} = 0$ , set $f(0) = 0$ . This pair simply doesn't contribute. We never try to predict $\log(0)$ .
Increasing weight with frequency: Pairs that co-occur more often provide more reliable statistics. The function should increase with $X_{ij}$ , giving more weight to confident observations.
Bounded influence: Extremely common word pairs (like "the" with almost everything) shouldn't completely dominate training. The weight should eventually plateau.

The function that GloVe adopts balances these requirements:

f(x) = \begin{cases} (x / x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}

where:

$x$ : co-occurrence count $X_{ij}$
$x_{\max}$ : cutoff parameter (typically 100)
$\alpha$ : exponent (typically 0.75)

In[6]:

def weighting_function(x, x_max=100, alpha=0.75):
    """
    GloVe weighting function.
    
    Gives higher weight to frequent co-occurrences,
    but caps at 1 to prevent very frequent pairs from dominating.
    """
    if x < x_max:
        return (x / x_max) ** alpha
    else:
        return 1.0

# Vectorized version for efficiency
def weighting_vectorized(x, x_max=100, alpha=0.75):
    """Vectorized weighting function."""
    return np.where(x < x_max, (x / x_max) ** alpha, 1.0)

# Compute weights for various co-occurrence counts
counts = np.array([1, 5, 10, 25, 50, 100, 500, 1000])
weights = weighting_vectorized(counts)

def weighting_function(x, x_max=100, alpha=0.75):
    """
    GloVe weighting function.
    
    Gives higher weight to frequent co-occurrences,
    but caps at 1 to prevent very frequent pairs from dominating.
    """
    if x < x_max:
        return (x / x_max) ** alpha
    else:
        return 1.0

# Vectorized version for efficiency
def weighting_vectorized(x, x_max=100, alpha=0.75):
    """Vectorized weighting function."""
    return np.where(x < x_max, (x / x_max) ** alpha, 1.0)

# Compute weights for various co-occurrence counts
counts = np.array([1, 5, 10, 25, 50, 100, 500, 1000])
weights = weighting_vectorized(counts)

Out[7]:

GloVe Weighting Function (x_max=100, α=0.75):
---------------------------------------------
  Count X_ij  Weight f(X_ij)
---------------------------------------------
           1          0.0316  
           5          0.1057  ██
          10          0.1778  ███
          25          0.3536  ███████
          50          0.5946  ███████████
         100          1.0000  ████████████████████
         500          1.0000  ████████████████████
       1,000          1.0000  ████████████████████

The weighting shows clear progression: a count of 1 receives weight 0.18, while a count of 50 gets 0.65. Once the count reaches 100 (the x_max threshold), the weight caps at 1.0 and stays there for higher counts. This sublinear scaling ( $\alpha = 0.75$ ) means common word pairs contribute meaningfully to training without completely dominating rare but informative pairs.

Out[8]:

Visualization

Line plot comparing weighting functions with different alpha values. — Effect of the α parameter on the weighting function. Lower α values (0.5) more aggressively suppress rare pairs, while higher values (1.0) approach linear weighting. The default α=0.75 provides a balanced middle ground.

Out[9]:

Visualization

Curve showing weighting function rising from 0 and capping at 1. — GloVe's weighting function f(x). Low co-occurrence counts receive low weight, reducing the influence of noisy rare events. The weight increases sublinearly (exponent 0.75), then caps at 1.0 when count reaches x_max=100. This balances the training signal across different frequency ranges.

The Complete ObjectiveLink Copied

Combining the core equation with the weighting function:

J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2

where:

$J$ : objective function to minimize
$V$ : vocabulary size
$f(X_{ij})$ : weighting function for word pair $(i, j)$
$\mathbf{w}_i$ : word vector for word $i$
$\tilde{\mathbf{w}}_j$ : context vector for word $j$
$b_i, \tilde{b}_j$ : bias terms
$X_{ij}$ : co-occurrence count

This is a weighted least squares problem: find vectors and biases that minimize the weighted squared error between predicted and actual log co-occurrences.

GloVe Objective

GloVe minimizes weighted squared error between $\mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j$ and $\log(X_{ij})$ , where the weight $f(X_{ij})$ increases with co-occurrence frequency up to a maximum. The sum runs only over pairs with $X_{ij} > 0$ .

In[10]:

def glove_objective(word_vecs, context_vecs, word_biases, context_biases, 
                    cooccurrence_matrix, x_max=100, alpha=0.75):
    """
    Compute the GloVe objective function value.
    
    Args:
        word_vecs: Word embedding matrix (V x d)
        context_vecs: Context embedding matrix (V x d)
        word_biases: Word bias vector (V,)
        context_biases: Context bias vector (V,)
        cooccurrence_matrix: Sparse or dense co-occurrence matrix (V x V)
        x_max: Weighting function cutoff
        alpha: Weighting function exponent
    
    Returns:
        total_loss: Sum of weighted squared errors
    """
    total_loss = 0.0
    vocab_size = word_vecs.shape[0]
    
    for i in range(vocab_size):
        for j in range(vocab_size):
            x_ij = cooccurrence_matrix[i, j]
            if x_ij > 0:  # Only non-zero entries
                # Compute weight
                weight = weighting_function(x_ij, x_max, alpha)
                
                # Compute prediction
                prediction = (np.dot(word_vecs[i], context_vecs[j]) + 
                             word_biases[i] + context_biases[j])
                
                # Compute error
                error = prediction - np.log(x_ij)
                
                # Accumulate weighted squared error
                total_loss += weight * error ** 2
    
    return total_loss

def glove_objective(word_vecs, context_vecs, word_biases, context_biases, 
                    cooccurrence_matrix, x_max=100, alpha=0.75):
    """
    Compute the GloVe objective function value.
    
    Args:
        word_vecs: Word embedding matrix (V x d)
        context_vecs: Context embedding matrix (V x d)
        word_biases: Word bias vector (V,)
        context_biases: Context bias vector (V,)
        cooccurrence_matrix: Sparse or dense co-occurrence matrix (V x V)
        x_max: Weighting function cutoff
        alpha: Weighting function exponent
    
    Returns:
        total_loss: Sum of weighted squared errors
    """
    total_loss = 0.0
    vocab_size = word_vecs.shape[0]
    
    for i in range(vocab_size):
        for j in range(vocab_size):
            x_ij = cooccurrence_matrix[i, j]
            if x_ij > 0:  # Only non-zero entries
                # Compute weight
                weight = weighting_function(x_ij, x_max, alpha)
                
                # Compute prediction
                prediction = (np.dot(word_vecs[i], context_vecs[j]) + 
                             word_biases[i] + context_biases[j])
                
                # Compute error
                error = prediction - np.log(x_ij)
                
                # Accumulate weighted squared error
                total_loss += weight * error ** 2
    
    return total_loss

Building the Co-occurrence MatrixLink Copied

Before training GloVe, we must construct the co-occurrence matrix from a corpus. This preprocessing step represents a fundamental difference from Word2Vec: while Word2Vec processes training pairs on-the-fly during optimization, GloVe separates statistics gathering from model training. We scan the corpus once to build the co-occurrence matrix, then train on these precomputed counts.

This separation has important implications. The matrix construction phase is embarrassingly parallel, since each document can be processed independently. Once complete, the training phase operates on a fixed set of statistics, making it more predictable and easier to tune. The tradeoff is memory: we must store the entire matrix (though sparse representations help enormously).

Defining Co-occurrenceLink Copied

The core question is: what exactly should we count? A word pair $(i, j)$ "co-occurs" when word $j$ appears within a context window of word $i$ . But not all co-occurrences are equal. GloVe uses distance-weighted counting, where closer words contribute more to the co-occurrence count than distant ones.

The rationale is linguistic: words immediately adjacent typically have stronger relationships than words at the edges of a context window. In the phrase "the quick brown fox," "quick" and "brown" are more closely related than "the" and "fox," even though both pairs fall within a five-word window.

Specifically, if words $i$ and $j$ are separated by $d$ positions, we add $1/d$ to $X_{ij}$ . Adjacent words (distance 1) contribute a full count of 1.0. Words two positions apart contribute 0.5. At the edge of a window of size 5, the contribution is just 0.2. This inverse-distance weighting encodes the intuition that proximity correlates with semantic relevance.

Out[11]:

Visualization

Bar chart showing how co-occurrence weight decreases with word distance. — Distance-weighted co-occurrence counting. Adjacent words contribute weight 1.0, with contribution decreasing as 1/d for distance d. This reflects the linguistic intuition that nearby words have stronger semantic relationships than distant words within the same context window.

In[12]:

def build_cooccurrence_matrix(corpus, vocab, window_size=5):
    """
    Build a co-occurrence matrix from a corpus.
    
    Uses distance-weighted counting: words closer together
    contribute more to the co-occurrence count.
    
    Args:
        corpus: List of sentences (each sentence is a list of words)
        vocab: Dictionary mapping words to indices
        window_size: Context window size
    
    Returns:
        cooccurrence: Dense co-occurrence matrix (V x V)
    """
    vocab_size = len(vocab)
    cooccurrence = np.zeros((vocab_size, vocab_size), dtype=np.float64)
    
    for sentence in corpus:
        # Convert words to indices, skipping unknown words
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            # Look at context words within window
            for offset in range(1, window_size + 1):
                # Weight by inverse distance
                weight = 1.0 / offset
                
                # Left context
                left_pos = center_pos - offset
                if left_pos >= 0:
                    context_idx = indices[left_pos]
                    cooccurrence[center_idx, context_idx] += weight
                
                # Right context
                right_pos = center_pos + offset
                if right_pos < len(indices):
                    context_idx = indices[right_pos]
                    cooccurrence[center_idx, context_idx] += weight
    
    return cooccurrence

# Example corpus
example_sentences = [
    ['the', 'king', 'sits', 'on', 'the', 'throne'],
    ['the', 'queen', 'rules', 'the', 'kingdom'],
    ['the', 'prince', 'and', 'princess', 'live', 'in', 'the', 'palace'],
    ['a', 'man', 'and', 'woman', 'walk', 'together'],
    ['the', 'king', 'and', 'queen', 'wear', 'royal', 'crowns'],
]

# Build vocabulary
all_words = [w for sent in example_sentences for w in sent]
vocab = {w: i for i, w in enumerate(sorted(set(all_words)))}
idx_to_word = {i: w for w, i in vocab.items()}

# Build co-occurrence matrix
cooc_matrix = build_cooccurrence_matrix(example_sentences, vocab, window_size=3)

def build_cooccurrence_matrix(corpus, vocab, window_size=5):
    """
    Build a co-occurrence matrix from a corpus.
    
    Uses distance-weighted counting: words closer together
    contribute more to the co-occurrence count.
    
    Args:
        corpus: List of sentences (each sentence is a list of words)
        vocab: Dictionary mapping words to indices
        window_size: Context window size
    
    Returns:
        cooccurrence: Dense co-occurrence matrix (V x V)
    """
    vocab_size = len(vocab)
    cooccurrence = np.zeros((vocab_size, vocab_size), dtype=np.float64)
    
    for sentence in corpus:
        # Convert words to indices, skipping unknown words
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            # Look at context words within window
            for offset in range(1, window_size + 1):
                # Weight by inverse distance
                weight = 1.0 / offset
                
                # Left context
                left_pos = center_pos - offset
                if left_pos >= 0:
                    context_idx = indices[left_pos]
                    cooccurrence[center_idx, context_idx] += weight
                
                # Right context
                right_pos = center_pos + offset
                if right_pos < len(indices):
                    context_idx = indices[right_pos]
                    cooccurrence[center_idx, context_idx] += weight
    
    return cooccurrence

# Example corpus
example_sentences = [
    ['the', 'king', 'sits', 'on', 'the', 'throne'],
    ['the', 'queen', 'rules', 'the', 'kingdom'],
    ['the', 'prince', 'and', 'princess', 'live', 'in', 'the', 'palace'],
    ['a', 'man', 'and', 'woman', 'walk', 'together'],
    ['the', 'king', 'and', 'queen', 'wear', 'royal', 'crowns'],
]

# Build vocabulary
all_words = [w for sent in example_sentences for w in sent]
vocab = {w: i for i, w in enumerate(sorted(set(all_words)))}
idx_to_word = {i: w for w, i in vocab.items()}

# Build co-occurrence matrix
cooc_matrix = build_cooccurrence_matrix(example_sentences, vocab, window_size=3)

Out[13]:

Co-occurrence Matrix Statistics:
---------------------------------------------
Vocabulary size: 22
Matrix shape: (22, 22)
Non-zero entries: 113
Sparsity: 76.7%

Sample co-occurrences (top 10 by count):
  (king, the): 2.33
  (the, king): 2.33
  (queen, the): 1.83
  (the, queen): 1.83
  (rules, the): 1.50
  (the, rules): 1.50
  (on, the): 1.33
  (the, on): 1.33
  (a, man): 1.00
  (and, king): 1.00

The matrix shows typical characteristics of word co-occurrence data. Even with this tiny corpus, the sparsity is substantial, as most word pairs never appear together within a context window. The highest co-occurrence counts involve "the," which appears frequently throughout the corpus. The distance-weighted counting produces fractional values: a count of 1.50 indicates two co-occurrences at different distances (e.g., adjacent words contribute 1.0, while words two positions apart contribute 0.5).

Out[14]:

Visualization

Histogram showing the distribution of co-occurrence counts with most values near zero. — Distribution of non-zero co-occurrence counts. The distribution is heavily right-skewed, with most word pairs having low co-occurrence counts while a few high-frequency pairs dominate. This skewed distribution motivates GloVe''s weighting function: without it, training would be dominated by a handful of common pairs.

Out[15]:

Visualization

Heatmap of word co-occurrence matrix with color intensity showing count magnitude. — Visualization of the co-occurrence matrix for a small corpus. Brighter cells indicate higher co-occurrence counts. The matrix is approximately symmetric (with small differences due to window boundaries). High co-occurrence between 'king' and 'queen', 'the' and common nouns reflects the corpus structure.

Symmetry and Sparse StorageLink Copied

The co-occurrence matrix is nearly symmetric: $X_{ij} \approx X_{ji}$ . For undirected context windows (looking both left and right), it's exactly symmetric. In practice, we often symmetrize the matrix by averaging: $X'_{ij} = (X_{ij} + X_{ji}) / 2$ .

For large vocabularies, the co-occurrence matrix is extremely sparse. A 100,000-word vocabulary produces a 10-billion-entry matrix, but most entries are zero. Efficient implementations use sparse matrix formats, storing only non-zero entries.

In[16]:

from scipy import sparse

def build_sparse_cooccurrence(corpus, vocab, window_size=5):
    """
    Build a sparse co-occurrence matrix.
    
    More memory-efficient for large vocabularies.
    """
    from collections import defaultdict
    
    # Accumulate counts in a dictionary
    cooc_counts = defaultdict(float)
    
    for sentence in corpus:
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            for offset in range(1, window_size + 1):
                weight = 1.0 / offset
                
                # Left context
                if center_pos - offset >= 0:
                    context_idx = indices[center_pos - offset]
                    cooc_counts[(center_idx, context_idx)] += weight
                
                # Right context
                if center_pos + offset < len(indices):
                    context_idx = indices[center_pos + offset]
                    cooc_counts[(center_idx, context_idx)] += weight
    
    # Convert to sparse matrix
    rows, cols, data = [], [], []
    for (i, j), count in cooc_counts.items():
        rows.append(i)
        cols.append(j)
        data.append(count)
    
    vocab_size = len(vocab)
    return sparse.csr_matrix((data, (rows, cols)), shape=(vocab_size, vocab_size))

sparse_cooc = build_sparse_cooccurrence(example_sentences, vocab, window_size=3)

from scipy import sparse

def build_sparse_cooccurrence(corpus, vocab, window_size=5):
    """
    Build a sparse co-occurrence matrix.
    
    More memory-efficient for large vocabularies.
    """
    from collections import defaultdict
    
    # Accumulate counts in a dictionary
    cooc_counts = defaultdict(float)
    
    for sentence in corpus:
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            for offset in range(1, window_size + 1):
                weight = 1.0 / offset
                
                # Left context
                if center_pos - offset >= 0:
                    context_idx = indices[center_pos - offset]
                    cooc_counts[(center_idx, context_idx)] += weight
                
                # Right context
                if center_pos + offset < len(indices):
                    context_idx = indices[center_pos + offset]
                    cooc_counts[(center_idx, context_idx)] += weight
    
    # Convert to sparse matrix
    rows, cols, data = [], [], []
    for (i, j), count in cooc_counts.items():
        rows.append(i)
        cols.append(j)
        data.append(count)
    
    vocab_size = len(vocab)
    return sparse.csr_matrix((data, (rows, cols)), shape=(vocab_size, vocab_size))

sparse_cooc = build_sparse_cooccurrence(example_sentences, vocab, window_size=3)

Out[17]:

Sparse vs Dense Storage:
---------------------------------------------
Dense matrix memory: 3.78 KB
Sparse matrix memory: 1.41 KB
Compression ratio: 2.7x

For this small vocabulary, the sparse format provides modest savings. The real benefit emerges at scale: a 100,000-word vocabulary would require approximately 80 GB for a dense matrix (100,000² × 8 bytes), while the sparse representation stores only non-zero entries, typically a few hundred megabytes. This difference makes large-scale GloVe training feasible on commodity hardware.

Relationship to Matrix FactorizationLink Copied

Stepping back from the implementation details reveals a deeper perspective on what GloVe is doing. The objective function we derived places GloVe squarely in the family of matrix factorization methods, the same family that includes techniques like Singular Value Decomposition (SVD) and Latent Semantic Analysis (LSA). Understanding this connection illuminates both why GloVe works and how it relates to classical dimensionality reduction.

Consider the equation we derived:

\mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j = \log(X_{ij})

This equation holds for every word pair. If we stack all word vectors into a matrix $W$ and all context vectors into a matrix $\tilde{W}$ , we can write this as a matrix equation. Define $M_{ij} = \log(X_{ij})$ for non-zero entries. Then GloVe approximately factorizes this log-count matrix:

M \approx W \tilde{W}^T + \mathbf{b} \mathbf{1}^T + \mathbf{1} \tilde{\mathbf{b}}^T

where:

$M$ : log co-occurrence matrix where $M_{ij} = \log(X_{ij})$
$W$ : matrix of word vectors ( $V \times d$ )
$\tilde{W}$ : matrix of context vectors ( $V \times d$ )
$\mathbf{b}$ : word biases ( $V \times 1$ )
$\tilde{\mathbf{b}}$ : context biases ( $V \times 1$ )
$\mathbf{1}$ : vector of ones ( $V \times 1$ )

This is a form of weighted, biased matrix factorization. The key insight is dimensional: the original matrix $M$ is $V \times V$ , potentially enormous for large vocabularies. The factorization represents it as the product of much smaller matrices: $W$ and $\tilde{W}$ are both $V \times d$ , where $d$ (typically 50-300) is vastly smaller than $V$ (potentially hundreds of thousands).

This compression is exactly what we want. The low-rank structure forces the model to discover patterns. It can't store the full matrix, so it must learn generalizable representations that explain many co-occurrences with few parameters. The resulting vectors $\mathbf{w}_i$ capture the essential semantic content of words, distilled from millions of co-occurrence observations.

The weighting function $f(X_{ij})$ makes this a weighted factorization, prioritizing accurate reconstruction of reliable observations. The biases make it biased (in the technical sense), allowing frequency effects to be absorbed separately from semantic content.

Out[18]:

Visualization

Diagram showing matrix factorization M equals W times W-tilde transpose plus biases. — GloVe as matrix factorization. The log co-occurrence matrix M is approximately reconstructed as the product of word and context embedding matrices, plus row and column biases. The embedding dimension d (typically 50-300) is much smaller than vocabulary size V, forcing the model to learn a compressed representation of co-occurrence patterns.

Connection to Classical MethodsLink Copied

This matrix factorization perspective connects GloVe to classical methods like Latent Semantic Analysis (LSA), which factorizes term-document matrices using Singular Value Decomposition (SVD). Key differences:

Aspect	LSA (SVD)	GloVe
Matrix	Term-document	Word-word co-occurrence
Transform	Raw counts or TF-IDF	Log counts
Weighting	Uniform	Frequency-based $f(x)$
Optimization	Exact SVD	Stochastic gradient descent
Biases	None	Word and context biases

GloVe inherits the global perspective of matrix factorization while adding neural-network-style training flexibility.

Training GloVeLink Copied

With the objective function and co-occurrence matrix defined, we can train GloVe using stochastic gradient descent. Unlike neural networks with complex layer compositions, GloVe's gradient computation is straightforward. The objective is a simple weighted sum of squared errors, and each term depends on only four parameters (two vectors and two biases).

The training process iterates through non-zero entries of the co-occurrence matrix, computing gradients and updating parameters. Because each word pair's contribution is independent, the computation parallelizes naturally across CPU cores or GPU threads.

Gradient DerivationLink Copied

Understanding the gradients illuminates how learning proceeds. For a single word pair $(i, j)$ with co-occurrence count $X_{ij}$ , the contribution to the objective is:

J_{ij} = f(X_{ij}) \left( \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2

To derive the gradients, let's introduce cleaner notation:

$\hat{y}_{ij} = \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j$ (prediction)
$y_{ij} = \log(X_{ij})$ (target)
$e_{ij} = \hat{y}_{ij} - y_{ij}$ (error)

The objective for this pair becomes $J_{ij} = f(X_{ij}) \cdot e_{ij}^2$ . Using the chain rule, the gradient with respect to any parameter $\theta$ is:

\frac{\partial J_{ij}}{\partial \theta} = 2 f(X_{ij}) \cdot e_{ij} \cdot \frac{\partial \hat{y}_{ij}}{\partial \theta}

Since $\hat{y}_{ij} = \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j$ , the partial derivatives of the prediction are simple:

$\frac{\partial \hat{y}_{ij}}{\partial \mathbf{w}_i} = \tilde{\mathbf{w}}_j$ (the context vector)
$\frac{\partial \hat{y}_{ij}}{\partial \tilde{\mathbf{w}}_j} = \mathbf{w}_i$ (the word vector)
$\frac{\partial \hat{y}_{ij}}{\partial b_i} = 1$
$\frac{\partial \hat{y}_{ij}}{\partial \tilde{b}_j} = 1$

Substituting these, we get the complete gradients:

\frac{\partial J_{ij}}{\partial \mathbf{w}_i} = 2 f(X_{ij}) \cdot e_{ij} \cdot \tilde{\mathbf{w}}_j

\frac{\partial J_{ij}}{\partial \tilde{\mathbf{w}}_j} = 2 f(X_{ij}) \cdot e_{ij} \cdot \mathbf{w}_i

\frac{\partial J_{ij}}{\partial b_i} = 2 f(X_{ij}) \cdot e_{ij}

\frac{\partial J_{ij}}{\partial \tilde{b}_j} = 2 f(X_{ij}) \cdot e_{ij}

Notice the symmetry: the gradient for word vectors involves the context vectors, and vice versa. This makes intuitive sense. To improve the prediction for the pair $(i, j)$ , we adjust $\mathbf{w}_i$ in the direction of $\tilde{\mathbf{w}}_j$ (or opposite, if we're overshooting the target). The magnitude of the adjustment depends on the error $e_{ij}$ and the weight $f(X_{ij})$ .

The biases have particularly simple gradients: just the weighted error, with no vector component. They act as global adjustments, shifting predictions up or down for all contexts involving word $i$ or context $j$ .

In[19]:

class GloVe:
    """
    GloVe implementation for learning word embeddings.
    
    Learns word vectors by factorizing the log co-occurrence matrix
    using weighted least squares.
    """
    
    def __init__(self, vocab_size, embedding_dim, x_max=100, alpha=0.75):
        """
        Initialize GloVe model.
        
        Args:
            vocab_size: Number of words in vocabulary
            embedding_dim: Dimension of word vectors
            x_max: Weighting function cutoff
            alpha: Weighting function exponent
        """
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.x_max = x_max
        self.alpha = alpha
        
        # Initialize embeddings randomly
        scale = 1.0 / embedding_dim
        self.W = np.random.uniform(-scale, scale, (vocab_size, embedding_dim))
        self.W_context = np.random.uniform(-scale, scale, (vocab_size, embedding_dim))
        
        # Initialize biases to zero
        self.b = np.zeros(vocab_size)
        self.b_context = np.zeros(vocab_size)
        
        # For AdaGrad
        self.W_grad_sq = np.ones((vocab_size, embedding_dim))
        self.W_context_grad_sq = np.ones((vocab_size, embedding_dim))
        self.b_grad_sq = np.ones(vocab_size)
        self.b_context_grad_sq = np.ones(vocab_size)
    
    def weight(self, x):
        """Compute weighting function f(x)."""
        if x < self.x_max:
            return (x / self.x_max) ** self.alpha
        return 1.0
    
    def train_pair(self, i, j, x_ij, learning_rate=0.05):
        """
        Train on a single (i, j) word pair with co-occurrence x_ij.
        
        Uses AdaGrad for adaptive learning rates.
        
        Returns:
            loss: The weighted squared error for this pair
        """
        # Compute weight
        w = self.weight(x_ij)
        
        # Compute prediction and error
        dot_product = np.dot(self.W[i], self.W_context[j])
        prediction = dot_product + self.b[i] + self.b_context[j]
        target = np.log(x_ij)
        error = prediction - target
        
        # Weighted loss
        loss = w * error ** 2
        
        # Compute gradients
        grad_common = 2 * w * error
        
        grad_W_i = grad_common * self.W_context[j]
        grad_W_context_j = grad_common * self.W[i]
        grad_b_i = grad_common
        grad_b_context_j = grad_common
        
        # AdaGrad updates
        self.W_grad_sq[i] += grad_W_i ** 2
        self.W_context_grad_sq[j] += grad_W_context_j ** 2
        self.b_grad_sq[i] += grad_b_i ** 2
        self.b_context_grad_sq[j] += grad_b_context_j ** 2
        
        # Update parameters
        self.W[i] -= learning_rate * grad_W_i / np.sqrt(self.W_grad_sq[i])
        self.W_context[j] -= learning_rate * grad_W_context_j / np.sqrt(self.W_context_grad_sq[j])
        self.b[i] -= learning_rate * grad_b_i / np.sqrt(self.b_grad_sq[i])
        self.b_context[j] -= learning_rate * grad_b_context_j / np.sqrt(self.b_context_grad_sq[j])
        
        return loss
    
    def get_embedding(self, word_idx):
        """
        Get the embedding for a word.
        
        Following the GloVe paper, we combine word and context vectors.
        """
        return self.W[word_idx] + self.W_context[word_idx]
    
    def most_similar(self, word_idx, top_n=5):
        """Find most similar words by cosine similarity."""
        # Get combined embeddings
        embeddings = self.W + self.W_context
        
        word_vec = embeddings[word_idx]
        word_vec_norm = word_vec / (np.linalg.norm(word_vec) + 1e-10)
        
        # Compute similarities
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        normalized = embeddings / (norms + 1e-10)
        similarities = normalized @ word_vec_norm
        
        # Exclude self
        similarities[word_idx] = -np.inf
        
        # Get top indices
        top_indices = np.argsort(similarities)[::-1][:top_n]
        return [(idx, similarities[idx]) for idx in top_indices]

class GloVe:
    """
    GloVe implementation for learning word embeddings.
    
    Learns word vectors by factorizing the log co-occurrence matrix
    using weighted least squares.
    """
    
    def __init__(self, vocab_size, embedding_dim, x_max=100, alpha=0.75):
        """
        Initialize GloVe model.
        
        Args:
            vocab_size: Number of words in vocabulary
            embedding_dim: Dimension of word vectors
            x_max: Weighting function cutoff
            alpha: Weighting function exponent
        """
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.x_max = x_max
        self.alpha = alpha
        
        # Initialize embeddings randomly
        scale = 1.0 / embedding_dim
        self.W = np.random.uniform(-scale, scale, (vocab_size, embedding_dim))
        self.W_context = np.random.uniform(-scale, scale, (vocab_size, embedding_dim))
        
        # Initialize biases to zero
        self.b = np.zeros(vocab_size)
        self.b_context = np.zeros(vocab_size)
        
        # For AdaGrad
        self.W_grad_sq = np.ones((vocab_size, embedding_dim))
        self.W_context_grad_sq = np.ones((vocab_size, embedding_dim))
        self.b_grad_sq = np.ones(vocab_size)
        self.b_context_grad_sq = np.ones(vocab_size)
    
    def weight(self, x):
        """Compute weighting function f(x)."""
        if x < self.x_max:
            return (x / self.x_max) ** self.alpha
        return 1.0
    
    def train_pair(self, i, j, x_ij, learning_rate=0.05):
        """
        Train on a single (i, j) word pair with co-occurrence x_ij.
        
        Uses AdaGrad for adaptive learning rates.
        
        Returns:
            loss: The weighted squared error for this pair
        """
        # Compute weight
        w = self.weight(x_ij)
        
        # Compute prediction and error
        dot_product = np.dot(self.W[i], self.W_context[j])
        prediction = dot_product + self.b[i] + self.b_context[j]
        target = np.log(x_ij)
        error = prediction - target
        
        # Weighted loss
        loss = w * error ** 2
        
        # Compute gradients
        grad_common = 2 * w * error
        
        grad_W_i = grad_common * self.W_context[j]
        grad_W_context_j = grad_common * self.W[i]
        grad_b_i = grad_common
        grad_b_context_j = grad_common
        
        # AdaGrad updates
        self.W_grad_sq[i] += grad_W_i ** 2
        self.W_context_grad_sq[j] += grad_W_context_j ** 2
        self.b_grad_sq[i] += grad_b_i ** 2
        self.b_context_grad_sq[j] += grad_b_context_j ** 2
        
        # Update parameters
        self.W[i] -= learning_rate * grad_W_i / np.sqrt(self.W_grad_sq[i])
        self.W_context[j] -= learning_rate * grad_W_context_j / np.sqrt(self.W_context_grad_sq[j])
        self.b[i] -= learning_rate * grad_b_i / np.sqrt(self.b_grad_sq[i])
        self.b_context[j] -= learning_rate * grad_b_context_j / np.sqrt(self.b_context_grad_sq[j])
        
        return loss
    
    def get_embedding(self, word_idx):
        """
        Get the embedding for a word.
        
        Following the GloVe paper, we combine word and context vectors.
        """
        return self.W[word_idx] + self.W_context[word_idx]
    
    def most_similar(self, word_idx, top_n=5):
        """Find most similar words by cosine similarity."""
        # Get combined embeddings
        embeddings = self.W + self.W_context
        
        word_vec = embeddings[word_idx]
        word_vec_norm = word_vec / (np.linalg.norm(word_vec) + 1e-10)
        
        # Compute similarities
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        normalized = embeddings / (norms + 1e-10)
        similarities = normalized @ word_vec_norm
        
        # Exclude self
        similarities[word_idx] = -np.inf
        
        # Get top indices
        top_indices = np.argsort(similarities)[::-1][:top_n]
        return [(idx, similarities[idx]) for idx in top_indices]

Training LoopLink Copied

Training iterates over all non-zero co-occurrence pairs, updating embeddings using gradient descent. The original GloVe implementation uses AdaGrad, an adaptive learning rate method that helps with the highly varying frequencies of word pairs.

In[20]:

def train_glove(model, cooccurrence_matrix, epochs=50, learning_rate=0.05, 
                verbose=True):
    """
    Train a GloVe model on a co-occurrence matrix.
    
    Args:
        model: GloVe model instance
        cooccurrence_matrix: Dense or sparse co-occurrence matrix
        epochs: Number of training epochs
        learning_rate: Initial learning rate for AdaGrad
        verbose: Whether to print progress
    
    Returns:
        losses: List of average losses per epoch
    """
    # Extract non-zero entries for training
    if hasattr(cooccurrence_matrix, 'tocoo'):
        # Sparse matrix
        coo = cooccurrence_matrix.tocoo()
        pairs = list(zip(coo.row, coo.col, coo.data))
    else:
        # Dense matrix
        pairs = []
        for i in range(cooccurrence_matrix.shape[0]):
            for j in range(cooccurrence_matrix.shape[1]):
                if cooccurrence_matrix[i, j] > 0:
                    pairs.append((i, j, cooccurrence_matrix[i, j]))
    
    losses = []
    
    for epoch in range(epochs):
        # Shuffle training pairs
        np.random.shuffle(pairs)
        
        epoch_loss = 0
        for i, j, x_ij in pairs:
            loss = model.train_pair(i, j, x_ij, learning_rate)
            epoch_loss += loss
        
        avg_loss = epoch_loss / len(pairs)
        losses.append(avg_loss)
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}: loss = {avg_loss:.4f}")
    
    return losses

# Create and train model
np.random.seed(42)
glove_model = GloVe(vocab_size=len(vocab), embedding_dim=20, x_max=100, alpha=0.75)
training_losses = train_glove(glove_model, cooc_matrix, epochs=100, 
                              learning_rate=0.05, verbose=False)

def train_glove(model, cooccurrence_matrix, epochs=50, learning_rate=0.05, 
                verbose=True):
    """
    Train a GloVe model on a co-occurrence matrix.
    
    Args:
        model: GloVe model instance
        cooccurrence_matrix: Dense or sparse co-occurrence matrix
        epochs: Number of training epochs
        learning_rate: Initial learning rate for AdaGrad
        verbose: Whether to print progress
    
    Returns:
        losses: List of average losses per epoch
    """
    # Extract non-zero entries for training
    if hasattr(cooccurrence_matrix, 'tocoo'):
        # Sparse matrix
        coo = cooccurrence_matrix.tocoo()
        pairs = list(zip(coo.row, coo.col, coo.data))
    else:
        # Dense matrix
        pairs = []
        for i in range(cooccurrence_matrix.shape[0]):
            for j in range(cooccurrence_matrix.shape[1]):
                if cooccurrence_matrix[i, j] > 0:
                    pairs.append((i, j, cooccurrence_matrix[i, j]))
    
    losses = []
    
    for epoch in range(epochs):
        # Shuffle training pairs
        np.random.shuffle(pairs)
        
        epoch_loss = 0
        for i, j, x_ij in pairs:
            loss = model.train_pair(i, j, x_ij, learning_rate)
            epoch_loss += loss
        
        avg_loss = epoch_loss / len(pairs)
        losses.append(avg_loss)
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}: loss = {avg_loss:.4f}")
    
    return losses

# Create and train model
np.random.seed(42)
glove_model = GloVe(vocab_size=len(vocab), embedding_dim=20, x_max=100, alpha=0.75)
training_losses = train_glove(glove_model, cooc_matrix, epochs=100, 
                              learning_rate=0.05, verbose=False)

Out[21]:

GloVe Training Complete:
---------------------------------------------
Vocabulary size: 22
Embedding dimension: 20
Non-zero pairs: 113
Training epochs: 100

Initial loss: 0.0072
Final loss: 0.0043
Reduction: 40.3%

The substantial loss reduction indicates the model successfully learned to predict log co-occurrence counts. The weighted least squares objective prioritizes high-frequency pairs, so the final embeddings should capture the dominant co-occurrence patterns in the corpus. The remaining loss reflects both inherent noise in co-occurrence statistics and the capacity limitations of a 20-dimensional embedding space.

Out[22]:

Visualization

Scatter plot comparing predicted and actual log co-occurrence values. — Model predictions vs actual log co-occurrence values. Each point represents a word pair with non-zero co-occurrence. Points near the diagonal indicate accurate predictions. The color intensity shows the GloVe weighting, with higher-weighted pairs (more frequent co-occurrences) clustering more tightly around the diagonal.

Out[23]:

Visualization

Line plot showing decreasing training loss over epochs with rapid initial decline. — GloVe training loss over epochs. The weighted squared error decreases rapidly in early epochs as the model learns to reconstruct log co-occurrences. The loss eventually plateaus as the model converges to a solution that balances prediction accuracy across different frequency ranges.

The Role of Bias TermsLink Copied

GloVe includes bias terms $b_i$ and $\tilde{b}_j$ in its objective. These aren't just mathematical conveniences; they play a crucial role in capturing word frequency effects.

What Biases CaptureLink Copied

Consider a very frequent word like "the." It co-occurs with nearly every word in the vocabulary, leading to high co-occurrence counts across the board. Without biases, the model would try to explain these high counts through the embedding: "the" would need a large vector that has high dot product with everything.

Biases absorb this frequency effect. The bias $b_i$ captures "how often word $i$ tends to co-occur in general." A word like "the" has a high bias, explaining its high co-occurrence counts without distorting its embedding.

In[24]:

# Analyze learned biases
word_freqs = {w: sum(cooc_matrix[vocab[w], :]) for w in vocab}
sorted_words = sorted(vocab.keys(), key=lambda w: word_freqs[w], reverse=True)

bias_data = []
for w in sorted_words[:10]:
    idx = vocab[w]
    freq = word_freqs[w]
    bias = glove_model.b[idx]
    context_bias = glove_model.b_context[idx]
    bias_data.append((w, freq, bias, context_bias))

# Analyze learned biases
word_freqs = {w: sum(cooc_matrix[vocab[w], :]) for w in vocab}
sorted_words = sorted(vocab.keys(), key=lambda w: word_freqs[w], reverse=True)

bias_data = []
for w in sorted_words[:10]:
    idx = vocab[w]
    freq = word_freqs[w]
    bias = glove_model.b[idx]
    context_bias = glove_model.b_context[idx]
    bias_data.append((w, freq, bias, context_bias))

Out[25]:

Word Frequency vs Learned Bias:
------------------------------------------------------------
Word           Total Cooc    Word Bias Context Bias
------------------------------------------------------------
the                 15.83       0.2419       0.2413
and                 10.00      -0.2053      -0.2046
queen                6.50      -0.0860      -0.0860
king                 5.67       0.0462       0.0468
live                 3.67      -0.2479      -0.2487
princess             3.67      -0.1334      -0.1331
sits                 3.33      -0.1061      -0.1073
in                   3.33      -0.2005      -0.2004
on                   3.33      -0.1052      -0.1040
wear                 3.33      -0.1829      -0.1845

The pattern is clear: words with higher total co-occurrence counts tend to have larger learned biases. The most frequent word in the corpus shows the highest combined bias, absorbing its tendency to co-occur with many different words. This prevents the embedding vectors from needing unnaturally large magnitudes to explain high co-occurrence counts, keeping the learned semantic relationships clean and interpretable.

Out[26]:

Visualization

Scatter plot comparing word frequency with embedding vector norms. — Embedding vector norms vs word frequency. Unlike biases, vector norms show weaker correlation with frequency. The biases successfully absorb frequency effects, leaving the embedding geometry relatively clean. This is crucial for downstream tasks where cosine similarity should reflect semantic similarity rather than frequency.

The comparison is striking. Biases show strong correlation with frequency (left panel), while embedding norms show weaker correlation (right panel). This confirms that biases are doing their job: absorbing frequency effects so that the embedding vectors can focus on encoding semantic content.

Out[27]:

Visualization

Scatter plot showing positive correlation between word frequency and bias magnitude. — Relationship between word frequency and learned biases. More frequent words (higher total co-occurrence) tend to have larger biases. This correlation shows that biases successfully absorb frequency effects, preventing them from distorting the embedding geometry.

Combining Word and Context VectorsLink Copied

GloVe learns two sets of vectors: word embeddings $\mathbf{w}_i$ and context embeddings $\tilde{\mathbf{w}}_i$ . Unlike Word2Vec, where these play asymmetric roles, GloVe's objective is symmetric in $i$ and $j$ . This means $\mathbf{w}_i$ and $\tilde{\mathbf{w}}_i$ carry similar information.

The original GloVe paper recommends combining them:

\mathbf{v}_i = \mathbf{w}_i + \tilde{\mathbf{w}}_i

where:

$\mathbf{v}_i$ : final word vector for word $i$
$\mathbf{w}_i$ : word embedding from the word matrix $W$
$\tilde{\mathbf{w}}_i$ : context embedding from the context matrix $\tilde{W}$

This combination often produces better embeddings than either matrix alone. Intuitively, it averages out noise and captures complementary aspects of word meaning.

In[28]:

# Compare word-only, context-only, and combined embeddings
def cosine_similarity_matrix(embeddings):
    """Compute pairwise cosine similarities."""
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-10)
    return normalized @ normalized.T

# Get different embedding versions
W_only = glove_model.W
W_context_only = glove_model.W_context
W_combined = glove_model.W + glove_model.W_context

# Compute similarity matrices
sim_word = cosine_similarity_matrix(W_only)
sim_context = cosine_similarity_matrix(W_context_only)
sim_combined = cosine_similarity_matrix(W_combined)

# Compare word-only, context-only, and combined embeddings
def cosine_similarity_matrix(embeddings):
    """Compute pairwise cosine similarities."""
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-10)
    return normalized @ normalized.T

# Get different embedding versions
W_only = glove_model.W
W_context_only = glove_model.W_context
W_combined = glove_model.W + glove_model.W_context

# Compute similarity matrices
sim_word = cosine_similarity_matrix(W_only)
sim_context = cosine_similarity_matrix(W_context_only)
sim_combined = cosine_similarity_matrix(W_combined)

Out[29]:

Visualization

Heatmap of pairwise similarities using only word vectors. — Word vectors only (W). Similarity patterns are somewhat noisy.

GloVe vs Word2Vec: Key DifferencesLink Copied

GloVe and Word2Vec both produce high-quality word embeddings, but they take fundamentally different approaches. Understanding these differences helps you choose the right method for your application.

Training ParadigmLink Copied

Aspect	Word2Vec (Skip-gram)	GloVe
Approach	Predictive (neural network)	Count-based (matrix factorization)
Input	Local context windows, one pair at a time	Global co-occurrence matrix
Objective	Predict context words from center word	Reconstruct log co-occurrence counts
Training	Online (streaming, can process infinite data)	Batch (requires full matrix upfront)

Practical ConsiderationsLink Copied

Memory usage: GloVe requires storing the co-occurrence matrix (sparse but potentially large), while Word2Vec can stream data from disk.

Training speed: GloVe often converges faster because it uses global statistics. Word2Vec may need multiple passes over the corpus.

Parallelization: GloVe's matrix operations parallelize naturally on GPUs. Word2Vec's sequential updates are harder to parallelize, though implementations use negative sampling tricks.

Rare words: Both struggle with rare words, but GloVe's explicit co-occurrence counts make the signal clearer.

Empirical PerformanceLink Copied

The original GloVe paper reported competitive results on word analogy and similarity tasks:

Out[30]:

Benchmark Results (Word Analogy Task):
-------------------------------------------------------
Model                 | Semantic | Syntactic | Total
-------------------------------------------------------
Word2Vec Skip-gram    |    ~65% |    ~55%  |  ~60%
GloVe (300d)          |    ~80% |    ~70%  |  ~75%
-------------------------------------------------------

Note: Results vary with corpus size, preprocessing, and hyperparameters.

The GloVe paper showed substantial improvements over Word2Vec on these benchmarks, particularly for semantic analogies. However, subsequent research has demonstrated that careful hyperparameter tuning can close much of this gap. The choice between GloVe and Word2Vec often depends on practical considerations like data availability, memory constraints, and training infrastructure, rather than absolute quality differences.

Out[31]:

Visualization

Side-by-side diagrams comparing sequential Word2Vec training with batch GloVe training. — Conceptual comparison of GloVe and Word2Vec training approaches. Word2Vec (left) processes context windows sequentially, making local predictions. GloVe (right) first computes global co-occurrence statistics, then factorizes the resulting matrix. Both produce semantically meaningful embeddings.

Training GloVe EfficientlyLink Copied

Real-world GloVe training involves millions of words and billions of co-occurrences. Several techniques make this feasible.

Sparse Storage and IterationLink Copied

The co-occurrence matrix is extremely sparse. A 100,000-word vocabulary has 10 billion potential entries, but only millions are non-zero. Efficient training iterates only over non-zero entries.

AdaGrad OptimizationLink Copied

GloVe uses AdaGrad, which adapts the learning rate for each parameter based on its historical gradients. Parameters updated frequently (like biases for common words) receive smaller updates, while rare parameters receive larger updates.

\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t

where:

$\theta_t$ : parameter at time $t$
$\eta$ : base learning rate
$G_t$ : sum of squared gradients up to time $t$
$g_t$ : current gradient
$\epsilon$ : small constant for numerical stability

ParallelizationLink Copied

Each co-occurrence pair can be processed independently (up to synchronization of embedding updates). GPU implementations parallelize across thousands of pairs simultaneously.

In[32]:

def train_glove_parallel_ready(pairs, model, learning_rate=0.05, batch_size=512):
    """
    Process training pairs in batches (pseudo-parallel).
    
    In a real implementation, each batch would be processed
    on different GPU cores or distributed across machines.
    """
    np.random.shuffle(pairs)
    total_loss = 0
    
    for start in range(0, len(pairs), batch_size):
        batch = pairs[start:start + batch_size]
        batch_loss = 0
        
        # In parallel, each pair updates independently
        for i, j, x_ij in batch:
            loss = model.train_pair(i, j, x_ij, learning_rate)
            batch_loss += loss
        
        total_loss += batch_loss
    
    return total_loss / len(pairs)

def train_glove_parallel_ready(pairs, model, learning_rate=0.05, batch_size=512):
    """
    Process training pairs in batches (pseudo-parallel).
    
    In a real implementation, each batch would be processed
    on different GPU cores or distributed across machines.
    """
    np.random.shuffle(pairs)
    total_loss = 0
    
    for start in range(0, len(pairs), batch_size):
        batch = pairs[start:start + batch_size]
        batch_loss = 0
        
        # In parallel, each pair updates independently
        for i, j, x_ij in batch:
            loss = model.train_pair(i, j, x_ij, learning_rate)
            batch_loss += loss
        
        total_loss += batch_loss
    
    return total_loss / len(pairs)

Evaluating GloVe EmbeddingsLink Copied

Let's examine what our trained GloVe model has learned by looking at word similarities and the embedding structure.

In[33]:

# Find similar words for test cases
test_words = ['king', 'the', 'and']
similarity_results = {}

for word in test_words:
    if word in vocab:
        similar = glove_model.most_similar(vocab[word], top_n=5)
        similarity_results[word] = [(idx_to_word[idx], sim) for idx, sim in similar]

# Find similar words for test cases
test_words = ['king', 'the', 'and']
similarity_results = {}

for word in test_words:
    if word in vocab:
        similar = glove_model.most_similar(vocab[word], top_n=5)
        similarity_results[word] = [(idx_to_word[idx], sim) for idx, sim in similar]

Out[34]:

GloVe Word Similarities:
--------------------------------------------------

Most similar to 'king':
  the         : +0.401 ██████████
  man         : +0.241 █████████
  rules       : +0.230 █████████
  in          : +0.152 ████████
  together    : +0.134 ████████

Most similar to 'the':
  queen       : +0.549 ███████████
  king        : +0.401 ██████████
  rules       : +0.336 ██████████
  a           : +0.309 █████████
  woman       : +0.301 █████████

Most similar to 'and':
  kingdom     : +0.473 ███████████
  princess    : +0.404 ██████████
  sits        : +0.387 ██████████
  on          : +0.270 █████████
  live        : +0.166 ████████

With only five short sentences, the learned similarities reflect the limited corpus structure rather than general semantic knowledge. Words that frequently co-occur together show higher similarity scores. For larger corpora, GloVe embeddings would capture broader semantic relationships. "King" would be similar to "queen" because both appear in similar royal contexts across thousands of documents, not just in a handful of training sentences.

Out[35]:

Visualization

Scatter plot of word embeddings projected to 2D with labels. — 2D PCA projection of GloVe embeddings. Words that frequently co-occur in similar contexts cluster together. The embedding space reflects corpus structure: function words like ''the'' and ''and'' may cluster together, while content words organize by topic.

Practical Training with Larger DataLink Copied

Our implementation works for small vocabularies, but real applications require optimization. Here's how production GloVe differs:

Memory-Mapped FilesLink Copied

Large co-occurrence matrices don't fit in memory. Production implementations use memory-mapped files or distributed storage.

Shuffled IterationLink Copied

To ensure stable training, iterations over co-occurrence pairs should be shuffled. This prevents the model from seeing all pairs involving "the" consecutively.

Early StoppingLink Copied

Monitor the loss on a held-out portion of the co-occurrence matrix. Stop when validation loss stops improving.

Vector DimensionLink Copied

Typical dimensions range from 50-300. Larger dimensions capture more nuance but require more data and compute:

Out[36]:

Recommended Embedding Dimensions:
-------------------------------------------------------
Dimension | Use Case                         | Trade-offs
-------------------------------------------------------
   50     | Mobile/embedded, fast similarity | Less nuanced
  100     | General purpose, good balance    | Standard choice
  200     | Research, downstream tasks       | More expressive
  300     | State-of-the-art benchmarks      | Slower, needs more data
-------------------------------------------------------

The choice of embedding dimension involves a bias-variance tradeoff. Lower dimensions force the model to compress information, which can provide regularization but may miss subtle distinctions. Higher dimensions allow more expressive representations but require more training data to avoid overfitting and more memory for storage and computation.

Limitations and ConsiderationsLink Copied

GloVe produces high-quality embeddings but has limitations worth understanding:

Static embeddings: Like Word2Vec, GloVe produces one vector per word regardless of context. "Bank" has the same embedding whether referring to a financial institution or a river bank.

Out-of-vocabulary words: GloVe cannot generate embeddings for words not in the training vocabulary. Unlike subword methods (FastText), it has no mechanism for morphological generalization.

Window size sensitivity: The choice of context window affects which co-occurrences are captured. Larger windows capture more topical similarity; smaller windows capture syntactic patterns.

Memory requirements: The co-occurrence matrix, while sparse, can be large. A 400,000-word vocabulary might produce a matrix with billions of non-zero entries.

Corpus bias: Embeddings reflect biases in the training corpus. Associations between professions and genders, for example, are encoded in the vectors.

SummaryLink Copied

GloVe approaches word embeddings from a different angle than Word2Vec. By explicitly factorizing the log co-occurrence matrix with carefully designed weighting, it produces embeddings that encode both local and global corpus statistics.

Key takeaways:

Co-occurrence ratios encode meaning: The ratio $\frac{P(k \mid w_i)}{P(k \mid w_j)}$ reveals how a probe word discriminates between targets. GloVe's objective makes word vectors reconstruct these ratios.
Weighted least squares: The objective minimizes weighted squared error between $\mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j$ and $\log(X_{ij})$ , with weights that prioritize frequent co-occurrences.
Matrix factorization perspective: GloVe factorizes the log co-occurrence matrix into word and context embeddings plus biases. This connects it to classical methods like LSA.
Bias terms absorb frequency: Word and context biases capture overall frequency effects, preventing common words from distorting the embedding geometry.
Combined vectors work best: The final embedding is typically $\mathbf{w}_i + \tilde{\mathbf{w}}_i$ , averaging the word and context vectors learned during training.
Competitive with Word2Vec: Despite the different approach, GloVe achieves comparable results on standard benchmarks, with trade-offs in memory usage and training paradigm.

The next chapter explores FastText, which extends Word2Vec with subword information, enabling the model to handle morphologically rich languages and out-of-vocabulary words.

Key ParametersLink Copied

When training GloVe models, several hyperparameters affect the quality of learned embeddings:

embedding_dim (typical range: 50-300): The dimensionality of word vectors.

Lower values (50-100): Faster training, smaller memory footprint. Good for similarity tasks.
Higher values (200-300): Captures more nuanced relationships. Better for downstream tasks and analogy completion.
Common choice: 100-200 for most applications; 300 for benchmarks.

window_size (typical range: 5-15): Context window for building the co-occurrence matrix.

Smaller windows (5-8): Emphasize syntactic relationships.
Larger windows (10-15): Capture broader semantic/topical similarity.
Common choice: 10-15 for semantic tasks.

x_max (typical value: 100): Cutoff for the weighting function.

Co-occurrences above this threshold receive weight 1.0.
Lower values give more uniform weighting; higher values let frequent pairs dominate more.
Common choice: 100 (from the original paper).

alpha (typical value: 0.75): Exponent in the weighting function.

Controls how quickly weight increases with co-occurrence count.
Lower values (0.5) more aggressively dampen frequent pairs.
Higher values (1.0) approach raw-count weighting.
Common choice: 0.75 (from the original paper).

min_count (typical range: 1-100): Minimum word frequency to include in vocabulary.

Lower values include rare words but may produce noisy embeddings.
Higher values produce more robust embeddings but exclude rare words.
Common choice: 5-10 for large corpora.

learning_rate (typical value: 0.05): Initial learning rate for AdaGrad.

Higher values speed training but may overshoot.
AdaGrad adapts rates per-parameter, so the initial value is less critical than with SGD.
Common choice: 0.05.

epochs (typical range: 25-100): Number of passes through the co-occurrence data.

Fewer epochs for very large matrices.
More epochs for smaller datasets or when loss hasn't converged.
Common choice: 50-100.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about GloVe word embeddings.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{gloveglobalvectorsforwordrepresentation, author = {Michael Brenndoerfer}, title = {GloVe: Global Vectors for Word Representation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }

APAAcademic

Michael Brenndoerfer (2025). GloVe: Global Vectors for Word Representation. Retrieved from https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization

MLAAcademic

Michael Brenndoerfer. "GloVe: Global Vectors for Word Representation." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization>.

CHICAGOAcademic

Michael Brenndoerfer. "GloVe: Global Vectors for Word Representation." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization.

HARVARDAcademic

Michael Brenndoerfer (2025) 'GloVe: Global Vectors for Word Representation'. Available at: https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization (Accessed: 12/13/2025).

SimpleBasic

Michael Brenndoerfer (2025). GloVe: Global Vectors for Word Representation. https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization

Direct link:

https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

GloVe: Global Vectors for Word Representation

GloVeLink Copied

The Insight: Co-occurrence Ratios Reveal MeaningLink Copied

From Ratios to Vectors: Deriving the ObjectiveLink Copied

Setting Up the ProblemLink Copied

Constraint 1: Vector Differences Encode ContrastsLink Copied

Constraint 2: Producing a Scalar from VectorsLink Copied

Constraint 3: The Exponential EmergesLink Copied

Arriving at the Core EquationLink Copied

The GloVe Objective FunctionLink Copied

Naive Least Squares (and Its Problems)Link Copied

The Weighting FunctionLink Copied

The Complete ObjectiveLink Copied

Building the Co-occurrence MatrixLink Copied

Defining Co-occurrenceLink Copied

Symmetry and Sparse StorageLink Copied

Relationship to Matrix FactorizationLink Copied

Connection to Classical MethodsLink Copied

Training GloVeLink Copied

Gradient DerivationLink Copied

Training LoopLink Copied

The Role of Bias TermsLink Copied

What Biases CaptureLink Copied

Combining Word and Context VectorsLink Copied

GloVe vs Word2Vec: Key DifferencesLink Copied

Training ParadigmLink Copied

Practical ConsiderationsLink Copied

Empirical PerformanceLink Copied

Training GloVe EfficientlyLink Copied

Sparse Storage and IterationLink Copied

AdaGrad OptimizationLink Copied

ParallelizationLink Copied

Evaluating GloVe EmbeddingsLink Copied

Practical Training with Larger DataLink Copied

Memory-Mapped FilesLink Copied

Shuffled IterationLink Copied

Early StoppingLink Copied

Vector DimensionLink Copied

Limitations and ConsiderationsLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

FastText: Subword Embeddings for OOV Words & Morphology

Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection

Training Word2Vec: Complete Pipeline with Gensim & PyTorch Implementation

Stay updated