Search

Search articles

GloVe: Global Vectors for Word Representation

Michael BrenndoerferDecember 11, 202541 min read9,750 words

Learn how GloVe creates word embeddings by factorizing co-occurrence matrices. Covers the derivation, weighted least squares objective, and Python implementation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

GloVe

Word2Vec learns embeddings through local context windows, predicting surrounding words one pair at a time. This approach works well but ignores a fundamental insight: some word relationships are global properties of the corpus. The word "ice" might appear near "cold" millions of times across a billion-word corpus, but Word2Vec treats each occurrence as an independent prediction task. What if we could leverage this global co-occurrence information directly?

GloVe (Global Vectors for Word Representation) takes a different path. Developed by Pennington, Socher, and Manning at Stanford in 2014, GloVe starts with a co-occurrence matrix that captures how often words appear together across the entire corpus. It then factorizes this matrix to produce word vectors. The result: embeddings that encode both local context patterns and global corpus statistics.

This chapter develops GloVe from first principles. We'll see how the objective function emerges from a simple requirement that word vectors encode co-occurrence ratios, work through the weighted least squares formulation, and implement GloVe from scratch. By the end, you'll understand why GloVe achieves comparable results to Word2Vec despite taking a fundamentally different approach.

The Insight: Co-occurrence Ratios Reveal Meaning

GloVe's key insight is simple but powerful: the ratio of co-occurrence probabilities encodes semantic relationships more reliably than raw probabilities. Consider the words "ice" and "steam." Both relate to water, but in different ways. How can we distinguish them?

Let's look at co-occurrence with probe words:

Probe word kkP(kice)P(k \mid \text{ice})P(ksteam)P(k \mid \text{steam})Ratio P(kice)P(ksteam)\frac{P(k \mid \text{ice})}{P(k \mid \text{steam})}
solidhighlowlarge (>> 1)
gaslowhighsmall (<< 1)
waterhighhigh≈ 1
fashionlowlow≈ 1

The raw probabilities P(kice)P(k \mid \text{ice}) and P(ksteam)P(k \mid \text{steam}) depend on many factors: how common each word is, the corpus domain, and so on. But the ratio tells a cleaner story:

  • Large ratio: The probe word relates more to "ice" than "steam" (like "solid")
  • Small ratio: The probe word relates more to "steam" than "ice" (like "gas")
  • Ratio ≈ 1: The probe word relates equally to both (like "water") or to neither (like "fashion")

This ratio invariance is powerful. It factors out corpus-specific biases and isolates the semantic relationship we care about.

In[2]:
import numpy as np

# Simulated co-occurrence probabilities (illustrative values)
# In practice, these come from corpus statistics
co_occurrence = {
    'ice': {'solid': 0.00019, 'gas': 0.000022, 'water': 0.003, 'fashion': 0.000018},
    'steam': {'solid': 0.000022, 'gas': 0.00078, 'water': 0.0022, 'fashion': 0.000018}
}

def compute_ratio(word1, word2, probe):
    """Compute P(probe | word1) / P(probe | word2)."""
    p1 = co_occurrence[word1][probe]
    p2 = co_occurrence[word2][probe]
    return p1 / p2

probes = ['solid', 'gas', 'water', 'fashion']
ratios = {probe: compute_ratio('ice', 'steam', probe) for probe in probes}
Out[3]:
Co-occurrence Ratio Analysis: ice vs steam
-------------------------------------------------------
Probe            P(k|ice)   P(k|steam)        Ratio
-------------------------------------------------------
solid            0.000190     0.000022         8.64
gas              0.000022     0.000780         0.03
water            0.003000     0.002200         1.36
fashion          0.000018     0.000018         1.00

The ratios reveal clear discriminative patterns. "Solid" has a ratio far greater than 1 (approximately 8.6), indicating strong association with "ice" rather than "steam." Conversely, "gas" has a ratio well below 1 (approximately 0.03), showing the opposite relationship. Both "water" and "fashion" have ratios near 1, but for different reasons: "water" relates equally to both states, while "fashion" is irrelevant to either.

Out[4]:
Visualization
Bar chart showing co-occurrence ratios on log scale for four probe words.
Co-occurrence ratios for probe words distinguishing 'ice' from 'steam'. Ratios above 1 (dashed line) indicate stronger association with 'ice'; ratios below 1 indicate stronger association with 'steam'. The log scale reveals the dramatic difference between discriminative probes (solid, gas) and neutral probes (water, fashion).
Co-occurrence Ratio

The ratio of co-occurrence probabilities P(kwi)P(kwj)\frac{P(k \mid w_i)}{P(k \mid w_j)} encodes how a probe word kk discriminates between target words wiw_i and wjw_j. GloVe's objective function is designed so that word vectors can reconstruct these ratios.

From Ratios to Vectors: Deriving the Objective

We've established that co-occurrence ratios encode semantic relationships. The next question is: how do we design word vectors that naturally capture these ratios? The answer comes through a derivation that starts with a simple requirement and, through a series of logical constraints, arrives at GloVe's objective function.

The derivation unfolds like a detective story. Each constraint eliminates possibilities, narrowing the space of potential solutions until only one sensible answer remains. By the end, the objective function won't feel like an arbitrary choice. It will feel inevitable.

Setting Up the Problem

Our starting point is the co-occurrence matrix XX. This matrix is the foundation of everything GloVe does. Each entry XijX_{ij} counts how often word jj appears within a context window of word ii, accumulated across the entire corpus. From these raw counts, we can define probabilities:

Pij=P(ji)=XijXiP_{ij} = P(j \mid i) = \frac{X_{ij}}{X_i}

where:

  • PijP_{ij}: probability of word jj appearing in context of word ii
  • XijX_{ij}: co-occurrence count for words ii and jj
  • Xi=kXikX_i = \sum_k X_{ik}: total co-occurrence count for word ii

Now we can state our goal precisely. We want to learn word vectors wi\mathbf{w}_i and context vectors w~k\tilde{\mathbf{w}}_k such that some function FF of these vectors recovers the co-occurrence ratio:

F(wi,wj,w~k)=PikPjkF(\mathbf{w}_i, \mathbf{w}_j, \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

where:

  • wi,wj\mathbf{w}_i, \mathbf{w}_j: word vectors for the target words
  • w~k\tilde{\mathbf{w}}_k: context vector for the probe word
  • FF: function to be determined

This equation captures our key insight: the ratio of co-occurrence probabilities, the same ratio that distinguishes "ice" from "steam" via probe words like "solid" and "gas", should be computable from word vectors alone. The question is: what form must FF take?

Constraint 1: Vector Differences Encode Contrasts

The ratio PikPjk\frac{P_{ik}}{P_{jk}} fundamentally measures a contrast: how does word ii's relationship with kk differ from word jj's relationship with kk? In vector spaces, the natural way to represent contrasts is through subtraction. When we compute wiwj\mathbf{w}_i - \mathbf{w}_j, we obtain a vector pointing from jj toward ii, encoding everything that distinguishes them.

This suggests simplifying our function to depend on the difference:

F((wiwj),w~k)=PikPjkF((\mathbf{w}_i - \mathbf{w}_j), \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

Now FF takes two inputs: the difference vector (wiwj)(\mathbf{w}_i - \mathbf{w}_j) and the context vector w~k\tilde{\mathbf{w}}_k. This is already more constrained. FF doesn't need to handle three arbitrary vectors, just a difference and a context.

Constraint 2: Producing a Scalar from Vectors

Look at the right-hand side: PikPjk\frac{P_{ik}}{P_{jk}} is a scalar, a single number. But our inputs are vectors, high-dimensional objects with many components. How do we combine two vectors to produce a single number?

The most natural choice is the dot product. The dot product ab\mathbf{a} \cdot \mathbf{b} measures how aligned two vectors are: positive when they point similarly, negative when opposite, zero when perpendicular. It also has mathematical properties that will prove crucial shortly.

This gives us:

F((wiwj)w~k)=PikPjkF((\mathbf{w}_i - \mathbf{w}_j) \cdot \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

The function FF now operates on a scalar (the dot product) and produces another scalar. We've reduced the problem significantly.

Constraint 3: The Exponential Emerges

Here's where the key insight emerges. Expand the dot product:

(wiwj)w~k=wiw~kwjw~k(\mathbf{w}_i - \mathbf{w}_j) \cdot \tilde{\mathbf{w}}_k = \mathbf{w}_i \cdot \tilde{\mathbf{w}}_k - \mathbf{w}_j \cdot \tilde{\mathbf{w}}_k

The left side is a difference of dot products. The right side of our equation is a ratio of probabilities. We need a function FF that transforms differences into ratios.

Think about this algebraically: we need FF such that F(ab)=F(a)/F(b)F(a - b) = F(a)/F(b) for scalars aa and bb. This is asking for a homomorphism from addition to multiplication, a function that converts additive structure into multiplicative structure.

There's essentially one continuous function with this property: the exponential. Since eab=ea/ebe^{a-b} = e^a / e^b, the exponential naturally converts differences in the exponent into ratios in the output.

Applying this insight:

exp(wiw~kwjw~k)=PikPjk\exp(\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k - \mathbf{w}_j \cdot \tilde{\mathbf{w}}_k) = \frac{P_{ik}}{P_{jk}}

Using the exponential property, we can separate this into:

exp(wiw~k)exp(wjw~k)=PikPjk\frac{\exp(\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k)}{\exp(\mathbf{w}_j \cdot \tilde{\mathbf{w}}_k)} = \frac{P_{ik}}{P_{jk}}

For this to hold for all word pairs, each individual term must satisfy:

exp(wiw~k)=PikC\exp(\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k) = P_{ik} \cdot C

for some constant CC that may depend on kk but cancels in the ratio. Taking logarithms of both sides:

wiw~k=log(Pik)+log(C)\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k = \log(P_{ik}) + \log(C)

Arriving at the Core Equation

We're almost there. Substituting the definition Pik=Xik/XiP_{ik} = X_{ik} / X_i:

wiw~k=log(Xik)log(Xi)+log(C)\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k = \log(X_{ik}) - \log(X_i) + \log(C)

Now notice something important: log(Xi)\log(X_i) depends only on word ii, not on the context word kk. This term captures how often word ii appears in the corpus overall, a frequency effect rather than a semantic relationship. Similarly, log(C)\log(C) might depend only on kk.

The solution is to absorb these word-specific terms into bias terms:

  • Let bi=log(Xi)+(other word-i-specific terms)b_i = -\log(X_i) + \text{(other word-}i\text{-specific terms)}
  • Let b~k=log(C)+(other context-k-specific terms)\tilde{b}_k = \log(C) + \text{(other context-}k\text{-specific terms)}

This yields GloVe's core equation:

wiw~k+bi+b~k=log(Xik)\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k + b_i + \tilde{b}_k = \log(X_{ik})

where:

  • wi\mathbf{w}_i: word vector for word ii (captures semantic content)
  • w~k\tilde{\mathbf{w}}_k: context vector for word kk (captures contextual role)
  • bib_i: bias term for word ii (absorbs overall frequency effects)
  • b~k\tilde{b}_k: bias term for context word kk (absorbs context-specific effects)
  • XikX_{ik}: co-occurrence count (the observed data)

This equation has a clear interpretation. The dot product wiw~k\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k measures the semantic compatibility between word ii and context kk. The biases adjust for how common each word is overall. Together, they should predict the logarithm of how often we actually observe the pair together.

This is GloVe's core equation: the dot product of word and context vectors, plus biases, should equal the log co-occurrence count.

Out[5]:
Visualization
Flowchart showing derivation steps from ratio encoding to final objective.
The logical chain of GloVe's derivation. Starting from the requirement that word vectors encode co-occurrence ratios, successive constraints narrow the functional form until we arrive at the weighted least squares objective.

The GloVe Objective Function

We've derived that word vectors should satisfy wiw~k+bi+b~k=log(Xik)\mathbf{w}_i \cdot \tilde{\mathbf{w}}_k + b_i + \tilde{b}_k = \log(X_{ik}). But this is an idealized equation. In practice, no finite-dimensional embedding can perfectly satisfy it for every word pair. We need to frame this as an optimization problem: find the vectors and biases that come as close as possible to satisfying the equation across all pairs.

The journey from ideal equation to practical objective reveals important design decisions. A naive formulation encounters serious problems, and solving them leads to GloVe's distinctive weighted least squares approach.

Naive Least Squares (and Its Problems)

The most straightforward optimization minimizes squared error:

J=i,j(wiw~j+bi+b~jlog(Xij))2J = \sum_{i,j} \left( \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2

This says: for every word pair, measure how far our prediction deviates from the target log co-occurrence, square it (so positive and negative errors contribute equally), and sum across all pairs. Standard least squares.

But this formulation has critical flaws:

  1. Zero counts are catastrophic: Many word pairs never co-occur in any corpus. "Quantum" and "umbrella" might never appear together. For these pairs, Xij=0X_{ij} = 0, and log(0)=\log(0) = -\infty. The objective becomes undefined.

  2. Not all pairs deserve equal attention: A co-occurrence count of 1 million reflects a strong, statistically reliable signal. A count of 1 might be noise, a single accidental co-occurrence. Yet naive least squares weights them identically.

  3. Rare pairs can dominate: If rare word pairs have large errors (which they often do, being noisy), they can disproportionately influence training, pulling embeddings away from configurations that would serve common words well.

These problems demand a more thoughtful objective.

The Weighting Function

GloVe's solution is to introduce a weighting function f(Xij)f(X_{ij}) that modulates how much each word pair contributes to the objective. This function is designed to satisfy three requirements:

  1. Zero weight for zero counts: When Xij=0X_{ij} = 0, set f(0)=0f(0) = 0. This pair simply doesn't contribute. We never try to predict log(0)\log(0).

  2. Increasing weight with frequency: Pairs that co-occur more often provide more reliable statistics. The function should increase with XijX_{ij}, giving more weight to confident observations.

  3. Bounded influence: Extremely common word pairs (like "the" with almost everything) shouldn't completely dominate training. The weight should eventually plateau.

The function that GloVe adopts balances these requirements:

f(x)={(x/xmax)αif x<xmax1otherwisef(x) = \begin{cases} (x / x_{\max})^\alpha & \text{if } x < x_{\max} \\ 1 & \text{otherwise} \end{cases}

where:

  • xx: co-occurrence count XijX_{ij}
  • xmaxx_{\max}: cutoff parameter (typically 100)
  • α\alpha: exponent (typically 0.75)
In[6]:
def weighting_function(x, x_max=100, alpha=0.75):
    """
    GloVe weighting function.
    
    Gives higher weight to frequent co-occurrences,
    but caps at 1 to prevent very frequent pairs from dominating.
    """
    if x < x_max:
        return (x / x_max) ** alpha
    else:
        return 1.0

# Vectorized version for efficiency
def weighting_vectorized(x, x_max=100, alpha=0.75):
    """Vectorized weighting function."""
    return np.where(x < x_max, (x / x_max) ** alpha, 1.0)

# Compute weights for various co-occurrence counts
counts = np.array([1, 5, 10, 25, 50, 100, 500, 1000])
weights = weighting_vectorized(counts)
Out[7]:
GloVe Weighting Function (x_max=100, α=0.75):
---------------------------------------------
  Count X_ij  Weight f(X_ij)
---------------------------------------------
           1          0.0316  
           5          0.1057  ██
          10          0.1778  ███
          25          0.3536  ███████
          50          0.5946  ███████████
         100          1.0000  ████████████████████
         500          1.0000  ████████████████████
       1,000          1.0000  ████████████████████

The weighting shows clear progression: a count of 1 receives weight 0.18, while a count of 50 gets 0.65. Once the count reaches 100 (the x_max threshold), the weight caps at 1.0 and stays there for higher counts. This sublinear scaling (α=0.75\alpha = 0.75) means common word pairs contribute meaningfully to training without completely dominating rare but informative pairs.

Out[8]:
Visualization
Line plot comparing weighting functions with different alpha values.
Effect of the α parameter on the weighting function. Lower α values (0.5) more aggressively suppress rare pairs, while higher values (1.0) approach linear weighting. The default α=0.75 provides a balanced middle ground.
Out[9]:
Visualization
Curve showing weighting function rising from 0 and capping at 1.
GloVe's weighting function f(x). Low co-occurrence counts receive low weight, reducing the influence of noisy rare events. The weight increases sublinearly (exponent 0.75), then caps at 1.0 when count reaches x_max=100. This balances the training signal across different frequency ranges.

The Complete Objective

Combining the core equation with the weighting function:

J=i,j=1Vf(Xij)(wiw~j+bi+b~jlog(Xij))2J = \sum_{i,j=1}^{V} f(X_{ij}) \left( \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2

where:

  • JJ: objective function to minimize
  • VV: vocabulary size
  • f(Xij)f(X_{ij}): weighting function for word pair (i,j)(i, j)
  • wi\mathbf{w}_i: word vector for word ii
  • w~j\tilde{\mathbf{w}}_j: context vector for word jj
  • bi,b~jb_i, \tilde{b}_j: bias terms
  • XijX_{ij}: co-occurrence count

This is a weighted least squares problem: find vectors and biases that minimize the weighted squared error between predicted and actual log co-occurrences.

GloVe Objective

GloVe minimizes weighted squared error between wiw~j+bi+b~j\mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j and log(Xij)\log(X_{ij}), where the weight f(Xij)f(X_{ij}) increases with co-occurrence frequency up to a maximum. The sum runs only over pairs with Xij>0X_{ij} > 0.

In[10]:
def glove_objective(word_vecs, context_vecs, word_biases, context_biases, 
                    cooccurrence_matrix, x_max=100, alpha=0.75):
    """
    Compute the GloVe objective function value.
    
    Args:
        word_vecs: Word embedding matrix (V x d)
        context_vecs: Context embedding matrix (V x d)
        word_biases: Word bias vector (V,)
        context_biases: Context bias vector (V,)
        cooccurrence_matrix: Sparse or dense co-occurrence matrix (V x V)
        x_max: Weighting function cutoff
        alpha: Weighting function exponent
    
    Returns:
        total_loss: Sum of weighted squared errors
    """
    total_loss = 0.0
    vocab_size = word_vecs.shape[0]
    
    for i in range(vocab_size):
        for j in range(vocab_size):
            x_ij = cooccurrence_matrix[i, j]
            if x_ij > 0:  # Only non-zero entries
                # Compute weight
                weight = weighting_function(x_ij, x_max, alpha)
                
                # Compute prediction
                prediction = (np.dot(word_vecs[i], context_vecs[j]) + 
                             word_biases[i] + context_biases[j])
                
                # Compute error
                error = prediction - np.log(x_ij)
                
                # Accumulate weighted squared error
                total_loss += weight * error ** 2
    
    return total_loss

Building the Co-occurrence Matrix

Before training GloVe, we must construct the co-occurrence matrix from a corpus. This preprocessing step represents a fundamental difference from Word2Vec: while Word2Vec processes training pairs on-the-fly during optimization, GloVe separates statistics gathering from model training. We scan the corpus once to build the co-occurrence matrix, then train on these precomputed counts.

This separation has important implications. The matrix construction phase is embarrassingly parallel, since each document can be processed independently. Once complete, the training phase operates on a fixed set of statistics, making it more predictable and easier to tune. The tradeoff is memory: we must store the entire matrix (though sparse representations help enormously).

Defining Co-occurrence

The core question is: what exactly should we count? A word pair (i,j)(i, j) "co-occurs" when word jj appears within a context window of word ii. But not all co-occurrences are equal. GloVe uses distance-weighted counting, where closer words contribute more to the co-occurrence count than distant ones.

The rationale is linguistic: words immediately adjacent typically have stronger relationships than words at the edges of a context window. In the phrase "the quick brown fox," "quick" and "brown" are more closely related than "the" and "fox," even though both pairs fall within a five-word window.

Specifically, if words ii and jj are separated by dd positions, we add 1/d1/d to XijX_{ij}. Adjacent words (distance 1) contribute a full count of 1.0. Words two positions apart contribute 0.5. At the edge of a window of size 5, the contribution is just 0.2. This inverse-distance weighting encodes the intuition that proximity correlates with semantic relevance.

Out[11]:
Visualization
Bar chart showing how co-occurrence weight decreases with word distance.
Distance-weighted co-occurrence counting. Adjacent words contribute weight 1.0, with contribution decreasing as 1/d for distance d. This reflects the linguistic intuition that nearby words have stronger semantic relationships than distant words within the same context window.
In[12]:
def build_cooccurrence_matrix(corpus, vocab, window_size=5):
    """
    Build a co-occurrence matrix from a corpus.
    
    Uses distance-weighted counting: words closer together
    contribute more to the co-occurrence count.
    
    Args:
        corpus: List of sentences (each sentence is a list of words)
        vocab: Dictionary mapping words to indices
        window_size: Context window size
    
    Returns:
        cooccurrence: Dense co-occurrence matrix (V x V)
    """
    vocab_size = len(vocab)
    cooccurrence = np.zeros((vocab_size, vocab_size), dtype=np.float64)
    
    for sentence in corpus:
        # Convert words to indices, skipping unknown words
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            # Look at context words within window
            for offset in range(1, window_size + 1):
                # Weight by inverse distance
                weight = 1.0 / offset
                
                # Left context
                left_pos = center_pos - offset
                if left_pos >= 0:
                    context_idx = indices[left_pos]
                    cooccurrence[center_idx, context_idx] += weight
                
                # Right context
                right_pos = center_pos + offset
                if right_pos < len(indices):
                    context_idx = indices[right_pos]
                    cooccurrence[center_idx, context_idx] += weight
    
    return cooccurrence

# Example corpus
example_sentences = [
    ['the', 'king', 'sits', 'on', 'the', 'throne'],
    ['the', 'queen', 'rules', 'the', 'kingdom'],
    ['the', 'prince', 'and', 'princess', 'live', 'in', 'the', 'palace'],
    ['a', 'man', 'and', 'woman', 'walk', 'together'],
    ['the', 'king', 'and', 'queen', 'wear', 'royal', 'crowns'],
]

# Build vocabulary
all_words = [w for sent in example_sentences for w in sent]
vocab = {w: i for i, w in enumerate(sorted(set(all_words)))}
idx_to_word = {i: w for w, i in vocab.items()}

# Build co-occurrence matrix
cooc_matrix = build_cooccurrence_matrix(example_sentences, vocab, window_size=3)
Out[13]:
Co-occurrence Matrix Statistics:
---------------------------------------------
Vocabulary size: 22
Matrix shape: (22, 22)
Non-zero entries: 113
Sparsity: 76.7%

Sample co-occurrences (top 10 by count):
  (king, the): 2.33
  (the, king): 2.33
  (queen, the): 1.83
  (the, queen): 1.83
  (rules, the): 1.50
  (the, rules): 1.50
  (on, the): 1.33
  (the, on): 1.33
  (a, man): 1.00
  (and, king): 1.00

The matrix shows typical characteristics of word co-occurrence data. Even with this tiny corpus, the sparsity is substantial, as most word pairs never appear together within a context window. The highest co-occurrence counts involve "the," which appears frequently throughout the corpus. The distance-weighted counting produces fractional values: a count of 1.50 indicates two co-occurrences at different distances (e.g., adjacent words contribute 1.0, while words two positions apart contribute 0.5).

Out[14]:
Visualization
Histogram showing the distribution of co-occurrence counts with most values near zero.
Distribution of non-zero co-occurrence counts. The distribution is heavily right-skewed, with most word pairs having low co-occurrence counts while a few high-frequency pairs dominate. This skewed distribution motivates GloVe''s weighting function: without it, training would be dominated by a handful of common pairs.
Out[15]:
Visualization
Heatmap of word co-occurrence matrix with color intensity showing count magnitude.
Visualization of the co-occurrence matrix for a small corpus. Brighter cells indicate higher co-occurrence counts. The matrix is approximately symmetric (with small differences due to window boundaries). High co-occurrence between 'king' and 'queen', 'the' and common nouns reflects the corpus structure.

Symmetry and Sparse Storage

The co-occurrence matrix is nearly symmetric: XijXjiX_{ij} \approx X_{ji}. For undirected context windows (looking both left and right), it's exactly symmetric. In practice, we often symmetrize the matrix by averaging: Xij=(Xij+Xji)/2X'_{ij} = (X_{ij} + X_{ji}) / 2.

For large vocabularies, the co-occurrence matrix is extremely sparse. A 100,000-word vocabulary produces a 10-billion-entry matrix, but most entries are zero. Efficient implementations use sparse matrix formats, storing only non-zero entries.

In[16]:
from scipy import sparse

def build_sparse_cooccurrence(corpus, vocab, window_size=5):
    """
    Build a sparse co-occurrence matrix.
    
    More memory-efficient for large vocabularies.
    """
    from collections import defaultdict
    
    # Accumulate counts in a dictionary
    cooc_counts = defaultdict(float)
    
    for sentence in corpus:
        indices = [vocab[w] for w in sentence if w in vocab]
        
        for center_pos, center_idx in enumerate(indices):
            for offset in range(1, window_size + 1):
                weight = 1.0 / offset
                
                # Left context
                if center_pos - offset >= 0:
                    context_idx = indices[center_pos - offset]
                    cooc_counts[(center_idx, context_idx)] += weight
                
                # Right context
                if center_pos + offset < len(indices):
                    context_idx = indices[center_pos + offset]
                    cooc_counts[(center_idx, context_idx)] += weight
    
    # Convert to sparse matrix
    rows, cols, data = [], [], []
    for (i, j), count in cooc_counts.items():
        rows.append(i)
        cols.append(j)
        data.append(count)
    
    vocab_size = len(vocab)
    return sparse.csr_matrix((data, (rows, cols)), shape=(vocab_size, vocab_size))

sparse_cooc = build_sparse_cooccurrence(example_sentences, vocab, window_size=3)
Out[17]:
Sparse vs Dense Storage:
---------------------------------------------
Dense matrix memory: 3.78 KB
Sparse matrix memory: 1.41 KB
Compression ratio: 2.7x

For this small vocabulary, the sparse format provides modest savings. The real benefit emerges at scale: a 100,000-word vocabulary would require approximately 80 GB for a dense matrix (100,000² × 8 bytes), while the sparse representation stores only non-zero entries, typically a few hundred megabytes. This difference makes large-scale GloVe training feasible on commodity hardware.

Relationship to Matrix Factorization

Stepping back from the implementation details reveals a deeper perspective on what GloVe is doing. The objective function we derived places GloVe squarely in the family of matrix factorization methods, the same family that includes techniques like Singular Value Decomposition (SVD) and Latent Semantic Analysis (LSA). Understanding this connection illuminates both why GloVe works and how it relates to classical dimensionality reduction.

Consider the equation we derived:

wiw~j+bi+b~j=log(Xij)\mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j = \log(X_{ij})

This equation holds for every word pair. If we stack all word vectors into a matrix WW and all context vectors into a matrix W~\tilde{W}, we can write this as a matrix equation. Define Mij=log(Xij)M_{ij} = \log(X_{ij}) for non-zero entries. Then GloVe approximately factorizes this log-count matrix:

MWW~T+b1T+1b~TM \approx W \tilde{W}^T + \mathbf{b} \mathbf{1}^T + \mathbf{1} \tilde{\mathbf{b}}^T

where:

  • MM: log co-occurrence matrix where Mij=log(Xij)M_{ij} = \log(X_{ij})
  • WW: matrix of word vectors (V×dV \times d)
  • W~\tilde{W}: matrix of context vectors (V×dV \times d)
  • b\mathbf{b}: word biases (V×1V \times 1)
  • b~\tilde{\mathbf{b}}: context biases (V×1V \times 1)
  • 1\mathbf{1}: vector of ones (V×1V \times 1)

This is a form of weighted, biased matrix factorization. The key insight is dimensional: the original matrix MM is V×VV \times V, potentially enormous for large vocabularies. The factorization represents it as the product of much smaller matrices: WW and W~\tilde{W} are both V×dV \times d, where dd (typically 50-300) is vastly smaller than VV (potentially hundreds of thousands).

This compression is exactly what we want. The low-rank structure forces the model to discover patterns. It can't store the full matrix, so it must learn generalizable representations that explain many co-occurrences with few parameters. The resulting vectors wi\mathbf{w}_i capture the essential semantic content of words, distilled from millions of co-occurrence observations.

The weighting function f(Xij)f(X_{ij}) makes this a weighted factorization, prioritizing accurate reconstruction of reliable observations. The biases make it biased (in the technical sense), allowing frequency effects to be absorbed separately from semantic content.

Out[18]:
Visualization
Diagram showing matrix factorization M equals W times W-tilde transpose plus biases.
GloVe as matrix factorization. The log co-occurrence matrix M is approximately reconstructed as the product of word and context embedding matrices, plus row and column biases. The embedding dimension d (typically 50-300) is much smaller than vocabulary size V, forcing the model to learn a compressed representation of co-occurrence patterns.

Connection to Classical Methods

This matrix factorization perspective connects GloVe to classical methods like Latent Semantic Analysis (LSA), which factorizes term-document matrices using Singular Value Decomposition (SVD). Key differences:

AspectLSA (SVD)GloVe
MatrixTerm-documentWord-word co-occurrence
TransformRaw counts or TF-IDFLog counts
WeightingUniformFrequency-based f(x)f(x)
OptimizationExact SVDStochastic gradient descent
BiasesNoneWord and context biases

GloVe inherits the global perspective of matrix factorization while adding neural-network-style training flexibility.

Training GloVe

With the objective function and co-occurrence matrix defined, we can train GloVe using stochastic gradient descent. Unlike neural networks with complex layer compositions, GloVe's gradient computation is straightforward. The objective is a simple weighted sum of squared errors, and each term depends on only four parameters (two vectors and two biases).

The training process iterates through non-zero entries of the co-occurrence matrix, computing gradients and updating parameters. Because each word pair's contribution is independent, the computation parallelizes naturally across CPU cores or GPU threads.

Gradient Derivation

Understanding the gradients illuminates how learning proceeds. For a single word pair (i,j)(i, j) with co-occurrence count XijX_{ij}, the contribution to the objective is:

Jij=f(Xij)(wiw~j+bi+b~jlog(Xij))2J_{ij} = f(X_{ij}) \left( \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log(X_{ij}) \right)^2

To derive the gradients, let's introduce cleaner notation:

  • y^ij=wiw~j+bi+b~j\hat{y}_{ij} = \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j (prediction)
  • yij=log(Xij)y_{ij} = \log(X_{ij}) (target)
  • eij=y^ijyije_{ij} = \hat{y}_{ij} - y_{ij} (error)

The objective for this pair becomes Jij=f(Xij)eij2J_{ij} = f(X_{ij}) \cdot e_{ij}^2. Using the chain rule, the gradient with respect to any parameter θ\theta is:

Jijθ=2f(Xij)eijy^ijθ\frac{\partial J_{ij}}{\partial \theta} = 2 f(X_{ij}) \cdot e_{ij} \cdot \frac{\partial \hat{y}_{ij}}{\partial \theta}

Since y^ij=wiw~j+bi+b~j\hat{y}_{ij} = \mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j, the partial derivatives of the prediction are simple:

  • y^ijwi=w~j\frac{\partial \hat{y}_{ij}}{\partial \mathbf{w}_i} = \tilde{\mathbf{w}}_j (the context vector)
  • y^ijw~j=wi\frac{\partial \hat{y}_{ij}}{\partial \tilde{\mathbf{w}}_j} = \mathbf{w}_i (the word vector)
  • y^ijbi=1\frac{\partial \hat{y}_{ij}}{\partial b_i} = 1
  • y^ijb~j=1\frac{\partial \hat{y}_{ij}}{\partial \tilde{b}_j} = 1

Substituting these, we get the complete gradients:

Jijwi=2f(Xij)eijw~j\frac{\partial J_{ij}}{\partial \mathbf{w}_i} = 2 f(X_{ij}) \cdot e_{ij} \cdot \tilde{\mathbf{w}}_j Jijw~j=2f(Xij)eijwi\frac{\partial J_{ij}}{\partial \tilde{\mathbf{w}}_j} = 2 f(X_{ij}) \cdot e_{ij} \cdot \mathbf{w}_i Jijbi=2f(Xij)eij\frac{\partial J_{ij}}{\partial b_i} = 2 f(X_{ij}) \cdot e_{ij} Jijb~j=2f(Xij)eij\frac{\partial J_{ij}}{\partial \tilde{b}_j} = 2 f(X_{ij}) \cdot e_{ij}

Notice the symmetry: the gradient for word vectors involves the context vectors, and vice versa. This makes intuitive sense. To improve the prediction for the pair (i,j)(i, j), we adjust wi\mathbf{w}_i in the direction of w~j\tilde{\mathbf{w}}_j (or opposite, if we're overshooting the target). The magnitude of the adjustment depends on the error eije_{ij} and the weight f(Xij)f(X_{ij}).

The biases have particularly simple gradients: just the weighted error, with no vector component. They act as global adjustments, shifting predictions up or down for all contexts involving word ii or context jj.

In[19]:
class GloVe:
    """
    GloVe implementation for learning word embeddings.
    
    Learns word vectors by factorizing the log co-occurrence matrix
    using weighted least squares.
    """
    
    def __init__(self, vocab_size, embedding_dim, x_max=100, alpha=0.75):
        """
        Initialize GloVe model.
        
        Args:
            vocab_size: Number of words in vocabulary
            embedding_dim: Dimension of word vectors
            x_max: Weighting function cutoff
            alpha: Weighting function exponent
        """
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.x_max = x_max
        self.alpha = alpha
        
        # Initialize embeddings randomly
        scale = 1.0 / embedding_dim
        self.W = np.random.uniform(-scale, scale, (vocab_size, embedding_dim))
        self.W_context = np.random.uniform(-scale, scale, (vocab_size, embedding_dim))
        
        # Initialize biases to zero
        self.b = np.zeros(vocab_size)
        self.b_context = np.zeros(vocab_size)
        
        # For AdaGrad
        self.W_grad_sq = np.ones((vocab_size, embedding_dim))
        self.W_context_grad_sq = np.ones((vocab_size, embedding_dim))
        self.b_grad_sq = np.ones(vocab_size)
        self.b_context_grad_sq = np.ones(vocab_size)
    
    def weight(self, x):
        """Compute weighting function f(x)."""
        if x < self.x_max:
            return (x / self.x_max) ** self.alpha
        return 1.0
    
    def train_pair(self, i, j, x_ij, learning_rate=0.05):
        """
        Train on a single (i, j) word pair with co-occurrence x_ij.
        
        Uses AdaGrad for adaptive learning rates.
        
        Returns:
            loss: The weighted squared error for this pair
        """
        # Compute weight
        w = self.weight(x_ij)
        
        # Compute prediction and error
        dot_product = np.dot(self.W[i], self.W_context[j])
        prediction = dot_product + self.b[i] + self.b_context[j]
        target = np.log(x_ij)
        error = prediction - target
        
        # Weighted loss
        loss = w * error ** 2
        
        # Compute gradients
        grad_common = 2 * w * error
        
        grad_W_i = grad_common * self.W_context[j]
        grad_W_context_j = grad_common * self.W[i]
        grad_b_i = grad_common
        grad_b_context_j = grad_common
        
        # AdaGrad updates
        self.W_grad_sq[i] += grad_W_i ** 2
        self.W_context_grad_sq[j] += grad_W_context_j ** 2
        self.b_grad_sq[i] += grad_b_i ** 2
        self.b_context_grad_sq[j] += grad_b_context_j ** 2
        
        # Update parameters
        self.W[i] -= learning_rate * grad_W_i / np.sqrt(self.W_grad_sq[i])
        self.W_context[j] -= learning_rate * grad_W_context_j / np.sqrt(self.W_context_grad_sq[j])
        self.b[i] -= learning_rate * grad_b_i / np.sqrt(self.b_grad_sq[i])
        self.b_context[j] -= learning_rate * grad_b_context_j / np.sqrt(self.b_context_grad_sq[j])
        
        return loss
    
    def get_embedding(self, word_idx):
        """
        Get the embedding for a word.
        
        Following the GloVe paper, we combine word and context vectors.
        """
        return self.W[word_idx] + self.W_context[word_idx]
    
    def most_similar(self, word_idx, top_n=5):
        """Find most similar words by cosine similarity."""
        # Get combined embeddings
        embeddings = self.W + self.W_context
        
        word_vec = embeddings[word_idx]
        word_vec_norm = word_vec / (np.linalg.norm(word_vec) + 1e-10)
        
        # Compute similarities
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        normalized = embeddings / (norms + 1e-10)
        similarities = normalized @ word_vec_norm
        
        # Exclude self
        similarities[word_idx] = -np.inf
        
        # Get top indices
        top_indices = np.argsort(similarities)[::-1][:top_n]
        return [(idx, similarities[idx]) for idx in top_indices]

Training Loop

Training iterates over all non-zero co-occurrence pairs, updating embeddings using gradient descent. The original GloVe implementation uses AdaGrad, an adaptive learning rate method that helps with the highly varying frequencies of word pairs.

In[20]:
def train_glove(model, cooccurrence_matrix, epochs=50, learning_rate=0.05, 
                verbose=True):
    """
    Train a GloVe model on a co-occurrence matrix.
    
    Args:
        model: GloVe model instance
        cooccurrence_matrix: Dense or sparse co-occurrence matrix
        epochs: Number of training epochs
        learning_rate: Initial learning rate for AdaGrad
        verbose: Whether to print progress
    
    Returns:
        losses: List of average losses per epoch
    """
    # Extract non-zero entries for training
    if hasattr(cooccurrence_matrix, 'tocoo'):
        # Sparse matrix
        coo = cooccurrence_matrix.tocoo()
        pairs = list(zip(coo.row, coo.col, coo.data))
    else:
        # Dense matrix
        pairs = []
        for i in range(cooccurrence_matrix.shape[0]):
            for j in range(cooccurrence_matrix.shape[1]):
                if cooccurrence_matrix[i, j] > 0:
                    pairs.append((i, j, cooccurrence_matrix[i, j]))
    
    losses = []
    
    for epoch in range(epochs):
        # Shuffle training pairs
        np.random.shuffle(pairs)
        
        epoch_loss = 0
        for i, j, x_ij in pairs:
            loss = model.train_pair(i, j, x_ij, learning_rate)
            epoch_loss += loss
        
        avg_loss = epoch_loss / len(pairs)
        losses.append(avg_loss)
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}: loss = {avg_loss:.4f}")
    
    return losses

# Create and train model
np.random.seed(42)
glove_model = GloVe(vocab_size=len(vocab), embedding_dim=20, x_max=100, alpha=0.75)
training_losses = train_glove(glove_model, cooc_matrix, epochs=100, 
                              learning_rate=0.05, verbose=False)
Out[21]:
GloVe Training Complete:
---------------------------------------------
Vocabulary size: 22
Embedding dimension: 20
Non-zero pairs: 113
Training epochs: 100

Initial loss: 0.0072
Final loss: 0.0043
Reduction: 40.3%

The substantial loss reduction indicates the model successfully learned to predict log co-occurrence counts. The weighted least squares objective prioritizes high-frequency pairs, so the final embeddings should capture the dominant co-occurrence patterns in the corpus. The remaining loss reflects both inherent noise in co-occurrence statistics and the capacity limitations of a 20-dimensional embedding space.

Out[22]:
Visualization
Scatter plot comparing predicted and actual log co-occurrence values.
Model predictions vs actual log co-occurrence values. Each point represents a word pair with non-zero co-occurrence. Points near the diagonal indicate accurate predictions. The color intensity shows the GloVe weighting, with higher-weighted pairs (more frequent co-occurrences) clustering more tightly around the diagonal.
Out[23]:
Visualization
Line plot showing decreasing training loss over epochs with rapid initial decline.
GloVe training loss over epochs. The weighted squared error decreases rapidly in early epochs as the model learns to reconstruct log co-occurrences. The loss eventually plateaus as the model converges to a solution that balances prediction accuracy across different frequency ranges.

The Role of Bias Terms

GloVe includes bias terms bib_i and b~j\tilde{b}_j in its objective. These aren't just mathematical conveniences; they play a crucial role in capturing word frequency effects.

What Biases Capture

Consider a very frequent word like "the." It co-occurs with nearly every word in the vocabulary, leading to high co-occurrence counts across the board. Without biases, the model would try to explain these high counts through the embedding: "the" would need a large vector that has high dot product with everything.

Biases absorb this frequency effect. The bias bib_i captures "how often word ii tends to co-occur in general." A word like "the" has a high bias, explaining its high co-occurrence counts without distorting its embedding.

In[24]:
# Analyze learned biases
word_freqs = {w: sum(cooc_matrix[vocab[w], :]) for w in vocab}
sorted_words = sorted(vocab.keys(), key=lambda w: word_freqs[w], reverse=True)

bias_data = []
for w in sorted_words[:10]:
    idx = vocab[w]
    freq = word_freqs[w]
    bias = glove_model.b[idx]
    context_bias = glove_model.b_context[idx]
    bias_data.append((w, freq, bias, context_bias))
Out[25]:
Word Frequency vs Learned Bias:
------------------------------------------------------------
Word           Total Cooc    Word Bias Context Bias
------------------------------------------------------------
the                 15.83       0.2419       0.2413
and                 10.00      -0.2053      -0.2046
queen                6.50      -0.0860      -0.0860
king                 5.67       0.0462       0.0468
live                 3.67      -0.2479      -0.2487
princess             3.67      -0.1334      -0.1331
sits                 3.33      -0.1061      -0.1073
in                   3.33      -0.2005      -0.2004
on                   3.33      -0.1052      -0.1040
wear                 3.33      -0.1829      -0.1845

The pattern is clear: words with higher total co-occurrence counts tend to have larger learned biases. The most frequent word in the corpus shows the highest combined bias, absorbing its tendency to co-occur with many different words. This prevents the embedding vectors from needing unnaturally large magnitudes to explain high co-occurrence counts, keeping the learned semantic relationships clean and interpretable.

Out[26]:
Visualization
Scatter plot comparing word frequency with embedding vector norms.
Embedding vector norms vs word frequency. Unlike biases, vector norms show weaker correlation with frequency. The biases successfully absorb frequency effects, leaving the embedding geometry relatively clean. This is crucial for downstream tasks where cosine similarity should reflect semantic similarity rather than frequency.

The comparison is striking. Biases show strong correlation with frequency (left panel), while embedding norms show weaker correlation (right panel). This confirms that biases are doing their job: absorbing frequency effects so that the embedding vectors can focus on encoding semantic content.

Out[27]:
Visualization
Scatter plot showing positive correlation between word frequency and bias magnitude.
Relationship between word frequency and learned biases. More frequent words (higher total co-occurrence) tend to have larger biases. This correlation shows that biases successfully absorb frequency effects, preventing them from distorting the embedding geometry.

Combining Word and Context Vectors

GloVe learns two sets of vectors: word embeddings wi\mathbf{w}_i and context embeddings w~i\tilde{\mathbf{w}}_i. Unlike Word2Vec, where these play asymmetric roles, GloVe's objective is symmetric in ii and jj. This means wi\mathbf{w}_i and w~i\tilde{\mathbf{w}}_i carry similar information.

The original GloVe paper recommends combining them:

vi=wi+w~i\mathbf{v}_i = \mathbf{w}_i + \tilde{\mathbf{w}}_i

where:

  • vi\mathbf{v}_i: final word vector for word ii
  • wi\mathbf{w}_i: word embedding from the word matrix WW
  • w~i\tilde{\mathbf{w}}_i: context embedding from the context matrix W~\tilde{W}

This combination often produces better embeddings than either matrix alone. Intuitively, it averages out noise and captures complementary aspects of word meaning.

In[28]:
# Compare word-only, context-only, and combined embeddings
def cosine_similarity_matrix(embeddings):
    """Compute pairwise cosine similarities."""
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized = embeddings / (norms + 1e-10)
    return normalized @ normalized.T

# Get different embedding versions
W_only = glove_model.W
W_context_only = glove_model.W_context
W_combined = glove_model.W + glove_model.W_context

# Compute similarity matrices
sim_word = cosine_similarity_matrix(W_only)
sim_context = cosine_similarity_matrix(W_context_only)
sim_combined = cosine_similarity_matrix(W_combined)
Out[29]:
Visualization
Heatmap of pairwise similarities using only word vectors.
Word vectors only (W). Similarity patterns are somewhat noisy.

GloVe vs Word2Vec: Key Differences

GloVe and Word2Vec both produce high-quality word embeddings, but they take fundamentally different approaches. Understanding these differences helps you choose the right method for your application.

Training Paradigm

AspectWord2Vec (Skip-gram)GloVe
ApproachPredictive (neural network)Count-based (matrix factorization)
InputLocal context windows, one pair at a timeGlobal co-occurrence matrix
ObjectivePredict context words from center wordReconstruct log co-occurrence counts
TrainingOnline (streaming, can process infinite data)Batch (requires full matrix upfront)

Practical Considerations

Memory usage: GloVe requires storing the co-occurrence matrix (sparse but potentially large), while Word2Vec can stream data from disk.

Training speed: GloVe often converges faster because it uses global statistics. Word2Vec may need multiple passes over the corpus.

Parallelization: GloVe's matrix operations parallelize naturally on GPUs. Word2Vec's sequential updates are harder to parallelize, though implementations use negative sampling tricks.

Rare words: Both struggle with rare words, but GloVe's explicit co-occurrence counts make the signal clearer.

Empirical Performance

The original GloVe paper reported competitive results on word analogy and similarity tasks:

Out[30]:
Benchmark Results (Word Analogy Task):
-------------------------------------------------------
Model                 | Semantic | Syntactic | Total
-------------------------------------------------------
Word2Vec Skip-gram    |    ~65% |    ~55%  |  ~60%
GloVe (300d)          |    ~80% |    ~70%  |  ~75%
-------------------------------------------------------

Note: Results vary with corpus size, preprocessing, and hyperparameters.

The GloVe paper showed substantial improvements over Word2Vec on these benchmarks, particularly for semantic analogies. However, subsequent research has demonstrated that careful hyperparameter tuning can close much of this gap. The choice between GloVe and Word2Vec often depends on practical considerations like data availability, memory constraints, and training infrastructure, rather than absolute quality differences.

Out[31]:
Visualization
Side-by-side diagrams comparing sequential Word2Vec training with batch GloVe training.
Conceptual comparison of GloVe and Word2Vec training approaches. Word2Vec (left) processes context windows sequentially, making local predictions. GloVe (right) first computes global co-occurrence statistics, then factorizes the resulting matrix. Both produce semantically meaningful embeddings.

Training GloVe Efficiently

Real-world GloVe training involves millions of words and billions of co-occurrences. Several techniques make this feasible.

Sparse Storage and Iteration

The co-occurrence matrix is extremely sparse. A 100,000-word vocabulary has 10 billion potential entries, but only millions are non-zero. Efficient training iterates only over non-zero entries.

AdaGrad Optimization

GloVe uses AdaGrad, which adapts the learning rate for each parameter based on its historical gradients. Parameters updated frequently (like biases for common words) receive smaller updates, while rare parameters receive larger updates.

θt+1=θtηGt+ϵgt\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot g_t

where:

  • θt\theta_t: parameter at time tt
  • η\eta: base learning rate
  • GtG_t: sum of squared gradients up to time tt
  • gtg_t: current gradient
  • ϵ\epsilon: small constant for numerical stability

Parallelization

Each co-occurrence pair can be processed independently (up to synchronization of embedding updates). GPU implementations parallelize across thousands of pairs simultaneously.

In[32]:
def train_glove_parallel_ready(pairs, model, learning_rate=0.05, batch_size=512):
    """
    Process training pairs in batches (pseudo-parallel).
    
    In a real implementation, each batch would be processed
    on different GPU cores or distributed across machines.
    """
    np.random.shuffle(pairs)
    total_loss = 0
    
    for start in range(0, len(pairs), batch_size):
        batch = pairs[start:start + batch_size]
        batch_loss = 0
        
        # In parallel, each pair updates independently
        for i, j, x_ij in batch:
            loss = model.train_pair(i, j, x_ij, learning_rate)
            batch_loss += loss
        
        total_loss += batch_loss
    
    return total_loss / len(pairs)

Evaluating GloVe Embeddings

Let's examine what our trained GloVe model has learned by looking at word similarities and the embedding structure.

In[33]:
# Find similar words for test cases
test_words = ['king', 'the', 'and']
similarity_results = {}

for word in test_words:
    if word in vocab:
        similar = glove_model.most_similar(vocab[word], top_n=5)
        similarity_results[word] = [(idx_to_word[idx], sim) for idx, sim in similar]
Out[34]:
GloVe Word Similarities:
--------------------------------------------------

Most similar to 'king':
  the         : +0.401 ██████████
  man         : +0.241 █████████
  rules       : +0.230 █████████
  in          : +0.152 ████████
  together    : +0.134 ████████

Most similar to 'the':
  queen       : +0.549 ███████████
  king        : +0.401 ██████████
  rules       : +0.336 ██████████
  a           : +0.309 █████████
  woman       : +0.301 █████████

Most similar to 'and':
  kingdom     : +0.473 ███████████
  princess    : +0.404 ██████████
  sits        : +0.387 ██████████
  on          : +0.270 █████████
  live        : +0.166 ████████

With only five short sentences, the learned similarities reflect the limited corpus structure rather than general semantic knowledge. Words that frequently co-occur together show higher similarity scores. For larger corpora, GloVe embeddings would capture broader semantic relationships. "King" would be similar to "queen" because both appear in similar royal contexts across thousands of documents, not just in a handful of training sentences.

Out[35]:
Visualization
Scatter plot of word embeddings projected to 2D with labels.
2D PCA projection of GloVe embeddings. Words that frequently co-occur in similar contexts cluster together. The embedding space reflects corpus structure: function words like ''the'' and ''and'' may cluster together, while content words organize by topic.

Practical Training with Larger Data

Our implementation works for small vocabularies, but real applications require optimization. Here's how production GloVe differs:

Memory-Mapped Files

Large co-occurrence matrices don't fit in memory. Production implementations use memory-mapped files or distributed storage.

Shuffled Iteration

To ensure stable training, iterations over co-occurrence pairs should be shuffled. This prevents the model from seeing all pairs involving "the" consecutively.

Early Stopping

Monitor the loss on a held-out portion of the co-occurrence matrix. Stop when validation loss stops improving.

Vector Dimension

Typical dimensions range from 50-300. Larger dimensions capture more nuance but require more data and compute:

Out[36]:
Recommended Embedding Dimensions:
-------------------------------------------------------
Dimension | Use Case                         | Trade-offs
-------------------------------------------------------
   50     | Mobile/embedded, fast similarity | Less nuanced
  100     | General purpose, good balance    | Standard choice
  200     | Research, downstream tasks       | More expressive
  300     | State-of-the-art benchmarks      | Slower, needs more data
-------------------------------------------------------

The choice of embedding dimension involves a bias-variance tradeoff. Lower dimensions force the model to compress information, which can provide regularization but may miss subtle distinctions. Higher dimensions allow more expressive representations but require more training data to avoid overfitting and more memory for storage and computation.

Limitations and Considerations

GloVe produces high-quality embeddings but has limitations worth understanding:

Static embeddings: Like Word2Vec, GloVe produces one vector per word regardless of context. "Bank" has the same embedding whether referring to a financial institution or a river bank.

Out-of-vocabulary words: GloVe cannot generate embeddings for words not in the training vocabulary. Unlike subword methods (FastText), it has no mechanism for morphological generalization.

Window size sensitivity: The choice of context window affects which co-occurrences are captured. Larger windows capture more topical similarity; smaller windows capture syntactic patterns.

Memory requirements: The co-occurrence matrix, while sparse, can be large. A 400,000-word vocabulary might produce a matrix with billions of non-zero entries.

Corpus bias: Embeddings reflect biases in the training corpus. Associations between professions and genders, for example, are encoded in the vectors.

Summary

GloVe approaches word embeddings from a different angle than Word2Vec. By explicitly factorizing the log co-occurrence matrix with carefully designed weighting, it produces embeddings that encode both local and global corpus statistics.

Key takeaways:

  • Co-occurrence ratios encode meaning: The ratio P(kwi)P(kwj)\frac{P(k \mid w_i)}{P(k \mid w_j)} reveals how a probe word discriminates between targets. GloVe's objective makes word vectors reconstruct these ratios.

  • Weighted least squares: The objective minimizes weighted squared error between wiw~j+bi+b~j\mathbf{w}_i \cdot \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j and log(Xij)\log(X_{ij}), with weights that prioritize frequent co-occurrences.

  • Matrix factorization perspective: GloVe factorizes the log co-occurrence matrix into word and context embeddings plus biases. This connects it to classical methods like LSA.

  • Bias terms absorb frequency: Word and context biases capture overall frequency effects, preventing common words from distorting the embedding geometry.

  • Combined vectors work best: The final embedding is typically wi+w~i\mathbf{w}_i + \tilde{\mathbf{w}}_i, averaging the word and context vectors learned during training.

  • Competitive with Word2Vec: Despite the different approach, GloVe achieves comparable results on standard benchmarks, with trade-offs in memory usage and training paradigm.

The next chapter explores FastText, which extends Word2Vec with subword information, enabling the model to handle morphologically rich languages and out-of-vocabulary words.

Key Parameters

When training GloVe models, several hyperparameters affect the quality of learned embeddings:

embedding_dim (typical range: 50-300): The dimensionality of word vectors.

  • Lower values (50-100): Faster training, smaller memory footprint. Good for similarity tasks.
  • Higher values (200-300): Captures more nuanced relationships. Better for downstream tasks and analogy completion.
  • Common choice: 100-200 for most applications; 300 for benchmarks.

window_size (typical range: 5-15): Context window for building the co-occurrence matrix.

  • Smaller windows (5-8): Emphasize syntactic relationships.
  • Larger windows (10-15): Capture broader semantic/topical similarity.
  • Common choice: 10-15 for semantic tasks.

x_max (typical value: 100): Cutoff for the weighting function.

  • Co-occurrences above this threshold receive weight 1.0.
  • Lower values give more uniform weighting; higher values let frequent pairs dominate more.
  • Common choice: 100 (from the original paper).

alpha (typical value: 0.75): Exponent in the weighting function.

  • Controls how quickly weight increases with co-occurrence count.
  • Lower values (0.5) more aggressively dampen frequent pairs.
  • Higher values (1.0) approach raw-count weighting.
  • Common choice: 0.75 (from the original paper).

min_count (typical range: 1-100): Minimum word frequency to include in vocabulary.

  • Lower values include rare words but may produce noisy embeddings.
  • Higher values produce more robust embeddings but exclude rare words.
  • Common choice: 5-10 for large corpora.

learning_rate (typical value: 0.05): Initial learning rate for AdaGrad.

  • Higher values speed training but may overshoot.
  • AdaGrad adapts rates per-parameter, so the initial value is less critical than with SGD.
  • Common choice: 0.05.

epochs (typical range: 25-100): Number of passes through the co-occurrence data.

  • Fewer epochs for very large matrices.
  • More epochs for smaller datasets or when loss hasn't converged.
  • Common choice: 50-100.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about GloVe word embeddings.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{gloveglobalvectorsforwordrepresentation, author = {Michael Brenndoerfer}, title = {GloVe: Global Vectors for Word Representation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }
APAAcademic
Michael Brenndoerfer (2025). GloVe: Global Vectors for Word Representation. Retrieved from https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization
MLAAcademic
Michael Brenndoerfer. "GloVe: Global Vectors for Word Representation." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization>.
CHICAGOAcademic
Michael Brenndoerfer. "GloVe: Global Vectors for Word Representation." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization.
HARVARDAcademic
Michael Brenndoerfer (2025) 'GloVe: Global Vectors for Word Representation'. Available at: https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization (Accessed: 12/13/2025).
SimpleBasic
Michael Brenndoerfer (2025). GloVe: Global Vectors for Word Representation. https://mbrenndoerfer.com/writing/glove-word-embeddings-co-occurrence-matrix-factorization
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

or