Learn how GloVe creates word embeddings by factorizing co-occurrence matrices. Covers the derivation, weighted least squares objective, and Python implementation.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
GloVe
Word2Vec learns embeddings through local context windows, predicting surrounding words one pair at a time. This approach works well but ignores a fundamental insight: some word relationships are global properties of the corpus. The word "ice" might appear near "cold" millions of times across a billion-word corpus, but Word2Vec treats each occurrence as an independent prediction task. What if we could leverage this global co-occurrence information directly?
GloVe (Global Vectors for Word Representation) takes a different path. Developed by Pennington, Socher, and Manning at Stanford in 2014, GloVe starts with a co-occurrence matrix that captures how often words appear together across the entire corpus. It then factorizes this matrix to produce word vectors. The result: embeddings that encode both local context patterns and global corpus statistics.
This chapter develops GloVe from first principles. We'll see how the objective function emerges from a simple requirement that word vectors encode co-occurrence ratios, work through the weighted least squares formulation, and implement GloVe from scratch. By the end, you'll understand why GloVe achieves comparable results to Word2Vec despite taking a fundamentally different approach.
The Insight: Co-occurrence Ratios Reveal Meaning
GloVe's key insight is simple but powerful: the ratio of co-occurrence probabilities encodes semantic relationships more reliably than raw probabilities. Consider the words "ice" and "steam." Both relate to water, but in different ways. How can we distinguish them?
Let's look at co-occurrence with probe words:
| Probe word | Ratio | ||
|---|---|---|---|
| solid | high | low | large (>> 1) |
| gas | low | high | small (<< 1) |
| water | high | high | ≈ 1 |
| fashion | low | low | ≈ 1 |
The raw probabilities and depend on many factors: how common each word is, the corpus domain, and so on. But the ratio tells a cleaner story:
- Large ratio: The probe word relates more to "ice" than "steam" (like "solid")
- Small ratio: The probe word relates more to "steam" than "ice" (like "gas")
- Ratio ≈ 1: The probe word relates equally to both (like "water") or to neither (like "fashion")
This ratio invariance is powerful. It factors out corpus-specific biases and isolates the semantic relationship we care about.
Co-occurrence Ratio Analysis: ice vs steam ------------------------------------------------------- Probe P(k|ice) P(k|steam) Ratio ------------------------------------------------------- solid 0.000190 0.000022 8.64 gas 0.000022 0.000780 0.03 water 0.003000 0.002200 1.36 fashion 0.000018 0.000018 1.00
The ratios reveal clear discriminative patterns. "Solid" has a ratio far greater than 1 (approximately 8.6), indicating strong association with "ice" rather than "steam." Conversely, "gas" has a ratio well below 1 (approximately 0.03), showing the opposite relationship. Both "water" and "fashion" have ratios near 1, but for different reasons: "water" relates equally to both states, while "fashion" is irrelevant to either.

The ratio of co-occurrence probabilities encodes how a probe word discriminates between target words and . GloVe's objective function is designed so that word vectors can reconstruct these ratios.
From Ratios to Vectors: Deriving the Objective
We've established that co-occurrence ratios encode semantic relationships. The next question is: how do we design word vectors that naturally capture these ratios? The answer comes through a derivation that starts with a simple requirement and, through a series of logical constraints, arrives at GloVe's objective function.
The derivation unfolds like a detective story. Each constraint eliminates possibilities, narrowing the space of potential solutions until only one sensible answer remains. By the end, the objective function won't feel like an arbitrary choice. It will feel inevitable.
Setting Up the Problem
Our starting point is the co-occurrence matrix . This matrix is the foundation of everything GloVe does. Each entry counts how often word appears within a context window of word , accumulated across the entire corpus. From these raw counts, we can define probabilities:
where:
- : probability of word appearing in context of word
- : co-occurrence count for words and
- : total co-occurrence count for word
Now we can state our goal precisely. We want to learn word vectors and context vectors such that some function of these vectors recovers the co-occurrence ratio:
where:
- : word vectors for the target words
- : context vector for the probe word
- : function to be determined
This equation captures our key insight: the ratio of co-occurrence probabilities, the same ratio that distinguishes "ice" from "steam" via probe words like "solid" and "gas", should be computable from word vectors alone. The question is: what form must take?
Constraint 1: Vector Differences Encode Contrasts
The ratio fundamentally measures a contrast: how does word 's relationship with differ from word 's relationship with ? In vector spaces, the natural way to represent contrasts is through subtraction. When we compute , we obtain a vector pointing from toward , encoding everything that distinguishes them.
This suggests simplifying our function to depend on the difference:
Now takes two inputs: the difference vector and the context vector . This is already more constrained. doesn't need to handle three arbitrary vectors, just a difference and a context.
Constraint 2: Producing a Scalar from Vectors
Look at the right-hand side: is a scalar, a single number. But our inputs are vectors, high-dimensional objects with many components. How do we combine two vectors to produce a single number?
The most natural choice is the dot product. The dot product measures how aligned two vectors are: positive when they point similarly, negative when opposite, zero when perpendicular. It also has mathematical properties that will prove crucial shortly.
This gives us:
The function now operates on a scalar (the dot product) and produces another scalar. We've reduced the problem significantly.
Constraint 3: The Exponential Emerges
Here's where the key insight emerges. Expand the dot product:
The left side is a difference of dot products. The right side of our equation is a ratio of probabilities. We need a function that transforms differences into ratios.
Think about this algebraically: we need such that for scalars and . This is asking for a homomorphism from addition to multiplication, a function that converts additive structure into multiplicative structure.
There's essentially one continuous function with this property: the exponential. Since , the exponential naturally converts differences in the exponent into ratios in the output.
Applying this insight:
Using the exponential property, we can separate this into:
For this to hold for all word pairs, each individual term must satisfy:
for some constant that may depend on but cancels in the ratio. Taking logarithms of both sides:
Arriving at the Core Equation
We're almost there. Substituting the definition :
Now notice something important: depends only on word , not on the context word . This term captures how often word appears in the corpus overall, a frequency effect rather than a semantic relationship. Similarly, might depend only on .
The solution is to absorb these word-specific terms into bias terms:
- Let
- Let
This yields GloVe's core equation:
where:
- : word vector for word (captures semantic content)
- : context vector for word (captures contextual role)
- : bias term for word (absorbs overall frequency effects)
- : bias term for context word (absorbs context-specific effects)
- : co-occurrence count (the observed data)
This equation has a clear interpretation. The dot product measures the semantic compatibility between word and context . The biases adjust for how common each word is overall. Together, they should predict the logarithm of how often we actually observe the pair together.
This is GloVe's core equation: the dot product of word and context vectors, plus biases, should equal the log co-occurrence count.

The GloVe Objective Function
We've derived that word vectors should satisfy . But this is an idealized equation. In practice, no finite-dimensional embedding can perfectly satisfy it for every word pair. We need to frame this as an optimization problem: find the vectors and biases that come as close as possible to satisfying the equation across all pairs.
The journey from ideal equation to practical objective reveals important design decisions. A naive formulation encounters serious problems, and solving them leads to GloVe's distinctive weighted least squares approach.
Naive Least Squares (and Its Problems)
The most straightforward optimization minimizes squared error:
This says: for every word pair, measure how far our prediction deviates from the target log co-occurrence, square it (so positive and negative errors contribute equally), and sum across all pairs. Standard least squares.
But this formulation has critical flaws:
-
Zero counts are catastrophic: Many word pairs never co-occur in any corpus. "Quantum" and "umbrella" might never appear together. For these pairs, , and . The objective becomes undefined.
-
Not all pairs deserve equal attention: A co-occurrence count of 1 million reflects a strong, statistically reliable signal. A count of 1 might be noise, a single accidental co-occurrence. Yet naive least squares weights them identically.
-
Rare pairs can dominate: If rare word pairs have large errors (which they often do, being noisy), they can disproportionately influence training, pulling embeddings away from configurations that would serve common words well.
These problems demand a more thoughtful objective.
The Weighting Function
GloVe's solution is to introduce a weighting function that modulates how much each word pair contributes to the objective. This function is designed to satisfy three requirements:
-
Zero weight for zero counts: When , set . This pair simply doesn't contribute. We never try to predict .
-
Increasing weight with frequency: Pairs that co-occur more often provide more reliable statistics. The function should increase with , giving more weight to confident observations.
-
Bounded influence: Extremely common word pairs (like "the" with almost everything) shouldn't completely dominate training. The weight should eventually plateau.
The function that GloVe adopts balances these requirements:
where:
- : co-occurrence count
- : cutoff parameter (typically 100)
- : exponent (typically 0.75)
GloVe Weighting Function (x_max=100, α=0.75):
---------------------------------------------
Count X_ij Weight f(X_ij)
---------------------------------------------
1 0.0316
5 0.1057 ██
10 0.1778 ███
25 0.3536 ███████
50 0.5946 ███████████
100 1.0000 ████████████████████
500 1.0000 ████████████████████
1,000 1.0000 ████████████████████
The weighting shows clear progression: a count of 1 receives weight 0.18, while a count of 50 gets 0.65. Once the count reaches 100 (the x_max threshold), the weight caps at 1.0 and stays there for higher counts. This sublinear scaling () means common word pairs contribute meaningfully to training without completely dominating rare but informative pairs.


The Complete Objective
Combining the core equation with the weighting function:
where:
- : objective function to minimize
- : vocabulary size
- : weighting function for word pair
- : word vector for word
- : context vector for word
- : bias terms
- : co-occurrence count
This is a weighted least squares problem: find vectors and biases that minimize the weighted squared error between predicted and actual log co-occurrences.
GloVe minimizes weighted squared error between and , where the weight increases with co-occurrence frequency up to a maximum. The sum runs only over pairs with .
Building the Co-occurrence Matrix
Before training GloVe, we must construct the co-occurrence matrix from a corpus. This preprocessing step represents a fundamental difference from Word2Vec: while Word2Vec processes training pairs on-the-fly during optimization, GloVe separates statistics gathering from model training. We scan the corpus once to build the co-occurrence matrix, then train on these precomputed counts.
This separation has important implications. The matrix construction phase is embarrassingly parallel, since each document can be processed independently. Once complete, the training phase operates on a fixed set of statistics, making it more predictable and easier to tune. The tradeoff is memory: we must store the entire matrix (though sparse representations help enormously).
Defining Co-occurrence
The core question is: what exactly should we count? A word pair "co-occurs" when word appears within a context window of word . But not all co-occurrences are equal. GloVe uses distance-weighted counting, where closer words contribute more to the co-occurrence count than distant ones.
The rationale is linguistic: words immediately adjacent typically have stronger relationships than words at the edges of a context window. In the phrase "the quick brown fox," "quick" and "brown" are more closely related than "the" and "fox," even though both pairs fall within a five-word window.
Specifically, if words and are separated by positions, we add to . Adjacent words (distance 1) contribute a full count of 1.0. Words two positions apart contribute 0.5. At the edge of a window of size 5, the contribution is just 0.2. This inverse-distance weighting encodes the intuition that proximity correlates with semantic relevance.

Co-occurrence Matrix Statistics: --------------------------------------------- Vocabulary size: 22 Matrix shape: (22, 22) Non-zero entries: 113 Sparsity: 76.7% Sample co-occurrences (top 10 by count): (king, the): 2.33 (the, king): 2.33 (queen, the): 1.83 (the, queen): 1.83 (rules, the): 1.50 (the, rules): 1.50 (on, the): 1.33 (the, on): 1.33 (a, man): 1.00 (and, king): 1.00
The matrix shows typical characteristics of word co-occurrence data. Even with this tiny corpus, the sparsity is substantial, as most word pairs never appear together within a context window. The highest co-occurrence counts involve "the," which appears frequently throughout the corpus. The distance-weighted counting produces fractional values: a count of 1.50 indicates two co-occurrences at different distances (e.g., adjacent words contribute 1.0, while words two positions apart contribute 0.5).


Symmetry and Sparse Storage
The co-occurrence matrix is nearly symmetric: . For undirected context windows (looking both left and right), it's exactly symmetric. In practice, we often symmetrize the matrix by averaging: .
For large vocabularies, the co-occurrence matrix is extremely sparse. A 100,000-word vocabulary produces a 10-billion-entry matrix, but most entries are zero. Efficient implementations use sparse matrix formats, storing only non-zero entries.
Sparse vs Dense Storage: --------------------------------------------- Dense matrix memory: 3.78 KB Sparse matrix memory: 1.41 KB Compression ratio: 2.7x
For this small vocabulary, the sparse format provides modest savings. The real benefit emerges at scale: a 100,000-word vocabulary would require approximately 80 GB for a dense matrix (100,000² × 8 bytes), while the sparse representation stores only non-zero entries, typically a few hundred megabytes. This difference makes large-scale GloVe training feasible on commodity hardware.
Relationship to Matrix Factorization
Stepping back from the implementation details reveals a deeper perspective on what GloVe is doing. The objective function we derived places GloVe squarely in the family of matrix factorization methods, the same family that includes techniques like Singular Value Decomposition (SVD) and Latent Semantic Analysis (LSA). Understanding this connection illuminates both why GloVe works and how it relates to classical dimensionality reduction.
Consider the equation we derived:
This equation holds for every word pair. If we stack all word vectors into a matrix and all context vectors into a matrix , we can write this as a matrix equation. Define for non-zero entries. Then GloVe approximately factorizes this log-count matrix:
where:
- : log co-occurrence matrix where
- : matrix of word vectors ()
- : matrix of context vectors ()
- : word biases ()
- : context biases ()
- : vector of ones ()
This is a form of weighted, biased matrix factorization. The key insight is dimensional: the original matrix is , potentially enormous for large vocabularies. The factorization represents it as the product of much smaller matrices: and are both , where (typically 50-300) is vastly smaller than (potentially hundreds of thousands).
This compression is exactly what we want. The low-rank structure forces the model to discover patterns. It can't store the full matrix, so it must learn generalizable representations that explain many co-occurrences with few parameters. The resulting vectors capture the essential semantic content of words, distilled from millions of co-occurrence observations.
The weighting function makes this a weighted factorization, prioritizing accurate reconstruction of reliable observations. The biases make it biased (in the technical sense), allowing frequency effects to be absorbed separately from semantic content.

Connection to Classical Methods
This matrix factorization perspective connects GloVe to classical methods like Latent Semantic Analysis (LSA), which factorizes term-document matrices using Singular Value Decomposition (SVD). Key differences:
| Aspect | LSA (SVD) | GloVe |
|---|---|---|
| Matrix | Term-document | Word-word co-occurrence |
| Transform | Raw counts or TF-IDF | Log counts |
| Weighting | Uniform | Frequency-based |
| Optimization | Exact SVD | Stochastic gradient descent |
| Biases | None | Word and context biases |
GloVe inherits the global perspective of matrix factorization while adding neural-network-style training flexibility.
Training GloVe
With the objective function and co-occurrence matrix defined, we can train GloVe using stochastic gradient descent. Unlike neural networks with complex layer compositions, GloVe's gradient computation is straightforward. The objective is a simple weighted sum of squared errors, and each term depends on only four parameters (two vectors and two biases).
The training process iterates through non-zero entries of the co-occurrence matrix, computing gradients and updating parameters. Because each word pair's contribution is independent, the computation parallelizes naturally across CPU cores or GPU threads.
Gradient Derivation
Understanding the gradients illuminates how learning proceeds. For a single word pair with co-occurrence count , the contribution to the objective is:
To derive the gradients, let's introduce cleaner notation:
- (prediction)
- (target)
- (error)
The objective for this pair becomes . Using the chain rule, the gradient with respect to any parameter is:
Since , the partial derivatives of the prediction are simple:
- (the context vector)
- (the word vector)
Substituting these, we get the complete gradients:
Notice the symmetry: the gradient for word vectors involves the context vectors, and vice versa. This makes intuitive sense. To improve the prediction for the pair , we adjust in the direction of (or opposite, if we're overshooting the target). The magnitude of the adjustment depends on the error and the weight .
The biases have particularly simple gradients: just the weighted error, with no vector component. They act as global adjustments, shifting predictions up or down for all contexts involving word or context .
Training Loop
Training iterates over all non-zero co-occurrence pairs, updating embeddings using gradient descent. The original GloVe implementation uses AdaGrad, an adaptive learning rate method that helps with the highly varying frequencies of word pairs.
GloVe Training Complete: --------------------------------------------- Vocabulary size: 22 Embedding dimension: 20 Non-zero pairs: 113 Training epochs: 100 Initial loss: 0.0072 Final loss: 0.0043 Reduction: 40.3%
The substantial loss reduction indicates the model successfully learned to predict log co-occurrence counts. The weighted least squares objective prioritizes high-frequency pairs, so the final embeddings should capture the dominant co-occurrence patterns in the corpus. The remaining loss reflects both inherent noise in co-occurrence statistics and the capacity limitations of a 20-dimensional embedding space.


The Role of Bias Terms
GloVe includes bias terms and in its objective. These aren't just mathematical conveniences; they play a crucial role in capturing word frequency effects.
What Biases Capture
Consider a very frequent word like "the." It co-occurs with nearly every word in the vocabulary, leading to high co-occurrence counts across the board. Without biases, the model would try to explain these high counts through the embedding: "the" would need a large vector that has high dot product with everything.
Biases absorb this frequency effect. The bias captures "how often word tends to co-occur in general." A word like "the" has a high bias, explaining its high co-occurrence counts without distorting its embedding.
Word Frequency vs Learned Bias: ------------------------------------------------------------ Word Total Cooc Word Bias Context Bias ------------------------------------------------------------ the 15.83 0.2419 0.2413 and 10.00 -0.2053 -0.2046 queen 6.50 -0.0860 -0.0860 king 5.67 0.0462 0.0468 live 3.67 -0.2479 -0.2487 princess 3.67 -0.1334 -0.1331 sits 3.33 -0.1061 -0.1073 in 3.33 -0.2005 -0.2004 on 3.33 -0.1052 -0.1040 wear 3.33 -0.1829 -0.1845
The pattern is clear: words with higher total co-occurrence counts tend to have larger learned biases. The most frequent word in the corpus shows the highest combined bias, absorbing its tendency to co-occur with many different words. This prevents the embedding vectors from needing unnaturally large magnitudes to explain high co-occurrence counts, keeping the learned semantic relationships clean and interpretable.

The comparison is striking. Biases show strong correlation with frequency (left panel), while embedding norms show weaker correlation (right panel). This confirms that biases are doing their job: absorbing frequency effects so that the embedding vectors can focus on encoding semantic content.

Combining Word and Context Vectors
GloVe learns two sets of vectors: word embeddings and context embeddings . Unlike Word2Vec, where these play asymmetric roles, GloVe's objective is symmetric in and . This means and carry similar information.
The original GloVe paper recommends combining them:
where:
- : final word vector for word
- : word embedding from the word matrix
- : context embedding from the context matrix
This combination often produces better embeddings than either matrix alone. Intuitively, it averages out noise and captures complementary aspects of word meaning.

GloVe vs Word2Vec: Key Differences
GloVe and Word2Vec both produce high-quality word embeddings, but they take fundamentally different approaches. Understanding these differences helps you choose the right method for your application.
Training Paradigm
| Aspect | Word2Vec (Skip-gram) | GloVe |
|---|---|---|
| Approach | Predictive (neural network) | Count-based (matrix factorization) |
| Input | Local context windows, one pair at a time | Global co-occurrence matrix |
| Objective | Predict context words from center word | Reconstruct log co-occurrence counts |
| Training | Online (streaming, can process infinite data) | Batch (requires full matrix upfront) |
Practical Considerations
Memory usage: GloVe requires storing the co-occurrence matrix (sparse but potentially large), while Word2Vec can stream data from disk.
Training speed: GloVe often converges faster because it uses global statistics. Word2Vec may need multiple passes over the corpus.
Parallelization: GloVe's matrix operations parallelize naturally on GPUs. Word2Vec's sequential updates are harder to parallelize, though implementations use negative sampling tricks.
Rare words: Both struggle with rare words, but GloVe's explicit co-occurrence counts make the signal clearer.
Empirical Performance
The original GloVe paper reported competitive results on word analogy and similarity tasks:
Benchmark Results (Word Analogy Task): ------------------------------------------------------- Model | Semantic | Syntactic | Total ------------------------------------------------------- Word2Vec Skip-gram | ~65% | ~55% | ~60% GloVe (300d) | ~80% | ~70% | ~75% ------------------------------------------------------- Note: Results vary with corpus size, preprocessing, and hyperparameters.
The GloVe paper showed substantial improvements over Word2Vec on these benchmarks, particularly for semantic analogies. However, subsequent research has demonstrated that careful hyperparameter tuning can close much of this gap. The choice between GloVe and Word2Vec often depends on practical considerations like data availability, memory constraints, and training infrastructure, rather than absolute quality differences.

Training GloVe Efficiently
Real-world GloVe training involves millions of words and billions of co-occurrences. Several techniques make this feasible.
Sparse Storage and Iteration
The co-occurrence matrix is extremely sparse. A 100,000-word vocabulary has 10 billion potential entries, but only millions are non-zero. Efficient training iterates only over non-zero entries.
AdaGrad Optimization
GloVe uses AdaGrad, which adapts the learning rate for each parameter based on its historical gradients. Parameters updated frequently (like biases for common words) receive smaller updates, while rare parameters receive larger updates.
where:
- : parameter at time
- : base learning rate
- : sum of squared gradients up to time
- : current gradient
- : small constant for numerical stability
Parallelization
Each co-occurrence pair can be processed independently (up to synchronization of embedding updates). GPU implementations parallelize across thousands of pairs simultaneously.
Evaluating GloVe Embeddings
Let's examine what our trained GloVe model has learned by looking at word similarities and the embedding structure.
GloVe Word Similarities: -------------------------------------------------- Most similar to 'king': the : +0.401 ██████████ man : +0.241 █████████ rules : +0.230 █████████ in : +0.152 ████████ together : +0.134 ████████ Most similar to 'the': queen : +0.549 ███████████ king : +0.401 ██████████ rules : +0.336 ██████████ a : +0.309 █████████ woman : +0.301 █████████ Most similar to 'and': kingdom : +0.473 ███████████ princess : +0.404 ██████████ sits : +0.387 ██████████ on : +0.270 █████████ live : +0.166 ████████
With only five short sentences, the learned similarities reflect the limited corpus structure rather than general semantic knowledge. Words that frequently co-occur together show higher similarity scores. For larger corpora, GloVe embeddings would capture broader semantic relationships. "King" would be similar to "queen" because both appear in similar royal contexts across thousands of documents, not just in a handful of training sentences.

Practical Training with Larger Data
Our implementation works for small vocabularies, but real applications require optimization. Here's how production GloVe differs:
Memory-Mapped Files
Large co-occurrence matrices don't fit in memory. Production implementations use memory-mapped files or distributed storage.
Shuffled Iteration
To ensure stable training, iterations over co-occurrence pairs should be shuffled. This prevents the model from seeing all pairs involving "the" consecutively.
Early Stopping
Monitor the loss on a held-out portion of the co-occurrence matrix. Stop when validation loss stops improving.
Vector Dimension
Typical dimensions range from 50-300. Larger dimensions capture more nuance but require more data and compute:
Recommended Embedding Dimensions: ------------------------------------------------------- Dimension | Use Case | Trade-offs ------------------------------------------------------- 50 | Mobile/embedded, fast similarity | Less nuanced 100 | General purpose, good balance | Standard choice 200 | Research, downstream tasks | More expressive 300 | State-of-the-art benchmarks | Slower, needs more data -------------------------------------------------------
The choice of embedding dimension involves a bias-variance tradeoff. Lower dimensions force the model to compress information, which can provide regularization but may miss subtle distinctions. Higher dimensions allow more expressive representations but require more training data to avoid overfitting and more memory for storage and computation.
Limitations and Considerations
GloVe produces high-quality embeddings but has limitations worth understanding:
Static embeddings: Like Word2Vec, GloVe produces one vector per word regardless of context. "Bank" has the same embedding whether referring to a financial institution or a river bank.
Out-of-vocabulary words: GloVe cannot generate embeddings for words not in the training vocabulary. Unlike subword methods (FastText), it has no mechanism for morphological generalization.
Window size sensitivity: The choice of context window affects which co-occurrences are captured. Larger windows capture more topical similarity; smaller windows capture syntactic patterns.
Memory requirements: The co-occurrence matrix, while sparse, can be large. A 400,000-word vocabulary might produce a matrix with billions of non-zero entries.
Corpus bias: Embeddings reflect biases in the training corpus. Associations between professions and genders, for example, are encoded in the vectors.
Summary
GloVe approaches word embeddings from a different angle than Word2Vec. By explicitly factorizing the log co-occurrence matrix with carefully designed weighting, it produces embeddings that encode both local and global corpus statistics.
Key takeaways:
-
Co-occurrence ratios encode meaning: The ratio reveals how a probe word discriminates between targets. GloVe's objective makes word vectors reconstruct these ratios.
-
Weighted least squares: The objective minimizes weighted squared error between and , with weights that prioritize frequent co-occurrences.
-
Matrix factorization perspective: GloVe factorizes the log co-occurrence matrix into word and context embeddings plus biases. This connects it to classical methods like LSA.
-
Bias terms absorb frequency: Word and context biases capture overall frequency effects, preventing common words from distorting the embedding geometry.
-
Combined vectors work best: The final embedding is typically , averaging the word and context vectors learned during training.
-
Competitive with Word2Vec: Despite the different approach, GloVe achieves comparable results on standard benchmarks, with trade-offs in memory usage and training paradigm.
The next chapter explores FastText, which extends Word2Vec with subword information, enabling the model to handle morphologically rich languages and out-of-vocabulary words.
Key Parameters
When training GloVe models, several hyperparameters affect the quality of learned embeddings:
embedding_dim (typical range: 50-300): The dimensionality of word vectors.
- Lower values (50-100): Faster training, smaller memory footprint. Good for similarity tasks.
- Higher values (200-300): Captures more nuanced relationships. Better for downstream tasks and analogy completion.
- Common choice: 100-200 for most applications; 300 for benchmarks.
window_size (typical range: 5-15): Context window for building the co-occurrence matrix.
- Smaller windows (5-8): Emphasize syntactic relationships.
- Larger windows (10-15): Capture broader semantic/topical similarity.
- Common choice: 10-15 for semantic tasks.
x_max (typical value: 100): Cutoff for the weighting function.
- Co-occurrences above this threshold receive weight 1.0.
- Lower values give more uniform weighting; higher values let frequent pairs dominate more.
- Common choice: 100 (from the original paper).
alpha (typical value: 0.75): Exponent in the weighting function.
- Controls how quickly weight increases with co-occurrence count.
- Lower values (0.5) more aggressively dampen frequent pairs.
- Higher values (1.0) approach raw-count weighting.
- Common choice: 0.75 (from the original paper).
min_count (typical range: 1-100): Minimum word frequency to include in vocabulary.
- Lower values include rare words but may produce noisy embeddings.
- Higher values produce more robust embeddings but exclude rare words.
- Common choice: 5-10 for large corpora.
learning_rate (typical value: 0.05): Initial learning rate for AdaGrad.
- Higher values speed training but may overshoot.
- AdaGrad adapts rates per-parameter, so the initial value is less critical than with SGD.
- Common choice: 0.05.
epochs (typical range: 25-100): Number of passes through the co-occurrence data.
- Fewer epochs for very large matrices.
- More epochs for smaller datasets or when loss hasn't converged.
- Common choice: 50-100.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about GloVe word embeddings.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

FastText: Subword Embeddings for OOV Words & Morphology
Learn how FastText extends Word2Vec with character n-grams to handle out-of-vocabulary words, typos, and morphologically rich languages.

Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection
Learn how to evaluate word embeddings using similarity tests, analogy tasks, downstream evaluation, t-SNE visualization, and bias detection with WEAT.

Training Word2Vec: Complete Pipeline with Gensim & PyTorch Implementation
Learn how to train Word2Vec embeddings from scratch, covering preprocessing, subsampling, negative sampling, learning rate scheduling, and full implementations in Gensim and PyTorch.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.


Comments