Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations

Michael Brenndoerfer

Data, Analytics & AI Software Engineering Machine Learning Language AI Handbook

Complete guide to word embeddings covering Word2Vec skip-gram, GloVe matrix factorization, negative sampling, and co-occurrence statistics. Learn how to implement embeddings from scratch and understand how semantic relationships emerge from vector space geometry.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

View full handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Word Embeddings

How do we represent words as numbers? This question has driven NLP research for decades. Early approaches used one-hot encoding: each word gets a unique vector where all values are zero except one position. "cat" might be [1, 0, 0, 0, ...], "dog" might be [0, 1, 0, 0, ...]. This works, but it's fundamentally limited. These vectors tell us nothing about relationships between words. "cat" and "dog" are as similar as "cat" and "airplane" in one-hot space.

Word embeddings solve this by learning dense, low-dimensional vectors where semantically similar words have similar vector representations. Instead of sparse one-hot vectors with thousands of dimensions, we get compact vectors (typically 100-300 dimensions) where words with related meanings cluster together in vector space. The distance between "cat" and "dog" becomes small, while "cat" and "airplane" remain far apart.

The breakthrough came from a simple insight: words that appear in similar contexts tend to have similar meanings. This distributional hypothesis, dating back to linguist J.R. Firth's 1957 observation that "you shall know a word by the company it keeps," became the foundation for modern word embedding methods. Word2Vec and GloVe, the two most influential embedding algorithms, both exploit this principle, but through different mathematical approaches.

Word embeddings transformed NLP by enabling models to capture semantic relationships directly from data. They unlocked transfer learning: embeddings trained on massive text corpora could be reused across tasks, dramatically improving performance on downstream applications with limited training data. Today, word embeddings are foundational to everything from search engines to chatbots, though they've been largely superseded by contextual embeddings from transformers.

The Problem with Traditional Representations

Before word embeddings, NLP systems relied on sparse, high-dimensional representations that couldn't capture semantic relationships. Understanding these limitations helps explain why embeddings represented a significant advance in NLP.

One-Hot Encoding: The Baseline

One-hot encoding represents each word as a binary vector where exactly one position is 1 and all others are 0. If your vocabulary has 10,000 words, each word gets a 10,000-dimensional vector:

In[2]:

1vocabulary = ["cat", "dog", "bird", "airplane", "car"]
2word_to_index = {word: i for i, word in enumerate(vocabulary)}
3
4def one_hot_encode(word, vocab_size):
5    vector = [0] * vocab_size
6    vector[word_to_index[word]] = 1
7    return vector
8
9cat_vector = one_hot_encode("cat", len(vocabulary))
10dog_vector = one_hot_encode("dog", len(vocabulary))
11airplane_vector = one_hot_encode("airplane", len(vocabulary))

1vocabulary = ["cat", "dog", "bird", "airplane", "car"]
2word_to_index = {word: i for i, word in enumerate(vocabulary)}
3
4def one_hot_encode(word, vocab_size):
5    vector = [0] * vocab_size
6    vector[word_to_index[word]] = 1
7    return vector
8
9cat_vector = one_hot_encode("cat", len(vocabulary))
10dog_vector = one_hot_encode("dog", len(vocabulary))
11airplane_vector = one_hot_encode("airplane", len(vocabulary))

Out[3]:

cat: [1, 0, 0, 0, 0]
dog: [0, 1, 0, 0, 0]
airplane: [0, 0, 0, 1, 0]

Loading component...

In[4]:

1corpus = [
2    "the cat sits on the mat",
3    "the dog runs in the park",
4    "the bird flies in the sky",
5    "the car drives on the road",
6    "the plane flies in the air",
7    "the train moves on tracks"
8]

1corpus = [
2    "the cat sits on the mat",
3    "the dog runs in the park",
4    "the bird flies in the sky",
5    "the car drives on the road",
6    "the plane flies in the air",
7    "the train moves on tracks"
8]

Even with this minimal data, we can see patterns: "cat" and "dog" appear in similar contexts (both with "the", both animals), while "car" and "plane" share "flies" or movement verbs but differ in their specific contexts.

Computing Co-occurrence Statistics

Let's build a co-occurrence matrix to understand what GloVe would learn:

In[5]:

1from collections import defaultdict
2import numpy as np
3
4def build_cooccurrence_matrix(corpus, window_size=2):
5    """Build a co-occurrence matrix from a corpus."""
6    # Tokenize
7    sentences = [sentence.split() for sentence in corpus]
8    
9    # Build vocabulary
10    vocab = set()
11    for sentence in sentences:
12        vocab.update(sentence)
13    vocab = sorted(list(vocab))
14    word_to_idx = {word: i for i, word in enumerate(vocab)}
15    
16    # Initialize co-occurrence matrix
17    cooccurrence = defaultdict(int)
18    
19    # Count co-occurrences
20    for sentence in sentences:
21        for i, center_word in enumerate(sentence):
22            center_idx = word_to_idx[center_word]
23            # Look at context window
24            for j in range(max(0, i - window_size), min(len(sentence), i + window_size + 1)):
25                if i != j:
26                    context_word = sentence[j]
27                    context_idx = word_to_idx[context_word]
28                    cooccurrence[(center_idx, context_idx)] += 1
29    
30    # Convert to matrix
31    V = len(vocab)
32    matrix = np.zeros((V, V))
33    for (i, j), count in cooccurrence.items():
34        matrix[i, j] = count
35    
36    return matrix, vocab, word_to_idx
37
38cooccurrence_matrix, vocabulary, word_to_idx = build_cooccurrence_matrix(corpus)

1from collections import defaultdict
2import numpy as np
3
4def build_cooccurrence_matrix(corpus, window_size=2):
5    """Build a co-occurrence matrix from a corpus."""
6    # Tokenize
7    sentences = [sentence.split() for sentence in corpus]
8    
9    # Build vocabulary
10    vocab = set()
11    for sentence in sentences:
12        vocab.update(sentence)
13    vocab = sorted(list(vocab))
14    word_to_idx = {word: i for i, word in enumerate(vocab)}
15    
16    # Initialize co-occurrence matrix
17    cooccurrence = defaultdict(int)
18    
19    # Count co-occurrences
20    for sentence in sentences:
21        for i, center_word in enumerate(sentence):
22            center_idx = word_to_idx[center_word]
23            # Look at context window
24            for j in range(max(0, i - window_size), min(len(sentence), i + window_size + 1)):
25                if i != j:
26                    context_word = sentence[j]
27                    context_idx = word_to_idx[context_word]
28                    cooccurrence[(center_idx, context_idx)] += 1
29    
30    # Convert to matrix
31    V = len(vocab)
32    matrix = np.zeros((V, V))
33    for (i, j), count in cooccurrence.items():
34        matrix[i, j] = count
35    
36    return matrix, vocab, word_to_idx
37
38cooccurrence_matrix, vocabulary, word_to_idx = build_cooccurrence_matrix(corpus)

Out[6]:

Vocabulary: ['air', 'bird', 'car', 'cat', 'dog', 'drives', 'flies', 'in', 'mat', 'moves', 'on', 'park', 'plane', 'road', 'runs', 'sits', 'sky', 'the', 'tracks', 'train']

Vocabulary size: 20

Co-occurrence matrix (first 10x10):
[[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 2. 0. 0.]
 [1. 1. 0. 0. 1. 0. 2. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

The vocabulary contains all unique words from our small corpus. The co-occurrence matrix shows how often each word pair appears together within the context window. Notice that words like "cat" and "dog" have similar co-occurrence patterns (both frequently co-occur with "the"), while "car" and "plane" share some patterns but differ in others. This is exactly what embeddings learn to capture: words with similar co-occurrence patterns will have similar embeddings.

Visualizing the Co-occurrence Matrix

A heatmap visualization makes the co-occurrence patterns immediately visible:

Out[7]:

Visualization

Heatmap showing co-occurrence counts between words, with darker colors indicating higher co-occurrence frequencies.

Heatmap of the co-occurrence matrix showing how often word pairs appear together in the corpus. Darker cells indicate higher co-occurrence counts. Notice how semantically related words like 'cat' and 'dog' have similar co-occurrence patterns (both frequently co-occur with 'the'), while 'car' and 'plane' share some patterns but differ in others. The sparse nature of the matrix (many zeros) is also evident, which is why GloVe uses a weighting function to handle this sparsity.

The heatmap reveals several important patterns. First, the matrix is sparse: most word pairs never co-occur (white cells with count 0). Second, semantically related words show similar patterns: "cat" and "dog" both have high co-occurrence with "the", "on", and "in", reflecting their similar grammatical roles. Third, the diagonal is empty because we don't count a word co-occurring with itself. These patterns are exactly what GloVe's matrix factorization captures: words with similar rows (or columns) in this matrix will have similar embeddings after training.

Visualizing Relationships

Even in this tiny example, we can see semantic clusters emerging. Words that appear in similar contexts will have similar rows (or columns) in the co-occurrence matrix, which translates to similar embeddings after factorization.

Code Implementation: Training Word Embeddings from Scratch

Now let's implement Word2Vec skip-gram with negative sampling from scratch. This implementation will help you understand exactly how embeddings are learned.

Implementing Skip-gram

In[8]:

1import numpy as np
2from collections import Counter, defaultdict
3import random
4
5class Word2Vec:
6    def __init__(self, vocab_size, embedding_dim=100, window_size=5, 
7                 negative_samples=5, learning_rate=0.01):
8        self.vocab_size = vocab_size
9        self.embedding_dim = embedding_dim
10        self.window_size = window_size
11        self.negative_samples = negative_samples
12        self.learning_rate = learning_rate
13        
14        # Initialize embedding matrices
15        # W_in: input embeddings (what we'll use as word embeddings)
16        # W_out: output embeddings (for predicting context words)
17        self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
18        self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
19        
20    def sigmoid(self, x):
21        """Sigmoid function with numerical stability."""
22        return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
23    
24    def train_step(self, center_idx, context_idx, negative_indices):
25        """Perform one training step for a center word and its context."""
26        # Get embeddings
27        center_emb = self.W_in[center_idx]  # Shape: (embedding_dim,)
28        context_emb = self.W_out[context_idx]  # Shape: (embedding_dim,)
29        negative_embs = self.W_out[negative_indices]  # Shape: (negative_samples, embedding_dim)
30        
31        # Positive example: maximize P(context | center)
32        pos_score = np.dot(center_emb, context_emb)
33        pos_prob = self.sigmoid(pos_score)
34        pos_error = 1 - pos_prob  # Target is 1 for positive
35        
36        # Negative examples: minimize P(negative | center)
37        neg_scores = np.dot(center_emb, negative_embs.T)  # Shape: (negative_samples,)
38        neg_probs = self.sigmoid(neg_scores)
39        neg_errors = neg_probs  # Target is 0 for negative
40        
41        # Compute gradients
42        # For input embedding (center word)
43        grad_in = pos_error * context_emb - np.sum(neg_errors[:, None] * negative_embs, axis=0)
44        
45        # For output embedding (context word)
46        grad_out_pos = pos_error * center_emb
47        
48        # For output embeddings (negative words)
49        grad_out_neg = neg_errors[:, None] * center_emb
50        
51        # Update weights
52        self.W_in[center_idx] += self.learning_rate * grad_in
53        self.W_out[context_idx] += self.learning_rate * grad_out_pos
54        self.W_out[negative_indices] -= self.learning_rate * grad_out_neg
55        
56        # Return loss for monitoring
57        loss = -np.log(pos_prob + 1e-10) - np.sum(np.log(1 - neg_probs + 1e-10))
58        return loss
59    
60    def get_embeddings(self):
61        """Return the input embeddings as the word representations."""
62        return self.W_in.copy()
63
64# Example usage with a small vocabulary
65vocab = ["cat", "dog", "bird", "car", "plane", "the", "runs", "flies"]
66vocab_size = len(vocab)
67word_to_idx = {word: i for i, word in enumerate(vocab)}
68
69# Initialize model
70model = Word2Vec(vocab_size=vocab_size, embedding_dim=10, 
71                 window_size=2, negative_samples=3, learning_rate=0.1)
72embedding_shape = model.get_embeddings().shape

1import numpy as np
2from collections import Counter, defaultdict
3import random
4
5class Word2Vec:
6    def __init__(self, vocab_size, embedding_dim=100, window_size=5, 
7                 negative_samples=5, learning_rate=0.01):
8        self.vocab_size = vocab_size
9        self.embedding_dim = embedding_dim
10        self.window_size = window_size
11        self.negative_samples = negative_samples
12        self.learning_rate = learning_rate
13        
14        # Initialize embedding matrices
15        # W_in: input embeddings (what we'll use as word embeddings)
16        # W_out: output embeddings (for predicting context words)
17        self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
18        self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
19        
20    def sigmoid(self, x):
21        """Sigmoid function with numerical stability."""
22        return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
23    
24    def train_step(self, center_idx, context_idx, negative_indices):
25        """Perform one training step for a center word and its context."""
26        # Get embeddings
27        center_emb = self.W_in[center_idx]  # Shape: (embedding_dim,)
28        context_emb = self.W_out[context_idx]  # Shape: (embedding_dim,)
29        negative_embs = self.W_out[negative_indices]  # Shape: (negative_samples, embedding_dim)
30        
31        # Positive example: maximize P(context | center)
32        pos_score = np.dot(center_emb, context_emb)
33        pos_prob = self.sigmoid(pos_score)
34        pos_error = 1 - pos_prob  # Target is 1 for positive
35        
36        # Negative examples: minimize P(negative | center)
37        neg_scores = np.dot(center_emb, negative_embs.T)  # Shape: (negative_samples,)
38        neg_probs = self.sigmoid(neg_scores)
39        neg_errors = neg_probs  # Target is 0 for negative
40        
41        # Compute gradients
42        # For input embedding (center word)
43        grad_in = pos_error * context_emb - np.sum(neg_errors[:, None] * negative_embs, axis=0)
44        
45        # For output embedding (context word)
46        grad_out_pos = pos_error * center_emb
47        
48        # For output embeddings (negative words)
49        grad_out_neg = neg_errors[:, None] * center_emb
50        
51        # Update weights
52        self.W_in[center_idx] += self.learning_rate * grad_in
53        self.W_out[context_idx] += self.learning_rate * grad_out_pos
54        self.W_out[negative_indices] -= self.learning_rate * grad_out_neg
55        
56        # Return loss for monitoring
57        loss = -np.log(pos_prob + 1e-10) - np.sum(np.log(1 - neg_probs + 1e-10))
58        return loss
59    
60    def get_embeddings(self):
61        """Return the input embeddings as the word representations."""
62        return self.W_in.copy()
63
64# Example usage with a small vocabulary
65vocab = ["cat", "dog", "bird", "car", "plane", "the", "runs", "flies"]
66vocab_size = len(vocab)
67word_to_idx = {word: i for i, word in enumerate(vocab)}
68
69# Initialize model
70model = Word2Vec(vocab_size=vocab_size, embedding_dim=10, 
71                 window_size=2, negative_samples=3, learning_rate=0.1)
72embedding_shape = model.get_embeddings().shape

Out[9]:

Initialized Word2Vec model
Embedding shape: (8, 10)

The Word2Vec class implements the skip-gram architecture with negative sampling. The embedding matrices are initialized with small random values, which will be updated during training. The embedding shape shows we have one vector per word in the vocabulary, with each vector having the specified embedding dimension.

Training on Real Data

Now let's train on a larger corpus. We'll use a simple text preprocessing pipeline and train for multiple epochs:

In[10]:

1def prepare_training_data(corpus, word_to_idx, window_size=5):
2    """Prepare training examples from corpus."""
3    training_pairs = []
4    
5    for sentence in corpus:
6        tokens = sentence.lower().split()
7        indices = [word_to_idx.get(word) for word in tokens if word in word_to_idx]
8        
9        for i, center_idx in enumerate(indices):
10            # Get context window
11            start = max(0, i - window_size)
12            end = min(len(indices), i + window_size + 1)
13            
14            for j in range(start, end):
15                if i != j:
16                    context_idx = indices[j]
17                    training_pairs.append((center_idx, context_idx))
18    
19    return training_pairs
20
21def sample_negative_words(center_idx, vocab_size, num_samples, word_freqs):
22    """Sample negative words weighted by frequency."""
23    # Create unigram distribution (simplified - in practice use actual frequencies)
24    # Exclude the center word itself
25    probs = np.ones(vocab_size)
26    probs[center_idx] = 0
27    probs = probs / probs.sum()
28    
29    negative_indices = np.random.choice(vocab_size, size=num_samples, 
30                                       replace=False, p=probs)
31    return negative_indices
32
33# Prepare corpus
34training_corpus = [
35    "the cat sits on the mat",
36    "the dog runs in the park", 
37    "the bird flies in the sky",
38    "the car drives on the road",
39    "the plane flies in the air",
40    "the train moves on tracks",
41    "cats and dogs are pets",
42    "birds fly in the sky",
43    "cars drive on roads",
44    "planes fly in the air"
45]
46
47# Build vocabulary from corpus
48all_words = []
49for sentence in training_corpus:
50    all_words.extend(sentence.lower().split())
51word_counts = Counter(all_words)
52vocab = [word for word, count in word_counts.items() if count >= 1]  # Keep all words for this example
53vocab_size = len(vocab)
54word_to_idx = {word: i for i, word in enumerate(vocab)}
55idx_to_word = {i: word for word, i in word_to_idx.items()}
56
57# Initialize model
58model = Word2Vec(vocab_size=vocab_size, embedding_dim=20, 
59                 window_size=2, negative_samples=5, learning_rate=0.1)
60
61# Prepare training data
62training_pairs = prepare_training_data(training_corpus, word_to_idx, window_size=2)
63
64# Train for a few epochs
65epochs = 50
66loss_history = []
67for epoch in range(epochs):
68    total_loss = 0
69    random.shuffle(training_pairs)
70    
71    for center_idx, context_idx in training_pairs:
72        negative_indices = sample_negative_words(center_idx, vocab_size, 
73                                                model.negative_samples, None)
74        loss = model.train_step(center_idx, context_idx, negative_indices)
75        total_loss += loss
76    
77    avg_loss = total_loss / len(training_pairs)
78    loss_history.append(avg_loss)

1def prepare_training_data(corpus, word_to_idx, window_size=5):
2    """Prepare training examples from corpus."""
3    training_pairs = []
4    
5    for sentence in corpus:
6        tokens = sentence.lower().split()
7        indices = [word_to_idx.get(word) for word in tokens if word in word_to_idx]
8        
9        for i, center_idx in enumerate(indices):
10            # Get context window
11            start = max(0, i - window_size)
12            end = min(len(indices), i + window_size + 1)
13            
14            for j in range(start, end):
15                if i != j:
16                    context_idx = indices[j]
17                    training_pairs.append((center_idx, context_idx))
18    
19    return training_pairs
20
21def sample_negative_words(center_idx, vocab_size, num_samples, word_freqs):
22    """Sample negative words weighted by frequency."""
23    # Create unigram distribution (simplified - in practice use actual frequencies)
24    # Exclude the center word itself
25    probs = np.ones(vocab_size)
26    probs[center_idx] = 0
27    probs = probs / probs.sum()
28    
29    negative_indices = np.random.choice(vocab_size, size=num_samples, 
30                                       replace=False, p=probs)
31    return negative_indices
32
33# Prepare corpus
34training_corpus = [
35    "the cat sits on the mat",
36    "the dog runs in the park", 
37    "the bird flies in the sky",
38    "the car drives on the road",
39    "the plane flies in the air",
40    "the train moves on tracks",
41    "cats and dogs are pets",
42    "birds fly in the sky",
43    "cars drive on roads",
44    "planes fly in the air"
45]
46
47# Build vocabulary from corpus
48all_words = []
49for sentence in training_corpus:
50    all_words.extend(sentence.lower().split())
51word_counts = Counter(all_words)
52vocab = [word for word, count in word_counts.items() if count >= 1]  # Keep all words for this example
53vocab_size = len(vocab)
54word_to_idx = {word: i for i, word in enumerate(vocab)}
55idx_to_word = {i: word for word, i in word_to_idx.items()}
56
57# Initialize model
58model = Word2Vec(vocab_size=vocab_size, embedding_dim=20, 
59                 window_size=2, negative_samples=5, learning_rate=0.1)
60
61# Prepare training data
62training_pairs = prepare_training_data(training_corpus, word_to_idx, window_size=2)
63
64# Train for a few epochs
65epochs = 50
66loss_history = []
67for epoch in range(epochs):
68    total_loss = 0
69    random.shuffle(training_pairs)
70    
71    for center_idx, context_idx in training_pairs:
72        negative_indices = sample_negative_words(center_idx, vocab_size, 
73                                                model.negative_samples, None)
74        loss = model.train_step(center_idx, context_idx, negative_indices)
75        total_loss += loss
76    
77    avg_loss = total_loss / len(training_pairs)
78    loss_history.append(avg_loss)

Out[11]:

Vocabulary size: 31
Vocabulary: ['the', 'cat', 'sits', 'on', 'mat', 'dog', 'runs', 'in', 'park', 'bird', 'flies', 'sky', 'car', 'drives', 'road', 'plane', 'air', 'train', 'moves', 'tracks', 'cats', 'and', 'dogs', 'are', 'pets', 'birds', 'fly', 'cars', 'drive', 'roads', 'planes']

Training pairs: 156

Training progress:
Epoch 1/50, Average Loss: 4.1580
Epoch 11/50, Average Loss: 2.2803
Epoch 21/50, Average Loss: 1.8897
Epoch 31/50, Average Loss: 1.7426
Epoch 41/50, Average Loss: 1.7041

Training complete!
Final loss: 1.5618

The vocabulary contains all unique words from our training corpus. We generate training pairs by sliding a context window across each sentence, creating (center word, context word) pairs. During training, the loss decreases as the model learns to predict context words more accurately. The decreasing loss indicates that the embeddings are learning meaningful relationships: words that appear in similar contexts are being pulled together in embedding space.

Visualizing Training Progress

A loss curve shows how the model improves over training:

Out[12]:

Visualization

Line plot showing decreasing training loss over 50 epochs, with loss values on the y-axis and epoch number on the x-axis.

Training loss curve showing how the negative sampling loss decreases over 50 epochs. The steady decline indicates that the model is successfully learning to distinguish between positive context words and negative samples, pulling related words together in embedding space while pushing unrelated words apart. The loss plateaus near the end, suggesting the model has converged to a good solution.

The loss curve demonstrates successful training: the loss decreases steadily from the initial random embeddings to a much lower value, indicating the model is learning meaningful word relationships. The initial rapid decrease shows the model quickly learns basic patterns, while the gradual decline and eventual plateau suggest it's converging to a stable solution where embeddings capture semantic relationships effectively.

Visualizing Learned Embeddings

Now let's visualize the embeddings to see if semantically similar words cluster together:

Out[13]:

Visualization

2D scatter plot showing word embeddings clustered by category: animals, vehicles, function words, verbs, and nouns.

Word embeddings projected to 2D using PCA. Semantically related words cluster together: animals (''cat'', ''dog'', ''bird'') form one group, vehicles (''car'', ''plane'', ''train'') form another, and function words (''the'', ''on'', ''in'') cluster separately. This demonstrates how Word2Vec learns to encode semantic relationships in the geometry of vector space.

The visualization shows that even with minimal training data, Word2Vec learns meaningful structure. Animals cluster together, vehicles form their own group, and function words occupy a distinct region. This demonstrates how semantic relationships emerge naturally from co-occurrence patterns.

Computing Word Similarities

We can measure similarity between words using cosine similarity:

In[14]:

1from sklearn.metrics.pairwise import cosine_similarity
2
3def find_most_similar(word, embeddings, vocab, word_to_idx, top_k=5):
4    """Find the most similar words to a given word."""
5    if word not in word_to_idx:
6        return []
7    
8    word_idx = word_to_idx[word]
9    word_emb = embeddings[word_idx:word_idx+1]  # Keep 2D for cosine_similarity
10    
11    # Compute similarities with all words
12    similarities = cosine_similarity(word_emb, embeddings)[0]
13    
14    # Get top k (excluding the word itself)
15    top_indices = np.argsort(similarities)[::-1][1:top_k+1]
16    
17    results = [(vocab[idx], similarities[idx]) for idx in top_indices]
18    return results
19
20# Test similarity
21embeddings = model.get_embeddings()
22cat_similar = find_most_similar("cat", embeddings, vocab, word_to_idx)
23car_similar = find_most_similar("car", embeddings, vocab, word_to_idx)

1from sklearn.metrics.pairwise import cosine_similarity
2
3def find_most_similar(word, embeddings, vocab, word_to_idx, top_k=5):
4    """Find the most similar words to a given word."""
5    if word not in word_to_idx:
6        return []
7    
8    word_idx = word_to_idx[word]
9    word_emb = embeddings[word_idx:word_idx+1]  # Keep 2D for cosine_similarity
10    
11    # Compute similarities with all words
12    similarities = cosine_similarity(word_emb, embeddings)[0]
13    
14    # Get top k (excluding the word itself)
15    top_indices = np.argsort(similarities)[::-1][1:top_k+1]
16    
17    results = [(vocab[idx], similarities[idx]) for idx in top_indices]
18    return results
19
20# Test similarity
21embeddings = model.get_embeddings()
22cat_similar = find_most_similar("cat", embeddings, vocab, word_to_idx)
23car_similar = find_most_similar("car", embeddings, vocab, word_to_idx)

Out[15]:

Most similar to 'cat':
  car: 0.8797
  sits: 0.8629
  mat: 0.8382
  road: 0.8239
  on: 0.8001

Most similar to 'car':
  drives: 0.8947
  cat: 0.8797
  on: 0.8470
  mat: 0.8277
  road: 0.8002

The similarity scores show how closely related words are in embedding space. Words with higher cosine similarity (closer to 1.0) have more similar embeddings, meaning they appear in similar contexts. For "cat", we expect to see other animals like "dog" or "bird" with high similarity. For "car", we expect vehicles like "plane" or "train". These results demonstrate that Word2Vec successfully captures semantic relationships: words that are semantically related have similar embeddings.

Word Analogies

Word embeddings can capture analogies through vector arithmetic. The classic example is: "king - man + woman = queen". Let's test this:

In[16]:

1def word_analogy(word1, word2, word3, embeddings, vocab, word_to_idx, top_k=5):
2    """Solve word analogy: word1 is to word2 as word3 is to ?"""
3    if not all(w in word_to_idx for w in [word1, word2, word3]):
4        return []
5    
6    # Compute analogy vector: vec3 + (vec2 - vec1)
7    vec1 = embeddings[word_to_idx[word1]]
8    vec2 = embeddings[word_to_idx[word2]]
9    vec3 = embeddings[word_to_idx[word3]]
10    
11    analogy_vec = vec3 + (vec2 - vec1)
12    analogy_vec = analogy_vec.reshape(1, -1)
13    
14    # Find most similar words
15    similarities = cosine_similarity(analogy_vec, embeddings)[0]
16    
17    # Exclude the input words
18    exclude_indices = [word_to_idx[w] for w in [word1, word2, word3]]
19    similarities[exclude_indices] = -1
20    
21    top_indices = np.argsort(similarities)[::-1][:top_k]
22    results = [(vocab[idx], similarities[idx]) for idx in top_indices]
23    return results
24
25# Test analogies (with our limited vocabulary)
26analogy1 = word_analogy("cat", "dog", "car", embeddings, vocab, word_to_idx)
27analogy2 = word_analogy("flies", "bird", "drives", embeddings, vocab, word_to_idx)

1def word_analogy(word1, word2, word3, embeddings, vocab, word_to_idx, top_k=5):
2    """Solve word analogy: word1 is to word2 as word3 is to ?"""
3    if not all(w in word_to_idx for w in [word1, word2, word3]):
4        return []
5    
6    # Compute analogy vector: vec3 + (vec2 - vec1)
7    vec1 = embeddings[word_to_idx[word1]]
8    vec2 = embeddings[word_to_idx[word2]]
9    vec3 = embeddings[word_to_idx[word3]]
10    
11    analogy_vec = vec3 + (vec2 - vec1)
12    analogy_vec = analogy_vec.reshape(1, -1)
13    
14    # Find most similar words
15    similarities = cosine_similarity(analogy_vec, embeddings)[0]
16    
17    # Exclude the input words
18    exclude_indices = [word_to_idx[w] for w in [word1, word2, word3]]
19    similarities[exclude_indices] = -1
20    
21    top_indices = np.argsort(similarities)[::-1][:top_k]
22    results = [(vocab[idx], similarities[idx]) for idx in top_indices]
23    return results
24
25# Test analogies (with our limited vocabulary)
26analogy1 = word_analogy("cat", "dog", "car", embeddings, vocab, word_to_idx)
27analogy2 = word_analogy("flies", "bird", "drives", embeddings, vocab, word_to_idx)

Out[17]:

cat is to dog as car is to ?
  bird: 0.8005
  plane: 0.7982
  sky: 0.7787
  air: 0.7643
  park: 0.7554

flies is to bird as drives is to ?
  car: 0.9235
  on: 0.8215
  cat: 0.7997
  mat: 0.7539
  road: 0.7397

Word analogies test whether embeddings capture relational patterns. The analogy "cat is to dog as car is to ?" asks: what word has the same relationship to "car" that "dog" has to "cat"? We expect "plane" or "train" (another vehicle). The vector arithmetic $\mathbf{v}_{\text{car}} + (\mathbf{v}_{\text{dog}} - \mathbf{v}_{\text{cat}})$ should point toward the answer. With our small vocabulary and limited training, results may not be perfect, but the principle demonstrates how embeddings encode relationships: semantic relationships become geometric relationships in vector space.

Key Parameters

The Word2Vec implementation uses several key parameters that control training and embedding quality:

embedding_dim (default: 100-300): The dimensionality of the embedding vectors. Larger dimensions can capture more nuanced relationships but require more training data and computation. Typical values range from 50-300, with 100-200 being most common for general-purpose embeddings.
window_size (default: 5-10): The size of the context window on each side of the center word. Larger windows capture more global relationships (document-level patterns), while smaller windows focus on local syntactic relationships. For most applications, 5-10 words on each side works well.
negative_samples (default: 5-20): The number of negative examples sampled for each positive example. More negative samples improve embedding quality but slow training. Values between 5-20 are typical, with 5-10 being common for large corpora.
learning_rate (default: 0.01-0.05): The step size for gradient descent updates. Higher learning rates train faster but may overshoot optimal values. Lower learning rates are more stable but require more epochs. Start with 0.01-0.05 and adjust based on loss convergence.
vocab_size: The number of unique words in the vocabulary. This is determined by your corpus and preprocessing choices. Larger vocabularies require more memory and computation but capture more word diversity.

These parameters interact: larger embedding dimensions and more negative samples improve quality but increase training time. For production use, start with moderate values (embedding_dim=200, window_size=5, negative_samples=5) and tune based on your specific task and computational constraints.

Comparing Word2Vec and GloVe

Both Word2Vec and GloVe produce high-quality embeddings, but they have different strengths:

Word2Vec advantages:

Online learning: can process streaming data
Memory efficient: doesn't need to store full co-occurrence matrix
Fast training with negative sampling
Better for very large corpora

GloVe advantages:

Uses global statistics: all co-occurrence information simultaneously
Often performs better on analogy tasks
More interpretable: explicit relationship to co-occurrence statistics
Can leverage pre-computed co-occurrence matrices

In practice, both methods produce embeddings of similar quality. The choice often depends on your specific constraints: use Word2Vec for streaming data or very large corpora, use GloVe when you can pre-compute co-occurrence statistics and want slightly better performance on semantic tasks.

Limitations & Impact

Word embeddings revolutionized NLP, but they have important limitations that contextual embeddings (like BERT) address.

Key Limitations

Static representations: Each word has a single embedding regardless of context. "bank" (financial institution) and "bank" (river edge) have the same vector, even though they mean different things.
Out-of-vocabulary words: Words not seen during training get no embedding. This is particularly problematic for morphologically rich languages and domain-specific terminology.
No sentence-level understanding: Embeddings capture word-level semantics but don't understand how words combine to form meaning. The phrase "not good" might be represented as the average of "not" and "good" embeddings, losing the negation.
Bias amplification: Embeddings learn biases present in training data. Gender stereotypes, racial biases, and other problematic associations get encoded in the vector space.
Limited to word-level: Can't represent subword units (important for morphologically rich languages) or multi-word expressions without special handling.

Historical Impact

Despite these limitations, word embeddings had significant impact:

Transfer learning: Pre-trained embeddings enabled NLP models to work with limited task-specific data
Semantic search: Embeddings power modern search engines and recommendation systems
Foundation for transformers: The embedding concept evolved into positional encodings and token embeddings in transformers
Research catalyst: Sparked interest in representation learning that led to contextual embeddings

The Path Forward

Word embeddings were a crucial stepping stone. They proved that dense vector representations could capture semantic relationships, setting the stage for contextual embeddings from transformers. Today, while static word embeddings are rarely used directly in state-of-the-art systems, the principles they established (dense representations, transfer learning, semantic relationships in vector space) remain foundational to modern NLP.

Summary

Word embeddings solve the fundamental problem of representing words as numbers in a way that captures semantic relationships. Unlike one-hot encoding, which treats all words as equally different, embeddings learn dense, low-dimensional vectors where similar words have similar representations.

Key takeaways:

Distributional hypothesis: Words appearing in similar contexts have similar meanings. This principle underlies both Word2Vec and GloVe.
Word2Vec: Learns embeddings by predicting context words (skip-gram) or center words (CBOW) using local windows. Negative sampling makes training efficient by approximating the expensive softmax.
GloVe: Learns embeddings by factorizing a global co-occurrence matrix. Uses ratios of co-occurrence probabilities to capture word relationships.
Semantic relationships: Embeddings encode meaning in vector geometry. Similarity is measurable (cosine similarity), and analogies can be solved through vector arithmetic.
Limitations: Static representations can't handle polysemy, require handling of OOV words, and may encode biases from training data.
Impact: Enabled transfer learning in NLP and laid the foundation for modern contextual embeddings.

Word embeddings transformed NLP by making semantic relationships computable. While superseded by contextual embeddings in most applications, understanding embeddings is important for grasping how modern language models represent and understand text.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about word embeddings, Word2Vec, and GloVe.

Loading component...

Back to Language AI Handbook

Previous Chapter

Text Preprocessing

Next Chapter

TF-IDF and Bag of Words

Reference

BIBTEXAcademic

@misc{wordembeddingsfromword2vectogloveunderstandingdistributedrepresentations, author = {Michael Brenndoerfer}, title = {Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations}, year = {2025}, url = {https://mbrenndoerfer.com/writing/word-embeddings-word2vec-glove-distributed-representations}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-30} }

APAAcademic

Michael Brenndoerfer (2025). Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations. Retrieved from https://mbrenndoerfer.com/writing/word-embeddings-word2vec-glove-distributed-representations

MLAAcademic

Michael Brenndoerfer. "Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations." 2025. Web. 11/30/2025. <https://mbrenndoerfer.com/writing/word-embeddings-word2vec-glove-distributed-representations>.

CHICAGOAcademic

Michael Brenndoerfer. "Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations." Accessed 11/30/2025. https://mbrenndoerfer.com/writing/word-embeddings-word2vec-glove-distributed-representations.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations'. Available at: https://mbrenndoerfer.com/writing/word-embeddings-word2vec-glove-distributed-representations (Accessed: 11/30/2025).

SimpleBasic

Michael Brenndoerfer (2025). Word Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations. https://mbrenndoerfer.com/writing/word-embeddings-word2vec-glove-distributed-representations

Direct link:

https://mbrenndoerfer.com/writing/word-embeddings-word2vec-glove-distributed-representations

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

View full handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications

InteractiveWord Embeddings: From Word2Vec to GloVe - Understanding Distributed Representations

Word Embeddings

The Problem with Traditional Representations

One-Hot Encoding: The Baseline

Computing Co-occurrence Statistics

Visualizing the Co-occurrence Matrix

Visualizing Relationships

Code Implementation: Training Word Embeddings from Scratch

Implementing Skip-gram

Training on Real Data

Visualizing Training Progress

Visualizing Learned Embeddings

Computing Word Similarities

Word Analogies

Key Parameters

Comparing Word2Vec and GloVe

Limitations & Impact

Key Limitations

Historical Impact

The Path Forward

Summary

Quiz

Text Preprocessing

TF-IDF and Bag of Words

Reference

About the author: Michael Brenndoerfer

Related Content

Attention Mechanisms: Dynamic Focus in Neural Sequence Models

TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval

Text Preprocessing: Complete Guide to Tokenization, Normalization & Cleaning for NLP

Stay updated