Complete guide to word embeddings covering Word2Vec skip-gram, GloVe matrix factorization, negative sampling, and co-occurrence statistics. Learn how to implement embeddings from scratch and understand how semantic relationships emerge from vector space geometry.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
Word Embeddings
How do we represent words as numbers? This question has driven NLP research for decades. Early approaches used one-hot encoding: each word gets a unique vector where all values are zero except one position. "cat" might be [1, 0, 0, 0, ...], "dog" might be [0, 1, 0, 0, ...]. This works, but it's fundamentally limited. These vectors tell us nothing about relationships between words. "cat" and "dog" are as similar as "cat" and "airplane" in one-hot space.
Word embeddings solve this by learning dense, low-dimensional vectors where semantically similar words have similar vector representations. Instead of sparse one-hot vectors with thousands of dimensions, we get compact vectors (typically 100-300 dimensions) where words with related meanings cluster together in vector space. The distance between "cat" and "dog" becomes small, while "cat" and "airplane" remain far apart.
The breakthrough came from a simple insight: words that appear in similar contexts tend to have similar meanings. This distributional hypothesis, dating back to linguist J.R. Firth's 1957 observation that "you shall know a word by the company it keeps," became the foundation for modern word embedding methods. Word2Vec and GloVe, the two most influential embedding algorithms, both exploit this principle, but through different mathematical approaches.
Word embeddings transformed NLP by enabling models to capture semantic relationships directly from data. They unlocked transfer learning: embeddings trained on massive text corpora could be reused across tasks, dramatically improving performance on downstream applications with limited training data. Today, word embeddings are foundational to everything from search engines to chatbots, though they've been largely superseded by contextual embeddings from transformers.
The Problem with Traditional Representations
Before word embeddings, NLP systems relied on sparse, high-dimensional representations that couldn't capture semantic relationships. Understanding these limitations helps explain why embeddings represented a significant advance in NLP.
One-Hot Encoding: The Baseline
One-hot encoding represents each word as a binary vector where exactly one position is 1 and all others are 0. If your vocabulary has 10,000 words, each word gets a 10,000-dimensional vector:
1vocabulary = ["cat", "dog", "bird", "airplane", "car"]
2word_to_index = {word: i for i, word in enumerate(vocabulary)}
3
4def one_hot_encode(word, vocab_size):
5 vector = [0] * vocab_size
6 vector[word_to_index[word]] = 1
7 return vector
8
9cat_vector = one_hot_encode("cat", len(vocabulary))
10dog_vector = one_hot_encode("dog", len(vocabulary))
11airplane_vector = one_hot_encode("airplane", len(vocabulary))1vocabulary = ["cat", "dog", "bird", "airplane", "car"]
2word_to_index = {word: i for i, word in enumerate(vocabulary)}
3
4def one_hot_encode(word, vocab_size):
5 vector = [0] * vocab_size
6 vector[word_to_index[word]] = 1
7 return vector
8
9cat_vector = one_hot_encode("cat", len(vocabulary))
10dog_vector = one_hot_encode("dog", len(vocabulary))
11airplane_vector = one_hot_encode("airplane", len(vocabulary))cat: [1, 0, 0, 0, 0] dog: [0, 1, 0, 0, 0] airplane: [0, 0, 0, 1, 0]
1corpus = [
2 "the cat sits on the mat",
3 "the dog runs in the park",
4 "the bird flies in the sky",
5 "the car drives on the road",
6 "the plane flies in the air",
7 "the train moves on tracks"
8]1corpus = [
2 "the cat sits on the mat",
3 "the dog runs in the park",
4 "the bird flies in the sky",
5 "the car drives on the road",
6 "the plane flies in the air",
7 "the train moves on tracks"
8]Even with this minimal data, we can see patterns: "cat" and "dog" appear in similar contexts (both with "the", both animals), while "car" and "plane" share "flies" or movement verbs but differ in their specific contexts.
Computing Co-occurrence Statistics
Let's build a co-occurrence matrix to understand what GloVe would learn:
1from collections import defaultdict
2import numpy as np
3
4def build_cooccurrence_matrix(corpus, window_size=2):
5 """Build a co-occurrence matrix from a corpus."""
6 # Tokenize
7 sentences = [sentence.split() for sentence in corpus]
8
9 # Build vocabulary
10 vocab = set()
11 for sentence in sentences:
12 vocab.update(sentence)
13 vocab = sorted(list(vocab))
14 word_to_idx = {word: i for i, word in enumerate(vocab)}
15
16 # Initialize co-occurrence matrix
17 cooccurrence = defaultdict(int)
18
19 # Count co-occurrences
20 for sentence in sentences:
21 for i, center_word in enumerate(sentence):
22 center_idx = word_to_idx[center_word]
23 # Look at context window
24 for j in range(max(0, i - window_size), min(len(sentence), i + window_size + 1)):
25 if i != j:
26 context_word = sentence[j]
27 context_idx = word_to_idx[context_word]
28 cooccurrence[(center_idx, context_idx)] += 1
29
30 # Convert to matrix
31 V = len(vocab)
32 matrix = np.zeros((V, V))
33 for (i, j), count in cooccurrence.items():
34 matrix[i, j] = count
35
36 return matrix, vocab, word_to_idx
37
38cooccurrence_matrix, vocabulary, word_to_idx = build_cooccurrence_matrix(corpus)1from collections import defaultdict
2import numpy as np
3
4def build_cooccurrence_matrix(corpus, window_size=2):
5 """Build a co-occurrence matrix from a corpus."""
6 # Tokenize
7 sentences = [sentence.split() for sentence in corpus]
8
9 # Build vocabulary
10 vocab = set()
11 for sentence in sentences:
12 vocab.update(sentence)
13 vocab = sorted(list(vocab))
14 word_to_idx = {word: i for i, word in enumerate(vocab)}
15
16 # Initialize co-occurrence matrix
17 cooccurrence = defaultdict(int)
18
19 # Count co-occurrences
20 for sentence in sentences:
21 for i, center_word in enumerate(sentence):
22 center_idx = word_to_idx[center_word]
23 # Look at context window
24 for j in range(max(0, i - window_size), min(len(sentence), i + window_size + 1)):
25 if i != j:
26 context_word = sentence[j]
27 context_idx = word_to_idx[context_word]
28 cooccurrence[(center_idx, context_idx)] += 1
29
30 # Convert to matrix
31 V = len(vocab)
32 matrix = np.zeros((V, V))
33 for (i, j), count in cooccurrence.items():
34 matrix[i, j] = count
35
36 return matrix, vocab, word_to_idx
37
38cooccurrence_matrix, vocabulary, word_to_idx = build_cooccurrence_matrix(corpus)Vocabulary: ['air', 'bird', 'car', 'cat', 'dog', 'drives', 'flies', 'in', 'mat', 'moves', 'on', 'park', 'plane', 'road', 'runs', 'sits', 'sky', 'the', 'tracks', 'train'] Vocabulary size: 20 Co-occurrence matrix (first 10x10): [[0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 1. 0. 0.] [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 2. 0. 0.] [1. 1. 0. 0. 1. 0. 2. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
The vocabulary contains all unique words from our small corpus. The co-occurrence matrix shows how often each word pair appears together within the context window. Notice that words like "cat" and "dog" have similar co-occurrence patterns (both frequently co-occur with "the"), while "car" and "plane" share some patterns but differ in others. This is exactly what embeddings learn to capture: words with similar co-occurrence patterns will have similar embeddings.
Visualizing the Co-occurrence Matrix
A heatmap visualization makes the co-occurrence patterns immediately visible:

Heatmap of the co-occurrence matrix showing how often word pairs appear together in the corpus. Darker cells indicate higher co-occurrence counts. Notice how semantically related words like 'cat' and 'dog' have similar co-occurrence patterns (both frequently co-occur with 'the'), while 'car' and 'plane' share some patterns but differ in others. The sparse nature of the matrix (many zeros) is also evident, which is why GloVe uses a weighting function to handle this sparsity.
The heatmap reveals several important patterns. First, the matrix is sparse: most word pairs never co-occur (white cells with count 0). Second, semantically related words show similar patterns: "cat" and "dog" both have high co-occurrence with "the", "on", and "in", reflecting their similar grammatical roles. Third, the diagonal is empty because we don't count a word co-occurring with itself. These patterns are exactly what GloVe's matrix factorization captures: words with similar rows (or columns) in this matrix will have similar embeddings after training.
Visualizing Relationships
Even in this tiny example, we can see semantic clusters emerging. Words that appear in similar contexts will have similar rows (or columns) in the co-occurrence matrix, which translates to similar embeddings after factorization.
Code Implementation: Training Word Embeddings from Scratch
Now let's implement Word2Vec skip-gram with negative sampling from scratch. This implementation will help you understand exactly how embeddings are learned.
Implementing Skip-gram
1import numpy as np
2from collections import Counter, defaultdict
3import random
4
5class Word2Vec:
6 def __init__(self, vocab_size, embedding_dim=100, window_size=5,
7 negative_samples=5, learning_rate=0.01):
8 self.vocab_size = vocab_size
9 self.embedding_dim = embedding_dim
10 self.window_size = window_size
11 self.negative_samples = negative_samples
12 self.learning_rate = learning_rate
13
14 # Initialize embedding matrices
15 # W_in: input embeddings (what we'll use as word embeddings)
16 # W_out: output embeddings (for predicting context words)
17 self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
18 self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
19
20 def sigmoid(self, x):
21 """Sigmoid function with numerical stability."""
22 return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
23
24 def train_step(self, center_idx, context_idx, negative_indices):
25 """Perform one training step for a center word and its context."""
26 # Get embeddings
27 center_emb = self.W_in[center_idx] # Shape: (embedding_dim,)
28 context_emb = self.W_out[context_idx] # Shape: (embedding_dim,)
29 negative_embs = self.W_out[negative_indices] # Shape: (negative_samples, embedding_dim)
30
31 # Positive example: maximize P(context | center)
32 pos_score = np.dot(center_emb, context_emb)
33 pos_prob = self.sigmoid(pos_score)
34 pos_error = 1 - pos_prob # Target is 1 for positive
35
36 # Negative examples: minimize P(negative | center)
37 neg_scores = np.dot(center_emb, negative_embs.T) # Shape: (negative_samples,)
38 neg_probs = self.sigmoid(neg_scores)
39 neg_errors = neg_probs # Target is 0 for negative
40
41 # Compute gradients
42 # For input embedding (center word)
43 grad_in = pos_error * context_emb - np.sum(neg_errors[:, None] * negative_embs, axis=0)
44
45 # For output embedding (context word)
46 grad_out_pos = pos_error * center_emb
47
48 # For output embeddings (negative words)
49 grad_out_neg = neg_errors[:, None] * center_emb
50
51 # Update weights
52 self.W_in[center_idx] += self.learning_rate * grad_in
53 self.W_out[context_idx] += self.learning_rate * grad_out_pos
54 self.W_out[negative_indices] -= self.learning_rate * grad_out_neg
55
56 # Return loss for monitoring
57 loss = -np.log(pos_prob + 1e-10) - np.sum(np.log(1 - neg_probs + 1e-10))
58 return loss
59
60 def get_embeddings(self):
61 """Return the input embeddings as the word representations."""
62 return self.W_in.copy()
63
64# Example usage with a small vocabulary
65vocab = ["cat", "dog", "bird", "car", "plane", "the", "runs", "flies"]
66vocab_size = len(vocab)
67word_to_idx = {word: i for i, word in enumerate(vocab)}
68
69# Initialize model
70model = Word2Vec(vocab_size=vocab_size, embedding_dim=10,
71 window_size=2, negative_samples=3, learning_rate=0.1)
72embedding_shape = model.get_embeddings().shape1import numpy as np
2from collections import Counter, defaultdict
3import random
4
5class Word2Vec:
6 def __init__(self, vocab_size, embedding_dim=100, window_size=5,
7 negative_samples=5, learning_rate=0.01):
8 self.vocab_size = vocab_size
9 self.embedding_dim = embedding_dim
10 self.window_size = window_size
11 self.negative_samples = negative_samples
12 self.learning_rate = learning_rate
13
14 # Initialize embedding matrices
15 # W_in: input embeddings (what we'll use as word embeddings)
16 # W_out: output embeddings (for predicting context words)
17 self.W_in = np.random.randn(vocab_size, embedding_dim) * 0.01
18 self.W_out = np.random.randn(vocab_size, embedding_dim) * 0.01
19
20 def sigmoid(self, x):
21 """Sigmoid function with numerical stability."""
22 return 1 / (1 + np.exp(-np.clip(x, -250, 250)))
23
24 def train_step(self, center_idx, context_idx, negative_indices):
25 """Perform one training step for a center word and its context."""
26 # Get embeddings
27 center_emb = self.W_in[center_idx] # Shape: (embedding_dim,)
28 context_emb = self.W_out[context_idx] # Shape: (embedding_dim,)
29 negative_embs = self.W_out[negative_indices] # Shape: (negative_samples, embedding_dim)
30
31 # Positive example: maximize P(context | center)
32 pos_score = np.dot(center_emb, context_emb)
33 pos_prob = self.sigmoid(pos_score)
34 pos_error = 1 - pos_prob # Target is 1 for positive
35
36 # Negative examples: minimize P(negative | center)
37 neg_scores = np.dot(center_emb, negative_embs.T) # Shape: (negative_samples,)
38 neg_probs = self.sigmoid(neg_scores)
39 neg_errors = neg_probs # Target is 0 for negative
40
41 # Compute gradients
42 # For input embedding (center word)
43 grad_in = pos_error * context_emb - np.sum(neg_errors[:, None] * negative_embs, axis=0)
44
45 # For output embedding (context word)
46 grad_out_pos = pos_error * center_emb
47
48 # For output embeddings (negative words)
49 grad_out_neg = neg_errors[:, None] * center_emb
50
51 # Update weights
52 self.W_in[center_idx] += self.learning_rate * grad_in
53 self.W_out[context_idx] += self.learning_rate * grad_out_pos
54 self.W_out[negative_indices] -= self.learning_rate * grad_out_neg
55
56 # Return loss for monitoring
57 loss = -np.log(pos_prob + 1e-10) - np.sum(np.log(1 - neg_probs + 1e-10))
58 return loss
59
60 def get_embeddings(self):
61 """Return the input embeddings as the word representations."""
62 return self.W_in.copy()
63
64# Example usage with a small vocabulary
65vocab = ["cat", "dog", "bird", "car", "plane", "the", "runs", "flies"]
66vocab_size = len(vocab)
67word_to_idx = {word: i for i, word in enumerate(vocab)}
68
69# Initialize model
70model = Word2Vec(vocab_size=vocab_size, embedding_dim=10,
71 window_size=2, negative_samples=3, learning_rate=0.1)
72embedding_shape = model.get_embeddings().shapeInitialized Word2Vec model Embedding shape: (8, 10)
The Word2Vec class implements the skip-gram architecture with negative sampling. The embedding matrices are initialized with small random values, which will be updated during training. The embedding shape shows we have one vector per word in the vocabulary, with each vector having the specified embedding dimension.
Training on Real Data
Now let's train on a larger corpus. We'll use a simple text preprocessing pipeline and train for multiple epochs:
1def prepare_training_data(corpus, word_to_idx, window_size=5):
2 """Prepare training examples from corpus."""
3 training_pairs = []
4
5 for sentence in corpus:
6 tokens = sentence.lower().split()
7 indices = [word_to_idx.get(word) for word in tokens if word in word_to_idx]
8
9 for i, center_idx in enumerate(indices):
10 # Get context window
11 start = max(0, i - window_size)
12 end = min(len(indices), i + window_size + 1)
13
14 for j in range(start, end):
15 if i != j:
16 context_idx = indices[j]
17 training_pairs.append((center_idx, context_idx))
18
19 return training_pairs
20
21def sample_negative_words(center_idx, vocab_size, num_samples, word_freqs):
22 """Sample negative words weighted by frequency."""
23 # Create unigram distribution (simplified - in practice use actual frequencies)
24 # Exclude the center word itself
25 probs = np.ones(vocab_size)
26 probs[center_idx] = 0
27 probs = probs / probs.sum()
28
29 negative_indices = np.random.choice(vocab_size, size=num_samples,
30 replace=False, p=probs)
31 return negative_indices
32
33# Prepare corpus
34training_corpus = [
35 "the cat sits on the mat",
36 "the dog runs in the park",
37 "the bird flies in the sky",
38 "the car drives on the road",
39 "the plane flies in the air",
40 "the train moves on tracks",
41 "cats and dogs are pets",
42 "birds fly in the sky",
43 "cars drive on roads",
44 "planes fly in the air"
45]
46
47# Build vocabulary from corpus
48all_words = []
49for sentence in training_corpus:
50 all_words.extend(sentence.lower().split())
51word_counts = Counter(all_words)
52vocab = [word for word, count in word_counts.items() if count >= 1] # Keep all words for this example
53vocab_size = len(vocab)
54word_to_idx = {word: i for i, word in enumerate(vocab)}
55idx_to_word = {i: word for word, i in word_to_idx.items()}
56
57# Initialize model
58model = Word2Vec(vocab_size=vocab_size, embedding_dim=20,
59 window_size=2, negative_samples=5, learning_rate=0.1)
60
61# Prepare training data
62training_pairs = prepare_training_data(training_corpus, word_to_idx, window_size=2)
63
64# Train for a few epochs
65epochs = 50
66loss_history = []
67for epoch in range(epochs):
68 total_loss = 0
69 random.shuffle(training_pairs)
70
71 for center_idx, context_idx in training_pairs:
72 negative_indices = sample_negative_words(center_idx, vocab_size,
73 model.negative_samples, None)
74 loss = model.train_step(center_idx, context_idx, negative_indices)
75 total_loss += loss
76
77 avg_loss = total_loss / len(training_pairs)
78 loss_history.append(avg_loss)1def prepare_training_data(corpus, word_to_idx, window_size=5):
2 """Prepare training examples from corpus."""
3 training_pairs = []
4
5 for sentence in corpus:
6 tokens = sentence.lower().split()
7 indices = [word_to_idx.get(word) for word in tokens if word in word_to_idx]
8
9 for i, center_idx in enumerate(indices):
10 # Get context window
11 start = max(0, i - window_size)
12 end = min(len(indices), i + window_size + 1)
13
14 for j in range(start, end):
15 if i != j:
16 context_idx = indices[j]
17 training_pairs.append((center_idx, context_idx))
18
19 return training_pairs
20
21def sample_negative_words(center_idx, vocab_size, num_samples, word_freqs):
22 """Sample negative words weighted by frequency."""
23 # Create unigram distribution (simplified - in practice use actual frequencies)
24 # Exclude the center word itself
25 probs = np.ones(vocab_size)
26 probs[center_idx] = 0
27 probs = probs / probs.sum()
28
29 negative_indices = np.random.choice(vocab_size, size=num_samples,
30 replace=False, p=probs)
31 return negative_indices
32
33# Prepare corpus
34training_corpus = [
35 "the cat sits on the mat",
36 "the dog runs in the park",
37 "the bird flies in the sky",
38 "the car drives on the road",
39 "the plane flies in the air",
40 "the train moves on tracks",
41 "cats and dogs are pets",
42 "birds fly in the sky",
43 "cars drive on roads",
44 "planes fly in the air"
45]
46
47# Build vocabulary from corpus
48all_words = []
49for sentence in training_corpus:
50 all_words.extend(sentence.lower().split())
51word_counts = Counter(all_words)
52vocab = [word for word, count in word_counts.items() if count >= 1] # Keep all words for this example
53vocab_size = len(vocab)
54word_to_idx = {word: i for i, word in enumerate(vocab)}
55idx_to_word = {i: word for word, i in word_to_idx.items()}
56
57# Initialize model
58model = Word2Vec(vocab_size=vocab_size, embedding_dim=20,
59 window_size=2, negative_samples=5, learning_rate=0.1)
60
61# Prepare training data
62training_pairs = prepare_training_data(training_corpus, word_to_idx, window_size=2)
63
64# Train for a few epochs
65epochs = 50
66loss_history = []
67for epoch in range(epochs):
68 total_loss = 0
69 random.shuffle(training_pairs)
70
71 for center_idx, context_idx in training_pairs:
72 negative_indices = sample_negative_words(center_idx, vocab_size,
73 model.negative_samples, None)
74 loss = model.train_step(center_idx, context_idx, negative_indices)
75 total_loss += loss
76
77 avg_loss = total_loss / len(training_pairs)
78 loss_history.append(avg_loss)Vocabulary size: 31 Vocabulary: ['the', 'cat', 'sits', 'on', 'mat', 'dog', 'runs', 'in', 'park', 'bird', 'flies', 'sky', 'car', 'drives', 'road', 'plane', 'air', 'train', 'moves', 'tracks', 'cats', 'and', 'dogs', 'are', 'pets', 'birds', 'fly', 'cars', 'drive', 'roads', 'planes'] Training pairs: 156 Training progress: Epoch 1/50, Average Loss: 4.1580 Epoch 11/50, Average Loss: 2.2803 Epoch 21/50, Average Loss: 1.8897 Epoch 31/50, Average Loss: 1.7426 Epoch 41/50, Average Loss: 1.7041 Training complete! Final loss: 1.5618
The vocabulary contains all unique words from our training corpus. We generate training pairs by sliding a context window across each sentence, creating (center word, context word) pairs. During training, the loss decreases as the model learns to predict context words more accurately. The decreasing loss indicates that the embeddings are learning meaningful relationships: words that appear in similar contexts are being pulled together in embedding space.
Visualizing Training Progress
A loss curve shows how the model improves over training:

Training loss curve showing how the negative sampling loss decreases over 50 epochs. The steady decline indicates that the model is successfully learning to distinguish between positive context words and negative samples, pulling related words together in embedding space while pushing unrelated words apart. The loss plateaus near the end, suggesting the model has converged to a good solution.
The loss curve demonstrates successful training: the loss decreases steadily from the initial random embeddings to a much lower value, indicating the model is learning meaningful word relationships. The initial rapid decrease shows the model quickly learns basic patterns, while the gradual decline and eventual plateau suggest it's converging to a stable solution where embeddings capture semantic relationships effectively.
Visualizing Learned Embeddings
Now let's visualize the embeddings to see if semantically similar words cluster together:

Word embeddings projected to 2D using PCA. Semantically related words cluster together: animals (''cat'', ''dog'', ''bird'') form one group, vehicles (''car'', ''plane'', ''train'') form another, and function words (''the'', ''on'', ''in'') cluster separately. This demonstrates how Word2Vec learns to encode semantic relationships in the geometry of vector space.
The visualization shows that even with minimal training data, Word2Vec learns meaningful structure. Animals cluster together, vehicles form their own group, and function words occupy a distinct region. This demonstrates how semantic relationships emerge naturally from co-occurrence patterns.
Computing Word Similarities
We can measure similarity between words using cosine similarity:
1from sklearn.metrics.pairwise import cosine_similarity
2
3def find_most_similar(word, embeddings, vocab, word_to_idx, top_k=5):
4 """Find the most similar words to a given word."""
5 if word not in word_to_idx:
6 return []
7
8 word_idx = word_to_idx[word]
9 word_emb = embeddings[word_idx:word_idx+1] # Keep 2D for cosine_similarity
10
11 # Compute similarities with all words
12 similarities = cosine_similarity(word_emb, embeddings)[0]
13
14 # Get top k (excluding the word itself)
15 top_indices = np.argsort(similarities)[::-1][1:top_k+1]
16
17 results = [(vocab[idx], similarities[idx]) for idx in top_indices]
18 return results
19
20# Test similarity
21embeddings = model.get_embeddings()
22cat_similar = find_most_similar("cat", embeddings, vocab, word_to_idx)
23car_similar = find_most_similar("car", embeddings, vocab, word_to_idx)1from sklearn.metrics.pairwise import cosine_similarity
2
3def find_most_similar(word, embeddings, vocab, word_to_idx, top_k=5):
4 """Find the most similar words to a given word."""
5 if word not in word_to_idx:
6 return []
7
8 word_idx = word_to_idx[word]
9 word_emb = embeddings[word_idx:word_idx+1] # Keep 2D for cosine_similarity
10
11 # Compute similarities with all words
12 similarities = cosine_similarity(word_emb, embeddings)[0]
13
14 # Get top k (excluding the word itself)
15 top_indices = np.argsort(similarities)[::-1][1:top_k+1]
16
17 results = [(vocab[idx], similarities[idx]) for idx in top_indices]
18 return results
19
20# Test similarity
21embeddings = model.get_embeddings()
22cat_similar = find_most_similar("cat", embeddings, vocab, word_to_idx)
23car_similar = find_most_similar("car", embeddings, vocab, word_to_idx)Most similar to 'cat': car: 0.8797 sits: 0.8629 mat: 0.8382 road: 0.8239 on: 0.8001 Most similar to 'car': drives: 0.8947 cat: 0.8797 on: 0.8470 mat: 0.8277 road: 0.8002
The similarity scores show how closely related words are in embedding space. Words with higher cosine similarity (closer to 1.0) have more similar embeddings, meaning they appear in similar contexts. For "cat", we expect to see other animals like "dog" or "bird" with high similarity. For "car", we expect vehicles like "plane" or "train". These results demonstrate that Word2Vec successfully captures semantic relationships: words that are semantically related have similar embeddings.
Word Analogies
Word embeddings can capture analogies through vector arithmetic. The classic example is: "king - man + woman = queen". Let's test this:
1def word_analogy(word1, word2, word3, embeddings, vocab, word_to_idx, top_k=5):
2 """Solve word analogy: word1 is to word2 as word3 is to ?"""
3 if not all(w in word_to_idx for w in [word1, word2, word3]):
4 return []
5
6 # Compute analogy vector: vec3 + (vec2 - vec1)
7 vec1 = embeddings[word_to_idx[word1]]
8 vec2 = embeddings[word_to_idx[word2]]
9 vec3 = embeddings[word_to_idx[word3]]
10
11 analogy_vec = vec3 + (vec2 - vec1)
12 analogy_vec = analogy_vec.reshape(1, -1)
13
14 # Find most similar words
15 similarities = cosine_similarity(analogy_vec, embeddings)[0]
16
17 # Exclude the input words
18 exclude_indices = [word_to_idx[w] for w in [word1, word2, word3]]
19 similarities[exclude_indices] = -1
20
21 top_indices = np.argsort(similarities)[::-1][:top_k]
22 results = [(vocab[idx], similarities[idx]) for idx in top_indices]
23 return results
24
25# Test analogies (with our limited vocabulary)
26analogy1 = word_analogy("cat", "dog", "car", embeddings, vocab, word_to_idx)
27analogy2 = word_analogy("flies", "bird", "drives", embeddings, vocab, word_to_idx)1def word_analogy(word1, word2, word3, embeddings, vocab, word_to_idx, top_k=5):
2 """Solve word analogy: word1 is to word2 as word3 is to ?"""
3 if not all(w in word_to_idx for w in [word1, word2, word3]):
4 return []
5
6 # Compute analogy vector: vec3 + (vec2 - vec1)
7 vec1 = embeddings[word_to_idx[word1]]
8 vec2 = embeddings[word_to_idx[word2]]
9 vec3 = embeddings[word_to_idx[word3]]
10
11 analogy_vec = vec3 + (vec2 - vec1)
12 analogy_vec = analogy_vec.reshape(1, -1)
13
14 # Find most similar words
15 similarities = cosine_similarity(analogy_vec, embeddings)[0]
16
17 # Exclude the input words
18 exclude_indices = [word_to_idx[w] for w in [word1, word2, word3]]
19 similarities[exclude_indices] = -1
20
21 top_indices = np.argsort(similarities)[::-1][:top_k]
22 results = [(vocab[idx], similarities[idx]) for idx in top_indices]
23 return results
24
25# Test analogies (with our limited vocabulary)
26analogy1 = word_analogy("cat", "dog", "car", embeddings, vocab, word_to_idx)
27analogy2 = word_analogy("flies", "bird", "drives", embeddings, vocab, word_to_idx)cat is to dog as car is to ? bird: 0.8005 plane: 0.7982 sky: 0.7787 air: 0.7643 park: 0.7554 flies is to bird as drives is to ? car: 0.9235 on: 0.8215 cat: 0.7997 mat: 0.7539 road: 0.7397
Word analogies test whether embeddings capture relational patterns. The analogy "cat is to dog as car is to ?" asks: what word has the same relationship to "car" that "dog" has to "cat"? We expect "plane" or "train" (another vehicle). The vector arithmetic should point toward the answer. With our small vocabulary and limited training, results may not be perfect, but the principle demonstrates how embeddings encode relationships: semantic relationships become geometric relationships in vector space.
Key Parameters
The Word2Vec implementation uses several key parameters that control training and embedding quality:
-
embedding_dim(default: 100-300): The dimensionality of the embedding vectors. Larger dimensions can capture more nuanced relationships but require more training data and computation. Typical values range from 50-300, with 100-200 being most common for general-purpose embeddings. -
window_size(default: 5-10): The size of the context window on each side of the center word. Larger windows capture more global relationships (document-level patterns), while smaller windows focus on local syntactic relationships. For most applications, 5-10 words on each side works well. -
negative_samples(default: 5-20): The number of negative examples sampled for each positive example. More negative samples improve embedding quality but slow training. Values between 5-20 are typical, with 5-10 being common for large corpora. -
learning_rate(default: 0.01-0.05): The step size for gradient descent updates. Higher learning rates train faster but may overshoot optimal values. Lower learning rates are more stable but require more epochs. Start with 0.01-0.05 and adjust based on loss convergence. -
vocab_size: The number of unique words in the vocabulary. This is determined by your corpus and preprocessing choices. Larger vocabularies require more memory and computation but capture more word diversity.
These parameters interact: larger embedding dimensions and more negative samples improve quality but increase training time. For production use, start with moderate values (embedding_dim=200, window_size=5, negative_samples=5) and tune based on your specific task and computational constraints.
Comparing Word2Vec and GloVe
Both Word2Vec and GloVe produce high-quality embeddings, but they have different strengths:
Word2Vec advantages:
- Online learning: can process streaming data
- Memory efficient: doesn't need to store full co-occurrence matrix
- Fast training with negative sampling
- Better for very large corpora
GloVe advantages:
- Uses global statistics: all co-occurrence information simultaneously
- Often performs better on analogy tasks
- More interpretable: explicit relationship to co-occurrence statistics
- Can leverage pre-computed co-occurrence matrices
In practice, both methods produce embeddings of similar quality. The choice often depends on your specific constraints: use Word2Vec for streaming data or very large corpora, use GloVe when you can pre-compute co-occurrence statistics and want slightly better performance on semantic tasks.
Limitations & Impact
Word embeddings revolutionized NLP, but they have important limitations that contextual embeddings (like BERT) address.
Key Limitations
-
Static representations: Each word has a single embedding regardless of context. "bank" (financial institution) and "bank" (river edge) have the same vector, even though they mean different things.
-
Out-of-vocabulary words: Words not seen during training get no embedding. This is particularly problematic for morphologically rich languages and domain-specific terminology.
-
No sentence-level understanding: Embeddings capture word-level semantics but don't understand how words combine to form meaning. The phrase "not good" might be represented as the average of "not" and "good" embeddings, losing the negation.
-
Bias amplification: Embeddings learn biases present in training data. Gender stereotypes, racial biases, and other problematic associations get encoded in the vector space.
-
Limited to word-level: Can't represent subword units (important for morphologically rich languages) or multi-word expressions without special handling.
Historical Impact
Despite these limitations, word embeddings had significant impact:
- Transfer learning: Pre-trained embeddings enabled NLP models to work with limited task-specific data
- Semantic search: Embeddings power modern search engines and recommendation systems
- Foundation for transformers: The embedding concept evolved into positional encodings and token embeddings in transformers
- Research catalyst: Sparked interest in representation learning that led to contextual embeddings
The Path Forward
Word embeddings were a crucial stepping stone. They proved that dense vector representations could capture semantic relationships, setting the stage for contextual embeddings from transformers. Today, while static word embeddings are rarely used directly in state-of-the-art systems, the principles they established (dense representations, transfer learning, semantic relationships in vector space) remain foundational to modern NLP.
Summary
Word embeddings solve the fundamental problem of representing words as numbers in a way that captures semantic relationships. Unlike one-hot encoding, which treats all words as equally different, embeddings learn dense, low-dimensional vectors where similar words have similar representations.
Key takeaways:
-
Distributional hypothesis: Words appearing in similar contexts have similar meanings. This principle underlies both Word2Vec and GloVe.
-
Word2Vec: Learns embeddings by predicting context words (skip-gram) or center words (CBOW) using local windows. Negative sampling makes training efficient by approximating the expensive softmax.
-
GloVe: Learns embeddings by factorizing a global co-occurrence matrix. Uses ratios of co-occurrence probabilities to capture word relationships.
-
Semantic relationships: Embeddings encode meaning in vector geometry. Similarity is measurable (cosine similarity), and analogies can be solved through vector arithmetic.
-
Limitations: Static representations can't handle polysemy, require handling of OOV words, and may encode biases from training data.
-
Impact: Enabled transfer learning in NLP and laid the foundation for modern contextual embeddings.
Word embeddings transformed NLP by making semantic relationships computable. While superseded by contextual embeddings in most applications, understanding embeddings is important for grasping how modern language models represent and understand text.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about word embeddings, Word2Vec, and GloVe.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

Attention Mechanisms: Dynamic Focus in Neural Sequence Models
Learn how attention mechanisms solve the information bottleneck in sequence-to-sequence models. Understand alignment scores, attention weights, and context vectors with mathematical formulations and PyTorch implementations.

TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval
Learn TF-IDF and Bag of Words, including term frequency, inverse document frequency, vectorization, and text classification. Master classical NLP text representation methods with Python implementation.

Text Preprocessing: Complete Guide to Tokenization, Normalization & Cleaning for NLP
Learn how to transform raw text into structured data through tokenization, normalization, and cleaning techniques. Discover best practices for different NLP tasks and understand when to apply aggressive versus minimal preprocessing strategies.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
