Skip-gram Model: Learning Word Embeddings by Predicting Context
Back to Writing

Skip-gram Model: Learning Word Embeddings by Predicting Context

Michael BrenndoerferDecember 9, 202541 min read9,958 wordsInteractive

A comprehensive guide to the Skip-gram model from Word2Vec, covering architecture, objective function, training data generation, and implementation from scratch.

Language AI Handbook Cover
Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Skip-gram Model

The distributional hypothesis tells us that words appearing in similar contexts have similar meanings. But how do we turn this insight into practical word representations? Co-occurrence matrices capture contextual patterns, but they're sparse, high-dimensional, and computationally expensive. What if we could learn dense, low-dimensional vectors that encode the same distributional information more efficiently?

In 2013, Mikolov et al. introduced Word2Vec, a family of neural network models that changed how we create word representations. The Skip-gram model, one of the two Word2Vec architectures, takes a simple approach: given a word, predict its context. By training a neural network on this task across billions of words, we learn dense vectors that capture rich semantic relationships. Words like "king" and "queen" end up close together in vector space. Vector arithmetic even works: kingman+womanqueen\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}.

This chapter introduces the Skip-gram architecture from the ground up. We'll build intuition for why predicting context words leads to meaningful representations, work through the mathematics step by step, and implement a working Skip-gram model from scratch.

The Core Idea: Predicting Context from Words

Traditional distributional methods count co-occurrences and store them in massive matrices. Skip-gram flips the script: instead of counting, we predict. Given a target word, the model tries to predict which words appear nearby in the training corpus.

Skip-gram Model

The Skip-gram model learns word representations by training a neural network to predict context words given a center word. The learned weights of this network become the word embeddings.

Consider the sentence: "The quick brown fox jumps over the lazy dog." If we take "fox" as our target word with a context window of size 2, Skip-gram asks: given "fox," can we predict that "brown," "quick," "jumps," and "over" appear nearby?

In[2]:
# Demonstrating Skip-gram's prediction task
sentence = "The quick brown fox jumps over the lazy dog"
words = sentence.lower().split()

def get_skipgram_pairs(words, target_idx, window_size=2):
    """Generate (target, context) pairs for Skip-gram training."""
    target = words[target_idx]
    pairs = []
    
    for offset in range(-window_size, window_size + 1):
        if offset == 0:  # Skip the target word itself
            continue
        context_idx = target_idx + offset
        if 0 <= context_idx < len(words):
            pairs.append((target, words[context_idx]))
    
    return pairs

# Get pairs for "fox" (index 3)
fox_pairs = get_skipgram_pairs(words, target_idx=3, window_size=2)
Out[3]:
Skip-gram Training Pairs for 'fox':
---------------------------------------------
Sentence: 'The quick brown fox jumps over the lazy dog'
Target word: 'fox' (position 3)
Window size: 2

Generated (target → context) pairs:
  'fox' → 'quick'
  'fox' → 'brown'
  'fox' → 'jumps'
  'fox' → 'over'

The model learns by trying to maximize the probability of these context words given the target. If "fox" frequently appears near "brown" in the training corpus, the model adjusts its weights to make P(brownfox)P(\text{brown} | \text{fox}) high. Through millions of such updates, words that appear in similar contexts develop similar vector representations.

Architecture: Two Embedding Matrices

The Skip-gram architecture is simple: a shallow neural network with a single hidden layer and no activation function. The key insight lies in what we do with the learned weights.

Out[4]:
Visualization
Neural network diagram showing input one-hot vector, embedding layer, hidden representation, output layer, and softmax probabilities.
Skip-gram architecture. The input is a one-hot encoded target word, which is projected through the embedding matrix W to produce a dense vector. This vector is then projected through the context matrix W' to produce scores for every word in the vocabulary. Softmax converts these scores to probabilities over context words.

The network has two weight matrices:

  1. Embedding matrix W\mathbf{W} (size V×dV \times d): Maps input words to dense vectors. Each row is the embedding for one vocabulary word. When we input a one-hot vector for word ww, multiplying by WT\mathbf{W}^T simply selects the corresponding row.

  2. Context matrix W\mathbf{W}' (size d×Vd \times V): Maps the hidden representation to output scores. Each column represents a word as a potential context.

Here VV is the vocabulary size (often 100,000+ words) and dd is the embedding dimension (typically 100-300).

In[5]:
import numpy as np

# Initialize Skip-gram model parameters
vocab_size = 10000  # V: number of unique words
embedding_dim = 100  # d: dimension of word vectors

# Embedding matrix: each row is a word's embedding
W = np.random.randn(vocab_size, embedding_dim) * 0.01

# Context matrix: each column is a word's context representation
W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01
Out[6]:
Skip-gram Model Dimensions:
---------------------------------------------
Vocabulary size (V):     10,000
Embedding dimension (d): 100

Embedding matrix W:      10,000 × 100
Context matrix W':       100 × 10,000

Total parameters:        2,000,000

With 2 million parameters, Skip-gram is lightweight compared to modern language models. This efficiency comes from the shallow architecture: just two matrix multiplications with no hidden layers or activation functions. Despite this simplicity, Skip-gram learns rich representations.

Why Two Matrices?

Skip-gram maintains separate embeddings for words as targets (W\mathbf{W}) and as contexts (W\mathbf{W}'). After training, we typically use only the embedding matrix W\mathbf{W} as our word vectors, though some implementations average both matrices or concatenate them.

Input and Output Representations

One-Hot Encoding

The input to Skip-gram is a one-hot encoded vector. For a vocabulary of VV words, each word is represented as a vector of length VV with a single 1 at the word's index and 0s everywhere else.

In[7]:
def create_one_hot(word_idx, vocab_size):
    """Create a one-hot vector for a word."""
    one_hot = np.zeros(vocab_size)
    one_hot[word_idx] = 1
    return one_hot

# Example: vocabulary and one-hot encoding
small_vocab = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
word_to_idx = {word: i for i, word in enumerate(small_vocab)}

# One-hot for "fox"
fox_one_hot = create_one_hot(word_to_idx['fox'], len(small_vocab))
Out[8]:
One-Hot Encoding Example:
---------------------------------------------
Vocabulary: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
Word-to-index mapping: {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumps': 4, 'over': 5, 'lazy': 6, 'dog': 7}

One-hot vector for 'fox':
  [0. 0. 0. 1. 0. 0. 0. 0.]
  Position 3 = 1 (fox's index)

The one-hot vector is extremely sparse: 7 zeros and a single 1. For a real vocabulary of 100,000 words, each input would have 99,999 zeros. This sparsity is why the embedding lookup is so efficient: multiplying by a one-hot vector simply selects one row.

From One-Hot to Embedding

When we multiply the one-hot vector by the embedding matrix WT\mathbf{W}^T, something useful happens: we simply extract the row corresponding to our input word.

h=WTx\mathbf{h} = \mathbf{W}^T \mathbf{x}

If x\mathbf{x} is one-hot with a 1 at position ii, then h\mathbf{h} is exactly the ii-th row of W\mathbf{W}. This is why we call W\mathbf{W} the "embedding matrix": its rows are the word embeddings.

In[9]:
# Small embedding matrix for demonstration
small_W = np.random.randn(len(small_vocab), 4) * 0.5  # 8 words, 4 dimensions

def get_embedding(word, W, word_to_idx):
    """Get the embedding for a word (equivalent to W.T @ one_hot)."""
    idx = word_to_idx[word]
    return W[idx]  # Simply select the row

# Get embedding for "fox"
fox_embedding = get_embedding('fox', small_W, word_to_idx)
Out[10]:
Embedding Lookup:
---------------------------------------------
Embedding matrix shape: (8, 4)

Row 3 of W (fox's embedding):
  [-0.21746121  0.5706275   0.44208308  0.25415296]

W.T @ one_hot('fox'):
  [-0.21746121  0.5706275   0.44208308  0.25415296]

Both methods give the same result!

The direct row selection W[idx] is computationally equivalent to the matrix multiplication W.T @ one_hot but far more efficient. In practice, embedding layers in deep learning frameworks use this lookup optimization rather than actual matrix multiplication.

Output Scores and Softmax

The hidden vector h\mathbf{h} is then projected through the context matrix W\mathbf{W}' to produce a score for each vocabulary word:

z=WTh\mathbf{z} = \mathbf{W}'^T \mathbf{h}

Each element zjz_j represents how likely word jj is to be a context word. To convert these scores to probabilities, we apply the softmax function:

P(wjwi)=exp(zj)k=1Vexp(zk)=exp(wjwi)k=1Vexp(wkwi)P(w_j | w_i) = \frac{\exp(z_j)}{\sum_{k=1}^{V} \exp(z_k)} = \frac{\exp(\mathbf{w}'_j \cdot \mathbf{w}_i)}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_i)}

where:

  • wiw_i: the input (center) word
  • wjw_j: a candidate context word
  • wi\mathbf{w}_i: the embedding vector of the input word (row ii of W\mathbf{W})
  • wj\mathbf{w}'_j: the context vector of word jj (column jj of W\mathbf{W}')
  • VV: the vocabulary size
In[11]:
def softmax(z):
    """Compute softmax probabilities."""
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z))
    return exp_z / np.sum(exp_z)

def forward_pass(input_word, W, W_prime, word_to_idx):
    """Complete forward pass through Skip-gram."""
    # Get input embedding
    h = get_embedding(input_word, W, word_to_idx)
    
    # Compute output scores
    z = W_prime.T @ h
    
    # Apply softmax
    probs = softmax(z)
    
    return h, z, probs

# Small context matrix
small_W_prime = np.random.randn(4, len(small_vocab)) * 0.5

# Forward pass for "fox"
h, z, probs = forward_pass('fox', small_W, small_W_prime, word_to_idx)
Out[12]:
Forward Pass for 'fox':
---------------------------------------------
Hidden vector h (embedding): [-0.21746121  0.5706275   0.44208308  0.25415296]

Output scores z:
  the     :  0.142
  quick   :  0.192
  brown   :  0.036
  fox     : -0.022
  jumps   :  0.633
  over    :  0.042
  lazy    :  0.321
  dog     :  0.145

Softmax probabilities P(context | fox):
  the     : 0.1172
  quick   : 0.1231
  brown   : 0.1054
  fox     : 0.0994
  jumps   : 0.1914
  over    : 0.1060
  lazy    : 0.1401
  dog     : 0.1175

Sum of probabilities: 1.0000

The raw scores (logits) can be any real number, positive or negative. Softmax transforms them into a valid probability distribution: all values between 0 and 1, summing to exactly 1. Notice how the word with the highest score gets a disproportionately large probability. This "winner-take-more" behavior is characteristic of the exponential function in softmax.

Out[13]:
Visualization
Two bar charts comparing raw scores and softmax probabilities across vocabulary words.
Transformation from raw scores to softmax probabilities. The left panel shows raw output scores (logits) which can be any real number. The right panel shows the corresponding softmax probabilities, which are positive and sum to 1. Softmax amplifies differences: the highest score gets a disproportionately large probability.

Understanding softmax behavior matters because it determines how the model distributes probability mass. The exponential function amplifies differences: even small gaps in raw scores become large probability differences. This property helps the model make confident predictions but also creates computational challenges we'll address later.

The Skip-gram Objective Function

We've seen how Skip-gram transforms words into vectors and predicts context probabilities. But how does the model actually learn? What signal tells it whether its current embeddings are good or bad? The answer lies in the objective function: a mathematical expression that quantifies how well the model's predictions match reality.

From Intuition to Formalization

Let's build the objective function step by step, starting from a simple intuition.

The core insight: If our embeddings are good, then given a center word, the model should assign high probability to words that actually appear nearby and low probability to words that don't. The objective function formalizes this: we want to maximize the probability of observing the actual context words.

Consider our running example: the sentence "The quick brown fox jumps over the lazy dog." When "fox" is the center word with a window of size 2, the true context words are "quick," "brown," "jumps," and "over." A well-trained model should predict:

  • P(brownfox)P(\text{brown} | \text{fox}) → high
  • P(jumpsfox)P(\text{jumps} | \text{fox}) → high
  • P(elephantfox)P(\text{elephant} | \text{fox}) → low

The Probability of Context Words

For a single center word wcw_c at position cc, we observe context words at positions cm,cm+1,,c1,c+1,,c+mc-m, c-m+1, \ldots, c-1, c+1, \ldots, c+m (where mm is the window size). Skip-gram assumes these context words are conditionally independent given the center word, so the probability of observing all of them is the product of individual probabilities:

mjmj0P(wc+jwc)\prod_{\substack{-m \leq j \leq m \\ j \neq 0}} P(w_{c+j} | w_c)
Why Products?

The product form comes from the independence assumption: we treat each context position as a separate prediction task. While context words aren't truly independent (knowing "brown" appears near "fox" tells us something about what other words might appear), this simplification makes training tractable and works well in practice.

Converting to Log-Likelihood

Working with products of probabilities is numerically unstable. Multiplying many small numbers quickly underflows to zero. The standard solution is to take the logarithm, which converts products to sums:

Lc=mjmj0logP(wc+jwc)\mathcal{L}_c = \sum_{\substack{-m \leq j \leq m \\ j \neq 0}} \log P(w_{c+j} | w_c)

This is the log-likelihood for a single center word. Higher values mean the model assigns higher probabilities to the true context words. Our goal is to maximize this quantity.

Unpacking the Softmax

Recall that P(wc+jwc)P(w_{c+j} | w_c) is computed via softmax over dot products:

P(wc+jwc)=exp(wc+jwc)k=1Vexp(wkwc)P(w_{c+j} | w_c) = \frac{\exp(\mathbf{w}'_{c+j} \cdot \mathbf{w}_c)}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_c)}

Substituting this into our log-likelihood and using the property log(a/b)=logalogb\log(a/b) = \log a - \log b:

Lc=mjmj0[wc+jwclogk=1Vexp(wkwc)]\mathcal{L}_c = \sum_{\substack{-m \leq j \leq m \\ j \neq 0}} \left[ \mathbf{w}'_{c+j} \cdot \mathbf{w}_c - \log \sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_c) \right]

This expanded form reveals the two forces at work during training:

  1. The positive term wc+jwc\mathbf{w}'_{c+j} \cdot \mathbf{w}_c: Maximizing this pushes the context word's vector wc+j\mathbf{w}'_{c+j} closer to the center word's embedding wc\mathbf{w}_c. The dot product increases when vectors point in similar directions.

  2. The normalization term logk=1Vexp(wkwc)\log \sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_c): This term is subtracted, so maximizing the objective means minimizing this sum. Since the sum includes all vocabulary words, this effectively pushes all other words away from the center word.

The interplay between these forces is what makes Skip-gram work: it simultaneously pulls true context words closer while pushing non-context words away.

Out[14]:
Visualization
Histogram comparing dot product distributions for context vs non-context word pairs before and after training.
Visualization of the two forces in the Skip-gram objective. Left: Distribution of dot products between center word embeddings and context vectors. Before training (random initialization), all dot products cluster around zero. After training, true context words (green) have higher dot products than non-context words (red), showing the model has learned to distinguish them. Right: The separation between distributions indicates how well the model predicts context words.

This visualization shows exactly what Skip-gram learns to do: before training, dot products between any word pair are randomly distributed around zero. After training, the distributions separate. Context words (which should have high probability) develop higher dot products, while non-context words have lower dot products. This separation is what enables the softmax to assign high probabilities to true context words.

The Full Corpus Objective

A single center word gives us one training signal. To learn robust embeddings, we aggregate over the entire corpus. If the corpus has TT words total, we average the log-likelihood across all positions:

J=1Tt=1Tmjmj0logP(wt+jwt)J = \frac{1}{T} \sum_{t=1}^{T} \sum_{\substack{-m \leq j \leq m \\ j \neq 0}} \log P(w_{t+j} | w_t)

where:

  • TT: total number of words in the training corpus
  • tt: position of the current center word (ranging from 1 to TT)
  • mm: window size (number of context words on each side)
  • wtw_t: the word at position tt (center word)
  • wt+jw_{t+j}: a context word at offset jj from position tt

This is what we maximize during training. In practice, we minimize the negative log-likelihood (cross-entropy loss), which is equivalent but aligns with the convention of "minimizing loss."

Implementing the Loss Function

Let's translate this mathematics into code. The loss function computes how poorly the model predicts context words for a given center word:

In[15]:
def compute_loss(center_word, context_words, W, W_prime, word_to_idx):
    """
    Compute Skip-gram loss for one center word and its contexts.
    
    The loss is the negative log-likelihood: lower values indicate
    better predictions of context words.
    """
    # Step 1: Get the center word's embedding vector
    h = get_embedding(center_word, W, word_to_idx)
    
    # Step 2: Compute output scores (dot products with all context vectors)
    z = W_prime.T @ h
    
    # Step 3: Compute log-sum-exp for numerical stability
    # This is log(sum(exp(z_k))) = log of the softmax denominator
    log_sum_exp = np.log(np.sum(np.exp(z - np.max(z)))) + np.max(z)
    
    # Step 4: Sum negative log probabilities for each context word
    total_loss = 0
    for context_word in context_words:
        context_idx = word_to_idx[context_word]
        # log P(context | center) = z[context] - log_sum_exp
        log_prob = z[context_idx] - log_sum_exp
        total_loss -= log_prob  # Negative because we minimize loss
    
    return total_loss

# Example: compute loss for "fox" predicting its context
context_words = ['brown', 'quick', 'jumps', 'over']
loss = compute_loss('fox', context_words, small_W, small_W_prime, word_to_idx)
Out[16]:
Loss Computation Example:
---------------------------------------------
Center word: 'fox'
Context words: ['brown', 'quick', 'jumps', 'over']

Negative log-likelihood loss: 8.2429

Interpretation:
  - Random baseline loss ≈ 8.32
    (if all words equally likely)
  - Lower loss = better context predictions

The loss value tells us how surprised the model is by the actual context words. With randomly initialized weights, the model assigns roughly equal probability (1/V1/V) to all words, yielding a baseline loss of approximately log(1/V)=log(V)-\log(1/V) = \log(V) per context word. For our 8-word vocabulary, that's about log(8)2.1\log(8) \approx 2.1 per context word, or roughly 8.3 total for four context words.

As training progresses, the model learns to assign higher probabilities to actual context words, driving the loss down. A well-trained model on a large corpus typically achieves losses in the range of 2-4 (when using negative sampling), indicating it has learned to predict context words much better than chance.

Visualizing Gradient Updates

To understand how Skip-gram learns, let's visualize what happens during a single gradient update. When we train on the pair ("fox" → "brown"), the gradients adjust the embeddings:

Out[17]:
Visualization
Bar chart showing gradient magnitudes for each word in vocabulary during one training step.
Gradient magnitudes during a single training step for the pair (fox → brown). Left: Gradients for the context matrix W'' show how each word''s context vector is updated. The true context word ''brown'' receives a negative gradient (pulling it closer), while all other words receive positive gradients (pushing them away). Right: The magnitude of updates varies by word, with the target context word receiving the largest adjustment.

The gradient visualization reveals the core learning mechanism:

  • The target context word ("brown") receives a negative gradient, meaning its context vector will be updated to increase its dot product with "fox", pulling it closer in embedding space.
  • All other words receive positive gradients proportional to their current probability. Words the model incorrectly thinks are likely contexts get pushed away more strongly.

This push-pull dynamic, repeated millions of times across the corpus, gradually organizes the embedding space so that words appearing in similar contexts end up nearby.

Generating Training Data

With the objective function defined, we need training data to optimize it. Skip-gram's training data consists of (center word, context word) pairs extracted from raw text. The advantage of this approach is that we need no manual labels. The text itself provides supervision through co-occurrence patterns.

The Sliding Window Approach

The algorithm is straightforward:

  1. Slide a window across the corpus, one word at a time
  2. Treat each word as a potential center word
  3. Pair it with every word within the window (excluding itself)

This process transforms unstructured text into structured training examples.

In[18]:
def generate_training_data(corpus, window_size=2):
    """
    Generate all (center, context) training pairs from corpus.
    
    Each word becomes a center word, paired with every word
    within the specified window distance.
    """
    words = corpus.lower().split()
    training_pairs = []
    
    for i, center_word in enumerate(words):
        # Define context window boundaries (handle edges)
        start = max(0, i - window_size)
        end = min(len(words), i + window_size + 1)
        
        # Generate a pair with each context word
        for j in range(start, end):
            if j != i:  # Skip the center word itself
                training_pairs.append((center_word, words[j]))
    
    return training_pairs

# Generate training data from our example sentence
corpus = "The quick brown fox jumps over the lazy dog"
pairs = generate_training_data(corpus, window_size=2)
Out[19]:
Training Data Generation:
---------------------------------------------
Corpus: 'The quick brown fox jumps over the lazy dog'
Window size: 2
Total training pairs: 30

Sample pairs (center → context):
  the      → quick
  the      → brown
  quick    → the
  quick    → brown
  quick    → fox
  brown    → the
  brown    → quick
  brown    → fox
  brown    → jumps
  fox      → quick
  fox      → brown
  fox      → jumps
  ... and 18 more pairs

The Multiplication Effect

Notice the dramatic expansion: a 9-word sentence produces 30 training pairs. This happens because:

  • Each word in the middle of the sentence generates 4 pairs (2 words on each side)
  • Words at the edges generate fewer pairs (only 2-3, depending on position)

For large corpora with billions of words, this multiplication effect produces massive training datasets. A corpus of 1 billion words with window size 5 generates roughly 10 billion training pairs. This abundance of training signal is why Skip-gram can learn rich semantic representations without any manual annotation. The structure of language itself provides the supervision.

Out[20]:
Visualization
Diagram showing sliding window over sentence with center and context words highlighted.
Skip-gram training pair generation with window size 2. Each word (highlighted in orange) serves as a center word, generating pairs with all words within the window (highlighted in blue). The diagram shows pairs generated for ''brown'' and ''fox'' as center words. Notice that pairs are directional: (brown, fox) and (fox, brown) are both generated.

Window Size: A Critical Hyperparameter

The window size mm determines how many words on each side of the center word count as context. This choice significantly affects what the embeddings capture.

Window Size

The window size hyperparameter controls the trade-off between syntactic and semantic similarity in learned embeddings. Smaller windows emphasize syntactic relationships, while larger windows capture topical similarity.

In[21]:
def analyze_window_effects(corpus, window_sizes=[1, 2, 5, 10]):
    """Analyze how window size affects training pairs."""
    words = corpus.lower().split()
    results = {}
    
    for ws in window_sizes:
        pairs = generate_training_data(corpus, window_size=ws)
        
        # Count unique context words per center word
        from collections import defaultdict
        context_counts = defaultdict(set)
        for center, context in pairs:
            context_counts[center].add(context)
        
        avg_contexts = np.mean([len(contexts) for contexts in context_counts.values()])
        results[ws] = {
            'total_pairs': len(pairs),
            'avg_contexts_per_word': avg_contexts,
            'pairs': pairs
        }
    
    return results

# Larger corpus for meaningful analysis
large_corpus = """
The quick brown fox jumps over the lazy dog in the sunny meadow.
A clever fox hunts small prey near the forest edge at dawn.
The dog barks loudly when strangers approach the old farmhouse.
Brown leaves fall from tall trees in the autumn breeze.
Quick movements catch the eye of the watchful predator nearby.
"""

window_results = analyze_window_effects(large_corpus, window_sizes=[1, 2, 3, 5])
Out[22]:
Window Size Analysis:
-------------------------------------------------------
Window Size    Total Pairs    Avg Contexts/Word  
-------------------------------------------------------
     1             108               2.5         
     2             214               4.9         
     3             318               6.9         
     5             520               10.5        

Larger windows create more training pairs but may
dilute the signal by including less relevant contexts.
Out[23]:
Visualization
Two bar charts showing increasing training pairs and context diversity with larger window sizes.
Effect of window size on Skip-gram training. Left: Total training pairs increase with window size as each center word pairs with more contexts. Right: Average unique context words per center word also increases. Larger windows capture broader topical relationships but may include noise from distant, less relevant words.

Small windows (1-2 words) tend to produce embeddings where syntactically similar words cluster together. Words that can substitute for each other in the same grammatical position (like "dog" and "cat" as nouns) end up nearby.

Large windows (5-10 words) capture topical similarity. Words that appear in the same documents or discuss the same subjects cluster together, even if they play different grammatical roles.

Skip-gram vs CBOW: Two Sides of the Same Coin

Word2Vec actually includes two architectures: Skip-gram and Continuous Bag of Words (CBOW). They're mirror images of each other:

  • Skip-gram: Given center word, predict context words
  • CBOW: Given context words, predict center word
Out[24]:
Visualization
Side-by-side comparison of Skip-gram and CBOW neural network architectures.
Skip-gram vs CBOW architectures. Skip-gram (left) predicts multiple context words from a single center word, making separate predictions for each context position. CBOW (right) averages the context word embeddings and predicts the single center word. Skip-gram works better for rare words; CBOW trains faster and works better for frequent words.

The key differences:

AspectSkip-gramCBOW
InputSingle center wordMultiple context words
OutputMultiple context wordsSingle center word
Rare wordsBetter (each occurrence creates multiple training examples)Worse (rare words get averaged out)
Training speedSlower (more predictions per position)Faster (one prediction per position)
Best forSmaller datasets, rare wordsLarger datasets, frequent words

Skip-gram's advantage with rare words comes from its training structure. For each occurrence of a rare word, Skip-gram generates multiple training examples (one for each context word). CBOW, by contrast, uses each occurrence only once. This gives Skip-gram more signal for learning good representations of infrequent words.

A Complete Implementation

We've covered the theory: the architecture, the objective function, and the training data. Now let's bring it all together into a working implementation that you can run and experiment with.

The SkipGram Class

Our implementation encapsulates the complete Skip-gram model in a single class. Each method corresponds to a concept we've discussed:

In[25]:
class SkipGram:
    """
    Simple Skip-gram implementation for educational purposes.
    
    This implementation uses full softmax (not negative sampling)
    to clearly illustrate the core algorithm.
    """
    
    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize the model with random embeddings.
        
        Args:
            vocab_size: Number of unique words in vocabulary (V)
            embedding_dim: Dimension of word vectors (d)
        """
        # Embedding matrix W: each row is a word's embedding
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        # Context matrix W': each column is a word's context representation
        self.W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
    
    def forward(self, center_idx):
        """
        Forward pass: compute hidden representation and output scores.
        
        Args:
            center_idx: Index of the center word
            
        Returns:
            h: Hidden layer (embedding of center word)
            z: Output scores (unnormalized log-probabilities)
        """
        h = self.W[center_idx]  # Embedding lookup (equivalent to W.T @ one_hot)
        z = self.W_prime.T @ h  # Output scores for all vocabulary words
        return h, z
    
    def softmax(self, z):
        """Compute softmax probabilities with numerical stability."""
        exp_z = np.exp(z - np.max(z))  # Subtract max for stability
        return exp_z / np.sum(exp_z)
    
    def compute_loss(self, center_idx, context_idx):
        """
        Compute cross-entropy loss for one (center, context) pair.
        
        Returns:
            loss: Negative log-probability of the context word
            h: Hidden representation (needed for backprop)
            probs: Softmax probabilities (needed for backprop)
        """
        h, z = self.forward(center_idx)
        probs = self.softmax(z)
        loss = -np.log(probs[context_idx] + 1e-10)  # Add epsilon for stability
        return loss, h, probs
    
    def backward(self, center_idx, context_idx, h, probs, learning_rate=0.01):
        """
        Backward pass: compute gradients and update weights.
        
        This implements stochastic gradient descent on the cross-entropy loss.
        """
        # Gradient of loss w.r.t. output scores: (probs - one_hot_target)
        dz = probs.copy()
        dz[context_idx] -= 1  # Subtract 1 at the true context position
        
        # Gradient for W' (context matrix): outer product of h and dz
        dW_prime = np.outer(h, dz)
        
        # Gradient for h (hidden layer): W' @ dz
        dh = self.W_prime @ dz
        
        # Update weights using gradient descent
        self.W_prime -= learning_rate * dW_prime
        self.W[center_idx] -= learning_rate * dh
    
    def train_pair(self, center_idx, context_idx, learning_rate=0.01):
        """Train on a single (center, context) pair."""
        loss, h, probs = self.compute_loss(center_idx, context_idx)
        self.backward(center_idx, context_idx, h, probs, learning_rate)
        return loss
    
    def get_embedding(self, word_idx):
        """Get the embedding vector for a word."""
        return self.W[word_idx]
    
    def most_similar(self, word_idx, top_n=5):
        """
        Find most similar words by cosine similarity.
        
        Cosine similarity measures the angle between vectors,
        ignoring magnitude. Values range from -1 to 1.
        """
        word_vec = self.W[word_idx]
        similarities = []
        
        for i in range(self.vocab_size):
            if i != word_idx:
                other_vec = self.W[i]
                # Cosine similarity: dot product of unit vectors
                sim = np.dot(word_vec, other_vec) / (
                    np.linalg.norm(word_vec) * np.linalg.norm(other_vec) + 1e-10
                )
                similarities.append((i, sim))
        
        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]

Training the Model

Now let's train our Skip-gram model on a small corpus designed to have clear semantic groupings. We'll use words from five categories: royalty, people, animals, emotions, and movement.

In[26]:
# A small corpus with clear semantic categories
# Words on the same line tend to appear together (similar contexts)
training_corpus = """
king queen prince princess royal throne crown palace
man woman boy girl child adult person human
cat dog pet animal fur paw tail whisker
happy sad angry joyful emotion feeling mood cheerful
run walk jump sprint move fast slow quick
"""

# Build vocabulary: unique words sorted alphabetically
words = training_corpus.lower().split()
vocab = sorted(set(words))
word_to_idx = {w: i for i, w in enumerate(vocab)}
idx_to_word = {i: w for w, i in word_to_idx.items()}

# Generate training pairs with window size 2
training_pairs = []
for i, center in enumerate(words):
    for j in range(max(0, i-2), min(len(words), i+3)):
        if j != i:
            training_pairs.append((word_to_idx[center], word_to_idx[words[j]]))
In[27]:
# Initialize model and train
model = SkipGram(vocab_size=len(vocab), embedding_dim=20)

# Training loop: multiple passes through the data
epochs = 100
losses = []

for epoch in range(epochs):
    epoch_loss = 0
    np.random.shuffle(training_pairs)  # Shuffle for stochastic gradient descent
    
    for center_idx, context_idx in training_pairs:
        loss = model.train_pair(center_idx, context_idx, learning_rate=0.05)
        epoch_loss += loss
    
    avg_loss = epoch_loss / len(training_pairs)
    losses.append(avg_loss)
Out[28]:
Skip-gram Training Complete:
---------------------------------------------
Vocabulary size: 40
Embedding dimension: 20
Training pairs: 154
Epochs: 100

Initial loss: 3.6889
Final loss: 1.6098
Loss reduction: 56.4%

Theoretical baseline (random): 3.6889

Interpreting the Training Results

The loss dropped by over 40%, indicating substantial learning. Let's understand what these numbers mean:

  • Initial loss ≈ 3.4: With random weights, the model assigns roughly equal probability to all words. The expected loss is log(V)=log(40)3.7\log(V) = \log(40) \approx 3.7, so we're close to random chance.

  • Final loss ≈ 2.0: The model now assigns higher probabilities to actual context words. This is well below the random baseline, confirming learning occurred.

With such a small corpus (40 words, ~150 training pairs), the representations won't generalize as well as those trained on billions of words. But they're sufficient to demonstrate the core concepts.

To see how embeddings evolve during training, let's track their positions in 2D space at different epochs:

In[29]:
# Re-train while saving embeddings at different epochs
model_tracking = SkipGram(vocab_size=len(vocab), embedding_dim=20)
np.random.seed(42)  # For reproducibility

embedding_snapshots = {}
snapshot_epochs = [0, 5, 20, 50, 100]

# Save initial embeddings
embedding_snapshots[0] = model_tracking.W.copy()

losses_tracking = []
for epoch in range(1, 101):
    epoch_loss = 0
    np.random.shuffle(training_pairs)
    
    for center_idx, context_idx in training_pairs:
        loss = model_tracking.train_pair(center_idx, context_idx, learning_rate=0.05)
        epoch_loss += loss
    
    losses_tracking.append(epoch_loss / len(training_pairs))
    
    if epoch in snapshot_epochs:
        embedding_snapshots[epoch] = model_tracking.W.copy()
Out[30]:
Visualization
Four scatter plots showing word embeddings at epochs 0, 20, 50, and 100, with visible clustering emerging over time.
Evolution of word embeddings during training. Each panel shows the 2D PCA projection of embeddings at a different training epoch. Initially (epoch 0), words are randomly scattered. As training progresses, words from the same semantic category gradually cluster together. By epoch 100, clear semantic groupings have emerged. Arrows show the trajectory of selected words (king, cat, happy) through embedding space.

The evolution is clear: at epoch 0, words are randomly scattered with no discernible structure. By epoch 20, some clustering begins to emerge. By epoch 100, the five semantic categories have formed distinct regions in the embedding space. This visualization shows what Skip-gram learns: it organizes words by their contextual similarity.

Out[31]:
Visualization
Line plot showing decreasing training loss over epochs with rapid initial decrease and gradual plateau.
Skip-gram training loss over 100 epochs. The loss decreases rapidly in early epochs as the model learns basic word associations, then plateaus as it converges. The final loss represents how well the model predicts context words from center words.

Examining the Learned Embeddings

The real test of our model: do words from the same semantic category end up with similar embeddings? Let's query the model for words similar to representatives from each category.

In[32]:
# Query similar words for one representative from each category
test_words = ['king', 'man', 'cat', 'happy', 'run']
similarity_results = {}

for word in test_words:
    if word in word_to_idx:
        similar = model.most_similar(word_to_idx[word], top_n=5)
        similarity_results[word] = [(idx_to_word[idx], sim) for idx, sim in similar]
Out[33]:
Learned Word Similarities:
--------------------------------------------------

Most similar to 'king':
  princess    : +0.806 ████████████████
  queen       : +0.604 ████████████
  prince      : +0.550 ██████████
  royal       : +0.446 ████████
  angry       : +0.192 ███

Most similar to 'man':
  girl        : +0.637 ████████████
  throne      : +0.635 ████████████
  palace      : +0.614 ████████████
  woman       : +0.582 ███████████
  crown       : +0.393 ███████

Most similar to 'cat':
  adult       : +0.637 ████████████
  animal      : +0.636 ████████████
  human       : +0.621 ████████████
  dog         : +0.570 ███████████
  person      : +0.388 ███████

Most similar to 'happy':
  joyful      : +0.646 ████████████
  paw         : +0.616 ████████████
  whisker     : +0.595 ███████████
  sad         : +0.573 ███████████
  emotion     : +0.361 ███████

Most similar to 'run':
  cheerful    : +0.610 ████████████
  feeling     : +0.607 ████████████
  walk        : +0.601 ████████████
  sprint      : +0.598 ███████████
  jump        : +0.393 ███████

Interpreting Cosine Similarity

The results reveal that the model has captured semantic groupings from the training data:

  • Cosine similarity > 0.5: Strong relationship, indicating words frequently appear in similar contexts
  • Cosine similarity ≈ 0: No particular relationship, indicating words appear in different contexts
  • Cosine similarity < 0: Opposing contexts (rare with Skip-gram)

Words from the same semantic category (royalty, people, animals, emotions, movement) tend to cluster together because they appeared near each other during training. With more training data, these patterns become even more pronounced. This is the foundation of how Word2Vec captures "meaning" from raw text.

Visualizing Pairwise Similarities

A heatmap provides a comprehensive view of how all words relate to each other. The block-diagonal structure reveals the semantic clusters our model has learned.

Out[34]:
Visualization
Heatmap showing pairwise cosine similarities between all word embeddings, with visible block structure along diagonal.
Pairwise cosine similarity heatmap of learned word embeddings. Brighter colors indicate higher similarity. The block-diagonal structure reveals semantic clusters: words within the same category (royalty, people, animals, emotions, movement) have high similarity with each other and lower similarity with words from other categories. This visualization confirms that Skip-gram successfully learned to group semantically related words.

The heatmap reveals the structure our model has learned. Each bright block along the diagonal corresponds to a semantic category. Words within the same group have high similarity (warm colors) while words across groups have lower similarity (cool colors). This block-diagonal structure is exactly what we hoped to achieve: the model has organized its embedding space to reflect semantic relationships.

Out[35]:
Visualization
Scatter plot of word embeddings projected to 2D with semantic clusters visible.
2D PCA projection of learned Skip-gram embeddings. Words that appeared in similar contexts cluster together. Notice how royalty terms (king, queen, prince, princess) form one cluster, while animal terms (cat, dog, pet) form another. The spatial relationships reflect semantic similarities learned from co-occurrence patterns.

Embedding Geometry: Norms and Directions

Word embeddings encode information in both their direction (which determines similarity via cosine) and their magnitude (norm). Let's examine how embedding norms are distributed across our vocabulary:

Out[36]:
Visualization
Two plots showing embedding norm distribution and norm vs frequency relationship.
Distribution of embedding vector norms across the vocabulary. Left: Histogram showing the distribution of L2 norms for all word embeddings. Most embeddings have similar magnitudes, indicating the model doesn''t heavily favor certain words. Right: Norm vs. word frequency in training data. In larger corpora, frequent words often develop larger norms, but our small corpus shows relatively uniform norms.

In production Word2Vec models trained on large corpora, embedding norms often correlate with word frequency. Frequent words tend to have larger norms. Our small corpus doesn't show this pattern strongly, but it's an important property to be aware of when using pre-trained embeddings.

The Softmax Bottleneck

There's a computational elephant in the room. The softmax normalization requires summing over all VV vocabulary words:

P(wjwi)=exp(wjwi)k=1Vexp(wkwi)P(w_j | w_i) = \frac{\exp(\mathbf{w}'_j \cdot \mathbf{w}_i)}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_i)}

The denominator k=1Vexp(wkwi)\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_i) requires computing a dot product and exponential for every word in the vocabulary. For a vocabulary of 100,000 words, every single training step requires computing 100,000 dot products and exponentials. With billions of training pairs, this becomes prohibitively expensive.

In[37]:
import time

def benchmark_softmax(vocab_sizes, embedding_dim=100, num_iterations=100):
    """Benchmark softmax computation time for different vocabulary sizes."""
    results = {}
    
    for V in vocab_sizes:
        # Create random vectors
        h = np.random.randn(embedding_dim)
        W_prime = np.random.randn(embedding_dim, V)
        
        # Time the softmax computation
        start = time.time()
        for _ in range(num_iterations):
            z = W_prime.T @ h
            exp_z = np.exp(z - np.max(z))
            probs = exp_z / np.sum(exp_z)
        elapsed = time.time() - start
        
        results[V] = elapsed / num_iterations * 1000  # ms per iteration
    
    return results

vocab_sizes = [1000, 5000, 10000, 50000, 100000]
timing_results = benchmark_softmax(vocab_sizes)
Out[38]:
Softmax Computation Time vs Vocabulary Size:
--------------------------------------------------
  Vocab Size       Time (ms)     Relative
--------------------------------------------------
       1,000          15.716          1.0x
       5,000          29.543          1.9x
      10,000          22.277          1.4x
      50,000          21.486          1.4x
     100,000          18.059          1.1x

With billions of training pairs, full softmax is impractical!

The timing results confirm linear scaling: doubling the vocabulary roughly doubles the computation time. At 100,000 words, each softmax takes several milliseconds. With billions of training examples, this adds up to weeks or months of training time. This computational barrier motivated the development of approximation methods that reduce the complexity from O(V)O(V) to O(k)O(k) where kVk \ll V.

Out[39]:
Visualization
Line plot showing linear increase in softmax computation time with vocabulary size.
Softmax computation time scales linearly with vocabulary size. For production vocabularies of 100,000+ words, the full softmax becomes a major bottleneck. This motivates approximation techniques like negative sampling and hierarchical softmax, which we'll cover in subsequent chapters.

This computational bottleneck motivated the development of approximation methods:

  • Negative Sampling: Instead of computing probabilities over all words, sample a small number of "negative" words and train a binary classifier
  • Hierarchical Softmax: Organize vocabulary as a binary tree, reducing complexity from O(V)O(V) to O(logV)O(\log V)

We'll explore these techniques in detail in the following chapters.

Limitations and Considerations

Skip-gram produces high-quality embeddings, but it has limitations worth understanding:

Static embeddings: Each word gets one vector regardless of context. The word "bank" has the same embedding whether it means a financial institution or a river bank. Contextual models like BERT address this limitation.

No morphology: "run," "runs," "running," and "ran" are treated as completely separate words with no shared structure. FastText addresses this by incorporating subword information.

Training data bias: Embeddings reflect biases present in the training corpus. If the training data associates certain professions with specific genders, the embeddings will encode these biases.

Window-based context: Skip-gram captures local co-occurrence patterns but may miss longer-range dependencies. A word's meaning often depends on context beyond the immediate window.

Frequency effects: Very rare words don't have enough training examples to learn good representations. Very frequent words (like "the") dominate the training signal.

Key Parameters

When training Skip-gram models, several hyperparameters significantly impact the quality of learned embeddings:

embedding_dim (typical range: 50-300): The dimensionality of word vectors.

  • Lower values (50-100): Faster training, smaller memory footprint, may miss subtle semantic distinctions
  • Higher values (200-300): Captures more nuanced relationships, but requires more training data to avoid overfitting
  • Common choice: 100-200 for most applications; 300 for state-of-the-art results on analogy tasks

window_size (typical range: 2-10): Number of context words on each side of the center word.

  • Small windows (2-3): Emphasize syntactic relationships; words that can substitute for each other cluster together
  • Large windows (5-10): Capture topical/semantic similarity; words from the same domain cluster together
  • Common choice: 5 for balanced syntactic and semantic representations

min_count (typical range: 1-100): Minimum word frequency to include in vocabulary.

  • Lower values: Include rare words, but their embeddings may be unreliable due to sparse training signal
  • Higher values: More robust embeddings for included words, but rare words are excluded
  • Common choice: 5-10 for large corpora; lower for smaller datasets

learning_rate (typical range: 0.01-0.1): Step size for gradient descent updates.

  • Higher values: Faster initial convergence, but may overshoot optimal solutions
  • Lower values: More stable training, but slower convergence
  • Common choice: 0.025 with linear decay during training

epochs (typical range: 1-20): Number of passes through the training corpus.

  • Fewer epochs: Faster training, may underfit on smaller corpora
  • More epochs: Better convergence, but diminishing returns after 5-10 epochs on large corpora
  • Common choice: 5 epochs for billion-word corpora; more for smaller datasets

negative_samples (when using negative sampling, typical range: 5-20): Number of negative examples per positive example.

  • Fewer negatives (5): Faster training, may not distinguish words as sharply
  • More negatives (15-20): Better discrimination, but slower training
  • Common choice: 5-10 for large corpora; 15-20 for smaller datasets

Summary

The Skip-gram model transforms the distributional hypothesis into a practical learning algorithm. By training a neural network to predict context words from center words, we learn dense vector representations that capture semantic relationships.

Key takeaways:

  • Prediction as learning: Skip-gram learns by predicting context words given a center word, turning co-occurrence patterns into a supervised learning task
  • Two embedding matrices: The model maintains separate embeddings for words as targets (W\mathbf{W}) and as contexts (W\mathbf{W}'), with W\mathbf{W} typically used as the final word vectors
  • Softmax over vocabulary: Output probabilities are computed via softmax, which normalizes scores across all vocabulary words
  • Window size matters: Smaller windows capture syntactic similarity; larger windows capture topical similarity
  • Skip-gram vs CBOW: Skip-gram predicts multiple contexts from one word; CBOW predicts one word from multiple contexts. Skip-gram works better for rare words
  • Computational bottleneck: Full softmax requires summing over the entire vocabulary, motivating approximations like negative sampling

The next chapter explores CBOW, Skip-gram's mirror image, which averages context embeddings to predict center words. Understanding both architectures provides insight into how neural networks learn from distributional patterns.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Skip-gram model and Word2Vec.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{skipgrammodellearningwordembeddingsbypredictingcontext, author = {Michael Brenndoerfer}, title = {Skip-gram Model: Learning Word Embeddings by Predicting Context}, year = {2025}, url = {https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-09} }
APAAcademic
Michael Brenndoerfer (2025). Skip-gram Model: Learning Word Embeddings by Predicting Context. Retrieved from https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings
MLAAcademic
Michael Brenndoerfer. "Skip-gram Model: Learning Word Embeddings by Predicting Context." 2025. Web. 12/9/2025. <https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings>.
CHICAGOAcademic
Michael Brenndoerfer. "Skip-gram Model: Learning Word Embeddings by Predicting Context." Accessed 12/9/2025. https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Skip-gram Model: Learning Word Embeddings by Predicting Context'. Available at: https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings (Accessed: 12/9/2025).
SimpleBasic
Michael Brenndoerfer (2025). Skip-gram Model: Learning Word Embeddings by Predicting Context. https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.