Skip-gram Model: Learning Word Embeddings by Predicting Context

Michael Brenndoerfer

Language AI Handbook Machine Learning Data, Analytics & AI

A comprehensive guide to the Skip-gram model from Word2Vec, covering architecture, objective function, training data generation, and implementation from scratch.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Skip-gram ModelLink Copied

The distributional hypothesis tells us that words appearing in similar contexts have similar meanings. But how do we turn this insight into practical word representations? Co-occurrence matrices capture contextual patterns, but they're sparse, high-dimensional, and computationally expensive. What if we could learn dense, low-dimensional vectors that encode the same distributional information more efficiently?

In 2013, Mikolov et al. introduced Word2Vec, a family of neural network models that changed how we create word representations. The Skip-gram model, one of the two Word2Vec architectures, takes a simple approach: given a word, predict its context. By training a neural network on this task across billions of words, we learn dense vectors that capture rich semantic relationships. Words like "king" and "queen" end up close together in vector space. Vector arithmetic even works: $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$ .

This chapter introduces the Skip-gram architecture from the ground up. We'll build intuition for why predicting context words leads to meaningful representations, work through the mathematics step by step, and implement a working Skip-gram model from scratch.

The Core Idea: Predicting Context from WordsLink Copied

Traditional distributional methods count co-occurrences and store them in massive matrices. Skip-gram flips the script: instead of counting, we predict. Given a target word, the model tries to predict which words appear nearby in the training corpus.

Skip-gram Model

The Skip-gram model learns word representations by training a neural network to predict context words given a center word. The learned weights of this network become the word embeddings.

Consider the sentence: "The quick brown fox jumps over the lazy dog." If we take "fox" as our target word with a context window of size 2, Skip-gram asks: given "fox," can we predict that "brown," "quick," "jumps," and "over" appear nearby?

In[2]:

Code

# Demonstrating Skip-gram's prediction task
sentence = "The quick brown fox jumps over the lazy dog"
words = sentence.lower().split()


def get_skipgram_pairs(words, target_idx, window_size=2):
    """Generate (target, context) pairs for Skip-gram training."""
    target = words[target_idx]
    pairs = []

    for offset in range(-window_size, window_size + 1):
        if offset == 0:  # Skip the target word itself
            continue
        context_idx = target_idx + offset
        if 0 <= context_idx < len(words):
            pairs.append((target, words[context_idx]))

    return pairs


# Get pairs for "fox" (index 3)
fox_pairs = get_skipgram_pairs(words, target_idx=3, window_size=2)

# Demonstrating Skip-gram's prediction task
sentence = "The quick brown fox jumps over the lazy dog"
words = sentence.lower().split()


def get_skipgram_pairs(words, target_idx, window_size=2):
    """Generate (target, context) pairs for Skip-gram training."""
    target = words[target_idx]
    pairs = []

    for offset in range(-window_size, window_size + 1):
        if offset == 0:  # Skip the target word itself
            continue
        context_idx = target_idx + offset
        if 0 <= context_idx < len(words):
            pairs.append((target, words[context_idx]))

    return pairs


# Get pairs for "fox" (index 3)
fox_pairs = get_skipgram_pairs(words, target_idx=3, window_size=2)

Out[3]:

Console

Skip-gram Training Pairs for 'fox':
---------------------------------------------
Sentence: 'The quick brown fox jumps over the lazy dog'
Target word: 'fox' (position 3)
Window size: 2

Generated (target → context) pairs:
  'fox' → 'quick'
  'fox' → 'brown'
  'fox' → 'jumps'
  'fox' → 'over'

The model learns by trying to maximize the probability of these context words given the target. If "fox" frequently appears near "brown" in the training corpus, the model adjusts its weights to make $P(\text{brown} | \text{fox})$ high. Through millions of such updates, words that appear in similar contexts develop similar vector representations.

Architecture: Two Embedding MatricesLink Copied

The Skip-gram architecture is simple: a shallow neural network with a single hidden layer and no activation function. The key insight lies in what we do with the learned weights.

Out[4]:

Visualization

Neural network diagram showing input one-hot vector, embedding layer, hidden representation, output layer, and softmax probabilities. — Skip-gram architecture. The input is a one-hot encoded target word, which is projected through the embedding matrix W to produce a dense vector. This vector is then projected through the context matrix W' to produce scores for every word in the vocabulary. Softmax converts these scores to probabilities over context words.

The network has two weight matrices:

Embedding matrix $\mathbf{W}$ (size $V \times d$ ): Maps input words to dense vectors. Each row is the embedding for one vocabulary word. When we input a one-hot vector for word $w$ , multiplying by $\mathbf{W}^T$ simply selects the corresponding row.
Context matrix $\mathbf{W}'$ (size $d \times V$ ): Maps the hidden representation to output scores. Each column represents a word as a potential context.

Here $V$ is the vocabulary size (often 100,000+ words) and $d$ is the embedding dimension (typically 100-300).

In[5]:

Code

import numpy as np

# Initialize Skip-gram model parameters
vocab_size = 10000  # V: number of unique words
embedding_dim = 100  # d: dimension of word vectors

# Embedding matrix: each row is a word's embedding
W = np.random.randn(vocab_size, embedding_dim) * 0.01

# Context matrix: each column is a word's context representation
W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01

import numpy as np

# Initialize Skip-gram model parameters
vocab_size = 10000  # V: number of unique words
embedding_dim = 100  # d: dimension of word vectors

# Embedding matrix: each row is a word's embedding
W = np.random.randn(vocab_size, embedding_dim) * 0.01

# Context matrix: each column is a word's context representation
W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01

Out[6]:

Console

Skip-gram Model Dimensions:
---------------------------------------------
Vocabulary size (V):     10,000
Embedding dimension (d): 100

Embedding matrix W:      10,000 × 100
Context matrix W':       100 × 10,000

Total parameters:        2,000,000

With 2 million parameters, Skip-gram is lightweight compared to modern language models. This efficiency comes from the shallow architecture: just two matrix multiplications with no hidden layers or activation functions. Despite this simplicity, Skip-gram learns rich representations.

Why Two Matrices?

Skip-gram maintains separate embeddings for words as targets ( $\mathbf{W}$ ) and as contexts ( $\mathbf{W}'$ ). After training, we typically use only the embedding matrix $\mathbf{W}$ as our word vectors, though some implementations average both matrices or concatenate them.

Input and Output RepresentationsLink Copied

Understanding how Skip-gram represents words at input and output is essential for grasping the model's mechanics. The input uses sparse one-hot vectors that serve as lookup indices, while the output produces probability distributions over the entire vocabulary.

One-Hot EncodingLink Copied

The input to Skip-gram is a one-hot encoded vector. For a vocabulary of $V$ words, each word is represented as a vector of length $V$ with a single 1 at the word's index and 0s everywhere else.

In[7]:

Code

def create_one_hot(word_idx, vocab_size):
    """Create a one-hot vector for a word."""
    one_hot = np.zeros(vocab_size)
    one_hot[word_idx] = 1
    return one_hot


# Example: vocabulary and one-hot encoding
small_vocab = ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog"]
word_to_idx = {word: i for i, word in enumerate(small_vocab)}

# One-hot for "fox"
fox_one_hot = create_one_hot(word_to_idx["fox"], len(small_vocab))

def create_one_hot(word_idx, vocab_size):
    """Create a one-hot vector for a word."""
    one_hot = np.zeros(vocab_size)
    one_hot[word_idx] = 1
    return one_hot


# Example: vocabulary and one-hot encoding
small_vocab = ["the", "quick", "brown", "fox", "jumps", "over", "lazy", "dog"]
word_to_idx = {word: i for i, word in enumerate(small_vocab)}

# One-hot for "fox"
fox_one_hot = create_one_hot(word_to_idx["fox"], len(small_vocab))

Out[8]:

Console

One-Hot Encoding Example:
---------------------------------------------
Vocabulary: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'lazy', 'dog']
Word-to-index mapping: {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumps': 4, 'over': 5, 'lazy': 6, 'dog': 7}

One-hot vector for 'fox':
  [0. 0. 0. 1. 0. 0. 0. 0.]
  Position 3 = 1 (fox's index)

The one-hot vector is extremely sparse: 7 zeros and a single 1. For a real vocabulary of 100,000 words, each input would have 99,999 zeros. This sparsity is why the embedding lookup is so efficient: multiplying by a one-hot vector simply selects one row.

From One-Hot to EmbeddingLink Copied

When we multiply the one-hot vector by the embedding matrix $\mathbf{W}^T$ , something useful happens: we simply extract the row corresponding to our input word. The hidden layer representation $\mathbf{h}$ is computed as:

\mathbf{h} = \mathbf{W}^T \mathbf{x}

where:

$\mathbf{h}$ : the hidden layer vector (the word embedding), with dimension $d$
$\mathbf{W}$ : the embedding matrix of size $V \times d$ , where each row is a word's embedding
$\mathbf{x}$ : a one-hot encoded input vector of length $V$ with a single 1 at the word's position
$V$ : the vocabulary size
$d$ : the embedding dimension

If $\mathbf{x}$ is one-hot with a 1 at position $i$ , then $\mathbf{h}$ is exactly the $i$ -th row of $\mathbf{W}$ . This is why we call $\mathbf{W}$ the "embedding matrix": its rows are the word embeddings.

In[9]:

Code

# Small embedding matrix for demonstration
small_W = np.random.randn(len(small_vocab), 4) * 0.5  # 8 words, 4 dimensions


def get_embedding(word, W, word_to_idx):
    """Get the embedding for a word (equivalent to W.T @ one_hot)."""
    idx = word_to_idx[word]
    return W[idx]  # Simply select the row


# Get embedding for "fox"
fox_embedding = get_embedding("fox", small_W, word_to_idx)

# Small embedding matrix for demonstration
small_W = np.random.randn(len(small_vocab), 4) * 0.5  # 8 words, 4 dimensions


def get_embedding(word, W, word_to_idx):
    """Get the embedding for a word (equivalent to W.T @ one_hot)."""
    idx = word_to_idx[word]
    return W[idx]  # Simply select the row


# Get embedding for "fox"
fox_embedding = get_embedding("fox", small_W, word_to_idx)

Out[10]:

Console

Embedding Lookup:
---------------------------------------------
Embedding matrix shape: (8, 4)

Row 3 of W (fox's embedding):
  [ 0.12880386 -0.26410839  0.19483714  0.17098863]

W.T @ one_hot('fox'):
  [ 0.12880386 -0.26410839  0.19483714  0.17098863]

Both methods give the same result!

The direct row selection W[idx] is computationally equivalent to the matrix multiplication W.T @ one_hot but far more efficient. In practice, embedding layers in deep learning frameworks use this lookup optimization rather than actual matrix multiplication.

Output Scores and SoftmaxLink Copied

The hidden vector $\mathbf{h}$ is then projected through the context matrix $\mathbf{W}'$ to produce a score for each vocabulary word. These raw scores (called "logits") indicate how compatible each word is as a context:

\mathbf{z} = \mathbf{W}'^T \mathbf{h}

where:

$\mathbf{z}$ : the output score vector of length $V$ , with one score per vocabulary word
$\mathbf{W}'$ : the context matrix of size $d \times V$ , where each column is a word's context representation
$\mathbf{h}$ : the hidden layer (the center word's embedding)

Each element $z_j$ represents how likely word $j$ is to be a context word. Higher scores indicate stronger compatibility. To convert these scores to a valid probability distribution, we apply the softmax function:

P(w_j | w_i) = \frac{\exp(z_j)}{\sum_{k=1}^{V} \exp(z_k)} = \frac{\exp(\mathbf{w}'_j \cdot \mathbf{w}_i)}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_i)}

where:

$P(w_j | w_i)$ : the probability of observing word $w_j$ as a context word given center word $w_i$
$w_i$ : the input (center) word
$w_j$ : a candidate context word
$\mathbf{w}_i$ : the embedding vector of the input word (row $i$ of $\mathbf{W}$ )
$\mathbf{w}'_j$ : the context vector of word $j$ (column $j$ of $\mathbf{W}'$ )
$z_j = \mathbf{w}'_j \cdot \mathbf{w}_i$ : the raw score for word $j$ , computed as the dot product of the context and embedding vectors
$V$ : the vocabulary size
$\exp(\cdot)$ : the exponential function, which ensures all values are positive

The softmax function has two key properties: (1) it transforms any real-valued scores into positive numbers that sum to 1, creating a valid probability distribution, and (2) it amplifies differences between scores, with larger scores receiving disproportionately more probability mass due to the exponential function.

In[11]:

Code

def softmax(z):
    """Compute softmax probabilities."""
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z))
    return exp_z / np.sum(exp_z)


def forward_pass(input_word, W, W_prime, word_to_idx):
    """Complete forward pass through Skip-gram."""
    # Get input embedding
    h = get_embedding(input_word, W, word_to_idx)

    # Compute output scores
    z = W_prime.T @ h

    # Apply softmax
    probs = softmax(z)

    return h, z, probs


# Small context matrix
small_W_prime = np.random.randn(4, len(small_vocab)) * 0.5

# Forward pass for "fox"
h, z, probs = forward_pass("fox", small_W, small_W_prime, word_to_idx)

def softmax(z):
    """Compute softmax probabilities."""
    # Subtract max for numerical stability
    exp_z = np.exp(z - np.max(z))
    return exp_z / np.sum(exp_z)


def forward_pass(input_word, W, W_prime, word_to_idx):
    """Complete forward pass through Skip-gram."""
    # Get input embedding
    h = get_embedding(input_word, W, word_to_idx)

    # Compute output scores
    z = W_prime.T @ h

    # Apply softmax
    probs = softmax(z)

    return h, z, probs


# Small context matrix
small_W_prime = np.random.randn(4, len(small_vocab)) * 0.5

# Forward pass for "fox"
h, z, probs = forward_pass("fox", small_W, small_W_prime, word_to_idx)

Out[12]:

Console

Forward Pass for 'fox':
---------------------------------------------
Hidden vector h (embedding): [ 0.12880386 -0.26410839  0.19483714  0.17098863]

Output scores z:
  the     :  0.327
  quick   :  0.294
  brown   : -0.207
  fox     : -0.256
  jumps   :  0.091
  over    :  0.287
  lazy    : -0.006
  dog     : -0.180

Softmax probabilities P(context | fox):
  the     : 0.1618
  quick   : 0.1565
  brown   : 0.0948
  fox     : 0.0903
  jumps   : 0.1278
  over    : 0.1554
  lazy    : 0.1159
  dog     : 0.0975

Sum of probabilities: 1.0000

The raw scores (logits) can be any real number, positive or negative. Softmax transforms them into a valid probability distribution: all values between 0 and 1, summing to exactly 1. Notice how the word with the highest score gets a disproportionately large probability. This "winner-take-more" behavior is characteristic of the exponential function in softmax.

Out[13]:

Visualization

Raw output scores (logits) can be any real number, positive or negative. The highest score (red) will receive the most probability mass after softmax transformation.

Softmax probabilities are positive and sum to 1. Notice how softmax amplifies differences: the highest score gets a disproportionately large probability.

Understanding softmax behavior matters because it determines how the model distributes probability mass. The exponential function amplifies differences: even small gaps in raw scores become large probability differences. This property helps the model make confident predictions but also creates computational challenges we'll address later.

The Skip-gram Objective FunctionLink Copied

We've seen how Skip-gram transforms words into vectors and predicts context probabilities. But how does the model actually learn? What signal tells it whether its current embeddings are good or bad? The answer lies in the objective function: a mathematical expression that quantifies how well the model's predictions match reality.

From Intuition to FormalizationLink Copied

Let's build the objective function step by step, starting from a simple intuition.

The core insight: If our embeddings are good, then given a center word, the model should assign high probability to words that actually appear nearby and low probability to words that don't. The objective function formalizes this: we want to maximize the probability of observing the actual context words.

Consider our running example: the sentence "The quick brown fox jumps over the lazy dog." When "fox" is the center word with a window of size 2, the true context words are "quick," "brown," "jumps," and "over." A well-trained model should predict:

$P(\text{brown} | \text{fox})$ → high
$P(\text{jumps} | \text{fox})$ → high
$P(\text{elephant} | \text{fox})$ → low

The Probability of Context WordsLink Copied

For a single center word $w_c$ at position $c$ , we observe context words at positions $c-m, c-m+1, \ldots, c-1, c+1, \ldots, c+m$ (where $m$ is the window size). Skip-gram assumes these context words are conditionally independent given the center word, so the probability of observing all of them is the product of individual probabilities:

\prod_{j=-m, j \neq 0}^{m} P(w_{c+j} | w_c)

where:

$\prod$ : the product operator, multiplying together all the terms
$m$ : the window size (number of context words on each side of the center word)
$j$ : the offset from the center position, ranging from $-m$ to $m$ but excluding 0 (the center word itself)
$w_c$ : the center word at position $c$
$w_{c+j}$ : the context word at position $c + j$ (i.e., $j$ positions away from the center)
$P(w_{c+j} | w_c)$ : the probability of observing context word $w_{c+j}$ given center word $w_c$

For a window size of $m = 2$ , this product includes probabilities for positions $j \in \{-2, -1, 1, 2\}$ , giving us four context words per center word.

Why Products?

The product form comes from the independence assumption: we treat each context position as a separate prediction task. While context words aren't truly independent (knowing "brown" appears near "fox" tells us something about what other words might appear), this simplification makes training tractable and works well in practice.

Converting to Log-LikelihoodLink Copied

Working with products of probabilities is numerically unstable. Multiplying many small numbers quickly underflows to zero. The standard solution is to take the logarithm, which converts products to sums. Using the property $\log(a \times b) = \log(a) + \log(b)$ , we transform the product of probabilities into a sum of log-probabilities:

\mathcal{L}_c = \sum_{j=-m, j \neq 0}^{m} \log P(w_{c+j} | w_c)

where:

$\mathcal{L}_c$ : the log-likelihood for center word at position $c$ (a scalar value)
$\sum$ : the summation operator, adding together all the terms
$m$ : the window size
$j$ : the offset from the center position, excluding 0
$\log P(w_{c+j} | w_c)$ : the log-probability of observing context word $w_{c+j}$ given center word $w_c$

This is the log-likelihood for a single center word. Since probabilities are between 0 and 1, their logarithms are negative, but higher (less negative) values mean the model assigns higher probabilities to the true context words. Our goal is to maximize this quantity.

Unpacking the SoftmaxLink Copied

Recall that $P(w_{c+j} | w_c)$ is computed via softmax over dot products:

P(w_{c+j} | w_c) = \frac{\exp(\mathbf{w}'_{c+j} \cdot \mathbf{w}_c)}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_c)}

where:

$\mathbf{w}_c$ : the embedding vector of the center word (from matrix $\mathbf{W}$ )
$\mathbf{w}'_{c+j}$ : the context vector of the context word at position $c+j$ (from matrix $\mathbf{W}'$ )
$\mathbf{w}'_{c+j} \cdot \mathbf{w}_c$ : the dot product between context and embedding vectors (a scalar score)
$\exp(\cdot)$ : the exponential function
$V$ : the vocabulary size
$k$ : an index iterating over all vocabulary words in the normalization sum

Substituting this into our log-likelihood and using the property $\log(a/b) = \log a - \log b$ , we get:

\mathcal{L}_c = \sum_{j=-m, j \neq 0}^{m} \left[ \mathbf{w}'_{c+j} \cdot \mathbf{w}_c - \log \sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_c) \right]

where:

$\mathbf{w}'_{c+j} \cdot \mathbf{w}_c$ : the score for the true context word (the "positive" term we want to maximize)
$\log \sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_c)$ : the log-sum-exp over all vocabulary words (the normalization term)

This expanded form reveals the two forces at work during training:

The positive term $\mathbf{w}'_{c+j} \cdot \mathbf{w}_c$ : Maximizing this pushes the context word's vector $\mathbf{w}'_{c+j}$ closer to the center word's embedding $\mathbf{w}_c$ . The dot product increases when vectors point in similar directions.
The normalization term $\log \sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_c)$ : This term is subtracted, so maximizing the objective means minimizing this sum. Since the sum includes all vocabulary words, this effectively pushes all other words away from the center word.

The interplay between these forces is what makes Skip-gram work: it simultaneously pulls true context words closer while pushing non-context words away.

Out[14]:

Visualization

Before training (random initialization), dot products between any word pair are randomly distributed around zero. Context and non-context words are indistinguishable.

After training, context words (green) develop higher dot products than non-context words (red). This separation enables the softmax to assign high probabilities to true context words.

This visualization shows exactly what Skip-gram learns to do: before training, dot products between any word pair are randomly distributed around zero. After training, the distributions separate. Context words (which should have high probability) develop higher dot products, while non-context words have lower dot products. This separation is what enables the softmax to assign high probabilities to true context words.

The Full Corpus ObjectiveLink Copied

A single center word gives us one training signal. To learn robust embeddings, we aggregate over the entire corpus. If the corpus has $T$ words total, we average the log-likelihood across all positions:

J = \frac{1}{T} \sum_{t=1}^{T} \sum_{j=-m, j \neq 0}^{m} \log P(w_{t+j} | w_t)

where:

$J$ : the objective function (average log-likelihood across the corpus)
$T$ : total number of words in the training corpus
$t$ : position of the current center word (ranging from 1 to $T$ )
$m$ : window size (number of context words on each side)
$j$ : offset from the center position, excluding 0
$w_t$ : the word at position $t$ (center word)
$w_{t+j}$ : a context word at offset $j$ from position $t$
$P(w_{t+j} | w_t)$ : the model's predicted probability for the context word given the center word

The factor $\frac{1}{T}$ normalizes by corpus size, making the objective comparable across different training sets. This is what we maximize during training. In practice, we minimize the negative log-likelihood (cross-entropy loss), which is equivalent but aligns with the convention of "minimizing loss."

Implementing the Loss FunctionLink Copied

Let's translate this mathematics into code. The loss function computes how poorly the model predicts context words for a given center word:

In[15]:

Code

def compute_loss(center_word, context_words, W, W_prime, word_to_idx):
    """
    Compute Skip-gram loss for one center word and its contexts.

    The loss is the negative log-likelihood: lower values indicate
    better predictions of context words.
    """
    # Step 1: Get the center word's embedding vector
    h = get_embedding(center_word, W, word_to_idx)

    # Step 2: Compute output scores (dot products with all context vectors)
    z = W_prime.T @ h

    # Step 3: Compute log-sum-exp for numerical stability
    # This is log(sum(exp(z_k))) = log of the softmax denominator
    log_sum_exp = np.log(np.sum(np.exp(z - np.max(z)))) + np.max(z)

    # Step 4: Sum negative log probabilities for each context word
    total_loss = 0
    for context_word in context_words:
        context_idx = word_to_idx[context_word]
        # log P(context | center) = z[context] - log_sum_exp
        log_prob = z[context_idx] - log_sum_exp
        total_loss -= log_prob  # Negative because we minimize loss

    return total_loss


# Example: compute loss for "fox" predicting its context
context_words = ["brown", "quick", "jumps", "over"]
loss = compute_loss("fox", context_words, small_W, small_W_prime, word_to_idx)

def compute_loss(center_word, context_words, W, W_prime, word_to_idx):
    """
    Compute Skip-gram loss for one center word and its contexts.

    The loss is the negative log-likelihood: lower values indicate
    better predictions of context words.
    """
    # Step 1: Get the center word's embedding vector
    h = get_embedding(center_word, W, word_to_idx)

    # Step 2: Compute output scores (dot products with all context vectors)
    z = W_prime.T @ h

    # Step 3: Compute log-sum-exp for numerical stability
    # This is log(sum(exp(z_k))) = log of the softmax denominator
    log_sum_exp = np.log(np.sum(np.exp(z - np.max(z)))) + np.max(z)

    # Step 4: Sum negative log probabilities for each context word
    total_loss = 0
    for context_word in context_words:
        context_idx = word_to_idx[context_word]
        # log P(context | center) = z[context] - log_sum_exp
        log_prob = z[context_idx] - log_sum_exp
        total_loss -= log_prob  # Negative because we minimize loss

    return total_loss


# Example: compute loss for "fox" predicting its context
context_words = ["brown", "quick", "jumps", "over"]
loss = compute_loss("fox", context_words, small_W, small_W_prime, word_to_idx)

Out[16]:

Console

Loss Computation Example:
---------------------------------------------
Center word: 'fox'
Context words: ['brown', 'quick', 'jumps', 'over']

Negative log-likelihood loss: 8.1297
Random baseline loss:         8.32

The computed loss tells us how surprised the model is by the actual context words. The random baseline represents the expected loss if all words were equally likely, calculated as $\log(V) \times \text{num\_contexts} = \log(8) \times 4 \approx 8.3$ . Since our loss is close to this baseline with randomly initialized weights, the model is essentially guessing. Lower loss indicates better context predictions.

As training progresses, the model learns to assign higher probabilities to actual context words, driving the loss down. A well-trained model on a large corpus typically achieves losses in the range of 2-4 (when using negative sampling), indicating it has learned to predict context words much better than chance.

Visualizing Gradient UpdatesLink Copied

To understand how Skip-gram learns, let's visualize what happens during a single gradient update. When we train on the pair ("fox" → "brown"), the gradients adjust the embeddings:

Out[17]:

Visualization

Gradients for each word's context vector during training. The target context word 'brown' receives a negative gradient (pulling it closer), while non-context words receive positive gradients (pushing them away).

Relative update magnitudes sorted by size. The target context word receives the largest adjustment, proportional to how much the model's prediction was wrong.

<Figure size 2100x1500 with 0 Axes>

The gradient visualization reveals the core learning mechanism:

The target context word ("brown") receives a negative gradient, meaning its context vector will be updated to increase its dot product with "fox", pulling it closer in embedding space.
All other words receive positive gradients proportional to their current probability. Words the model incorrectly thinks are likely contexts get pushed away more strongly.

This push-pull dynamic, repeated millions of times across the corpus, gradually organizes the embedding space so that words appearing in similar contexts end up nearby.

Generating Training DataLink Copied

With the objective function defined, we need training data to optimize it. Skip-gram's training data consists of (center word, context word) pairs extracted from raw text. The advantage of this approach is that we need no manual labels. The text itself provides supervision through co-occurrence patterns.

The Sliding Window ApproachLink Copied

The algorithm is straightforward:

Slide a window across the corpus, one word at a time
Treat each word as a potential center word
Pair it with every word within the window (excluding itself)

This process transforms unstructured text into structured training examples.

In[18]:

Code

def generate_training_data(corpus, window_size=2):
    """
    Generate all (center, context) training pairs from corpus.

    Each word becomes a center word, paired with every word
    within the specified window distance.
    """
    words = corpus.lower().split()
    training_pairs = []

    for i, center_word in enumerate(words):
        # Define context window boundaries (handle edges)
        start = max(0, i - window_size)
        end = min(len(words), i + window_size + 1)

        # Generate a pair with each context word
        for j in range(start, end):
            if j != i:  # Skip the center word itself
                training_pairs.append((center_word, words[j]))

    return training_pairs


# Generate training data from our example sentence
corpus = "The quick brown fox jumps over the lazy dog"
pairs = generate_training_data(corpus, window_size=2)

def generate_training_data(corpus, window_size=2):
    """
    Generate all (center, context) training pairs from corpus.

    Each word becomes a center word, paired with every word
    within the specified window distance.
    """
    words = corpus.lower().split()
    training_pairs = []

    for i, center_word in enumerate(words):
        # Define context window boundaries (handle edges)
        start = max(0, i - window_size)
        end = min(len(words), i + window_size + 1)

        # Generate a pair with each context word
        for j in range(start, end):
            if j != i:  # Skip the center word itself
                training_pairs.append((center_word, words[j]))

    return training_pairs


# Generate training data from our example sentence
corpus = "The quick brown fox jumps over the lazy dog"
pairs = generate_training_data(corpus, window_size=2)

Out[19]:

Console

Training Data Generation:
---------------------------------------------
Corpus: 'The quick brown fox jumps over the lazy dog'
Window size: 2
Total training pairs: 30

Sample pairs (center → context):
  the      → quick
  the      → brown
  quick    → the
  quick    → brown
  quick    → fox
  brown    → the
  brown    → quick
  brown    → fox
  brown    → jumps
  fox      → quick
  fox      → brown
  fox      → jumps
  ... and 18 more pairs

The Multiplication EffectLink Copied

Notice the dramatic expansion: a 9-word sentence produces 30 training pairs. This happens because:

Each word in the middle of the sentence generates 4 pairs (2 words on each side)
Words at the edges generate fewer pairs (only 2-3, depending on position)

For large corpora with billions of words, this multiplication effect produces massive training datasets. A corpus of 1 billion words with window size 5 generates roughly 10 billion training pairs. This abundance of training signal is why Skip-gram can learn rich semantic representations without any manual annotation. The structure of language itself provides the supervision.

Out[20]:

Visualization

Diagram showing sliding window over sentence with center and context words highlighted. — Skip-gram training pair generation with window size 2. Each word (highlighted in orange) serves as a center word, generating pairs with all words within the window (highlighted in blue). The diagram shows pairs generated for 'brown' and 'fox' as center words. Notice that pairs are directional: (brown, fox) and (fox, brown) are both generated.

Window Size: A Critical HyperparameterLink Copied

The window size $m$ determines how many words on each side of the center word count as context. This choice significantly affects what the embeddings capture.

Window Size

The window size hyperparameter controls the trade-off between syntactic and semantic similarity in learned embeddings. Smaller windows emphasize syntactic relationships, while larger windows capture topical similarity.

In[21]:

Code

def analyze_window_effects(corpus, window_sizes=[1, 2, 5, 10]):
    """Analyze how window size affects training pairs."""
    words = corpus.lower().split()
    results = {}

    for ws in window_sizes:
        pairs = generate_training_data(corpus, window_size=ws)

        # Count unique context words per center word
        from collections import defaultdict

        context_counts = defaultdict(set)
        for center, context in pairs:
            context_counts[center].add(context)

        avg_contexts = np.mean(
            [len(contexts) for contexts in context_counts.values()]
        )
        results[ws] = {
            "total_pairs": len(pairs),
            "avg_contexts_per_word": avg_contexts,
            "pairs": pairs,
        }

    return results


# Larger corpus for meaningful analysis
large_corpus = """
The quick brown fox jumps over the lazy dog in the sunny meadow.
A clever fox hunts small prey near the forest edge at dawn.
The dog barks loudly when strangers approach the old farmhouse.
Brown leaves fall from tall trees in the autumn breeze.
Quick movements catch the eye of the watchful predator nearby.
"""

window_results = analyze_window_effects(large_corpus, window_sizes=[1, 2, 3, 5])

def analyze_window_effects(corpus, window_sizes=[1, 2, 5, 10]):
    """Analyze how window size affects training pairs."""
    words = corpus.lower().split()
    results = {}

    for ws in window_sizes:
        pairs = generate_training_data(corpus, window_size=ws)

        # Count unique context words per center word
        from collections import defaultdict

        context_counts = defaultdict(set)
        for center, context in pairs:
            context_counts[center].add(context)

        avg_contexts = np.mean(
            [len(contexts) for contexts in context_counts.values()]
        )
        results[ws] = {
            "total_pairs": len(pairs),
            "avg_contexts_per_word": avg_contexts,
            "pairs": pairs,
        }

    return results


# Larger corpus for meaningful analysis
large_corpus = """
The quick brown fox jumps over the lazy dog in the sunny meadow.
A clever fox hunts small prey near the forest edge at dawn.
The dog barks loudly when strangers approach the old farmhouse.
Brown leaves fall from tall trees in the autumn breeze.
Quick movements catch the eye of the watchful predator nearby.
"""

window_results = analyze_window_effects(large_corpus, window_sizes=[1, 2, 3, 5])

Out[22]:

Console

Window Size Analysis:
-------------------------------------------------------
Window Size    Total Pairs    Avg Contexts/Word  
-------------------------------------------------------
     1             108               2.5         
     2             214               4.9         
     3             318               6.9         
     5             520               10.5

Larger windows create more training pairs, but they may dilute the signal by including less relevant contexts. The trade-off becomes clear: a window of 5 generates roughly 3x more pairs than a window of 1, but those additional context words are further from the center and may be less semantically related.

Small windows (1-2 words) tend to produce embeddings where syntactically similar words cluster together. Words that can substitute for each other in the same grammatical position (like "dog" and "cat" as nouns) end up nearby.

Large windows (5-10 words) capture topical similarity. Words that appear in the same documents or discuss the same subjects cluster together, even if they play different grammatical roles.

Skip-gram vs CBOW: Two Sides of the Same CoinLink Copied

Word2Vec actually includes two architectures: Skip-gram and Continuous Bag of Words (CBOW). They're mirror images of each other:

Skip-gram: Given center word, predict context words
CBOW: Given context words, predict center word

Out[23]:

Visualization

Side-by-side comparison of Skip-gram and CBOW neural network architectures. — Skip-gram vs CBOW architectures. Skip-gram (left) predicts multiple context words from a single center word, making separate predictions for each context position. CBOW (right) averages the context word embeddings and predicts the single center word. Skip-gram works better for rare words; CBOW trains faster and works better for frequent words.

The key differences:

Skip-gram vs CBOW comparison. Skip-gram excels with rare words due to its multiple training examples per occurrence.

Aspect	Skip-gram	CBOW
Input	Single center word	Multiple context words
Output	Multiple context words	Single center word
Rare words	Better (each occurrence creates multiple training examples)	Worse (rare words get averaged out)
Training speed	Slower (more predictions per position)	Faster (one prediction per position)
Best for	Smaller datasets, rare words	Larger datasets, frequent words

Skip-gram's advantage with rare words comes from its training structure. For each occurrence of a rare word, Skip-gram generates multiple training examples (one for each context word). CBOW, by contrast, uses each occurrence only once. This gives Skip-gram more signal for learning good representations of infrequent words.

A Complete ImplementationLink Copied

We've covered the theory: the architecture, the objective function, and the training data. Now let's bring it all together into a working implementation that you can run and experiment with.

The SkipGram ClassLink Copied

Our implementation encapsulates the complete Skip-gram model in a single class. Each method corresponds to a concept we've discussed:

In[24]:

Code

class SkipGram:
    """
    Simple Skip-gram implementation for educational purposes.

    This implementation uses full softmax (not negative sampling)
    to clearly illustrate the core algorithm.
    """

    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize the model with random embeddings.

        Args:
            vocab_size: Number of unique words in vocabulary (V)
            embedding_dim: Dimension of word vectors (d)
        """
        # Embedding matrix W: each row is a word's embedding
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        # Context matrix W': each column is a word's context representation
        self.W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim

    def forward(self, center_idx):
        """
        Forward pass: compute hidden representation and output scores.

        Args:
            center_idx: Index of the center word

        Returns:
            h: Hidden layer (embedding of center word)
            z: Output scores (unnormalized log-probabilities)
        """
        h = self.W[center_idx]  # Embedding lookup (equivalent to W.T @ one_hot)
        z = self.W_prime.T @ h  # Output scores for all vocabulary words
        return h, z

    def softmax(self, z):
        """Compute softmax probabilities with numerical stability."""
        exp_z = np.exp(z - np.max(z))  # Subtract max for stability
        return exp_z / np.sum(exp_z)

    def compute_loss(self, center_idx, context_idx):
        """
        Compute cross-entropy loss for one (center, context) pair.

        Returns:
            loss: Negative log-probability of the context word
            h: Hidden representation (needed for backprop)
            probs: Softmax probabilities (needed for backprop)
        """
        h, z = self.forward(center_idx)
        probs = self.softmax(z)
        loss = -np.log(probs[context_idx] + 1e-10)  # Add epsilon for stability
        return loss, h, probs

    def backward(self, center_idx, context_idx, h, probs, learning_rate=0.01):
        """
        Backward pass: compute gradients and update weights.

        This implements stochastic gradient descent on the cross-entropy loss.
        """
        # Gradient of loss w.r.t. output scores: (probs - one_hot_target)
        dz = probs.copy()
        dz[context_idx] -= 1  # Subtract 1 at the true context position

        # Gradient for W' (context matrix): outer product of h and dz
        dW_prime = np.outer(h, dz)

        # Gradient for h (hidden layer): W' @ dz
        dh = self.W_prime @ dz

        # Update weights using gradient descent
        self.W_prime -= learning_rate * dW_prime
        self.W[center_idx] -= learning_rate * dh

    def train_pair(self, center_idx, context_idx, learning_rate=0.01):
        """Train on a single (center, context) pair."""
        loss, h, probs = self.compute_loss(center_idx, context_idx)
        self.backward(center_idx, context_idx, h, probs, learning_rate)
        return loss

    def get_embedding(self, word_idx):
        """Get the embedding vector for a word."""
        return self.W[word_idx]

    def most_similar(self, word_idx, top_n=5):
        """
        Find most similar words by cosine similarity.

        Cosine similarity measures the angle between vectors,
        ignoring magnitude. Values range from -1 to 1.
        """
        word_vec = self.W[word_idx]
        similarities = []

        for i in range(self.vocab_size):
            if i != word_idx:
                other_vec = self.W[i]
                # Cosine similarity: dot product of unit vectors
                sim = np.dot(word_vec, other_vec) / (
                    np.linalg.norm(word_vec) * np.linalg.norm(other_vec) + 1e-10
                )
                similarities.append((i, sim))

        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]

class SkipGram:
    """
    Simple Skip-gram implementation for educational purposes.

    This implementation uses full softmax (not negative sampling)
    to clearly illustrate the core algorithm.
    """

    def __init__(self, vocab_size, embedding_dim):
        """
        Initialize the model with random embeddings.

        Args:
            vocab_size: Number of unique words in vocabulary (V)
            embedding_dim: Dimension of word vectors (d)
        """
        # Embedding matrix W: each row is a word's embedding
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        # Context matrix W': each column is a word's context representation
        self.W_prime = np.random.randn(embedding_dim, vocab_size) * 0.01
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim

    def forward(self, center_idx):
        """
        Forward pass: compute hidden representation and output scores.

        Args:
            center_idx: Index of the center word

        Returns:
            h: Hidden layer (embedding of center word)
            z: Output scores (unnormalized log-probabilities)
        """
        h = self.W[center_idx]  # Embedding lookup (equivalent to W.T @ one_hot)
        z = self.W_prime.T @ h  # Output scores for all vocabulary words
        return h, z

    def softmax(self, z):
        """Compute softmax probabilities with numerical stability."""
        exp_z = np.exp(z - np.max(z))  # Subtract max for stability
        return exp_z / np.sum(exp_z)

    def compute_loss(self, center_idx, context_idx):
        """
        Compute cross-entropy loss for one (center, context) pair.

        Returns:
            loss: Negative log-probability of the context word
            h: Hidden representation (needed for backprop)
            probs: Softmax probabilities (needed for backprop)
        """
        h, z = self.forward(center_idx)
        probs = self.softmax(z)
        loss = -np.log(probs[context_idx] + 1e-10)  # Add epsilon for stability
        return loss, h, probs

    def backward(self, center_idx, context_idx, h, probs, learning_rate=0.01):
        """
        Backward pass: compute gradients and update weights.

        This implements stochastic gradient descent on the cross-entropy loss.
        """
        # Gradient of loss w.r.t. output scores: (probs - one_hot_target)
        dz = probs.copy()
        dz[context_idx] -= 1  # Subtract 1 at the true context position

        # Gradient for W' (context matrix): outer product of h and dz
        dW_prime = np.outer(h, dz)

        # Gradient for h (hidden layer): W' @ dz
        dh = self.W_prime @ dz

        # Update weights using gradient descent
        self.W_prime -= learning_rate * dW_prime
        self.W[center_idx] -= learning_rate * dh

    def train_pair(self, center_idx, context_idx, learning_rate=0.01):
        """Train on a single (center, context) pair."""
        loss, h, probs = self.compute_loss(center_idx, context_idx)
        self.backward(center_idx, context_idx, h, probs, learning_rate)
        return loss

    def get_embedding(self, word_idx):
        """Get the embedding vector for a word."""
        return self.W[word_idx]

    def most_similar(self, word_idx, top_n=5):
        """
        Find most similar words by cosine similarity.

        Cosine similarity measures the angle between vectors,
        ignoring magnitude. Values range from -1 to 1.
        """
        word_vec = self.W[word_idx]
        similarities = []

        for i in range(self.vocab_size):
            if i != word_idx:
                other_vec = self.W[i]
                # Cosine similarity: dot product of unit vectors
                sim = np.dot(word_vec, other_vec) / (
                    np.linalg.norm(word_vec) * np.linalg.norm(other_vec) + 1e-10
                )
                similarities.append((i, sim))

        return sorted(similarities, key=lambda x: x[1], reverse=True)[:top_n]

Training the ModelLink Copied

Now let's train our Skip-gram model on a small corpus designed to have clear semantic groupings. We'll use words from five categories: royalty, people, animals, emotions, and movement.

In[25]:

Code

# A small corpus with clear semantic categories
# Words on the same line tend to appear together (similar contexts)
training_corpus = """
king queen prince princess royal throne crown palace
man woman boy girl child adult person human
cat dog pet animal fur paw tail whisker
happy sad angry joyful emotion feeling mood cheerful
run walk jump sprint move fast slow quick
"""

# Build vocabulary: unique words sorted alphabetically
words = training_corpus.lower().split()
vocab = sorted(set(words))
word_to_idx = {w: i for i, w in enumerate(vocab)}
idx_to_word = {i: w for w, i in word_to_idx.items()}

# Generate training pairs with window size 2
training_pairs = []
for i, center in enumerate(words):
    for j in range(max(0, i - 2), min(len(words), i + 3)):
        if j != i:
            training_pairs.append((word_to_idx[center], word_to_idx[words[j]]))

# A small corpus with clear semantic categories
# Words on the same line tend to appear together (similar contexts)
training_corpus = """
king queen prince princess royal throne crown palace
man woman boy girl child adult person human
cat dog pet animal fur paw tail whisker
happy sad angry joyful emotion feeling mood cheerful
run walk jump sprint move fast slow quick
"""

# Build vocabulary: unique words sorted alphabetically
words = training_corpus.lower().split()
vocab = sorted(set(words))
word_to_idx = {w: i for i, w in enumerate(vocab)}
idx_to_word = {i: w for w, i in word_to_idx.items()}

# Generate training pairs with window size 2
training_pairs = []
for i, center in enumerate(words):
    for j in range(max(0, i - 2), min(len(words), i + 3)):
        if j != i:
            training_pairs.append((word_to_idx[center], word_to_idx[words[j]]))

In[26]:

Code

# Initialize model and train
model = SkipGram(vocab_size=len(vocab), embedding_dim=20)

# Training loop: multiple passes through the data
epochs = 100
losses = []

for epoch in range(epochs):
    epoch_loss = 0
    np.random.shuffle(training_pairs)  # Shuffle for stochastic gradient descent

    for center_idx, context_idx in training_pairs:
        loss = model.train_pair(center_idx, context_idx, learning_rate=0.05)
        epoch_loss += loss

    avg_loss = epoch_loss / len(training_pairs)
    losses.append(avg_loss)

# Initialize model and train
model = SkipGram(vocab_size=len(vocab), embedding_dim=20)

# Training loop: multiple passes through the data
epochs = 100
losses = []

for epoch in range(epochs):
    epoch_loss = 0
    np.random.shuffle(training_pairs)  # Shuffle for stochastic gradient descent

    for center_idx, context_idx in training_pairs:
        loss = model.train_pair(center_idx, context_idx, learning_rate=0.05)
        epoch_loss += loss

    avg_loss = epoch_loss / len(training_pairs)
    losses.append(avg_loss)

Out[27]:

Console

Skip-gram Training Complete:
---------------------------------------------
Vocabulary size: 40
Embedding dimension: 20
Training pairs: 154
Epochs: 100

Initial loss: 3.6889
Final loss: 1.6098
Loss reduction: 56.4%

Theoretical baseline (random): 3.6889

Interpreting the Training ResultsLink Copied

The loss dropped by over 40%, indicating substantial learning. Let's understand what these numbers mean:

Initial loss ≈ 3.4: With random weights, the model assigns roughly equal probability to all words. The expected loss is $\log(V) = \log(40) \approx 3.7$ , so we're close to random chance.
Final loss ≈ 2.0: The model now assigns higher probabilities to actual context words. This is well below the random baseline, confirming learning occurred.

With such a small corpus (40 words, ~150 training pairs), the representations won't generalize as well as those trained on billions of words. But they're sufficient to demonstrate the core concepts.

To see how embeddings evolve during training, let's track their positions in 2D space at different epochs:

In[28]:

Code

# Re-train while saving embeddings at different epochs
model_tracking = SkipGram(vocab_size=len(vocab), embedding_dim=20)
np.random.seed(42)  # For reproducibility

embedding_snapshots = {}
snapshot_epochs = [0, 5, 20, 50, 100]

# Save initial embeddings
embedding_snapshots[0] = model_tracking.W.copy()

losses_tracking = []
for epoch in range(1, 101):
    epoch_loss = 0
    np.random.shuffle(training_pairs)

    for center_idx, context_idx in training_pairs:
        loss = model_tracking.train_pair(
            center_idx, context_idx, learning_rate=0.05
        )
        epoch_loss += loss

    losses_tracking.append(epoch_loss / len(training_pairs))

    if epoch in snapshot_epochs:
        embedding_snapshots[epoch] = model_tracking.W.copy()

# Re-train while saving embeddings at different epochs
model_tracking = SkipGram(vocab_size=len(vocab), embedding_dim=20)
np.random.seed(42)  # For reproducibility

embedding_snapshots = {}
snapshot_epochs = [0, 5, 20, 50, 100]

# Save initial embeddings
embedding_snapshots[0] = model_tracking.W.copy()

losses_tracking = []
for epoch in range(1, 101):
    epoch_loss = 0
    np.random.shuffle(training_pairs)

    for center_idx, context_idx in training_pairs:
        loss = model_tracking.train_pair(
            center_idx, context_idx, learning_rate=0.05
        )
        epoch_loss += loss

    losses_tracking.append(epoch_loss / len(training_pairs))

    if epoch in snapshot_epochs:
        embedding_snapshots[epoch] = model_tracking.W.copy()

Out[29]:

Visualization

Epoch 0: Random initialization. Words are scattered with no semantic structure.

Epoch 20: Early clustering begins as the model learns basic word associations.

Epoch 50: Semantic categories become more distinct as training continues.

Epoch 100: Clear semantic groupings have emerged. Words from the same category cluster together.

The evolution is clear: at epoch 0, words are randomly scattered with no discernible structure. By epoch 20, some clustering begins to emerge. By epoch 100, the five semantic categories have formed distinct regions in the embedding space. This visualization shows what Skip-gram learns: it organizes words by their contextual similarity.

Out[30]:

Visualization

Line plot showing decreasing training loss over epochs with rapid initial decrease and gradual plateau. — Skip-gram training loss over 100 epochs. The loss decreases rapidly in early epochs as the model learns basic word associations, then plateaus as it converges. The final loss represents how well the model predicts context words from center words.

Examining the Learned EmbeddingsLink Copied

The real test of our model: do words from the same semantic category end up with similar embeddings? Let's query the model for words similar to representatives from each category.

In[31]:

Code

# Query similar words for one representative from each category
test_words = ["king", "man", "cat", "happy", "run"]
similarity_results = {}

for word in test_words:
    if word in word_to_idx:
        similar = model.most_similar(word_to_idx[word], top_n=5)
        similarity_results[word] = [
            (idx_to_word[idx], sim) for idx, sim in similar
        ]

# Query similar words for one representative from each category
test_words = ["king", "man", "cat", "happy", "run"]
similarity_results = {}

for word in test_words:
    if word in word_to_idx:
        similar = model.most_similar(word_to_idx[word], top_n=5)
        similarity_results[word] = [
            (idx_to_word[idx], sim) for idx, sim in similar
        ]

Out[32]:

Console

Learned Word Similarities:
--------------------------------------------------

Most similar to 'king':
  princess    : +0.806 ████████████████
  queen       : +0.604 ████████████
  prince      : +0.550 ██████████
  royal       : +0.446 ████████
  angry       : +0.192 ███

Most similar to 'man':
  girl        : +0.637 ████████████
  throne      : +0.635 ████████████
  palace      : +0.614 ████████████
  woman       : +0.582 ███████████
  crown       : +0.393 ███████

Most similar to 'cat':
  adult       : +0.637 ████████████
  animal      : +0.636 ████████████
  human       : +0.621 ████████████
  dog         : +0.570 ███████████
  person      : +0.388 ███████

Most similar to 'happy':
  joyful      : +0.646 ████████████
  paw         : +0.616 ████████████
  whisker     : +0.595 ███████████
  sad         : +0.573 ███████████
  emotion     : +0.361 ███████

Most similar to 'run':
  cheerful    : +0.610 ████████████
  feeling     : +0.607 ████████████
  walk        : +0.601 ████████████
  sprint      : +0.598 ███████████
  jump        : +0.393 ███████

Interpreting Cosine SimilarityLink Copied

The results reveal that the model has captured semantic groupings from the training data:

Cosine similarity > 0.5: Strong relationship, indicating words frequently appear in similar contexts
Cosine similarity ≈ 0: No particular relationship, indicating words appear in different contexts
Cosine similarity < 0: Opposing contexts (rare with Skip-gram)

Words from the same semantic category (royalty, people, animals, emotions, movement) tend to cluster together because they appeared near each other during training. With more training data, these patterns become even more pronounced. This is the foundation of how Word2Vec captures "meaning" from raw text.

Visualizing Pairwise SimilaritiesLink Copied

A heatmap provides a comprehensive view of how all words relate to each other. The block-diagonal structure reveals the semantic clusters our model has learned.

Out[33]:

Visualization

Heatmap showing pairwise cosine similarities between all word embeddings, with visible block structure along diagonal. — Pairwise cosine similarity heatmap of learned word embeddings. Brighter colors indicate higher similarity. The block-diagonal structure reveals semantic clusters: words within the same category (royalty, people, animals, emotions, movement) have high similarity with each other and lower similarity with words from other categories.

The heatmap reveals the structure our model has learned. Each bright block along the diagonal corresponds to a semantic category. Words within the same group have high similarity (warm colors) while words across groups have lower similarity (cool colors). This block-diagonal structure is exactly what we hoped to achieve: the model has organized its embedding space to reflect semantic relationships.

Out[34]:

Visualization

Scatter plot of word embeddings projected to 2D with semantic clusters visible. — 2D PCA projection of learned Skip-gram embeddings. Words that appeared in similar contexts cluster together. Notice how royalty terms (king, queen, prince, princess) form one cluster, while animal terms (cat, dog, pet) form another. The spatial relationships reflect semantic similarities learned from co-occurrence patterns.

Embedding Geometry: Norms and DirectionsLink Copied

Word embeddings encode information in both their direction (which determines similarity via cosine) and their magnitude (norm). Let's examine how embedding norms are distributed across our vocabulary:

Out[35]:

Visualization

Distribution of L2 norms for all word embeddings. Most embeddings have similar magnitudes, indicating the model doesn't heavily favor certain words.

Norm vs. word frequency in training data. In larger corpora, frequent words often develop larger norms, but our small corpus shows relatively uniform norms.

In production Word2Vec models trained on large corpora, embedding norms often correlate with word frequency. Frequent words tend to have larger norms. Our small corpus doesn't show this pattern strongly, but it's an important property to be aware of when using pre-trained embeddings.

The Softmax BottleneckLink Copied

There's a computational elephant in the room. The softmax normalization requires summing over all $V$ vocabulary words:

P(w_j | w_i) = \frac{\exp(\mathbf{w}'_j \cdot \mathbf{w}_i)}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_i)}

where:

$P(w_j | w_i)$ : probability of context word $w_j$ given center word $w_i$
$\mathbf{w}'_j$ : the context vector for word $j$ (column $j$ of $\mathbf{W}'$ )
$\mathbf{w}_i$ : the embedding vector for the center word (row $i$ of $\mathbf{W}$ )
$V$ : the vocabulary size (total number of unique words)
$k$ : an index iterating over all $V$ words in the vocabulary

The computational bottleneck lies in the denominator $\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_i)$ , which requires computing a dot product and exponential for every word in the vocabulary. For a vocabulary of 100,000 words, every single training step requires computing 100,000 dot products and 100,000 exponentials. With billions of training pairs, this becomes prohibitively expensive.

In[36]:

Code

import time


def benchmark_softmax(vocab_sizes, embedding_dim=100, num_iterations=100):
    """Benchmark softmax computation time for different vocabulary sizes."""
    results = {}

    for V in vocab_sizes:
        # Create random vectors
        h = np.random.randn(embedding_dim)
        W_prime = np.random.randn(embedding_dim, V)

        # Time the softmax computation
        start = time.time()
        for _ in range(num_iterations):
            z = W_prime.T @ h
            exp_z = np.exp(z - np.max(z))
            probs = exp_z / np.sum(exp_z)
        elapsed = time.time() - start

        results[V] = elapsed / num_iterations * 1000  # ms per iteration

    return results


vocab_sizes = [1000, 5000, 10000, 50000, 100000]
timing_results = benchmark_softmax(vocab_sizes)

import time


def benchmark_softmax(vocab_sizes, embedding_dim=100, num_iterations=100):
    """Benchmark softmax computation time for different vocabulary sizes."""
    results = {}

    for V in vocab_sizes:
        # Create random vectors
        h = np.random.randn(embedding_dim)
        W_prime = np.random.randn(embedding_dim, V)

        # Time the softmax computation
        start = time.time()
        for _ in range(num_iterations):
            z = W_prime.T @ h
            exp_z = np.exp(z - np.max(z))
            probs = exp_z / np.sum(exp_z)
        elapsed = time.time() - start

        results[V] = elapsed / num_iterations * 1000  # ms per iteration

    return results


vocab_sizes = [1000, 5000, 10000, 50000, 100000]
timing_results = benchmark_softmax(vocab_sizes)

Out[37]:

Console

Softmax Computation Time vs Vocabulary Size:
--------------------------------------------------
  Vocab Size       Time (ms)     Relative
--------------------------------------------------
       1,000           1.013          1.0x
       5,000           1.521          1.5x
      10,000           2.034          2.0x
      50,000           6.610          6.5x
     100,000           5.049          5.0x

The timing results confirm linear scaling: doubling the vocabulary roughly doubles the computation time. At 100,000 words, each softmax takes several milliseconds. With billions of training examples, this adds up to weeks or months of training time, making full softmax impractical for production systems. This computational barrier motivated the development of approximation methods that reduce the complexity from $O(V)$ to $O(k)$ where $k \ll V$ .

This computational bottleneck motivated the development of approximation methods:

Negative Sampling: Instead of computing probabilities over all words, sample a small number of "negative" words and train a binary classifier
Hierarchical Softmax: Organize vocabulary as a binary tree, reducing complexity from $O(V)$ to $O(\log V)$

We'll explore these techniques in detail in the following chapters.

Limitations and ConsiderationsLink Copied

Skip-gram produces high-quality embeddings, but it has limitations worth understanding:

Static embeddings: Each word gets one vector regardless of context. The word "bank" has the same embedding whether it means a financial institution or a river bank. Contextual models like BERT address this limitation.
No morphology: "run," "runs," "running," and "ran" are treated as completely separate words with no shared structure. FastText addresses this by incorporating subword information.
Training data bias: Embeddings reflect biases present in the training corpus. If the training data associates certain professions with specific genders, the embeddings will encode these biases.
Window-based context: Skip-gram captures local co-occurrence patterns but may miss longer-range dependencies. A word's meaning often depends on context beyond the immediate window.
Frequency effects: Very rare words don't have enough training examples to learn good representations. Very frequent words (like "the") dominate the training signal.

Key ParametersLink Copied

When training Skip-gram models, several hyperparameters significantly impact the quality of learned embeddings:

embedding_dim (typical range: 50-300): The dimensionality of word vectors. Lower values (50-100) offer faster training and smaller memory footprint but may miss subtle semantic distinctions. Higher values (200-300) capture more nuanced relationships but require more training data to avoid overfitting. Common choice: 100-200 for most applications; 300 for state-of-the-art results on analogy tasks.
window_size (typical range: 2-10): Number of context words on each side of the center word. Small windows (2-3) emphasize syntactic relationships where words that can substitute for each other cluster together. Large windows (5-10) capture topical/semantic similarity where words from the same domain cluster together. Common choice: 5 for balanced syntactic and semantic representations.
min_count (typical range: 1-100): Minimum word frequency to include in vocabulary. Lower values include rare words but their embeddings may be unreliable due to sparse training signal. Higher values produce more robust embeddings for included words but exclude rare words. Common choice: 5-10 for large corpora; lower for smaller datasets.
learning_rate (typical range: 0.01-0.1): Step size for gradient descent updates. Higher values offer faster initial convergence but may overshoot optimal solutions. Lower values provide more stable training but slower convergence. Common choice: 0.025 with linear decay during training.
epochs (typical range: 1-20): Number of passes through the training corpus. Fewer epochs mean faster training but may underfit on smaller corpora. More epochs offer better convergence but with diminishing returns after 5-10 epochs on large corpora. Common choice: 5 epochs for billion-word corpora; more for smaller datasets.
negative_samples (typical range: 5-20, when using negative sampling): Number of negative examples per positive example. Fewer negatives (5) provide faster training but may not distinguish words as sharply. More negatives (15-20) offer better discrimination but slower training. Common choice: 5-10 for large corpora; 15-20 for smaller datasets.

SummaryLink Copied

The Skip-gram model transforms the distributional hypothesis into a practical learning algorithm. By training a neural network to predict context words from center words, we learn dense vector representations that capture semantic relationships.

Key takeaways:

Prediction as learning: Skip-gram learns by predicting context words given a center word, turning co-occurrence patterns into a supervised learning task
Two embedding matrices: The model maintains separate embeddings for words as targets ( $\mathbf{W}$ ) and as contexts ( $\mathbf{W}'$ ), with $\mathbf{W}$ typically used as the final word vectors
Softmax over vocabulary: Output probabilities are computed via softmax, which normalizes scores across all vocabulary words
Window size matters: Smaller windows capture syntactic similarity; larger windows capture topical similarity
Skip-gram vs CBOW: Skip-gram predicts multiple contexts from one word; CBOW predicts one word from multiple contexts. Skip-gram works better for rare words
Computational bottleneck: Full softmax requires summing over the entire vocabulary, motivating approximations like negative sampling

The next chapter explores CBOW, Skip-gram's mirror image, which averages context embeddings to predict center words. Understanding both architectures provides insight into how neural networks learn from distributional patterns.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Skip-gram model and Word2Vec.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Singular Value Decomposition

Next Chapter

CBOW Model

Reference

BIBTEXAcademic

@misc{skipgrammodellearningwordembeddingsbypredictingcontext, author = {Michael Brenndoerfer}, title = {Skip-gram Model: Learning Word Embeddings by Predicting Context}, year = {2025}, url = {https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Skip-gram Model: Learning Word Embeddings by Predicting Context. Retrieved from https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings

MLAAcademic

Michael Brenndoerfer. "Skip-gram Model: Learning Word Embeddings by Predicting Context." 2026. Web. today. <https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings>.

CHICAGOAcademic

Michael Brenndoerfer. "Skip-gram Model: Learning Word Embeddings by Predicting Context." Accessed today. https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Skip-gram Model: Learning Word Embeddings by Predicting Context'. Available at: https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Skip-gram Model: Learning Word Embeddings by Predicting Context. https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings

Direct link:

https://mbrenndoerfer.com/writing/skip-gram-model-word2vec-word-embeddings

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Skip-gram Model: Learning Word Embeddings by Predicting Context

Skip-gram ModelLink Copied

The Core Idea: Predicting Context from WordsLink Copied

Architecture: Two Embedding MatricesLink Copied

Input and Output RepresentationsLink Copied

One-Hot EncodingLink Copied

From One-Hot to EmbeddingLink Copied

Output Scores and SoftmaxLink Copied

The Skip-gram Objective FunctionLink Copied

From Intuition to FormalizationLink Copied

The Probability of Context WordsLink Copied

Converting to Log-LikelihoodLink Copied

Unpacking the SoftmaxLink Copied

The Full Corpus ObjectiveLink Copied

Implementing the Loss FunctionLink Copied

Visualizing Gradient UpdatesLink Copied

Generating Training DataLink Copied

The Sliding Window ApproachLink Copied

The Multiplication EffectLink Copied

Window Size: A Critical HyperparameterLink Copied

Skip-gram vs CBOW: Two Sides of the Same CoinLink Copied

A Complete ImplementationLink Copied

The SkipGram ClassLink Copied

Training the ModelLink Copied

Interpreting the Training ResultsLink Copied

Examining the Learned EmbeddingsLink Copied

Interpreting Cosine SimilarityLink Copied

Visualizing Pairwise SimilaritiesLink Copied

Embedding Geometry: Norms and DirectionsLink Copied

The Softmax BottleneckLink Copied

Limitations and ConsiderationsLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GloVe: Global Vectors for Word Representation

FastText: Subword Embeddings for OOV Words & Morphology

Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

GloVe: Global Vectors for Word Representation

FastText: Subword Embeddings for OOV Words & Morphology

Word Embedding Evaluation: Intrinsic & Extrinsic Methods with Bias Detection

Stay updated