Search

Search articles

Negative Sampling: Efficient Word Embedding Training

Michael BrenndoerferDecember 11, 202540 min read9,681 words

Learn how negative sampling transforms expensive softmax computation into efficient binary classification, enabling practical training of word embeddings on large corpora.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Negative Sampling

The Skip-gram model learns powerful word representations, but it has a computational Achilles' heel: the softmax denominator. Every training step requires computing a sum over the entire vocabulary:

P(wcwt)=exp(wcwt)k=1Vexp(wkwt)P(w_c | w_t) = \frac{\exp(\mathbf{w}'_c \cdot \mathbf{w}_t)}{\sum_{k=1}^{V} \exp(\mathbf{w}'_k \cdot \mathbf{w}_t)}

where:

  • P(wcwt)P(w_c | w_t): probability of context word wcw_c given center word wtw_t
  • wc\mathbf{w}'_c: context embedding vector for word wcw_c
  • wt\mathbf{w}_t: center embedding vector for word wtw_t
  • VV: vocabulary size

For a vocabulary of 100,000 words, this means 100,000 dot products and exponentials per training step. With billions of training examples, full softmax becomes impractical. Training would take months instead of days.

Negative sampling solves this problem differently: instead of predicting the exact probability of a context word, we reformulate the task as binary classification. Given a word pair, is it a genuine context pair from the corpus, or a fake pair we constructed? This transformation reduces the computational cost from O(V)O(V) to O(k)O(k), where kk is a small constant (typically 5-20), making large-scale training feasible.

This chapter develops negative sampling from first principles. We'll derive the objective function, understand why the clever sampling distribution matters, implement efficient training, and see how this approximation achieves nearly the same quality as full softmax at a fraction of the cost.

The Core Insight: Binary Classification Instead of Softmax

The full Skip-gram model tries to answer: "What is the probability of each vocabulary word being a context word?" This requires computing a probability distribution over 100,000+ words.

Negative sampling asks a simpler question: "Is this specific word pair a real context pair or a fake one?" This is binary classification, requiring only a single probability per pair.

Negative Sampling

Negative sampling approximates the full softmax objective by converting the multi-class classification problem into multiple binary classification problems. For each positive (center, context) pair from the corpus, we sample kk "negative" pairs and train the model to distinguish real pairs from fake ones.

Consider the sentence "The quick brown fox jumps." When "fox" is the center word, "brown" is a genuine context word. But what about "elephant" or "democracy"? These words don't appear near "fox" in this context. They're negatives, and we can use them to train the model.

In[2]:
import numpy as np

# A genuine training pair from our corpus
center_word = "fox"
positive_context = "brown"  # Actually appears near "fox"

# Negative samples: words that don't appear near "fox" in this context
negative_contexts = ["elephant", "democracy", "refrigerator", "quantum", "bicycle"]
Out[3]:
Negative Sampling Training Example:
--------------------------------------------------
Center word: 'fox'
Positive context (from corpus): 'brown' → label: 1

Negative contexts (sampled randomly):
  'elephant' → label: 0
  'democracy' → label: 0
  'refrigerator' → label: 0
  'quantum' → label: 0
  'bicycle' → label: 0

Training task: distinguish real pair from fake pairs

The model learns by trying to maximize the probability for the positive pair and minimize it for the negative pairs. Through millions of such updates, the embedding space organizes so that words appearing in similar contexts end up nearby.

From Softmax to Sigmoid

The mathematical reformulation is straightforward. Instead of softmax, we use the sigmoid function σ(x)=1/(1+ex)\sigma(x) = 1 / (1 + e^{-x}) to output a probability for each pair independently.

For a positive (center, context) pair, we want:

P(D=1wc,wt)=σ(wcwt)=11+exp(wcwt)P(D = 1 | w_c, w_t) = \sigma(\mathbf{w}'_c \cdot \mathbf{w}_t) = \frac{1}{1 + \exp(-\mathbf{w}'_c \cdot \mathbf{w}_t)}

where:

  • D=1D = 1: indicator that the pair came from the data (is genuine)
  • σ(x)\sigma(x): sigmoid function
  • wcwt\mathbf{w}'_c \cdot \mathbf{w}_t: dot product between context and center embeddings

For negative pairs, we want P(D=1)0P(D = 1) \approx 0, or equivalently P(D=0)1P(D = 0) \approx 1:

P(D=0wn,wt)=1σ(wnwt)=σ(wnwt)P(D = 0 | w_n, w_t) = 1 - \sigma(\mathbf{w}'_n \cdot \mathbf{w}_t) = \sigma(-\mathbf{w}'_n \cdot \mathbf{w}_t)

where wnw_n denotes a negative (randomly sampled) word and wn\mathbf{w}'_n is its context embedding.

In[4]:
def sigmoid(x):
    """Compute sigmoid function with numerical stability."""
    # Clip to prevent overflow
    x = np.clip(x, -500, 500)
    return 1 / (1 + np.exp(-x))

# Example: probability computation for word pairs
dot_product_positive = 2.5   # High similarity (learned)
dot_product_negative = -1.0  # Low similarity (learned)

prob_positive = sigmoid(dot_product_positive)
prob_negative = sigmoid(dot_product_negative)
Out[5]:
Sigmoid Probabilities for Word Pairs:
--------------------------------------------------
Positive pair dot product: 2.5
  P(real pair) = σ(2.5) = 0.9241

Negative pair dot product: -1.0
  P(real pair) = σ(-1.0) = 0.2689

A well-trained model assigns high probability to
genuine pairs and low probability to fake pairs.

The positive pair with dot product 2.5 receives probability 0.92, indicating the model is confident this is a genuine context pair. The negative pair with dot product -1.0 receives only 0.27 probability, correctly reflecting low confidence that it came from real data.

Out[6]:
Visualization
S-shaped sigmoid curve showing probability output versus dot product input.
The sigmoid function maps dot products to probabilities. Large positive dot products (similar embeddings) yield probabilities near 1, indicating the model believes the pair is genuine. Large negative dot products (dissimilar embeddings) yield probabilities near 0. The function smoothly transitions around zero, allowing gradients to flow during training.

The Negative Sampling Objective

Having established that we can use sigmoid to score individual word pairs, we now face a fundamental question: what exactly should the model optimize? The answer lies in constructing a clever objective function that captures our core intuition: genuine context pairs should score high, while fake pairs should score low.

For each positive pair (wt,wc)(w_t, w_c) from the corpus, we sample kk negative words {wn1,wn2,,wnk}\{w_{n_1}, w_{n_2}, \ldots, w_{n_k}\} from a noise distribution Pn(w)P_n(w). The objective for this training example combines both goals into a single expression:

L=logσ(wcwt)+i=1klogσ(wniwt)\mathcal{L} = \log \sigma(\mathbf{w}'_c \cdot \mathbf{w}_t) + \sum_{i=1}^{k} \log \sigma(-\mathbf{w}'_{n_i} \cdot \mathbf{w}_t)

where:

  • L\mathcal{L}: objective function to maximize
  • kk: number of negative samples
  • wniw_{n_i}: the ii-th negative sample word
  • Pn(w)P_n(w): noise distribution for sampling negatives

Let's break down what each component contributes:

The positive term logσ(wcwt)\log \sigma(\mathbf{w}'_c \cdot \mathbf{w}_t) rewards the model for assigning high probability to genuine pairs. When the dot product is large and positive, the sigmoid approaches 1, and log(1)=0\log(1) = 0. When the dot product is small or negative, the sigmoid drops toward 0, and log(small)=large negative\log(\text{small}) = \text{large negative}. Thus, the model is penalized for underconfident predictions on true context words.

The negative terms i=1klogσ(wniwt)\sum_{i=1}^{k} \log \sigma(-\mathbf{w}'_{n_i} \cdot \mathbf{w}_t) reward the model for correctly identifying fake pairs. Notice the crucial negative sign inside the sigmoid. For a negative pair to score well, we need σ(dot product)1\sigma(-\text{dot product}) \approx 1, which happens when the dot product itself is negative (dissimilar embeddings). If the model mistakenly places a negative word close to the center word, the dot product becomes positive, σ(positive)\sigma(-\text{positive}) drops toward 0, and the log becomes a large penalty.

Together, these terms create opposing forces: the first pulls genuine context words toward the center word, while the second pushes random words away. Through millions of such updates, the embedding space self-organizes so that semantically related words cluster together.

In[7]:
def negative_sampling_loss(center_vec, context_vec, negative_vecs):
    """
    Compute the negative sampling loss for one training example.
    
    Args:
        center_vec: Embedding of the center word
        context_vec: Embedding of the positive context word
        negative_vecs: List of embeddings for negative samples
    
    Returns:
        loss: The negative of the objective (to be minimized)
    """
    # Positive term: log σ(context · center)
    pos_dot = np.dot(context_vec, center_vec)
    pos_loss = np.log(sigmoid(pos_dot) + 1e-10)
    
    # Negative terms: sum of log σ(-negative · center)
    neg_loss = 0
    for neg_vec in negative_vecs:
        neg_dot = np.dot(neg_vec, center_vec)
        neg_loss += np.log(sigmoid(-neg_dot) + 1e-10)
    
    # We maximize the objective, so return negative for minimization
    return -(pos_loss + neg_loss)

# Example with random vectors
np.random.seed(42)
embedding_dim = 50
center = np.random.randn(embedding_dim) * 0.5
context = np.random.randn(embedding_dim) * 0.5
negatives = [np.random.randn(embedding_dim) * 0.5 for _ in range(5)]

loss = negative_sampling_loss(center, context, negatives)
Out[8]:
Negative Sampling Loss Example:
--------------------------------------------------
Embedding dimension: 50
Number of negative samples: 5

Center · Context (positive): 1.0506
  σ(dot) = 0.7409

Center · Negatives:
  Negative 1: dot = -1.3478, σ(-dot) = 0.7938
  Negative 2: dot = 0.8562, σ(-dot) = 0.2981
  Negative 3: dot = 3.7310, σ(-dot) = 0.0234
  Negative 4: dot = 2.8612, σ(-dot) = 0.0541
  Negative 5: dot = -1.9121, σ(-dot) = 0.8713

Total loss (to minimize): 8.5505

With random initial embeddings, the dot products are near zero, producing sigmoid values around 0.5. This means the model is uncertain about all pairs. Training will push these probabilities toward 1 for positive pairs and toward 0 for negatives.

Out[9]:
Visualization
Two line plots showing loss curves for positive and negative terms versus dot product.
Loss landscape for the negative sampling objective. Left: The positive term loss decreases as the dot product increases, rewarding similar embeddings for genuine pairs. Right: The negative term loss decreases as the dot product decreases (becomes more negative), rewarding dissimilar embeddings for fake pairs. The steep gradients near zero indicate where the model learns most rapidly.

The loss landscape reveals the asymmetric pressure applied by each term. For positive pairs (left), the loss drops sharply as the dot product increases, strongly encouraging the model to push genuine context words closer. For negative pairs (right), the loss drops as the dot product becomes negative, encouraging dissimilar embeddings. The steepest gradients occur near zero, where the model is most uncertain, focusing learning effort where it matters most.

Understanding the Gradient Forces

The objective function creates a tug-of-war in embedding space. Each training step applies forces that reshape the geometry of word representations. Understanding these gradients reveals how semantic structure emerges from local updates.

During backpropagation, two types of forces act on each embedding:

  1. Attractive force (positive pairs): The gradient pulls the center word embedding toward its true context word. Words that genuinely co-occur are drawn together in the vector space.

  2. Repulsive force (negative pairs): The gradient pushes the center word away from each sampled negative. Words that don't co-occur are separated in the vector space.

The magnitude of each force depends on the current prediction confidence. If the model already confidently predicts a positive pair correctly, the gradient is small. If it's uncertain or wrong, the gradient is large. This self-regulating behavior focuses learning on the most informative examples.

Out[10]:
Visualization
Line plot showing gradient magnitudes for positive and negative terms across dot product values.
Gradient magnitude as a function of dot product. For positive pairs (green), the gradient magnitude |σ - 1| is largest when the dot product is negative (wrong prediction) and smallest when positive (correct prediction). For negative pairs (red), the gradient |σ| is largest when embeddings are similar (wrong) and smallest when dissimilar (correct). This self-regulating behavior focuses learning where it matters most.

The gradient curves cross at zero, where the model is maximally uncertain. For positive pairs, the model learns strongly when embeddings are dissimilar (negative dot product) but relaxes when they're already similar. For negative pairs, the opposite holds: strong learning when embeddings are incorrectly similar, weak learning when already dissimilar. This elegant self-regulation emerges naturally from the sigmoid function.

Let's derive the gradients explicitly. For the center word embedding wt\mathbf{w}_t:

Lwt=(σ(wcwt)1)wc+i=1kσ(wniwt)wni\frac{\partial \mathcal{L}}{\partial \mathbf{w}_t} = (\sigma(\mathbf{w}'_c \cdot \mathbf{w}_t) - 1) \mathbf{w}'_c + \sum_{i=1}^{k} \sigma(\mathbf{w}'_{n_i} \cdot \mathbf{w}_t) \mathbf{w}'_{n_i}

where:

  • Lwt\frac{\partial \mathcal{L}}{\partial \mathbf{w}_t}: gradient of the objective with respect to the center word embedding
  • σ(wcwt)\sigma(\mathbf{w}'_c \cdot \mathbf{w}_t): sigmoid of the positive pair dot product (current prediction)
  • σ(wniwt)\sigma(\mathbf{w}'_{n_i} \cdot \mathbf{w}_t): sigmoid of the ii-th negative pair dot product

The first term pulls the center word toward the positive context (since σ1<0\sigma - 1 < 0), while the sum pushes it away from negatives.

For the positive context word embedding wc\mathbf{w}'_c:

Lwc=(σ(wcwt)1)wt\frac{\partial \mathcal{L}}{\partial \mathbf{w}'_c} = (\sigma(\mathbf{w}'_c \cdot \mathbf{w}_t) - 1) \mathbf{w}_t

This gradient points toward the center word when the prediction is uncertain, pulling the context embedding closer.

For each negative word embedding wni\mathbf{w}'_{n_i}:

Lwni=σ(wniwt)wt\frac{\partial \mathcal{L}}{\partial \mathbf{w}'_{n_i}} = \sigma(\mathbf{w}'_{n_i} \cdot \mathbf{w}_t) \mathbf{w}_t

This gradient points away from the center word, pushing negative samples to become more dissimilar.

Out[11]:
Visualization
Diagram showing arrows indicating attractive force for positive pair and repulsive forces for negative pairs.
Gradient forces during negative sampling training. The positive context word (green) is pulled toward the center word, while negative samples (red) are pushed away. The magnitude of each force depends on the current sigmoid probability: words that are incorrectly positioned receive stronger gradients.

The Sampling Distribution: Why Pn(w)f(w)0.75P_n(w) \propto f(w)^{0.75}

The choice of how to sample negative words turns out to be surprisingly consequential. A naive approach might sample words uniformly at random from the vocabulary. But consider what happens: "the" appears in nearly every sentence, yet it would only be sampled as a negative in 1/V1/V of cases. Meanwhile, "aardvark" appears rarely in real text but would be sampled just as often as "the."

This mismatch creates problems. Common function words like "the" genuinely appear near almost every word in the corpus. Sampling them as negatives is confusing because they're often legitimate context words too. Rare words, conversely, might appear as negatives so often that the model learns to push everything away from them, hurting their embedding quality.

The Word2Vec authors discovered that a modified unigram distribution works best:

Pn(w)=f(w)0.75wf(w)0.75P_n(w) = \frac{f(w)^{0.75}}{\sum_{w'} f(w')^{0.75}}

where:

  • Pn(w)P_n(w): probability of sampling word ww as a negative
  • f(w)f(w): frequency (count) of word ww in the corpus
  • 0.750.75: smoothing exponent (empirically determined)

The exponent 0.75 was determined empirically and has two important effects:

  1. Reduces dominance of frequent words: Without the exponent, "the" might account for 7% of the sampling distribution. With 0.75, its share drops, giving other words more representation.

  2. Increases representation of rare words: Rare words are sampled more often than pure frequency would suggest, providing better training signal for their embeddings.

In[12]:
def create_sampling_distribution(word_counts, alpha=0.75):
    """
    Create the negative sampling distribution.
    
    Args:
        word_counts: Dictionary mapping words to their corpus frequencies
        alpha: Exponent to smooth the distribution (default 0.75)
    
    Returns:
        words: List of words
        probs: Sampling probabilities
    """
    words = list(word_counts.keys())
    freqs = np.array([word_counts[w] for w in words], dtype=np.float64)
    
    # Apply the smoothing exponent
    smoothed = freqs ** alpha
    
    # Normalize to create a probability distribution
    probs = smoothed / smoothed.sum()
    
    return words, probs

# Example corpus frequencies (simplified)
word_frequencies = {
    'the': 100000,
    'a': 80000,
    'is': 50000,
    'word': 1000,
    'embedding': 500,
    'neural': 400,
    'semantic': 200,
    'vector': 800,
    'aardvark': 5,
    'quixotic': 2
}

words, probs = create_sampling_distribution(word_frequencies, alpha=0.75)
Out[13]:
Comparison of Sampling Distributions:
-----------------------------------------------------------------
Word           Raw Freq    Uniform    Freq Prob     Smoothed
-----------------------------------------------------------------
the             100,000     0.1000     0.429356     0.393092
a                80,000     0.1000     0.343485     0.332516
is               50,000     0.1000     0.214678     0.233734
word              1,000     0.1000     0.004294     0.012431
embedding           500     0.1000     0.002147     0.007391
neural              400     0.1000     0.001717     0.006252
semantic            200     0.1000     0.000859     0.003718
vector              800     0.1000     0.003435     0.010515
aardvark              5     0.1000     0.000021     0.000234
quixotic              2     0.1000     0.000009     0.000118

Sum of smoothed probs: 1.0000

The comparison shows how smoothing redistributes probability mass. Common words like "the" drop from dominating the distribution (raw frequency ~43%) to a more moderate share. Rare words like "aardvark" and "quixotic" see their sampling probability increase relative to their raw frequency. This ensures rare words receive adequate negative sampling during training.

Out[14]:
Visualization
Three bar charts comparing raw frequency, smoothed frequency, and their ratio for vocabulary words.
Comparison of different sampling distributions for negative sampling. Left: Raw frequency distribution heavily favors common words. Middle: The smoothed distribution (freq^0.75) reduces the gap between frequent and rare words. Right: Ratio of smoothed to raw probability shows rare words get boosted while frequent words are dampened. The 0.75 exponent was empirically tuned to balance training efficiency and embedding quality.

Why 0.75? Intuition and Alternatives

The exponent 0.75 was found empirically by the Word2Vec authors. But why does it work?

  • Exponent = 1.0 (raw frequency): Frequent words dominate. The model sees "the" as a negative over and over, wasting computation on words that are already well-positioned.

  • Exponent = 0.0 (uniform): Rare words appear as often as frequent ones. But rare words may genuinely share contexts with many other words, making them poor negatives.

  • Exponent = 0.75: A compromise. Common words still appear often enough to learn good embeddings, but rare words get meaningful representation in the negative samples.

Out[15]:
Visualization
Line plot showing how different exponents affect the sampling probability of words with different frequencies.
Effect of the smoothing exponent on the sampling distribution. As the exponent decreases from 1.0 (raw frequency) toward 0.0 (uniform), the distribution becomes more balanced. The exponent 0.75 provides a good trade-off: frequent words are still sampled often enough to matter, while rare words get boosted representation.

How Many Negative Samples? The kk Hyperparameter

The number of negative samples kk is a crucial hyperparameter. More negatives provide stronger training signal but increase computation time.

Choosing k

Typical values for kk range from 5 to 20. Smaller values (5-10) work well for large datasets where each word pair is seen many times. Larger values (15-20) help with smaller datasets or less frequent words.

The original Word2Vec paper recommends:

  • k = 5-10 for large datasets (billions of words)
  • k = 15-25 for smaller datasets

The intuition: with more data, each positive example is reinforced many times. Fewer negatives suffice because the model sees the same positive patterns repeatedly. With less data, more negatives help compensate for sparse positive signal.

In[16]:
def estimate_training_cost(vocab_size, num_negatives, num_training_pairs):
    """
    Compare computational cost of full softmax vs negative sampling.
    
    Cost is measured in dot product operations per training step.
    """
    # Full softmax: compute dot product with every vocab word
    softmax_cost = num_training_pairs * vocab_size
    
    # Negative sampling: 1 positive + k negatives per pair
    neg_sampling_cost = num_training_pairs * (1 + num_negatives)
    
    speedup = softmax_cost / neg_sampling_cost
    
    return softmax_cost, neg_sampling_cost, speedup

# Realistic training scenario
vocab_size = 100000
num_negatives = 10
num_training_pairs = 1_000_000_000  # 1 billion pairs

softmax, neg_samp, speedup = estimate_training_cost(vocab_size, num_negatives, num_training_pairs)
Out[17]:
Computational Cost Comparison:
-------------------------------------------------------
Vocabulary size:              100,000
Training pairs:         1,000,000,000
Negative samples k:                10

Full softmax cost:    100,000,000,000,000 dot products
Neg sampling cost:     11,000,000,000 dot products

Speedup factor:                 9,091x

This is why negative sampling enables practical training!

The speedup is dramatic. For a 100,000-word vocabulary with 10 negative samples, negative sampling performs approximately 9,000 times fewer operations per training step than full softmax. This translates directly to training time: what would take months with softmax can be completed in days with negative sampling.

Out[18]:
Visualization
Two plots comparing computational cost and speedup of negative sampling versus softmax across vocabulary sizes.
Computational cost comparison between full softmax and negative sampling for different vocabulary sizes. Left: Total operations scale linearly with vocabulary for softmax but remain constant for negative sampling. Right: Speedup factor increases dramatically with vocabulary size. For a 100k vocabulary with k=10, negative sampling is approximately 9,000x faster.

The trade-off between the number of negative samples and computational cost is linear: doubling kk doubles the work per training step. But the relationship between kk and embedding quality is more nuanced, with diminishing returns as kk increases.

Out[19]:
Visualization
Two plots showing linear cost increase and diminishing quality returns with increasing k.
Trade-off between number of negative samples (k) and training efficiency. Left: Computational cost grows linearly with k. Right: Empirically, embedding quality improves with more negatives but with diminishing returns. The sweet spot (k=5-15) balances quality against training time. Values based on typical Word2Vec performance patterns.

The recommended range of k=5-15 captures most of the quality gains while keeping computational costs manageable. Beyond k=20, additional negatives provide minimal improvement but continue to increase training time linearly.

Noise Contrastive Estimation: The Theoretical Foundation

Negative sampling didn't emerge from thin air. It's a practical simplification of a more principled technique called Noise Contrastive Estimation (NCE), developed by Gutmann and Hyvärinen (2010). Understanding NCE illuminates why negative sampling works and what theoretical guarantees we sacrifice for efficiency.

Noise Contrastive Estimation (NCE)

NCE transforms the problem of estimating a probability distribution into a binary classification problem: distinguish data samples from noise samples. Under certain conditions, the optimal classifier recovers the true data distribution.

The core insight of NCE is simple: if you can perfectly distinguish real data from noise, you must have learned what makes the data real. Here's how it works:

  1. Draw positive samples from the true data distribution Pd(wc)P_d(w | c) (context words given center)
  2. Draw negative samples from a known noise distribution Pn(w)P_n(w)
  3. Train a classifier to distinguish them

The key insight: if we use kk noise samples for every data sample, and our model outputs s(w,c)s(w, c) as a score, the optimal classification decision is:

P(D=1w,c)=Pd(wc)Pd(wc)+kPn(w)P(D = 1 | w, c) = \frac{P_d(w | c)}{P_d(w | c) + k \cdot P_n(w)}

where:

  • Pd(wc)P_d(w | c): true data distribution (probability of word ww given context cc)
  • Pn(w)P_n(w): noise distribution probability for word ww

NCE models this probability as:

P(D=1w,c)=exp(s(w,c))exp(s(w,c))+kPn(w)P(D = 1 | w, c) = \frac{\exp(s(w, c))}{\exp(s(w, c)) + k \cdot P_n(w)}

where:

  • s(w,c)s(w, c): the model's score for word pair (w,c)(w, c), typically the dot product wc\mathbf{w}' \cdot \mathbf{c}
  • kk: number of noise samples per data sample
  • Pn(w)P_n(w): probability of word ww under the noise distribution

When training converges, the model learns that exp(s(w,c))Pd(wc)\exp(s(w, c)) \propto P_d(w | c), recovering the true distribution up to a constant.

From NCE to Negative Sampling

Negative sampling departs from NCE in two important ways, each trading theoretical rigor for practical efficiency:

First simplification: Ignoring the partition function. NCE explicitly models the normalization constant (the sum over all vocabulary words that makes probabilities sum to 1). This constant is precisely what makes softmax expensive. Negative sampling simply ignores it, treating exp(s(w,c))\exp(s(w, c)) as an unnormalized score. We lose the ability to compute true probabilities, but for learning embeddings, relative scores suffice.

Second simplification: Using sigmoid directly. NCE's probability formula accounts for the ratio of data to noise samples and the noise distribution. Negative sampling replaces this with the simpler sigmoid σ(s(w,c))\sigma(s(w, c)). This makes implementation straightforward and gradients clean, at the cost of some theoretical precision.

What do we lose? Negative sampling no longer guarantees convergence to the true data distribution. The embeddings might not represent exact co-occurrence probabilities. But for downstream tasks like analogy completion, similarity search, and transfer learning, this rarely matters. The embeddings capture semantic relationships just as well, at a fraction of the computational cost.

In[20]:
def nce_probability(score, noise_prob, k):
    """
    Compute NCE probability that a sample is from data (not noise).
    
    Args:
        score: Model's score s(w, c) for word pair
        noise_prob: P_n(w) for the sampled word
        k: Number of noise samples per data sample
    """
    exp_score = np.exp(score)
    return exp_score / (exp_score + k * noise_prob)

def neg_sampling_probability(score):
    """
    Compute negative sampling probability using sigmoid.
    """
    return sigmoid(score)

# Compare the two for various scores
scores = np.linspace(-4, 4, 100)
noise_prob = 1e-4  # Typical probability for a word
k = 10

nce_probs = [nce_probability(s, noise_prob, k) for s in scores]
ns_probs = [neg_sampling_probability(s) for s in scores]
Out[21]:
Visualization
Line plot comparing NCE probability curve with negative sampling sigmoid curve.
Comparison of NCE and negative sampling probability functions. NCE (blue) accounts for the noise distribution and number of samples in its probability computation. Negative sampling (orange) uses the simpler sigmoid function. Despite the theoretical differences, both produce similar behavior for typical score ranges, explaining why negative sampling works well in practice.

Implementing Efficient Negative Sampling

A practical implementation needs efficient sampling from the noise distribution. Computing probabilities for 100,000 words every time we need a sample is slow. Instead, we precompute a sampling table.

The Alias Method

The most efficient approach uses the alias method, which allows O(1)O(1) sampling from any discrete distribution after O(V)O(V) preprocessing. However, a simpler approach that's nearly as fast uses cumulative probabilities:

In[22]:
class NegativeSampler:
    """
    Efficient negative sampling using precomputed cumulative distribution.
    
    Preprocessing: O(V)
    Sampling: O(log V) per sample using binary search
    """
    
    def __init__(self, word_counts, alpha=0.75):
        """
        Initialize the sampler with word frequency counts.
        
        Args:
            word_counts: Dict mapping words to frequencies
            alpha: Smoothing exponent (default 0.75)
        """
        self.words = list(word_counts.keys())
        self.word_to_idx = {w: i for i, w in enumerate(self.words)}
        
        # Compute smoothed probabilities
        freqs = np.array([word_counts[w] for w in self.words], dtype=np.float64)
        smoothed = freqs ** alpha
        self.probs = smoothed / smoothed.sum()
        
        # Precompute cumulative distribution for binary search sampling
        self.cumsum = np.cumsum(self.probs)
    
    def sample(self, k, exclude_idx=None):
        """
        Sample k negative word indices.
        
        Args:
            k: Number of samples
            exclude_idx: Index to exclude (typically the positive word)
        
        Returns:
            List of k word indices
        """
        samples = []
        while len(samples) < k:
            # Sample from uniform, use binary search to find word
            r = np.random.random()
            idx = np.searchsorted(self.cumsum, r)
            
            # Don't sample the excluded word (e.g., the true context)
            if idx != exclude_idx and idx not in samples:
                samples.append(idx)
        
        return samples
    
    def sample_batch(self, batch_size, k):
        """
        Sample negative indices for a batch of training examples.
        
        Returns:
            Array of shape (batch_size, k) containing negative indices
        """
        # Use numpy for vectorized sampling
        random_vals = np.random.random((batch_size, k * 2))  # Oversample to handle duplicates
        all_samples = np.searchsorted(self.cumsum, random_vals)
        
        # Take first k unique samples per row (simple approach)
        result = np.zeros((batch_size, k), dtype=np.int64)
        for i in range(batch_size):
            unique = np.unique(all_samples[i])[:k]
            if len(unique) < k:
                # Fallback: sample more
                unique = list(unique)
                while len(unique) < k:
                    idx = np.searchsorted(self.cumsum, np.random.random())
                    if idx not in unique:
                        unique.append(idx)
                unique = np.array(unique)
            result[i] = unique[:k]
        
        return result

# Create sampler with our example vocabulary
sampler = NegativeSampler(word_frequencies, alpha=0.75)
Out[23]:
Negative Sampling Demonstration:
-------------------------------------------------------
Drew 1000 batches of 5 negative samples each
Total samples: 5000

Word sampling frequencies (should roughly match smoothed dist):
Word           Expected   Observed    Ratio
---------------------------------------------
the              0.3931     0.2000     0.51
a                0.3325     0.1998     0.60
is               0.2337     0.1994     0.85
word             0.0124     0.1152     9.27
embedding        0.0074     0.0712     9.63
neural           0.0063     0.0618     9.88
semantic         0.0037     0.0424    11.41
vector           0.0105     0.1060    10.08

Verifying the Sampling Distribution

Let's verify that our sampler matches the intended distribution:

Out[24]:
Visualization
Bar chart comparing expected and observed sampling probabilities for vocabulary words.
Verification that the negative sampler produces the correct distribution. The bar chart compares expected probabilities (from the smoothed unigram distribution) against observed sampling frequencies. The close match confirms our implementation is correct. Small variations are expected due to sampling noise.

Complete Negative Sampling Implementation

Now let's put everything together into a complete, trainable Skip-gram model with negative sampling:

In[25]:
class SkipGramNegSampling:
    """
    Skip-gram model with negative sampling.
    
    This implementation combines efficient negative sampling with
    the Skip-gram objective for practical training.
    """
    
    def __init__(self, vocab_size, embedding_dim, word_counts, num_negatives=5, alpha=0.75):
        """
        Initialize model and sampler.
        
        Args:
            vocab_size: Number of words in vocabulary
            embedding_dim: Dimension of word embeddings
            word_counts: Dict mapping word indices to frequencies (for sampling)
            num_negatives: Number of negative samples per positive (k)
            alpha: Smoothing exponent for sampling distribution
        """
        # Embedding matrices
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01  # Center word embeddings
        self.W_prime = np.random.randn(vocab_size, embedding_dim) * 0.01  # Context embeddings
        
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.num_negatives = num_negatives
        
        # Create negative sampler
        self.sampler = NegativeSampler(word_counts, alpha=alpha)
    
    def forward(self, center_idx, context_idx, negative_indices):
        """
        Forward pass: compute loss and cache values for backprop.
        
        Returns:
            loss: Negative sampling loss
            cache: Values needed for backward pass
        """
        # Get embeddings
        center_vec = self.W[center_idx]
        context_vec = self.W_prime[context_idx]
        negative_vecs = self.W_prime[negative_indices]
        
        # Positive pair score and probability
        pos_score = np.dot(context_vec, center_vec)
        pos_sigmoid = sigmoid(pos_score)
        
        # Negative pair scores and probabilities
        neg_scores = negative_vecs @ center_vec  # (k,)
        neg_sigmoids = sigmoid(-neg_scores)  # Want these close to 1
        
        # Loss: -log(pos_sigmoid) - sum(log(neg_sigmoids))
        loss = -np.log(pos_sigmoid + 1e-10) - np.sum(np.log(neg_sigmoids + 1e-10))
        
        # Cache for backprop
        cache = {
            'center_idx': center_idx,
            'context_idx': context_idx,
            'negative_indices': negative_indices,
            'center_vec': center_vec,
            'context_vec': context_vec,
            'negative_vecs': negative_vecs,
            'pos_sigmoid': pos_sigmoid,
            'neg_sigmoids': neg_sigmoids
        }
        
        return loss, cache
    
    def backward(self, cache, learning_rate):
        """
        Backward pass: compute gradients and update weights.
        """
        center_idx = cache['center_idx']
        context_idx = cache['context_idx']
        negative_indices = cache['negative_indices']
        center_vec = cache['center_vec']
        context_vec = cache['context_vec']
        negative_vecs = cache['negative_vecs']
        pos_sigmoid = cache['pos_sigmoid']
        neg_sigmoids = cache['neg_sigmoids']
        
        # Gradient for center word embedding
        # From positive: (sigmoid - 1) * context
        # From negatives: sum of sigmoid(-dot) * negative
        grad_center = (pos_sigmoid - 1) * context_vec
        
        # Gradients for negative samples contribute: sigmoid(neg_score) * center
        for i, neg_idx in enumerate(negative_indices):
            neg_sigmoid_pos = 1 - neg_sigmoids[i]  # sigmoid(neg_score)
            grad_center += neg_sigmoid_pos * negative_vecs[i]
        
        # Gradient for positive context embedding: (sigmoid - 1) * center
        grad_context = (pos_sigmoid - 1) * center_vec
        
        # Gradients for negative embeddings: sigmoid(neg_score) * center
        grad_negatives = np.zeros_like(negative_vecs)
        for i in range(len(negative_indices)):
            neg_sigmoid_pos = 1 - neg_sigmoids[i]
            grad_negatives[i] = neg_sigmoid_pos * center_vec
        
        # Update embeddings
        self.W[center_idx] -= learning_rate * grad_center
        self.W_prime[context_idx] -= learning_rate * grad_context
        for i, neg_idx in enumerate(negative_indices):
            self.W_prime[neg_idx] -= learning_rate * grad_negatives[i]
    
    def train_pair(self, center_idx, context_idx, learning_rate=0.025):
        """
        Train on a single (center, context) pair.
        
        Returns:
            loss: The loss for this training step
        """
        # Sample negatives (excluding the true context)
        negative_indices = self.sampler.sample(self.num_negatives, exclude_idx=context_idx)
        
        # Forward and backward pass
        loss, cache = self.forward(center_idx, context_idx, negative_indices)
        self.backward(cache, learning_rate)
        
        return loss
    
    def get_embedding(self, word_idx):
        """Get the embedding vector for a word."""
        return self.W[word_idx]
    
    def most_similar(self, word_idx, top_n=5):
        """Find most similar words by cosine similarity."""
        word_vec = self.W[word_idx]
        word_vec_norm = word_vec / (np.linalg.norm(word_vec) + 1e-10)
        
        # Compute all similarities at once
        norms = np.linalg.norm(self.W, axis=1, keepdims=True)
        normalized = self.W / (norms + 1e-10)
        similarities = normalized @ word_vec_norm
        
        # Get top indices (excluding self)
        similarities[word_idx] = -np.inf
        top_indices = np.argsort(similarities)[::-1][:top_n]
        
        return [(idx, similarities[idx]) for idx in top_indices]

Training with Negative Sampling

Let's train our model on a corpus with clear semantic structure:

In[26]:
# Create a training corpus with semantic groups
training_sentences = [
    "king queen prince princess royal throne crown palace",
    "man woman boy girl child adult person human",
    "cat dog pet animal fur paw tail whisker",
    "happy sad angry joyful emotion feeling mood cheerful",
    "run walk jump sprint move fast slow quick",
    "king rules the kingdom with royal power",
    "the queen sits on the throne in the palace",
    "a man and woman walk with their child",
    "the cat and dog are beloved pets",
    "feeling happy brings a joyful mood",
    "run fast and jump quick to move",
]

# Build vocabulary
all_words = ' '.join(training_sentences).lower().split()
vocab = sorted(set(all_words))
word_to_idx = {w: i for i, w in enumerate(vocab)}
idx_to_word = {i: w for w, i in word_to_idx.items()}

# Count word frequencies
from collections import Counter
word_counts = Counter(all_words)
word_count_by_idx = {word_to_idx[w]: c for w, c in word_counts.items()}

# Generate training pairs with window size 2
def generate_pairs(sentences, word_to_idx, window=2):
    pairs = []
    for sentence in sentences:
        words = sentence.lower().split()
        for i, center in enumerate(words):
            for j in range(max(0, i-window), min(len(words), i+window+1)):
                if j != i:
                    pairs.append((word_to_idx[center], word_to_idx[words[j]]))
    return pairs

training_pairs = generate_pairs(training_sentences, word_to_idx, window=2)
Out[27]:
Training Corpus Statistics:
---------------------------------------------
Number of sentences: 11
Total words: 84
Vocabulary size: 56
Training pairs: 270

Sample vocabulary:
  ['a', 'adult', 'and', 'angry', 'animal', 'are', 'beloved', 'boy', 'brings', 'cat', 'cheerful', 'child', 'crown', 'dog', 'emotion']
  ... and 41 more words
In[28]:
# Initialize and train model
np.random.seed(42)
model = SkipGramNegSampling(
    vocab_size=len(vocab),
    embedding_dim=30,
    word_counts=word_count_by_idx,
    num_negatives=5,
    alpha=0.75
)

# Training loop
epochs = 50
losses_per_epoch = []

for epoch in range(epochs):
    np.random.shuffle(training_pairs)
    epoch_loss = 0
    
    for center_idx, context_idx in training_pairs:
        loss = model.train_pair(center_idx, context_idx, learning_rate=0.1)
        epoch_loss += loss
    
    avg_loss = epoch_loss / len(training_pairs)
    losses_per_epoch.append(avg_loss)
Out[29]:
Training Complete:
---------------------------------------------
Epochs: 50
Learning rate: 0.1
Negative samples per pair: 5

Initial loss: 4.1587
Final loss: 1.1548
Reduction: 72.2%

The loss decreases substantially over training, indicating the model is learning to distinguish positive from negative pairs. The reduction from initial to final loss shows the embeddings have organized to reflect the corpus structure.

Out[30]:
Visualization
Line plot showing decreasing training loss over epochs with initial rapid decline and eventual plateau.
Training loss curve for Skip-gram with negative sampling. The loss decreases rapidly in early epochs as the model learns to distinguish positive pairs from negatives. The plateau indicates convergence. The final loss reflects how well the model discriminates real context pairs from sampled negatives.

Evaluating the Learned Embeddings

Let's examine whether the model has learned meaningful semantic relationships:

In[31]:
# Test word similarities
test_words = ['king', 'cat', 'happy', 'run', 'man']
similarity_results = {}

for word in test_words:
    if word in word_to_idx:
        similar = model.most_similar(word_to_idx[word], top_n=5)
        similarity_results[word] = [(idx_to_word[idx], sim) for idx, sim in similar]
Out[32]:
Learned Word Similarities (Negative Sampling):
-------------------------------------------------------

Most similar to 'king':
  kingdom     : +0.748 █████████████
  sits        : +0.744 █████████████
  on          : +0.729 ████████████
  princess    : +0.719 ████████████
  queen       : +0.606 ████████████

Most similar to 'cat':
  animal      : +0.786 █████████████
  are         : +0.654 ████████████
  sits        : +0.607 ████████████
  dog         : +0.590 ███████████
  beloved     : +0.567 ███████████

Most similar to 'happy':
  joyful      : +0.914 ██████████████
  angry       : +0.687 ████████████
  brings      : +0.676 ████████████
  emotion     : +0.652 ████████████
  cheerful    : +0.640 ████████████

Most similar to 'run':
  sprint      : +0.894 ██████████████
  quick       : +0.754 █████████████
  jump        : +0.750 █████████████
  woman       : +0.749 █████████████
  slow        : +0.693 ████████████

Most similar to 'man':
  girl        : +0.755 █████████████
  brings      : +0.653 ████████████
  woman       : +0.624 ████████████
  boy         : +0.612 ████████████
  walk        : +0.591 ███████████

Visualizing the Embedding Space

Out[33]:
Visualization
Scatter plot of word embeddings projected to 2D with visible semantic clusters.
2D PCA projection of word embeddings learned with negative sampling. Words from the same semantic category tend to cluster together: royalty terms (purple), people terms (blue), animal terms (red), emotion terms (orange), and movement terms (green). The spatial organization demonstrates that negative sampling learns meaningful semantic relationships.

Similarity Heatmap

Out[34]:
Visualization
Heatmap showing pairwise cosine similarities between word embeddings with visible block structure.
Pairwise cosine similarity matrix for embeddings learned with negative sampling. The block-diagonal structure indicates that words within semantic categories have higher similarity with each other than with words from other categories. This pattern emerges purely from the training objective: distinguishing real context pairs from random negatives.

Comparing Full Softmax and Negative Sampling

How do embeddings from negative sampling compare to those from full softmax? Let's train both on the same data:

In[35]:
class SkipGramFullSoftmax:
    """Skip-gram with full softmax for comparison."""
    
    def __init__(self, vocab_size, embedding_dim):
        self.W = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.W_prime = np.random.randn(vocab_size, embedding_dim) * 0.01
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
    
    def forward(self, center_idx, context_idx):
        center_vec = self.W[center_idx]
        scores = self.W_prime @ center_vec
        
        # Softmax
        exp_scores = np.exp(scores - np.max(scores))
        probs = exp_scores / exp_scores.sum()
        
        loss = -np.log(probs[context_idx] + 1e-10)
        return loss, probs, center_vec
    
    def backward(self, center_idx, context_idx, probs, center_vec, learning_rate):
        # Gradient for W_prime
        grad_probs = probs.copy()
        grad_probs[context_idx] -= 1
        
        for j in range(self.vocab_size):
            self.W_prime[j] -= learning_rate * grad_probs[j] * center_vec
        
        # Gradient for center embedding
        grad_center = self.W_prime.T @ grad_probs
        self.W[center_idx] -= learning_rate * grad_center
    
    def train_pair(self, center_idx, context_idx, learning_rate=0.025):
        loss, probs, center_vec = self.forward(center_idx, context_idx)
        self.backward(center_idx, context_idx, probs, center_vec, learning_rate)
        return loss
    
    def get_embedding(self, word_idx):
        return self.W[word_idx]

# Train both models
np.random.seed(42)
model_softmax = SkipGramFullSoftmax(vocab_size=len(vocab), embedding_dim=30)
model_negsamp = SkipGramNegSampling(
    vocab_size=len(vocab), embedding_dim=30,
    word_counts=word_count_by_idx, num_negatives=5
)

losses_softmax = []
losses_negsamp = []

for epoch in range(50):
    np.random.shuffle(training_pairs)
    epoch_loss_sm = 0
    epoch_loss_ns = 0
    
    for center_idx, context_idx in training_pairs:
        epoch_loss_sm += model_softmax.train_pair(center_idx, context_idx, learning_rate=0.1)
        epoch_loss_ns += model_negsamp.train_pair(center_idx, context_idx, learning_rate=0.1)
    
    losses_softmax.append(epoch_loss_sm / len(training_pairs))
    losses_negsamp.append(epoch_loss_ns / len(training_pairs))
Out[36]:
Visualization
Two plots comparing softmax and negative sampling training loss and embedding quality.
Training loss comparison between full softmax and negative sampling. Left: Loss curves over training epochs. Full softmax (orange) directly optimizes cross-entropy over the vocabulary, while negative sampling (blue) optimizes a binary classification surrogate. The losses are on different scales because the objectives differ. Right: Embedding quality comparison via average within-group similarity. Despite the simpler objective, negative sampling produces embeddings of comparable quality.

The comparison reveals a key insight: negative sampling produces embeddings of similar quality to full softmax, despite using a fundamentally different (and much cheaper) objective. This explains why negative sampling became the default training method for Word2Vec and similar models.

Training Dynamics Visualization

Out[37]:
Visualization
Line plot showing decreasing within-group distance and increasing between-group distance over training epochs.
Evolution of embedding distances during training. The plot tracks average cosine distance between semantically related words (within-group, green) versus unrelated words (between-group, orange) over training epochs. Effective training causes within-group distance to decrease (words cluster) while between-group distance increases (clusters separate).

Training progressively separates semantic clusters. Initially, within-group and between-group distances are similar because embeddings are random. As training proceeds, related words (within same semantic group) move closer together while unrelated words (across groups) move apart. The growing gap between these curves indicates successful learning.

Practical Considerations

Learning Rate Scheduling

The original Word2Vec implementation uses linear learning rate decay:

In[38]:
def get_learning_rate(epoch, total_epochs, initial_lr=0.025, min_lr=0.0001):
    """
    Linearly decay learning rate from initial to minimum.
    """
    progress = epoch / total_epochs
    return max(initial_lr * (1 - progress) + min_lr * progress, min_lr)

# Demonstrate learning rate schedule
epochs = np.arange(100)
lrs = [get_learning_rate(e, 100) for e in epochs]
Out[39]:
Visualization
Line plot showing linearly decreasing learning rate from 0.025 to near 0 over training epochs.
Linear learning rate decay schedule commonly used in Word2Vec training. Starting from a higher learning rate (0.025) allows fast initial progress, while the gradual decay to a minimum (0.0001) enables fine-tuning in later epochs. This schedule balances exploration and convergence.

Subsampling Frequent Words

Very frequent words like "the" and "a" provide less information than rare words. Word2Vec subsamples frequent words during training.

P(keep w)=tf(w)+tf(w)P(\text{keep } w) = \sqrt{\frac{t}{f(w)}} + \frac{t}{f(w)}

where:

  • P(keep w)P(\text{keep } w): probability of keeping word ww during training
  • f(w)f(w): relative frequency of word ww in the corpus (count of ww divided by total word count)
  • tt: subsampling threshold, typically 10510^{-5}

Words with frequency above the threshold tt get subsampled, while rare words are always kept. This formula ensures that very frequent words like "the" appear less often during training, reducing their dominance while preserving rare word information.

In[40]:
def subsample_prob(word_freq, total_words, threshold=1e-5):
    """
    Compute probability of keeping a word during training.
    
    Frequent words are subsampled to reduce their dominance.
    """
    freq_ratio = word_freq / total_words
    if freq_ratio > threshold:
        # Keep with probability based on frequency
        keep_prob = np.sqrt(threshold / freq_ratio) + (threshold / freq_ratio)
        return min(keep_prob, 1.0)
    return 1.0  # Always keep rare words

# Example: subsampling probabilities
total = sum(word_frequencies.values())
subsample_probs = {w: subsample_prob(f, total) for w, f in word_frequencies.items()}
Out[41]:
Subsampling Probabilities:
--------------------------------------------------
Word          Frequency    Keep Prob
--------------------------------------------------
the             100,000       0.0048
a                80,000       0.0054
is               50,000       0.0069
word              1,000       0.0506
vector              800       0.0569
embedding           500       0.0729
neural              400       0.0821
semantic            200       0.1196
aardvark              5       1.0000
quixotic              2       1.0000

The most frequent words like "the" and "a" have very low keep probabilities, meaning they're discarded during most training iterations. This prevents them from overwhelming the training signal. Rarer words like "aardvark" and "quixotic" have keep probability 1.0, ensuring every occurrence contributes to their embeddings.

Mini-batch Training

For large-scale training, processing one pair at a time is inefficient. Mini-batch training processes multiple pairs simultaneously, enabling better hardware utilization:

In[42]:
def train_minibatch(model, pairs, batch_size=32, learning_rate=0.025):
    """
    Train on a minibatch of (center, context) pairs.
    
    This is more efficient than single-pair training on GPUs.
    """
    total_loss = 0
    for i in range(0, len(pairs), batch_size):
        batch = pairs[i:i + batch_size]
        batch_loss = 0
        
        for center_idx, context_idx in batch:
            loss = model.train_pair(center_idx, context_idx, learning_rate)
            batch_loss += loss
        
        total_loss += batch_loss
    
    return total_loss / len(pairs)

Limitations and Extensions

Negative sampling made Word2Vec practical, but it has limitations:

Approximation quality: Negative sampling doesn't optimize the true softmax objective. For most applications, this doesn't matter, but tasks requiring precise probability estimates may suffer.

Hyperparameter sensitivity: The choice of kk (number of negatives) and α\alpha (smoothing exponent) affect results. The defaults (k=5-10, α\alpha=0.75) work well but aren't universally optimal.

No guarantee of convergence to NCE solution: Unlike NCE, negative sampling doesn't have theoretical guarantees about recovering the true distribution. In practice, this rarely matters for downstream tasks.

Static embeddings: Like all Word2Vec variants, negative sampling produces static embeddings. Each word gets one vector regardless of context. Modern contextual models like BERT address this limitation.

Summary

Negative sampling transforms the Skip-gram training problem from an expensive multi-class classification (softmax over vocabulary) to efficient binary classification (real vs. fake pairs). This simple change reduces computational complexity from O(V)O(V) to O(k)O(k) per training step, making large-scale training feasible.

Key takeaways:

  • Binary classification reformulation: Instead of predicting which word is the context, predict whether a given pair is genuine or randomly sampled
  • Sigmoid instead of softmax: Each pair is scored independently using the sigmoid function, eliminating the need to normalize over the entire vocabulary
  • The sampling distribution matters: Using Pn(w)f(w)0.75P_n(w) \propto f(w)^{0.75} balances representation between frequent and rare words
  • Number of negatives (kk): Typically 5-20, with smaller values for larger datasets. More negatives provide stronger training signal at higher computational cost
  • Theoretical foundation in NCE: Negative sampling is a simplification of Noise Contrastive Estimation, trading theoretical guarantees for practical efficiency
  • Comparable quality to full softmax: Despite the approximation, embeddings from negative sampling rival those from full softmax in downstream tasks

The next chapter explores hierarchical softmax, an alternative approximation that organizes the vocabulary as a binary tree, reducing softmax complexity from O(V)O(V) to O(logV)O(\log V).

Key Parameters

Negative Sampling Parameters:

  • num_negatives (k): Number of negative samples per positive pair. Use 5-10 for large corpora (billions of words), 15-25 for smaller datasets. More negatives provide stronger training signal but increase computation.

  • alpha (smoothing exponent): Controls the noise distribution shape. The standard value 0.75 balances sampling frequency between common and rare words. Lower values (0.5) boost rare words more; higher values (1.0) sample proportionally to frequency.

  • embedding_dim: Dimensionality of word vectors. Typical values range from 50-300. Larger dimensions capture more nuance but require more data and computation.

  • learning_rate: Step size for gradient updates. Start with 0.025 and decay linearly to 0.0001. Higher rates speed early training; lower rates enable fine-tuning.

  • window_size: Context window for extracting training pairs. Common values are 5-10. Larger windows capture broader semantic relationships; smaller windows focus on syntactic patterns.

  • subsampling_threshold (t): Threshold for discarding frequent words. Typical value is 10510^{-5}. Words more frequent than this are probabilistically skipped during training.

  • epochs: Number of passes through the training data. For large corpora, 1-5 epochs often suffice. Smaller datasets may benefit from 20-50 epochs.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about negative sampling.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{negativesamplingefficientwordembeddingtraining, author = {Michael Brenndoerfer}, title = {Negative Sampling: Efficient Word Embedding Training}, year = {2025}, url = {https://mbrenndoerfer.com/writing/negative-sampling-word-embeddings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }
APAAcademic
Michael Brenndoerfer (2025). Negative Sampling: Efficient Word Embedding Training. Retrieved from https://mbrenndoerfer.com/writing/negative-sampling-word-embeddings
MLAAcademic
Michael Brenndoerfer. "Negative Sampling: Efficient Word Embedding Training." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/negative-sampling-word-embeddings>.
CHICAGOAcademic
Michael Brenndoerfer. "Negative Sampling: Efficient Word Embedding Training." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/negative-sampling-word-embeddings.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Negative Sampling: Efficient Word Embedding Training'. Available at: https://mbrenndoerfer.com/writing/negative-sampling-word-embeddings (Accessed: 12/13/2025).
SimpleBasic
Michael Brenndoerfer (2025). Negative Sampling: Efficient Word Embedding Training. https://mbrenndoerfer.com/writing/negative-sampling-word-embeddings
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

or