Search

Search articles

FastText: Subword Embeddings for OOV Words & Morphology

Michael BrenndoerferDecember 11, 202539 min read9,072 words

Learn how FastText extends Word2Vec with character n-grams to handle out-of-vocabulary words, typos, and morphologically rich languages.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

FastText

Word2Vec revolutionized word embeddings by learning dense vector representations from context. But it has a fundamental limitation: each word is treated as an atomic unit. The embedding for "running" shares nothing with "run," "runner," or "runs." Every word form needs to be seen during training to get a representation, and words absent from the training corpus receive no embedding at all.

This poses serious challenges for morphologically rich languages like German, Finnish, or Turkish, where a single root can spawn thousands of word forms through compounding and inflection. Even in English, rare words, typos, and domain-specific terms often lack embeddings despite being clearly related to known words.

FastText, introduced by Bojanowski et al. at Facebook AI Research in 2017, solves this elegantly: instead of treating words as atomic, it represents each word as a bag of character n-grams. The word "running" becomes a collection of substrings like "run," "unn," "nni," "nin," "ing," plus boundary markers. The word's embedding is simply the sum of its n-gram embeddings. This simple change unlocks three powerful capabilities: handling out-of-vocabulary words, capturing morphological patterns, and improving representations for rare words.

This chapter develops FastText from the ground up. We'll understand why subword representations matter, work through the architecture and training process, and implement a working FastText model that handles words it has never seen.

The Vocabulary Problem in Word2Vec

Before diving into FastText's solution, let's understand the problem it addresses. Word2Vec models learn one embedding per vocabulary entry. If a word wasn't in the training data, it has no representation.

In[2]:
import numpy as np

# Simulating a Word2Vec vocabulary
word2vec_vocab = {
    'run': 0,
    'running': 1,
    'runner': 2,
    'walk': 3,
    'walking': 4,
    'walker': 5,
    'swim': 6,
    'happy': 7,
    'happily': 8,
}

# Simulate embeddings (random for demonstration)
np.random.seed(42)
word2vec_embeddings = {word: np.random.randn(50) for word in word2vec_vocab}

# Words that cause problems
problem_words = ['runs', 'swam', 'swimmer', 'runing', 'happyness']
Out[3]:
Word2Vec Vocabulary Problem:
-------------------------------------------------------
Vocabulary size: 9
Known words: ['run', 'running', 'runner', 'walk', 'walking', 'walker', 'swim', 'happy', 'happily']

Problematic words (no embedding available):
  'runs': ✗ not found (different inflection of 'run')
  'swam': ✗ not found (past tense of 'swim')
  'swimmer': ✗ not found (noun form of 'swim')
  'runing': ✗ not found (typo for 'running')
  'happyness': ✗ not found (misspelling of 'happiness')

The vocabulary problem becomes severe in three scenarios:

Morphologically rich languages: German compounds like "Lebensversicherungsgesellschaft" (life insurance company) or Finnish cases like "talossanikinko" (in my house, too?) generate enormous vocabularies. Training a model that sees every possible form is impractical.

Domain-specific text: Medical terms, product names, or technical jargon often don't appear in general-purpose training corpora. A model trained on news articles won't have embeddings for "bevacizumab" or "iPhone15ProMax."

Noisy text: Social media, user reviews, and informal writing contain typos, creative spellings, and neologisms. "Cooool," "gr8," and "amaziiing" are meaningful but likely absent from formal training data.

Out[4]:
Visualization
Three panels showing examples of OOV words in German compounds, medical terms, and social media text.
Out-of-vocabulary (OOV) words pose challenges across different domains. Left: Morphological complexity in German compounds. Middle: Domain-specific terms in medicine. Right: Noisy text from social media. Traditional Word2Vec has no representation for any of these words.

The FastText Solution: Character N-grams

Word2Vec treats each word as an indivisible symbol. "Running" and "runner" share nothing in their representations despite their obvious linguistic relationship. This atomic view ignores the internal structure of words, structure that humans naturally exploit when encountering unfamiliar terms. When you see "unfamiliar," you don't need to have seen it before to guess its meaning: the prefix "un-" signals negation, and you recognize the root "familiar."

FastText captures this intuition algorithmically. Instead of treating words as atomic symbols, it represents each word as a bag of character n-grams, overlapping substrings of characters. The word's embedding emerges from combining the embeddings of its constituent parts.

Character N-gram

A character n-gram is a contiguous sequence of nn characters extracted from a word. For example, the 3-grams (trigrams) of "where" are: "whe", "her", "ere". FastText uses special boundary markers < and > to distinguish prefixes and suffixes, so "where" becomes "<wh", "whe", "her", "ere", "re>".

Why this representation? Consider what happens when two words share the same root. "Running" and "runner" both contain the substrings "run," "unn," and several others. If these shared substrings contribute to both words' embeddings, then related words will naturally have similar representations. This happens not because we explicitly encode morphological rules, but because the overlapping structure emerges from the character sequences themselves.

The decomposition process transforms each word through three steps:

  1. Add boundary markers: "where" → "" (these mark word boundaries, distinguishing prefixes from suffixes)
  2. Extract all n-grams: Generate all substrings of length nn within the configured range (typically 3 to 6)
  3. Include the word itself: The full word acts as an additional feature for known vocabulary items
In[5]:
def get_ngrams(word, min_n=3, max_n=6):
    """
    Extract character n-grams from a word.
    
    Args:
        word: The input word
        min_n: Minimum n-gram length
        max_n: Maximum n-gram length
    
    Returns:
        List of character n-grams including boundary markers
    """
    # Add boundary markers
    word_with_markers = f"<{word}>"
    
    ngrams = []
    
    # Extract n-grams for each length
    for n in range(min_n, max_n + 1):
        for i in range(len(word_with_markers) - n + 1):
            ngram = word_with_markers[i:i + n]
            ngrams.append(ngram)
    
    return ngrams

# Example: extract n-grams for "where"
example_word = "where"
ngrams = get_ngrams(example_word, min_n=3, max_n=6)
Out[6]:
Character N-grams for 'where':
-------------------------------------------------------
With boundary markers: '<where>'

3-grams (5 total):
  ['<wh', 'whe', 'her', 'ere', 're>']
4-grams (4 total):
  ['<whe', 'wher', 'here', 'ere>']
5-grams (3 total):
  ['<wher', 'where', 'here>']
6-grams (2 total):
  ['<where', 'where>']

Total n-grams: 14

The word "where" decomposes into 18 n-grams spanning lengths 3 to 6. This decomposition reveals something important: different n-grams encode different aspects of the word. The boundary markers "<wh" and "re>" capture positional information: "<wh" signals a word beginning, while "re>" indicates a word ending. Meanwhile, the n-gram "her" floating in the middle carries no positional constraints.

Out[7]:
Visualization
Line plot showing n-gram count vs word length, and stacked bar chart showing n-gram distribution.
Number of character n-grams as a function of word length. Left: Total n-gram count grows linearly with word length for a fixed n-gram range (3-6). Short words have fewer n-grams, limiting their representational capacity. Right: Distribution of n-grams by length for words of different sizes. Longer words have more n-grams at each length, providing richer subword representations.

Why Boundary Markers Matter

This positional encoding matters more than it might first appear. Consider the substring "her" appearing in three different words: "where," "hero," and "other." Without boundary markers, these would all contribute the same n-gram "her" to each word's representation. But linguistically, these are quite different situations:

  • In "hero," "her" forms the start of the word (a prefix)
  • In "other," "her" ends the word (a suffix)
  • In "where," "her" sits in the middle

Boundary markers disambiguate these cases:

In[8]:
# Compare n-grams with and without markers
words_to_compare = ['where', 'hero', 'other']
trigrams_comparison = {}

for word in words_to_compare:
    marked = f"<{word}>"
    trigrams = [marked[i:i+3] for i in range(len(marked) - 2)]
    trigrams_comparison[word] = set(trigrams)
Out[9]:
Effect of Boundary Markers on Trigrams:
-------------------------------------------------------
'where' → '<where>':
  Trigrams: ['<wh', 'ere', 'her', 're>', 'whe']
'hero' → '<hero>':
  Trigrams: ['<he', 'ero', 'her', 'ro>']
'other' → '<other>':
  Trigrams: ['<ot', 'er>', 'her', 'oth', 'the']

Shared trigrams: {'her'}

The trigram 'her' appears differently:
  'where': 'her' (middle position)
  'hero':  '<he', 'her', 'ero', 'ro>' (starts with 'her')
  'other': 'the', 'her', 'er>' (ends with 'her')

The boundary markers ensure that "hero" (containing "<he," "her," "ero," "ro>") and "other" (containing "the," "her," "er>") produce overlapping but distinct n-gram sets. They share "her" because both contain that substring internally, but their boundary-marked n-grams differ entirely. This encoding captures both the shared internal structure and the distinct positional contexts.

Out[10]:
Visualization
Heatmap showing Jaccard similarity of n-gram sets between words containing 'her'.
N-gram overlap matrix for words containing the substring 'her'. The heatmap shows the Jaccard similarity (intersection over union) of n-gram sets between word pairs. 'where', 'hero', and 'other' all contain 'her' but have low overall overlap due to boundary markers capturing different positions. Words with shared roots (like 'hero' and 'heroic') show higher overlap.

From N-grams to Word Embeddings

With our words decomposed into n-gram sets, we need a mechanism to combine these pieces into a single word representation. FastText takes the simplest possible approach: summation.

Each n-gram gg has its own learnable embedding vector zg\mathbf{z}_g. To compute a word's embedding, we simply add up all its n-gram embeddings, optionally including a word-level embedding for vocabulary words:

vw=zw+gGwzg\mathbf{v}_w = \mathbf{z}_w + \sum_{g \in G_w} \mathbf{z}_g

The components of this formula reveal FastText's elegance:

  • vw\mathbf{v}_w is the final embedding we want, the vector that will represent word ww in downstream tasks
  • zw\mathbf{z}_w is an optional word-level embedding that captures any meaning not encoded by subwords (for vocabulary words only)
  • GwG_w is the set of character n-grams extracted from word ww (the decomposition we computed above)
  • zg\mathbf{z}_g is the learned embedding for n-gram gg, shared across all words containing that n-gram

The sum means that words sharing n-grams automatically share portions of their representation. The n-gram embeddings zg\mathbf{z}_g are the true learned parameters. They encode character-sequence semantics that transfer across all words containing those sequences.

In[11]:
class FastTextEmbeddings:
    """
    Simple FastText-style embeddings using character n-grams.
    """
    
    def __init__(self, embedding_dim=50, min_n=3, max_n=6):
        """
        Initialize FastText embeddings.
        
        Args:
            embedding_dim: Dimension of embedding vectors
            min_n: Minimum n-gram length
            max_n: Maximum n-gram length
        """
        self.embedding_dim = embedding_dim
        self.min_n = min_n
        self.max_n = max_n
        
        # Embeddings stored in dictionaries
        self.word_embeddings = {}  # For known words
        self.ngram_embeddings = {}  # For n-grams
    
    def _get_ngrams(self, word):
        """Extract character n-grams from a word."""
        word_with_markers = f"<{word}>"
        ngrams = []
        for n in range(self.min_n, self.max_n + 1):
            for i in range(len(word_with_markers) - n + 1):
                ngrams.append(word_with_markers[i:i + n])
        return ngrams
    
    def _ensure_ngram_embeddings(self, ngrams):
        """Create embeddings for unseen n-grams."""
        for ng in ngrams:
            if ng not in self.ngram_embeddings:
                # Initialize with small random values
                self.ngram_embeddings[ng] = np.random.randn(self.embedding_dim) * 0.01
    
    def get_embedding(self, word):
        """
        Get the embedding for a word.
        
        Computes sum of n-gram embeddings plus word embedding (if known).
        """
        ngrams = self._get_ngrams(word)
        self._ensure_ngram_embeddings(ngrams)
        
        # Start with word embedding if known
        if word in self.word_embeddings:
            embedding = self.word_embeddings[word].copy()
        else:
            embedding = np.zeros(self.embedding_dim)
        
        # Add n-gram embeddings
        for ng in ngrams:
            embedding += self.ngram_embeddings[ng]
        
        return embedding

# Create FastText embeddings
np.random.seed(42)
ft = FastTextEmbeddings(embedding_dim=50, min_n=3, max_n=6)

# Add some known words
for word in ['run', 'running', 'runner', 'swim', 'swimming']:
    ft.word_embeddings[word] = np.random.randn(50) * 0.1

# Get embeddings for known and unknown words
known_embedding = ft.get_embedding('running')
unknown_embedding = ft.get_embedding('runs')  # Not in vocabulary
typo_embedding = ft.get_embedding('runing')   # Typo
Out[12]:
FastText Embedding Computation:
-------------------------------------------------------
Embedding dimension: 50
N-gram range: 3 to 6

Computing embedding for OOV word 'runs':
  N-grams (10): ['<ru', 'run', 'uns', 'ns>', '<run', 'runs', 'uns>', '<runs']...
  Word embedding: (not in vocabulary, starts at zero)
  Final embedding: sum of 10 n-gram vectors

Cosine similarities (benefit of shared n-grams):
  'run' vs 'runs':    0.1595
  'run' vs 'running': 0.0312
  'running' vs 'runing' (typo): 0.0091
  'run' vs 'swim':    0.1339

Even with random initialization, before any training has occurred, words sharing n-grams exhibit non-trivial similarity. This is the key insight: the structure of language is partially encoded in the character sequences themselves. Training amplifies this signal, making morphological variants cluster together not because we programmed morphological rules, but because the shared substrings naturally encode shared meaning.

Out[13]:
Visualization
Heatmap showing cosine similarities between word pairs based on shared n-grams.
Cosine similarity matrix between morphologically related words using random n-gram embeddings. Even before training, words sharing character n-grams show positive similarity. 'run' and 'runs' are highly similar due to shared n-grams, while 'run' and 'swim' (unrelated roots) show near-zero similarity.

Visualizing N-gram Overlap

To build intuition for why this works, let's visualize which n-grams different word forms share. Consider the word family "run," "runs," "running," and "runner," all derived from the same root:

Out[14]:
Visualization
Venn-style diagram showing shared and unique n-grams between related word forms.
Character n-gram overlap between related words. Words sharing the same root ('run', 'runs', 'running', 'runner') share many n-grams (shown in green), leading to similar embeddings. The n-grams '<ru', 'run', 'un>' appear in all forms, encoding the morphological relationship.

The FastText Training Objective

We've established how FastText represents words as sums of n-gram embeddings. But how do we learn what values those n-gram embeddings should take? The answer lies in adapting Word2Vec's Skip-gram training objective to operate on n-grams rather than words.

Recall that Skip-gram learns embeddings by predicting context words from center words. FastText uses the same principle, but with a crucial modification: instead of looking up a single embedding for the center word, we compute the center word's representation as the sum of its n-gram embeddings. This propagates gradients back to all n-grams during training, allowing them to learn from every word they appear in.

The scoring function measures how compatible a center word wtw_t is with a context word wcw_c:

s(wt,wc)=gGwtzgTvwcs(w_t, w_c) = \sum_{g \in G_{w_t}} \mathbf{z}_g^T \cdot \mathbf{v}_{w_c}

This formula computes a dot product, but with a twist: instead of a single center word vector, we sum over all n-gram vectors for the center word. The components are:

  • s(wt,wc)s(w_t, w_c): compatibility score (higher means the context word is more likely given the center word)
  • GwtG_{w_t}: the set of character n-grams for center word wtw_t (our decomposition from earlier)
  • zg\mathbf{z}_g: the learnable embedding for n-gram gg
  • vwc\mathbf{v}_{w_c}: the context embedding for word wcw_c (a separate embedding matrix, as in standard Skip-gram)

The key insight is that training this objective updates the n-gram embeddings zg\mathbf{z}_g. When we observe that "runner" appears near "fast" in training data, we don't just update an embedding for "runner." We update embeddings for all n-grams in "runner": "<ru," "run," "unn," "nne," "ner," "er>," etc. These same n-grams appear in "running," "runs," and other related words, so information flows between morphological variants through shared subword structure.

Training with Negative Sampling

Computing the full softmax over all vocabulary words would be prohibitively expensive. Like Word2Vec, FastText uses negative sampling to make training tractable. For each genuine (center word, context word) pair, we sample kk random "negative" context words and train the model to distinguish real pairs from fake ones:

L=logσ(s(wt,wc))+i=1klogσ(s(wt,wni))\mathcal{L} = \log \sigma(s(w_t, w_c)) + \sum_{i=1}^{k} \log \sigma(-s(w_t, w_{n_i}))

Breaking down this loss function:

  • The first term logσ(s(wt,wc))\log \sigma(s(w_t, w_c)) rewards high scores for genuine context words. We want the sigmoid of the score to be close to 1.
  • The second term i=1klogσ(s(wt,wni))\sum_{i=1}^{k} \log \sigma(-s(w_t, w_{n_i})) penalizes high scores for randomly sampled negatives. We want the sigmoid of the negated score to be close to 1, meaning the original score should be negative.
  • σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}} is the sigmoid function, squashing scores to the (0,1)(0, 1) range
  • kk is the number of negative samples (typically 5-10)
  • wniw_{n_i} is the ii-th randomly sampled negative word

Maximizing this loss encourages the model to score real word-context pairs higher than random pairs. Because the center word representation is built from n-gram embeddings, the gradients flow back through the sum to update each n-gram's embedding individually.

Out[15]:
Visualization
Side-by-side comparison of Word2Vec and FastText architectures showing n-gram decomposition.
FastText architecture compared to Word2Vec. Word2Vec (left) looks up a single embedding for the center word. FastText (right) decomposes the center word into n-grams, looks up an embedding for each, and sums them to produce the center word representation. Both then compute a dot product with context word embeddings.

Implementing FastText Training

With the conceptual foundation in place, let's implement a complete FastText model. The implementation mirrors our mathematical formulation: we'll maintain a matrix of n-gram embeddings, compute word representations as sums over relevant n-grams, and train using negative sampling.

The implementation has three key components:

  1. N-gram extraction and indexing: Decompose vocabulary words into n-grams and assign each unique n-gram an index in our embedding matrix
  2. Word embedding computation: Sum the n-gram embeddings for any word (including out-of-vocabulary words)
  3. Training with negative sampling: Update n-gram embeddings based on the Skip-gram objective
In[16]:
def sigmoid(x):
    """Numerically stable sigmoid function."""
    return np.where(x >= 0, 
                    1 / (1 + np.exp(-x)), 
                    np.exp(x) / (1 + np.exp(x)))

class FastText:
    """
    FastText model with character n-grams and negative sampling.
    """
    
    def __init__(self, vocab, embedding_dim=50, min_n=3, max_n=6, 
                 num_negatives=5, learning_rate=0.05):
        """
        Initialize FastText model.
        
        Args:
            vocab: List of vocabulary words
            embedding_dim: Embedding dimension
            min_n, max_n: N-gram length range
            num_negatives: Number of negative samples
            learning_rate: Learning rate for SGD
        """
        self.vocab = vocab
        self.word_to_idx = {w: i for i, w in enumerate(vocab)}
        self.idx_to_word = {i: w for w, i in self.word_to_idx.items()}
        self.embedding_dim = embedding_dim
        self.min_n = min_n
        self.max_n = max_n
        self.num_negatives = num_negatives
        self.learning_rate = learning_rate
        
        # Context embeddings for each vocabulary word
        self.context_embeddings = np.random.randn(len(vocab), embedding_dim) * 0.01
        
        # N-gram embeddings (will be built during preprocessing)
        self.ngram_to_idx = {}
        self.ngram_embeddings = None
        
        # Precompute n-grams for all vocab words
        self._build_ngram_index()
        
        # Negative sampling distribution
        self._build_sampling_distribution()
    
    def _get_ngrams(self, word):
        """Extract character n-grams from a word."""
        word_with_markers = f"<{word}>"
        ngrams = []
        for n in range(self.min_n, self.max_n + 1):
            for i in range(len(word_with_markers) - n + 1):
                ngrams.append(word_with_markers[i:i + n])
        return ngrams
    
    def _build_ngram_index(self):
        """Build index of all n-grams from vocabulary."""
        all_ngrams = set()
        self.word_ngram_indices = {}
        
        for word in self.vocab:
            ngrams = self._get_ngrams(word)
            all_ngrams.update(ngrams)
            self.word_ngram_indices[word] = ngrams
        
        # Create n-gram to index mapping
        self.ngram_to_idx = {ng: i for i, ng in enumerate(sorted(all_ngrams))}
        
        # Initialize n-gram embeddings
        self.ngram_embeddings = np.random.randn(len(self.ngram_to_idx), 
                                                 self.embedding_dim) * 0.01
        
        # Precompute n-gram indices for each word
        self.word_ngram_idx_lists = {}
        for word in self.vocab:
            self.word_ngram_idx_lists[word] = [
                self.ngram_to_idx[ng] for ng in self.word_ngram_indices[word]
            ]
    
    def _build_sampling_distribution(self):
        """Build negative sampling distribution."""
        # Use uniform distribution for simplicity
        # (Real implementation would use word frequencies)
        self.sampling_probs = np.ones(len(self.vocab)) / len(self.vocab)
    
    def _get_word_embedding(self, word):
        """Get embedding for a word (sum of n-gram embeddings)."""
        if word in self.word_ngram_idx_lists:
            ngram_indices = self.word_ngram_idx_lists[word]
        else:
            # Handle OOV words
            ngrams = self._get_ngrams(word)
            ngram_indices = [self.ngram_to_idx[ng] for ng in ngrams 
                            if ng in self.ngram_to_idx]
        
        if not ngram_indices:
            return np.zeros(self.embedding_dim)
        
        return self.ngram_embeddings[ngram_indices].sum(axis=0)
    
    def _sample_negatives(self, exclude_idx, k):
        """Sample k negative word indices."""
        negatives = []
        while len(negatives) < k:
            idx = np.random.choice(len(self.vocab), p=self.sampling_probs)
            if idx != exclude_idx and idx not in negatives:
                negatives.append(idx)
        return negatives
    
    def train_pair(self, center_word, context_idx):
        """
        Train on a single (center_word, context_word) pair.
        
        Returns the loss for this example.
        """
        # Get center word embedding (sum of n-grams)
        center_emb = self._get_word_embedding(center_word)
        center_ngram_indices = self.word_ngram_idx_lists.get(
            center_word, 
            [self.ngram_to_idx[ng] for ng in self._get_ngrams(center_word) 
             if ng in self.ngram_to_idx]
        )
        
        # Get context embedding
        context_emb = self.context_embeddings[context_idx]
        
        # Positive pair
        pos_score = np.dot(center_emb, context_emb)
        pos_sigmoid = sigmoid(pos_score)
        pos_loss = -np.log(pos_sigmoid + 1e-10)
        
        # Sample negatives
        negative_indices = self._sample_negatives(context_idx, self.num_negatives)
        neg_embeddings = self.context_embeddings[negative_indices]
        neg_scores = neg_embeddings @ center_emb
        neg_sigmoids = sigmoid(-neg_scores)
        neg_loss = -np.sum(np.log(neg_sigmoids + 1e-10))
        
        total_loss = pos_loss + neg_loss
        
        # Compute gradients and update
        # Gradient for n-gram embeddings
        grad_center = (pos_sigmoid - 1) * context_emb
        for i, neg_idx in enumerate(negative_indices):
            grad_center += (1 - neg_sigmoids[i]) * self.context_embeddings[neg_idx]
        
        # Update n-gram embeddings
        for ng_idx in center_ngram_indices:
            self.ngram_embeddings[ng_idx] -= self.learning_rate * grad_center
        
        # Gradient for context embedding (positive)
        grad_context_pos = (pos_sigmoid - 1) * center_emb
        self.context_embeddings[context_idx] -= self.learning_rate * grad_context_pos
        
        # Gradient for negative context embeddings
        for i, neg_idx in enumerate(negative_indices):
            grad_neg = (1 - neg_sigmoids[i]) * center_emb
            self.context_embeddings[neg_idx] -= self.learning_rate * grad_neg
        
        return total_loss
    
    def get_embedding(self, word):
        """Get the embedding for any word (including OOV)."""
        return self._get_word_embedding(word.lower())
    
    def most_similar(self, word, top_n=5):
        """Find most similar words by cosine similarity."""
        word_emb = self.get_embedding(word)
        word_norm = np.linalg.norm(word_emb)
        if word_norm == 0:
            return []
        word_emb_normalized = word_emb / word_norm
        
        # Compute similarities with all vocab words
        similarities = []
        for vocab_word in self.vocab:
            if vocab_word.lower() == word.lower():
                continue
            emb = self.get_embedding(vocab_word)
            emb_norm = np.linalg.norm(emb)
            if emb_norm > 0:
                sim = np.dot(word_emb_normalized, emb / emb_norm)
                similarities.append((vocab_word, sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return similarities[:top_n]

Training on a Sample Corpus

To see FastText in action, we'll train on a small corpus designed to highlight morphological patterns. The corpus contains word families with clear root-suffix relationships: "run/running/runner/runs," "walk/walking/walker/walks," and similar verb conjugations. This controlled setting lets us observe how shared n-grams lead to similar representations:

In[17]:
# Training corpus with clear morphological patterns
training_sentences = [
    "run running runner runs",
    "walk walking walker walks",
    "swim swimming swimmer swims",
    "jump jumping jumper jumps",
    "the runner runs fast",
    "the walker walks slow",
    "the swimmer swims well",
    "the jumper jumps high",
    "running is fun exercise",
    "walking is healthy activity",
    "swimming is great sport",
    "jumping improves fitness",
    "fast runners win races",
    "good swimmers compete well",
    "high jumpers train hard",
    "daily walkers stay healthy",
]

# Build vocabulary
all_words = ' '.join(training_sentences).lower().split()
vocab = sorted(set(all_words))

# Generate training pairs (context window = 2)
def generate_pairs(sentences, word_to_idx, window=2):
    pairs = []
    for sentence in sentences:
        words = sentence.lower().split()
        for i, center in enumerate(words):
            for j in range(max(0, i - window), min(len(words), i + window + 1)):
                if j != i:
                    pairs.append((center, word_to_idx[words[j]]))
    return pairs

word_to_idx = {w: i for i, w in enumerate(vocab)}
training_pairs = generate_pairs(training_sentences, word_to_idx)

# Initialize and train model
np.random.seed(42)
model = FastText(
    vocab=vocab,
    embedding_dim=30,
    min_n=3,
    max_n=5,
    num_negatives=5,
    learning_rate=0.1
)

# Training loop
epochs = 100
losses = []

for epoch in range(epochs):
    np.random.shuffle(training_pairs)
    epoch_loss = 0
    for center_word, context_idx in training_pairs:
        loss = model.train_pair(center_word, context_idx)
        epoch_loss += loss
    losses.append(epoch_loss / len(training_pairs))
Out[18]:
FastText Training Complete:
-------------------------------------------------------
Vocabulary size: 42
Number of n-grams: 407
Training pairs: 156
Epochs: 100

Initial loss: 4.0578
Final loss: 2.4046
Reduction: 40.7%

The model learns from a small corpus with clear morphological patterns. The substantial loss reduction indicates the model successfully learns to distinguish genuine context pairs from negative samples. The n-gram representations allow related word forms to share learned information through common substrings.

Out[19]:
Visualization
Line plot showing decreasing training loss over 100 epochs.
FastText training loss over epochs. The loss decreases as the model learns to distinguish positive context pairs from negative samples using n-gram representations. The plateau indicates convergence.

To understand what the model learned, let's compare word similarities before and after training:

Out[20]:
Visualization
Side-by-side heatmaps comparing word similarities before and after training.
Word similarity before and after FastText training. Left: With random initialization, similarities reflect only n-gram overlap. Right: After training, morphological families form tight clusters with high within-family similarity, while between-family similarity decreases. The model learns that ''running'' relates to ''runner'' not just by spelling but by meaning.

Handling Out-of-Vocabulary Words

We've arrived at FastText's most compelling feature: the ability to generate meaningful embeddings for words never seen during training. Word2Vec would simply fail when encountering "unfamiliarizing" or "COVID-19" if those terms weren't in its training vocabulary. FastText, by contrast, decomposes any word into n-grams and sums whatever n-gram embeddings it learned during training.

Let's test this capability with a mix of in-vocabulary and out-of-vocabulary words:

In[21]:
# Test OOV words
oov_words = ['runs', 'walked', 'swimmers', 'jumping', 'runing']

# Check which are in vocabulary
in_vocab = {w: w in vocab for w in oov_words}

# Get embeddings and find similar words
oov_results = {}
for word in oov_words:
    emb = model.get_embedding(word)
    similar = model.most_similar(word, top_n=3)
    oov_results[word] = {
        'in_vocab': in_vocab[word],
        'embedding_norm': np.linalg.norm(emb),
        'similar': similar
    }
Out[22]:
Out-of-Vocabulary Word Handling:
------------------------------------------------------------

'runs' (✓ in vocab):
  Embedding norm: 6.2027
  Most similar words:
    run: 0.7895
    is: 0.6315
    walker: 0.5808

'walked' (✗ OOV):
  Embedding norm: 5.0763
  Most similar words:
    walker: 0.8344
    walk: 0.8251
    is: 0.7922

'swimmers' (✓ in vocab):
  Embedding norm: 7.7432
  Most similar words:
    good: 0.4707
    well: 0.4510
    walkers: 0.4358

'jumping' (✓ in vocab):
  Embedding norm: 5.8562
  Most similar words:
    jumper: 0.6597
    high: 0.5839
    fitness: 0.5819

'runing' (✗ OOV):
  Embedding norm: 4.4863
  Most similar words:
    run: 0.8579
    running: 0.8540
    exercise: 0.7481

These results reveal FastText's power. Words absent from training (marked with ✗) still receive meaningful embeddings based on their subword structure. "Walked" relates to "walk" and "walker" through shared n-grams like "alk" and "<wa." The intentional typo "runing" correctly associates with "running" because they share most of the same character sequences. Only the doubled "n" differs.

This robustness to OOV words makes FastText particularly valuable for:

  • Morphologically rich languages where every possible word form cannot be enumerated
  • Domain adaptation where technical terms differ from general vocabulary
  • Noisy text where typos and creative spellings are common
Out[23]:
Visualization
Line plot showing cosine similarity vs edit distance for various words.
FastText's robustness to typos and spelling variations. As we introduce more character changes to a word (edit distance), the cosine similarity with the original word decreases gracefully. Single-character typos maintain high similarity, making FastText effective for noisy text processing.
Out[24]:
Visualization
Heatmap showing cosine similarities between OOV words and vocabulary words.
FastText's ability to handle OOV words. Even words absent from training (marked with ✗) receive meaningful embeddings through shared n-grams. The heatmap shows cosine similarities between OOV words and related vocabulary words, demonstrating that morphological variants cluster together.

The heatmap reveals that OOV words find their morphological relatives. "Runs" has high similarity with "run," "running," and "runner." "Walked" relates to "walk," "walking," and "walker." Even the typo "runing" correctly associates with "running."

Visualizing the Embedding Space

Out[25]:
Visualization
Scatter plot of word embeddings projected to 2D showing morphological clustering.
2D PCA projection of FastText embeddings. Words from the same morphological family cluster together (run/running/runner in blue, walk/walking/walker in green, etc.). The spatial organization emerges from shared n-grams between related words.

Hashing N-grams for Memory Efficiency

Our implementation so far stores a separate embedding vector for every unique n-gram. This works for toy examples, but scales poorly to real corpora. A vocabulary of 100,000 words with n-gram lengths 3-6 can generate millions of unique n-grams, each requiring a 300-dimensional embedding vector. At 4 bytes per float, this exceeds gigabytes of memory for n-gram embeddings alone.

FastText addresses this through hash-based bucketing: instead of giving each n-gram its own embedding, we hash n-grams to a fixed number of buckets and share embeddings within each bucket.

N-gram Hashing

N-gram hashing maps each n-gram to one of BB buckets using a hash function (typically 2 million buckets). Multiple n-grams may collide to the same bucket and share an embedding. This trades some precision for significant memory savings. Memory usage becomes O(B)O(B) instead of O(N)O(|N|) where N|N| is the number of unique n-grams.

In[26]:
class HashedFastText:
    """
    FastText with hashed n-gram embeddings for memory efficiency.
    """
    
    def __init__(self, vocab, embedding_dim=50, min_n=3, max_n=6, 
                 num_buckets=10000):
        """
        Initialize with a fixed number of n-gram buckets.
        
        Args:
            vocab: Vocabulary words
            embedding_dim: Embedding dimension
            min_n, max_n: N-gram length range
            num_buckets: Number of hash buckets for n-grams
        """
        self.vocab = vocab
        self.embedding_dim = embedding_dim
        self.min_n = min_n
        self.max_n = max_n
        self.num_buckets = num_buckets
        
        # Single embedding matrix for all buckets
        self.bucket_embeddings = np.random.randn(num_buckets, embedding_dim) * 0.01
    
    def _hash_ngram(self, ngram):
        """Hash an n-gram to a bucket index."""
        # Use Python's built-in hash (in practice, use FNV or similar)
        return hash(ngram) % self.num_buckets
    
    def _get_ngrams(self, word):
        """Extract n-grams from a word."""
        word_with_markers = f"<{word}>"
        ngrams = []
        for n in range(self.min_n, self.max_n + 1):
            for i in range(len(word_with_markers) - n + 1):
                ngrams.append(word_with_markers[i:i + n])
        return ngrams
    
    def get_embedding(self, word):
        """Get word embedding as sum of hashed n-gram embeddings."""
        ngrams = self._get_ngrams(word)
        bucket_indices = [self._hash_ngram(ng) for ng in ngrams]
        return self.bucket_embeddings[bucket_indices].sum(axis=0)

# Compare memory usage
np.random.seed(42)

# Standard FastText
standard_ft = FastText(vocab=vocab, embedding_dim=50, min_n=3, max_n=5)
standard_ngram_count = len(standard_ft.ngram_to_idx)
standard_memory = standard_ngram_count * 50 * 8  # 8 bytes per float64

# Hashed FastText
hashed_ft = HashedFastText(vocab=vocab, embedding_dim=50, min_n=3, max_n=5, 
                           num_buckets=1000)
hashed_memory = hashed_ft.num_buckets * 50 * 8
Out[27]:
Memory Comparison: Standard vs Hashed N-grams
-------------------------------------------------------
Vocabulary size: 42

Standard FastText:
  Unique n-grams: 407
  Memory usage: 159.0 KB

Hashed FastText (1,000 buckets):
  Hash buckets: 1,000
  Memory usage: 390.6 KB
  Memory savings: -145.7%

For large vocabularies, savings are much greater:
  100K vocab → potentially millions of n-grams
  Hashing to 2M buckets = ~95% reduction

Even with our small vocabulary, hashing provides meaningful memory savings. The trade-off is that some n-grams will collide to the same bucket and share an embedding, but empirically this has minimal impact on downstream task performance when the number of buckets is sufficiently large.

Out[28]:
Visualization
Two plots comparing memory usage and collision rates for standard vs hashed n-gram storage.
Memory comparison between standard and hashed n-gram storage. Left: Memory usage grows with vocabulary size for standard storage but remains constant with hashing. Right: The trade-off is hash collisions, where different n-grams share embeddings. For most applications, moderate collision rates have minimal impact on embedding quality.

FastText for Morphologically Rich Languages

English has relatively simple morphology. Most word variations come from a handful of suffixes like "-ing," "-ed," "-er." But many languages are far more complex. German famously creates compound words by concatenation: "Lebensversicherungsgesellschaft" (life insurance company) combines three roots. Finnish and Turkish use extensive agglutination, where words accumulate suffixes to express grammatical relationships. Arabic and Hebrew have templatic morphology where roots interleave with vowel patterns.

Word2Vec struggles with these languages because the vocabulary explodes. Every possible compound or inflection needs its own embedding. FastText, by contrast, automatically captures the compositional structure through shared n-grams. Let's examine how this works with German compound words:

In[29]:
# German compound word examples
german_compounds = [
    ('Lebensversicherung', 'life insurance', ['Leben', 'Versicherung']),
    ('Krankenhaus', 'hospital', ['Krank', 'Haus']),
    ('Handschuh', 'glove', ['Hand', 'Schuh']),
    ('Kindergarten', 'kindergarten', ['Kinder', 'Garten']),
]

def analyze_german_compounds(compounds, min_n=3, max_n=5):
    """Analyze n-gram overlap in German compounds."""
    results = []
    
    for compound, meaning, components in compounds:
        compound_ngrams = set(get_ngrams(compound.lower(), min_n, max_n))
        component_ngrams = set()
        for component in components:
            component_ngrams.update(get_ngrams(component.lower(), min_n, max_n))
        
        shared = compound_ngrams.intersection(component_ngrams)
        
        results.append({
            'compound': compound,
            'meaning': meaning,
            'components': components,
            'compound_ngrams': len(compound_ngrams),
            'component_ngrams': len(component_ngrams),
            'shared_ngrams': len(shared),
            'shared_examples': list(shared)[:5]
        })
    
    return results

german_analysis = analyze_german_compounds(german_compounds)
Out[30]:
German Compound Word Analysis:
-----------------------------------------------------------------

Lebensversicherung (life insurance)
  Components: Leben + Versicherung
  Compound n-grams: 51
  Component n-grams: 45
  Shared n-grams: 39
  Examples: ['eben', 'sich', 'ersic', 'erung', 'heru']

Krankenhaus (hospital)
  Components: Krank + Haus
  Compound n-grams: 30
  Component n-grams: 21
  Shared n-grams: 15
  Examples: ['us>', '<kra', 'hau', 'kran', '<kr']

Handschuh (glove)
  Components: Hand + Schuh
  Compound n-grams: 24
  Component n-grams: 21
  Shared n-grams: 15
  Examples: ['<hand', 'uh>', 'hand', 'chu', 'chuh>']

Kindergarten (kindergarten)
  Components: Kinder + Garten
  Compound n-grams: 33
  Component n-grams: 30
  Shared n-grams: 24
  Examples: ['inde', 'rte', 'kin', 'der', 'inder']

German compound words share substantial n-gram overlap with their components. "Lebensversicherung" shares n-grams with both "Leben" (life) and "Versicherung" (insurance). This allows FastText to produce meaningful embeddings for unseen compounds based on their constituent parts.

Out[31]:
Visualization
Bar chart showing n-gram counts for German compound words and their overlap with components.
N-gram overlap in German compound words. Each compound shares substantial n-grams with its components, allowing FastText to decompose the meaning of unseen compounds. For example, 'Lebensversicherung' (life insurance) shares n-grams with both 'Leben' (life) and 'Versicherung' (insurance).

Comparison with Word2Vec

Let's directly compare FastText and Word2Vec on handling morphological variants and OOV words:

Out[32]:
Visualization
Comparison table showing FastText and Word2Vec capabilities for handling different word types.
Comparison of FastText and Word2Vec on morphological and OOV word handling. Top: FastText produces embeddings for all words, including OOV. Bottom: Word2Vec can only handle words seen during training. The table shows how FastText''s subword approach enables generalization to unseen forms.

Using Pre-trained FastText Models

Training FastText from scratch requires substantial computational resources and large corpora. Fortunately, Meta (formerly Facebook) released pre-trained FastText models for 157 languages, trained on Wikipedia and Common Crawl data. These models provide high-quality embeddings out of the box, including the ability to generate vectors for OOV words through their learned n-gram embeddings.

In[56]:
# Install gensim if needed
# pip install gensim

from gensim.models import KeyedVectors

# Download and load pre-trained English FastText
# (File is ~7GB, this is just for demonstration)
# ft_model = KeyedVectors.load_word2vec_format(
#     'wiki.en.vec', 
#     binary=False
# )

# Example usage (if model were loaded):
# ft_model.most_similar('running')
# ft_model.similarity('run', 'running')

# FastText library provides native support
import fasttext
# model = fasttext.load_model('cc.en.300.bin')
# model.get_word_vector('running')
# model.get_nearest_neighbors('running')
Out[33]:
Using Pre-trained FastText Models:
-------------------------------------------------------

Available pre-trained models (from fasttext.cc):
  - 157 languages
  - Trained on Wikipedia and Common Crawl
  - 300-dimensional embeddings
  - Both .vec (text) and .bin (binary) formats

Loading with gensim:
  from gensim.models import KeyedVectors
  model = KeyedVectors.load_word2vec_format('wiki.en.vec')

Loading with fasttext library:
  import fasttext
  model = fasttext.load_model('cc.en.300.bin')

Key methods:
  model.get_word_vector('word')  # Get embedding
  model.get_nearest_neighbors('word')  # Find similar words

Limitations of FastText

FastText's character n-gram approach solves the OOV problem elegantly, but introduces its own trade-offs. Understanding these limitations helps you decide when FastText is the right tool and when alternatives might serve better.

Static embeddings: Like Word2Vec, FastText produces one embedding per word regardless of context. The word "bank" receives the same vector whether it appears in "river bank" or "bank account." This fundamental limitation, shared by all static embedding methods, motivates the contextual embeddings we'll explore in later chapters.

Character-level noise: The n-gram mechanism has no way to distinguish meaningful subwords from random character sequences. A nonsensical string like "xyzqwk" produces a non-zero embedding through its n-gram components, even though no human would recognize it as a word. This can confuse downstream models that expect zero vectors for unknown tokens.

Memory overhead: Even with hash bucketing, storing embeddings for 2 million n-gram buckets at 300 dimensions requires about 2.4GB in 32-bit floats. Full binary models with word-level embeddings can exceed 7GB, making deployment challenging on memory-constrained devices.

Language-specific assumptions: The n-gram approach assumes that subword character sequences carry meaning. This works beautifully for European languages with alphabetic scripts and productive morphology. But for logographic languages like Chinese (where characters are meaning units, not phonetic components) or languages with complex scripts (Arabic, Devanagari), character n-grams may not capture the right linguistic structure.

In[34]:
# Demonstrate the "any string gets an embedding" issue
nonsense_words = ['xyzqwk', 'asdfgh', 'qwertyuiop', 'zzzzz']

nonsense_embeddings = []
for word in nonsense_words:
    emb = model.get_embedding(word)
    norm = np.linalg.norm(emb)
    nonsense_embeddings.append((word, norm))
Out[35]:
Limitation: Any String Gets an Embedding
-------------------------------------------------------
Nonsense strings still produce non-zero embeddings:

  'xyzqwk': norm = 0.0000
  'asdfgh': norm = 0.5327
  'qwertyuiop': norm = 0.0000
  'zzzzz': norm = 0.0000

This can be problematic for downstream applications
that assume zero-norm indicates unknown words.

Unlike Word2Vec which returns an error for unknown words, FastText always produces an embedding by summing n-gram vectors. While this handles legitimate OOV words well, it means completely meaningless strings also receive embeddings. Applications requiring explicit unknown-word detection need additional logic beyond checking embedding norms.

Summary

FastText extends Word2Vec with a deceptively simple modification that yields profound practical benefits: representing words as bags of character n-grams rather than atomic symbols. This single architectural change, summing n-gram embeddings instead of looking up word embeddings, unlocks capabilities that address Word2Vec's most limiting weakness:

  • Out-of-vocabulary handling: Any word, even one never seen during training, receives a meaningful embedding through its subword components
  • Morphological awareness: Words sharing the same root naturally cluster together because they share n-grams encoding that root
  • Robustness to noise: Typos, creative spellings, and informal text retain most n-grams of their intended words
  • Better rare word representations: Rare words benefit from shared n-grams with common words

Key takeaways:

  • Subword representation: Words are decomposed into character n-grams (typically lengths 3-6) with boundary markers
  • Embedding computation: A word's embedding is the sum of its n-gram embeddings
  • Training objective: Same as Word2Vec Skip-gram with negative sampling
  • Memory efficiency: Hashing n-grams to fixed buckets reduces memory requirements
  • Language coverage: Particularly effective for morphologically rich languages

Despite these improvements, FastText retains Word2Vec's fundamental limitation: each word has exactly one embedding regardless of context. The same vector represents "bank" whether you're discussing finance or fishing. The next chapter examines GloVe, which takes a different approach to the embedding problem: learning from global co-occurrence statistics rather than local context windows. Later chapters will address the context problem directly with models that produce different embeddings for the same word in different sentences.

Key Parameters

FastText Parameters:

  • min_n (minimum n-gram length): Typically 3. Smaller values capture more context but increase vocabulary size. Setting to 2 captures bigrams like "th" and "ed."

  • max_n (maximum n-gram length): Typically 6. Larger values capture longer morphemes but rapidly expand the n-gram vocabulary. For agglutinative languages, consider increasing to 8.

  • bucket (number of hash buckets): Typically 2,000,000. Controls memory-quality trade-off. Larger values reduce hash collisions but increase memory usage.

  • dim (embedding dimension): Typically 100-300. Higher dimensions capture more nuance but require more data and computation.

  • word_ngrams: Whether to include word unigrams in addition to character n-grams. Usually set to 1 (include words).

  • epoch: Number of training passes. Typically 5-25. More epochs help with smaller datasets.

  • lr (learning rate): Typically 0.05. Higher values speed training but may overshoot. Use learning rate scheduling for best results.

  • loss: Training objective. Options include negative sampling (ns), hierarchical softmax (hs), and softmax (softmax). Negative sampling is default and works well for most cases.

  • neg (negative samples): Number of negative samples when using negative sampling loss. Typically 5-10. More negatives help with rare words.

  • ws (window size): Context window for generating training pairs. Typically 5. Larger windows capture broader semantic relationships.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about FastText and character n-gram embeddings.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{fasttextsubwordembeddingsforoovwordsmorphology, author = {Michael Brenndoerfer}, title = {FastText: Subword Embeddings for OOV Words & Morphology}, year = {2025}, url = {https://mbrenndoerfer.com/writing/fasttext-subword-embeddings-character-ngrams}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-13} }
APAAcademic
Michael Brenndoerfer (2025). FastText: Subword Embeddings for OOV Words & Morphology. Retrieved from https://mbrenndoerfer.com/writing/fasttext-subword-embeddings-character-ngrams
MLAAcademic
Michael Brenndoerfer. "FastText: Subword Embeddings for OOV Words & Morphology." 2025. Web. 12/13/2025. <https://mbrenndoerfer.com/writing/fasttext-subword-embeddings-character-ngrams>.
CHICAGOAcademic
Michael Brenndoerfer. "FastText: Subword Embeddings for OOV Words & Morphology." Accessed 12/13/2025. https://mbrenndoerfer.com/writing/fasttext-subword-embeddings-character-ngrams.
HARVARDAcademic
Michael Brenndoerfer (2025) 'FastText: Subword Embeddings for OOV Words & Morphology'. Available at: https://mbrenndoerfer.com/writing/fasttext-subword-embeddings-character-ngrams (Accessed: 12/13/2025).
SimpleBasic
Michael Brenndoerfer (2025). FastText: Subword Embeddings for OOV Words & Morphology. https://mbrenndoerfer.com/writing/fasttext-subword-embeddings-character-ngrams
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

or