Master n-gram text representations including bigrams, trigrams, character n-grams, and skip-grams. Learn extraction techniques, vocabulary explosion challenges, Zipf's law, and practical applications in NLP.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
N-grams
Bag of words treats every word as an isolated unit, discarding all information about word order. But word order matters. "The dog bit the man" means something entirely different from "The man bit the dog." N-grams capture this local context by representing text as sequences of consecutive tokens rather than individual words. An n-gram is simply a contiguous sequence of n items from a text, where items can be words, characters, or any other units.
This chapter explores how n-grams preserve local word order, why vocabulary size explodes as n increases, and how the statistical properties of n-grams follow predictable patterns. You'll learn to extract n-grams efficiently, understand when character n-grams outperform word n-grams, and see how skip-grams relax the strict adjacency requirement.
From Words to Word Sequences
The bag of words model creates a vocabulary of individual words and counts how often each appears. N-grams extend this by creating a vocabulary of word sequences. A bigram (2-gram) is a sequence of two consecutive words. A trigram (3-gram) is three consecutive words. The general term n-gram covers any sequence length.
An n-gram is a contiguous sequence of n items from a text. When the items are words, we call them word n-grams. When the items are characters, we call them character n-grams. Common special cases include unigrams (n=1), bigrams (n=2), and trigrams (n=3).
Let's see how the same sentence produces different representations at each n-gram level:
Notice the pattern: a sentence with tokens produces a specific number of n-grams depending on the sequence length. The formula for the number of n-grams is:
where:
- : the total number of tokens in the sentence
- : the n-gram size (1 for unigrams, 2 for bigrams, 3 for trigrams, etc.)
- : the resulting number of n-grams
This formula works because each n-gram starts at a different position, and the last valid starting position is (since we need consecutive tokens). With 9 tokens, we get 9 unigrams (), 8 bigrams (), and 7 trigrams (). As increases, we extract fewer n-grams from each document, but each n-gram carries more contextual information.
Bigrams and Trigrams in Practice
Bigrams are the most commonly used n-grams in practice. They capture immediate word relationships while keeping vocabulary size manageable. Trigrams provide richer context but at the cost of sparser data.
The bigram "machine learning" appears frequently because it's a meaningful phrase in this corpus. This is the key insight: common n-grams often correspond to meaningful multi-word expressions, collocations, or phrases.
Even in this tiny corpus, the bigram vocabulary is larger than the unigram vocabulary. This vocabulary explosion becomes severe as n increases or corpus size grows.
The co-occurrence matrix reveals the structure hidden in bigram counts. The cell at row "machine" and column "learning" shows a high count because "machine learning" is a frequent phrase. Most cells are zero, reflecting the sparsity inherent in natural language: most word pairs never occur together.
The Vocabulary Explosion Problem
The theoretical maximum number of distinct n-grams grows exponentially with n. If your vocabulary has unique words, the maximum number of possible n-grams is . For a modest vocabulary of 10,000 words:
- Unigrams: possible
- Bigrams: possible
- Trigrams: possible
In practice, most of these combinations never occur in natural language. "The the the" is a valid trigram structurally but appears rarely. Still, the number of observed n-grams grows rapidly with n.
The vocabulary roughly doubles or triples with each increment in n. A trigram vocabulary can easily be 10 times larger than the unigram vocabulary. This has practical implications for memory usage, model complexity, and the amount of training data needed.
N-gram Frequency Distributions and Zipf's Law
Individual words follow Zipf's law: the frequency of a word is inversely proportional to its rank. A few words appear extremely often while most words appear rarely. N-grams follow the same pattern, but even more extremely.
Zipf's law states that in a corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table. The most frequent word appears roughly twice as often as the second most frequent, three times as often as the third, and so on.
The hapax ratio (proportion of n-grams appearing only once) increases dramatically with n. For trigrams, the vast majority appear exactly once. This sparsity is the core challenge of n-gram models: most n-grams you encounter in new text won't exist in your training data.
The visualization above makes the sparsity problem concrete. With trigrams, you're essentially building a model where 4 out of 5 features appear only once in training. This severely limits the model's ability to generalize to new text.
Character N-grams for Robustness
Word n-grams assume clean, correctly spelled text. But real-world text contains typos, spelling variations, and out-of-vocabulary words. Character n-grams offer an alternative that's robust to these issues.
A character n-gram is a contiguous sequence of n characters from a text, including spaces and punctuation. Character n-grams can capture subword patterns, making them robust to spelling variations and useful for languages with rich morphology.
Despite the typo, most character trigrams still match. This partial matching is why character n-grams excel at:
- Spelling correction: Finding similar words despite typos
- Language identification: Different languages have characteristic character patterns
- Author attribution: Writing style shows up in character-level patterns
- Morphologically rich languages: Capturing word stems and affixes
Each language has distinctive character trigram signatures. German shows "sch" and "ber", French has "ard" and "eux", and Spanish features "rro" and "bre". These patterns form the basis of language detection algorithms.
Skip-grams: Flexible Context Windows
Standard n-grams require strict adjacency: every word must be immediately next to its neighbors. Skip-grams relax this constraint by allowing gaps between words. A skip-gram with skips can have up to words between any two selected words.
A skip-gram is a generalization of n-grams that allows non-adjacent words to form a sequence. A -skip--gram includes all subsequences of words where the gap between any two consecutive selected words is at most .
Skip-grams capture relationships between words that aren't immediately adjacent. In "the cat sat on the mat", regular bigrams miss the relationship between "cat" and "on", but skip-bigrams capture it. This is useful when:
- Word order varies: Different phrasings of the same idea
- Modifiers intervene: "the big fluffy cat" vs "the cat"
- Long-distance dependencies: Subject-verb agreement across clauses
The trade-off is a larger vocabulary. Skip-grams with skips produce more combinations than regular n-grams, exacerbating the vocabulary explosion.
The vocabulary growth with skip distance is substantial. Even with a small corpus, skip-3 bigrams produce significantly more unique pairs than regular bigrams. In larger corpora, this effect compounds, making high-skip models memory-intensive.
N-gram Indexing for Search
N-grams enable efficient approximate string matching and search. By indexing documents by their n-grams, you can quickly find documents containing similar phrases even with minor variations.
This inverted index approach is the foundation of many search engines and fuzzy matching systems. The n-gram index trades space for speed: storing all n-grams requires more memory, but queries become fast set intersection operations.
Using NLTK for N-gram Extraction
While we've implemented n-gram extraction from scratch, NLTK provides optimized utilities:
NLTK also provides FreqDist for counting and analyzing n-gram frequencies:
Practical Considerations
When working with n-grams, several practical considerations affect your results:
Choosing n: Larger n captures more context but creates sparser data. For most applications:
- Unigrams (n=1): Baseline, loses all word order
- Bigrams (n=2): Good balance for most tasks
- Trigrams (n=3): Richer context, common in language modeling
- n > 3: Rarely used due to sparsity
Vocabulary pruning: Remove n-grams that appear too rarely (min_df) or too frequently (max_df). Rare n-grams add noise; frequent n-grams (like "of the") add little discriminative value.
Padding: For some applications, you may want to add special start and end tokens to capture sentence boundaries:
Memory efficiency: For large corpora, store n-grams as hashed integers rather than string tuples. Use sparse matrix representations when building document-term matrices.
Limitations and Trade-offs
N-grams improve on bag of words by capturing local context, but they have significant limitations:
Vocabulary explosion: The number of unique n-grams grows exponentially with n, making high-order n-grams impractical for most applications.
Sparsity: Most n-grams appear rarely. In new text, you'll frequently encounter n-grams not seen during training.
Fixed context window: N-grams capture exactly n consecutive words, missing both shorter and longer patterns. The phrase "not very good" is meaningful as a trigram, but its sentiment depends on understanding that "not" negates "good" across the intervening word.
No semantic understanding: "Excellent movie" and "great film" share no n-grams despite having similar meanings. N-grams are purely syntactic.
Data requirements: Higher-order n-grams require exponentially more training data to estimate reliably. Trigram language models need millions of words; 5-gram models need billions.
Impact and Applications
Despite these limitations, n-grams remain foundational in NLP:
Language modeling: N-gram language models estimate the probability of word sequences. Before neural networks, trigram models dominated speech recognition and machine translation.
Text classification: Adding bigrams and trigrams to bag-of-words features often improves classification accuracy by capturing phrases.
Spell checking and autocomplete: Character n-gram similarity identifies likely corrections. Word n-grams predict the next word.
Plagiarism detection: Matching n-gram fingerprints identifies copied text even with minor modifications.
Information retrieval: N-gram indexing enables fast fuzzy matching and phrase search.
The transition from n-grams to neural models didn't make n-grams obsolete. Modern subword tokenizers like BPE and WordPiece are essentially learned character n-gram vocabularies. Understanding n-grams provides intuition for why these newer methods work.
Key Functions and Parameters
When working with n-grams in Python, these are the essential functions and their most important parameters:
nltk.ngrams(sequence, n, pad_left=False, pad_right=False)
sequence: The input tokens (list of words or characters)n: The size of the n-gram (2 for bigrams, 3 for trigrams, etc.)pad_left,pad_right: Whether to add padding symbols at boundaries. Useful for language modeling where start/end context matters
nltk.bigrams(sequence) and nltk.trigrams(sequence)
- Convenience functions equivalent to
ngrams(sequence, 2)andngrams(sequence, 3) - Return generators, so wrap in
list()if you need to iterate multiple times
nltk.everygrams(sequence, min_len=1, max_len=-1)
min_len: Minimum n-gram size to includemax_len: Maximum n-gram size (-1 means use sequence length)- Useful when you want to combine multiple n-gram orders in a single feature set
collections.Counter(iterable)
- Essential for counting n-gram frequencies
.most_common(n): Returns the n most frequent items- Supports arithmetic operations for combining counts across documents
Custom extraction parameters:
max_skip: For skip-grams, controls how many words can be skipped between selected words. Higher values capture more distant relationships but increase vocabulary sizepad_symbol: The token used for boundary padding (commonly<s>,</s>, or<PAD>)
Summary
N-grams extend bag of words by preserving local word order. Key takeaways:
- N-grams are contiguous sequences of n tokens (words or characters)
- Bigrams capture immediate word relationships; trigrams add more context
- Vocabulary explosion: Unique n-grams grow exponentially with n
- Zipf's law applies to n-grams, with most appearing only once
- Character n-grams provide robustness to typos and work across languages
- Skip-grams relax adjacency requirements to capture flexible patterns
- N-gram indexing enables fast approximate search and matching
- Trade-offs: More context requires more data and creates sparser representations
N-grams bridge the gap between treating words as isolated units and understanding them in context. In the next chapter, we'll explore TF-IDF, which adds statistical weighting to distinguish informative terms from common ones.
Quiz
Ready to test your understanding? Take this quick quiz to reinforce what you've learned about n-grams and text representation.



Comments