Master n-gram text representations including bigrams, trigrams, character n-grams, and skip-grams. Learn extraction techniques, vocabulary explosion challenges, Zipf's law, and practical applications in NLP.

This article is part of the free-to-read Language AI Handbook
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
N-gramsLink Copied
Bag of words treats every word as an isolated unit, discarding all information about word order. But word order matters. "The dog bit the man" means something entirely different from "The man bit the dog." N-grams capture this local context by representing text as sequences of consecutive tokens rather than individual words. An n-gram is simply a contiguous sequence of n items from a text, where items can be words, characters, or any other units.
This chapter explores how n-grams preserve local word order, why vocabulary size explodes as n increases, and how the statistical properties of n-grams follow predictable patterns. You'll learn to extract n-grams efficiently, understand when character n-grams outperform word n-grams, and see how skip-grams relax the strict adjacency requirement.
From Words to Word SequencesLink Copied
The bag of words model creates a vocabulary of individual words and counts how often each appears. N-grams extend this by creating a vocabulary of word sequences. A bigram (2-gram) is a sequence of two consecutive words. A trigram (3-gram) is three consecutive words. The general term n-gram covers any sequence length.
An n-gram is a contiguous sequence of n items from a text. When the items are words, we call them word n-grams. When the items are characters, we call them character n-grams. Common special cases include unigrams (n=1), bigrams (n=2), and trigrams (n=3).
Let's see how the same sentence produces different representations at each n-gram level:
Sentence: 'the quick brown fox jumps over the lazy dog'
Tokens: 9
Unigrams (9):
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Bigrams (8):
[('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog')]
Trigrams (7):
[('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Notice the pattern: a sentence with tokens produces n-grams of size . With 9 tokens, we get 9 unigrams, 8 bigrams, and 7 trigrams. As increases, we extract fewer n-grams from each document, but each n-gram carries more contextual information.
Bigrams and Trigrams in PracticeLink Copied
Bigrams are the most commonly used n-grams in practice. They capture immediate word relationships while keeping vocabulary size manageable. Trigrams provide richer context but at the cost of sparser data.
Corpus bigram frequencies: ---------------------------------------- machine learning 4 i love 2 deep learning 2 learning is 2 love machine 1 love deep 1 is powerful 1 is a 1 a subset 1 subset of 1 Total bigram tokens: 21 Unique bigrams: 15
The bigram "machine learning" appears frequently because it's a meaningful phrase in this corpus. This is the key insight: common n-grams often correspond to meaningful multi-word expressions, collocations, or phrases.
Unigram frequencies (top 10): learning 6 machine 4 i 3 love 2 deep 2 is 2 powerful 1 a 1 subset 1 of 1 Unique unigrams: 13 Unique bigrams: 15 Vocabulary explosion factor: 1.2x
Even in this tiny corpus, the bigram vocabulary is larger than the unigram vocabulary. This vocabulary explosion becomes severe as n increases or corpus size grows.

The co-occurrence matrix reveals the structure hidden in bigram counts. The cell at row "machine" and column "learning" shows a high count because "machine learning" is a frequent phrase. Most cells are zero, reflecting the sparsity inherent in natural language: most word pairs never occur together.
The Vocabulary Explosion ProblemLink Copied
The theoretical maximum number of distinct n-grams grows exponentially with n. If your vocabulary has unique words, the maximum number of possible n-grams is . For a modest vocabulary of 10,000 words:
- Unigrams: possible
- Bigrams: possible
- Trigrams: possible
In practice, most of these combinations never occur in natural language. "The the the" is a valid trigram structurally but appears rarely. Still, the number of observed n-grams grows rapidly with n.
Text: Emma by Jane Austen Total tokens: 158,167 Vocabulary size by n-gram order: ----------------------------------- Unigrams 16,945 (1.0x unigrams) Bigrams 81,586 (4.8x unigrams) Trigrams 136,444 (8.1x unigrams) 4-grams 154,062 (9.1x unigrams) 5-grams 157,344 (9.3x unigrams)
The vocabulary roughly doubles or triples with each increment in n. A trigram vocabulary can easily be 10 times larger than the unigram vocabulary. This has practical implications for memory usage, model complexity, and the amount of training data needed.

N-gram Frequency Distributions and Zipf's LawLink Copied
Individual words follow Zipf's law: the frequency of a word is inversely proportional to its rank. A few words appear extremely often while most words appear rarely. N-grams follow the same pattern, but even more extremely.
Zipf's law states that in a corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table. The most frequent word appears roughly twice as often as the second most frequent, three times as often as the third, and so on.
Frequency Distribution Statistics:
============================================================
Unigrams:
Total tokens: 158,167
Unique types: 16,945
Hapax legomena (appear once): 10,232 (60.4%)
Top 5 most frequent:
'the': 5,120
'to': 5,079
'and': 4,445
'of': 4,196
'a': 3,055
Bigrams:
Total tokens: 158,166
Unique types: 81,586
Hapax legomena (appear once): 65,609 (80.4%)
Top 5 most frequent:
'to be': 566
'of the': 558
'in the': 441
'it was': 387
'she had': 313
Trigrams:
Total tokens: 158,165
Unique types: 136,444
Hapax legomena (appear once): 126,523 (92.7%)
Top 5 most frequent:
'i do not': 94
'i am sure': 75
'she could not': 65
'a great deal': 56
'it would be': 56
The hapax ratio (proportion of n-grams appearing only once) increases dramatically with n. For trigrams, the vast majority appear exactly once. This sparsity is the core challenge of n-gram models: most n-grams you encounter in new text won't exist in your training data.

The visualization above makes the sparsity problem concrete. With trigrams, you're essentially building a model where 4 out of 5 features appear only once in training. This severely limits the model's ability to generalize to new text.

Character N-grams for RobustnessLink Copied
Word n-grams assume clean, correctly spelled text. But real-world text contains typos, spelling variations, and out-of-vocabulary words. Character n-grams offer an alternative that's robust to these issues.
A character n-gram is a contiguous sequence of n characters from a text, including spaces and punctuation. Character n-grams can capture subword patterns, making them robust to spelling variations and useful for languages with rich morphology.
Word: 'learning' Character bigrams: ['le', 'ea', 'ar', 'rn', 'ni', 'in', 'ng'] Character trigrams: ['lea', 'ear', 'arn', 'rni', 'nin', 'ing'] Typo robustness example: Correct: 'learning' → trigrams: ['arn', 'ear', 'ing', 'lea', 'nin', 'rni'] Typo: 'lerning' → trigrams: ['ern', 'ing', 'ler', 'nin', 'rni'] Overlap: ['ing', 'nin', 'rni'] (3/6 = 50%)
Despite the typo, most character trigrams still match. This partial matching is why character n-grams excel at:
- Spelling correction: Finding similar words despite typos
- Language identification: Different languages have characteristic character patterns
- Author attribution: Writing style shows up in character-level patterns
- Morphologically rich languages: Capturing word stems and affixes
Top 5 character trigrams by language: -------------------------------------------------- English : 'the', 'he ', 'e q', ' qu', 'qui' French : 'le ', ' pa', 'par', 'ess', 'e r' German : 'er ', 'en ', 'der', 'r s', ' sc' Spanish : 'el ', 'rro', 'ro ', ' pe', 'per'
Each language has distinctive character trigram signatures. German shows "sch" and "ber", French has "ard" and "eux", and Spanish features "rro" and "bre". These patterns form the basis of language detection algorithms.

Skip-grams: Flexible Context WindowsLink Copied
Standard n-grams require strict adjacency: every word must be immediately next to its neighbors. Skip-grams relax this constraint by allowing gaps between words. A skip-gram with skips can have up to words between any two selected words.
A skip-gram is a generalization of n-grams that allows non-adjacent words to form a sequence. A -skip--gram includes all subsequences of words where the gap between any two consecutive selected words is at most .
Sentence: 'the cat sat on the mat'
Regular bigrams (5):
[('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]
Skip-1 bigrams (9):
[('the', 'cat'), ('the', 'sat'), ('cat', 'sat'), ('cat', 'on'), ('sat', 'on'), ('sat', 'the'), ('on', 'the'), ('on', 'mat'), ('the', 'mat')]
Skip-2 bigrams (12):
[('the', 'cat'), ('the', 'sat'), ('the', 'on'), ('cat', 'sat'), ('cat', 'on'), ('cat', 'the'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('on', 'the'), ('on', 'mat'), ('the', 'mat')]
Skip-grams capture relationships between words that aren't immediately adjacent. In "the cat sat on the mat", regular bigrams miss the relationship between "cat" and "on", but skip-bigrams capture it. This is useful when:
- Word order varies: Different phrasings of the same idea
- Modifiers intervene: "the big fluffy cat" vs "the cat"
- Long-distance dependencies: Subject-verb agreement across clauses
The trade-off is a larger vocabulary. Skip-grams with skips produce more combinations than regular n-grams, exacerbating the vocabulary explosion.
Vocabulary size by skip distance: ---------------------------------------- Regular bigrams 15 unique Skip-1 bigrams 29 unique Skip-2 bigrams 38 unique Skip-3 bigrams 44 unique

The vocabulary growth with skip distance is substantial. Even with a small corpus, skip-3 bigrams produce significantly more unique pairs than regular bigrams. In larger corpora, this effect compounds, making high-skip models memory-intensive.
N-gram Indexing for SearchLink Copied
N-grams enable efficient approximate string matching and search. By indexing documents by their n-grams, you can quickly find documents containing similar phrases even with minor variations.
Indexed documents: [0] machine learning algorithms [1] deep learning neural networks [2] machine learning for natural language processing [3] statistical machine translation [4] reinforcement learning agents Query: 'machine learning' Score 1: [0] machine learning algorithms Score 1: [2] machine learning for natural language processing Query: 'deep learning' Score 1: [1] deep learning neural networks Query: 'natural language' Score 1: [2] machine learning for natural language processing
This inverted index approach is the foundation of many search engines and fuzzy matching systems. The n-gram index trades space for speed: storing all n-grams requires more memory, but queries become fast set intersection operations.
Using NLTK for N-gram ExtractionLink Copied
While we've implemented n-gram extraction from scratch, NLTK provides optimized utilities:
NLTK n-gram extraction:
Bigrams: [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps')]...
Trigrams: [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps')]...
4-grams: [('the', 'quick', 'brown', 'fox'), ('quick', 'brown', 'fox', 'jumps')]...
Everygrams (1-3): 24 total
Sample: [('the',), ('the', 'quick'), ('the', 'quick', 'brown'), ('quick',), ('quick', 'brown')]...
NLTK also provides FreqDist for counting and analyzing n-gram frequencies:
NLTK FreqDist analysis:
Total bigrams: 21
Unique bigrams: 15
Most common: [(('machine', 'learning'), 4), (('i', 'love'), 2), (('deep', 'learning'), 2), (('learning', 'is'), 2), (('love', 'machine'), 1)]
Frequency of ('machine', 'learning'): 4
Practical ConsiderationsLink Copied
When working with n-grams, several practical considerations affect your results:
Choosing n: Larger n captures more context but creates sparser data. For most applications:
- Unigrams (n=1): Baseline, loses all word order
- Bigrams (n=2): Good balance for most tasks
- Trigrams (n=3): Richer context, common in language modeling
- n > 3: Rarely used due to sparsity
Vocabulary pruning: Remove n-grams that appear too rarely (min_df) or too frequently (max_df). Rare n-grams add noise; frequent n-grams (like "of the") add little discriminative value.
Padding: For some applications, you may want to add special start and end tokens to capture sentence boundaries:
Padded n-grams capture sentence boundaries:
Tokens: ['the', 'cat', 'sat']
Padded bigrams: [('<PAD>', 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', '<PAD>')]
Padded trigrams: [('<PAD>', '<PAD>', 'the'), ('<PAD>', 'the', 'cat'), ('the', 'cat', 'sat'), ('cat', 'sat', '<PAD>'), ('sat', '<PAD>', '<PAD>')]
Memory efficiency: For large corpora, store n-grams as hashed integers rather than string tuples. Use sparse matrix representations when building document-term matrices.
Limitations and Trade-offsLink Copied
N-grams improve on bag of words by capturing local context, but they have significant limitations:
Vocabulary explosion: The number of unique n-grams grows exponentially with n, making high-order n-grams impractical for most applications.
Sparsity: Most n-grams appear rarely. In new text, you'll frequently encounter n-grams not seen during training.
Fixed context window: N-grams capture exactly n consecutive words, missing both shorter and longer patterns. The phrase "not very good" is meaningful as a trigram, but its sentiment depends on understanding that "not" negates "good" across the intervening word.
No semantic understanding: "Excellent movie" and "great film" share no n-grams despite having similar meanings. N-grams are purely syntactic.
Data requirements: Higher-order n-grams require exponentially more training data to estimate reliably. Trigram language models need millions of words; 5-gram models need billions.
Impact and ApplicationsLink Copied
Despite these limitations, n-grams remain foundational in NLP:
Language modeling: N-gram language models estimate the probability of word sequences. Before neural networks, trigram models dominated speech recognition and machine translation.
Text classification: Adding bigrams and trigrams to bag-of-words features often improves classification accuracy by capturing phrases.
Spell checking and autocomplete: Character n-gram similarity identifies likely corrections. Word n-grams predict the next word.
Plagiarism detection: Matching n-gram fingerprints identifies copied text even with minor modifications.
Information retrieval: N-gram indexing enables fast fuzzy matching and phrase search.
The transition from n-grams to neural models didn't make n-grams obsolete. Modern subword tokenizers like BPE and WordPiece are essentially learned character n-gram vocabularies. Understanding n-grams provides intuition for why these newer methods work.
Key Functions and ParametersLink Copied
When working with n-grams in Python, these are the essential functions and their most important parameters:
nltk.ngrams(sequence, n, pad_left=False, pad_right=False)
sequence: The input tokens (list of words or characters)n: The size of the n-gram (2 for bigrams, 3 for trigrams, etc.)pad_left,pad_right: Whether to add padding symbols at boundaries. Useful for language modeling where start/end context matters
nltk.bigrams(sequence) and nltk.trigrams(sequence)
- Convenience functions equivalent to
ngrams(sequence, 2)andngrams(sequence, 3) - Return generators, so wrap in
list()if you need to iterate multiple times
nltk.everygrams(sequence, min_len=1, max_len=-1)
min_len: Minimum n-gram size to includemax_len: Maximum n-gram size (-1 means use sequence length)- Useful when you want to combine multiple n-gram orders in a single feature set
collections.Counter(iterable)
- Essential for counting n-gram frequencies
.most_common(n): Returns the n most frequent items- Supports arithmetic operations for combining counts across documents
Custom extraction parameters:
max_skip: For skip-grams, controls how many words can be skipped between selected words. Higher values capture more distant relationships but increase vocabulary sizepad_symbol: The token used for boundary padding (commonly<s>,</s>, or<PAD>)
SummaryLink Copied
N-grams extend bag of words by preserving local word order. Key takeaways:
- N-grams are contiguous sequences of n tokens (words or characters)
- Bigrams capture immediate word relationships; trigrams add more context
- Vocabulary explosion: Unique n-grams grow exponentially with n
- Zipf's law applies to n-grams, with most appearing only once
- Character n-grams provide robustness to typos and work across languages
- Skip-grams relax adjacency requirements to capture flexible patterns
- N-gram indexing enables fast approximate search and matching
- Trade-offs: More context requires more data and creates sparser representations
N-grams bridge the gap between treating words as isolated units and understanding them in context. In the next chapter, we'll explore TF-IDF, which adds statistical weighting to distinguish informative terms from common ones.
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

N-gram Language Models: Probability-Based Text Generation & Prediction
Learn how n-gram language models assign probabilities to word sequences using the chain rule and Markov assumption, with implementations for text generation and scoring.

Smoothing Techniques for N-gram Language Models: From Laplace to Kneser-Ney
Master smoothing techniques that solve the zero probability problem in n-gram models, including Laplace, add-k, Good-Turing, interpolation, and Kneser-Ney smoothing with Python implementations.

Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations
Learn how the Bag of Words model transforms text into numerical vectors through word counting, vocabulary construction, and sparse matrix storage. Master CountVectorizer and understand when this foundational NLP technique works best.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.

