N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams

Michael Brenndoerfer

Master n-gram text representations including bigrams, trigrams, character n-grams, and skip-grams. Learn extraction techniques, vocabulary explosion challenges, Zipf's law, and practical applications in NLP.

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

N-gramsLink Copied

Bag of words treats every word as an isolated unit, discarding all information about word order. But word order matters. "The dog bit the man" means something entirely different from "The man bit the dog." N-grams capture this local context by representing text as sequences of consecutive tokens rather than individual words. An n-gram is simply a contiguous sequence of n items from a text, where items can be words, characters, or any other units.

This chapter explores how n-grams preserve local word order, why vocabulary size explodes as n increases, and how the statistical properties of n-grams follow predictable patterns. You'll learn to extract n-grams efficiently, understand when character n-grams outperform word n-grams, and see how skip-grams relax the strict adjacency requirement.

From Words to Word SequencesLink Copied

The bag of words model creates a vocabulary of individual words and counts how often each appears. N-grams extend this by creating a vocabulary of word sequences. A bigram (2-gram) is a sequence of two consecutive words. A trigram (3-gram) is three consecutive words. The general term n-gram covers any sequence length.

N-gram

An n-gram is a contiguous sequence of n items from a text. When the items are words, we call them word n-grams. When the items are characters, we call them character n-grams. Common special cases include unigrams (n=1), bigrams (n=2), and trigrams (n=3).

Let's see how the same sentence produces different representations at each n-gram level:

In[2]:

def extract_ngrams(tokens, n):
    """Extract n-grams from a list of tokens."""
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

# Sample sentence
sentence = "the quick brown fox jumps over the lazy dog"
tokens = sentence.split()

# Extract different n-gram sizes
unigrams = extract_ngrams(tokens, 1)
bigrams = extract_ngrams(tokens, 2)
trigrams = extract_ngrams(tokens, 3)

def extract_ngrams(tokens, n):
    """Extract n-grams from a list of tokens."""
    return [tuple(tokens[i:i+n]) for i in range(len(tokens) - n + 1)]

# Sample sentence
sentence = "the quick brown fox jumps over the lazy dog"
tokens = sentence.split()

# Extract different n-gram sizes
unigrams = extract_ngrams(tokens, 1)
bigrams = extract_ngrams(tokens, 2)
trigrams = extract_ngrams(tokens, 3)

Out[3]:

Sentence: 'the quick brown fox jumps over the lazy dog'
Tokens: 9

Unigrams (9):
  ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Bigrams (8):
  [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog')]

Trigrams (7):
  [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]

Notice the pattern: a sentence with $k$ tokens produces $k - n + 1$ n-grams of size $n$ . With 9 tokens, we get 9 unigrams, 8 bigrams, and 7 trigrams. As $n$ increases, we extract fewer n-grams from each document, but each n-gram carries more contextual information.

Bigrams and Trigrams in PracticeLink Copied

Bigrams are the most commonly used n-grams in practice. They capture immediate word relationships while keeping vocabulary size manageable. Trigrams provide richer context but at the cost of sparser data.

In[4]:

from collections import Counter

# A small corpus to analyze
corpus = [
    "I love machine learning",
    "I love deep learning", 
    "machine learning is powerful",
    "deep learning is a subset of machine learning",
    "I study machine learning every day"
]

# Tokenize and extract bigrams from entire corpus
all_bigrams = []
for doc in corpus:
    tokens = doc.lower().split()
    bigrams = extract_ngrams(tokens, 2)
    all_bigrams.extend(bigrams)

# Count bigram frequencies
bigram_counts = Counter(all_bigrams)

from collections import Counter

# A small corpus to analyze
corpus = [
    "I love machine learning",
    "I love deep learning", 
    "machine learning is powerful",
    "deep learning is a subset of machine learning",
    "I study machine learning every day"
]

# Tokenize and extract bigrams from entire corpus
all_bigrams = []
for doc in corpus:
    tokens = doc.lower().split()
    bigrams = extract_ngrams(tokens, 2)
    all_bigrams.extend(bigrams)

# Count bigram frequencies
bigram_counts = Counter(all_bigrams)

Out[5]:

Corpus bigram frequencies:
----------------------------------------
  machine learning          4
  i love                    2
  deep learning             2
  learning is               2
  love machine              1
  love deep                 1
  is powerful               1
  is a                      1
  a subset                  1
  subset of                 1

Total bigram tokens: 21
Unique bigrams: 15

The bigram "machine learning" appears frequently because it's a meaningful phrase in this corpus. This is the key insight: common n-grams often correspond to meaningful multi-word expressions, collocations, or phrases.

In[6]:

# Compare unigram vs bigram representations
all_unigrams = []
for doc in corpus:
    tokens = doc.lower().split()
    all_unigrams.extend(tokens)

unigram_counts = Counter(all_unigrams)

# Compare unigram vs bigram representations
all_unigrams = []
for doc in corpus:
    tokens = doc.lower().split()
    all_unigrams.extend(tokens)

unigram_counts = Counter(all_unigrams)

Out[7]:

Unigram frequencies (top 10):
  learning        6
  machine         4
  i               3
  love            2
  deep            2
  is              2
  powerful        1
  a               1
  subset          1
  of              1

Unique unigrams: 13
Unique bigrams: 15
Vocabulary explosion factor: 1.2x

Even in this tiny corpus, the bigram vocabulary is larger than the unigram vocabulary. This vocabulary explosion becomes severe as n increases or corpus size grows.

Out[8]:

Visualization

Heatmap matrix showing word co-occurrence counts with darker cells indicating more frequent bigrams. — Bigram co-occurrence matrix showing which words follow which in our sample corpus. Each cell represents how often the column word follows the row word. The matrix is sparse, with most combinations never occurring. High-frequency pairs like 'machine→learning' stand out, revealing meaningful collocations that unigrams would miss.

The co-occurrence matrix reveals the structure hidden in bigram counts. The cell at row "machine" and column "learning" shows a high count because "machine learning" is a frequent phrase. Most cells are zero, reflecting the sparsity inherent in natural language: most word pairs never occur together.

The Vocabulary Explosion ProblemLink Copied

The theoretical maximum number of distinct n-grams grows exponentially with n. If your vocabulary has $V$ unique words, the maximum number of possible n-grams is $V^n$ . For a modest vocabulary of 10,000 words:

Unigrams: $10,000$ possible
Bigrams: $100,000,000$ possible
Trigrams: $1,000,000,000,000$ possible

In practice, most of these combinations never occur in natural language. "The the the" is a valid trigram structurally but appears rarely. Still, the number of observed n-grams grows rapidly with n.

In[9]:

import nltk
from collections import Counter

# Download sample text
nltk.download('gutenberg', quiet=True)
from nltk.corpus import gutenberg

# Use a substantial text for realistic statistics
text = gutenberg.raw('austen-emma.txt')
tokens = text.lower().split()

# Count unique n-grams for different values of n
ngram_vocab_sizes = {}
for n in range(1, 6):
    ngrams = extract_ngrams(tokens, n)
    unique_count = len(set(ngrams))
    ngram_vocab_sizes[n] = unique_count

import nltk
from collections import Counter

# Download sample text
nltk.download('gutenberg', quiet=True)
from nltk.corpus import gutenberg

# Use a substantial text for realistic statistics
text = gutenberg.raw('austen-emma.txt')
tokens = text.lower().split()

# Count unique n-grams for different values of n
ngram_vocab_sizes = {}
for n in range(1, 6):
    ngrams = extract_ngrams(tokens, n)
    unique_count = len(set(ngrams))
    ngram_vocab_sizes[n] = unique_count

Out[10]:

Text: Emma by Jane Austen
Total tokens: 158,167

Vocabulary size by n-gram order:
-----------------------------------
  Unigrams         16,945  (1.0x unigrams)
  Bigrams          81,586  (4.8x unigrams)
  Trigrams        136,444  (8.1x unigrams)
  4-grams         154,062  (9.1x unigrams)
  5-grams         157,344  (9.3x unigrams)

The vocabulary roughly doubles or triples with each increment in n. A trigram vocabulary can easily be 10 times larger than the unigram vocabulary. This has practical implications for memory usage, model complexity, and the amount of training data needed.

Out[11]:

Visualization

Line chart showing vocabulary size increasing from about 7000 for unigrams to over 100000 for 5-grams. — Vocabulary size growth as n-gram order increases, shown for Jane Austen's Emma. The vocabulary grows roughly exponentially, with each additional word in the n-gram sequence multiplying the number of unique combinations. This exponential growth is the fundamental challenge of n-gram models.

N-gram Frequency Distributions and Zipf's LawLink Copied

Individual words follow Zipf's law: the frequency of a word is inversely proportional to its rank. A few words appear extremely often while most words appear rarely. N-grams follow the same pattern, but even more extremely.

Zipf's Law

Zipf's law states that in a corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table. The most frequent word appears roughly twice as often as the second most frequent, three times as often as the third, and so on.

In[12]:

# Analyze frequency distribution for different n-gram orders
def analyze_frequency_distribution(tokens, n, top_k=10):
    """Compute frequency statistics for n-grams."""
    ngrams = extract_ngrams(tokens, n)
    counts = Counter(ngrams)
    
    # Get frequency of frequencies
    freq_of_freq = Counter(counts.values())
    
    # Calculate statistics
    total = len(ngrams)
    unique = len(counts)
    hapax = sum(1 for c in counts.values() if c == 1)  # Words appearing once
    
    return {
        'total': total,
        'unique': unique,
        'hapax': hapax,
        'hapax_ratio': hapax / unique,
        'top_k': counts.most_common(top_k),
        'freq_of_freq': freq_of_freq
    }

# Analyze unigrams, bigrams, and trigrams
stats = {n: analyze_frequency_distribution(tokens, n) for n in [1, 2, 3]}

# Analyze frequency distribution for different n-gram orders
def analyze_frequency_distribution(tokens, n, top_k=10):
    """Compute frequency statistics for n-grams."""
    ngrams = extract_ngrams(tokens, n)
    counts = Counter(ngrams)
    
    # Get frequency of frequencies
    freq_of_freq = Counter(counts.values())
    
    # Calculate statistics
    total = len(ngrams)
    unique = len(counts)
    hapax = sum(1 for c in counts.values() if c == 1)  # Words appearing once
    
    return {
        'total': total,
        'unique': unique,
        'hapax': hapax,
        'hapax_ratio': hapax / unique,
        'top_k': counts.most_common(top_k),
        'freq_of_freq': freq_of_freq
    }

# Analyze unigrams, bigrams, and trigrams
stats = {n: analyze_frequency_distribution(tokens, n) for n in [1, 2, 3]}

Out[13]:

Frequency Distribution Statistics:
============================================================

Unigrams:
  Total tokens: 158,167
  Unique types: 16,945
  Hapax legomena (appear once): 10,232 (60.4%)

  Top 5 most frequent:
    'the': 5,120
    'to': 5,079
    'and': 4,445
    'of': 4,196
    'a': 3,055

Bigrams:
  Total tokens: 158,166
  Unique types: 81,586
  Hapax legomena (appear once): 65,609 (80.4%)

  Top 5 most frequent:
    'to be': 566
    'of the': 558
    'in the': 441
    'it was': 387
    'she had': 313

Trigrams:
  Total tokens: 158,165
  Unique types: 136,444
  Hapax legomena (appear once): 126,523 (92.7%)

  Top 5 most frequent:
    'i do not': 94
    'i am sure': 75
    'she could not': 65
    'a great deal': 56
    'it would be': 56

The hapax ratio (proportion of n-grams appearing only once) increases dramatically with n. For trigrams, the vast majority appear exactly once. This sparsity is the core challenge of n-gram models: most n-grams you encounter in new text won't exist in your training data.

Out[14]:

Visualization

Bar chart showing hapax ratio increasing from about 50% for unigrams to over 80% for trigrams. — Hapax ratio (proportion of n-grams appearing exactly once) increases dramatically with n-gram order. While only about 50% of unigrams are hapax legomena, over 80% of trigrams appear just once. This sparsity is the fundamental challenge of n-gram models: most sequences you encounter in new text won''t exist in your training data.

The visualization above makes the sparsity problem concrete. With trigrams, you're essentially building a model where 4 out of 5 features appear only once in training. This severely limits the model's ability to generalize to new text.

Out[15]:

Visualization

Log-log plot showing frequency vs rank for unigrams, bigrams, and trigrams, all showing approximately linear relationships. — Zipf's law in action across unigrams, bigrams, and trigrams. Each line shows the frequency of n-grams plotted against their rank on log-log axes. The roughly linear relationship on log-log scale confirms Zipf's law. Note how the bigram and trigram distributions have steeper slopes, indicating even more extreme concentration of probability mass in the top few items.

Character N-grams for RobustnessLink Copied

Word n-grams assume clean, correctly spelled text. But real-world text contains typos, spelling variations, and out-of-vocabulary words. Character n-grams offer an alternative that's robust to these issues.

Character N-gram

A character n-gram is a contiguous sequence of n characters from a text, including spaces and punctuation. Character n-grams can capture subword patterns, making them robust to spelling variations and useful for languages with rich morphology.

In[16]:

def extract_char_ngrams(text, n):
    """Extract character n-grams from text."""
    return [text[i:i+n] for i in range(len(text) - n + 1)]

# Compare word and character n-grams
word = "learning"
char_bigrams = extract_char_ngrams(word, 2)
char_trigrams = extract_char_ngrams(word, 3)

# Show how typos affect matching
correct = "learning"
typo = "lerning"  # Missing 'a'

correct_trigrams = set(extract_char_ngrams(correct, 3))
typo_trigrams = set(extract_char_ngrams(typo, 3))
overlap = correct_trigrams & typo_trigrams

def extract_char_ngrams(text, n):
    """Extract character n-grams from text."""
    return [text[i:i+n] for i in range(len(text) - n + 1)]

# Compare word and character n-grams
word = "learning"
char_bigrams = extract_char_ngrams(word, 2)
char_trigrams = extract_char_ngrams(word, 3)

# Show how typos affect matching
correct = "learning"
typo = "lerning"  # Missing 'a'

correct_trigrams = set(extract_char_ngrams(correct, 3))
typo_trigrams = set(extract_char_ngrams(typo, 3))
overlap = correct_trigrams & typo_trigrams

Out[17]:

Word: 'learning'
Character bigrams: ['le', 'ea', 'ar', 'rn', 'ni', 'in', 'ng']
Character trigrams: ['lea', 'ear', 'arn', 'rni', 'nin', 'ing']

Typo robustness example:
  Correct: 'learning' → trigrams: ['arn', 'ear', 'ing', 'lea', 'nin', 'rni']
  Typo:    'lerning' → trigrams: ['ern', 'ing', 'ler', 'nin', 'rni']
  Overlap: ['ing', 'nin', 'rni'] (3/6 = 50%)

Despite the typo, most character trigrams still match. This partial matching is why character n-grams excel at:

Spelling correction: Finding similar words despite typos
Language identification: Different languages have characteristic character patterns
Author attribution: Writing style shows up in character-level patterns
Morphologically rich languages: Capturing word stems and affixes

In[18]:

# Language identification using character trigrams
samples = {
    'English': "The quick brown fox jumps over the lazy dog",
    'French': "Le renard brun rapide saute par-dessus le chien paresseux",
    'German': "Der schnelle braune Fuchs springt über den faulen Hund",
    'Spanish': "El rápido zorro marrón salta sobre el perro perezoso"
}

# Extract character trigram profiles
profiles = {}
for lang, text in samples.items():
    trigrams = extract_char_ngrams(text.lower(), 3)
    profiles[lang] = Counter(trigrams).most_common(10)

# Language identification using character trigrams
samples = {
    'English': "The quick brown fox jumps over the lazy dog",
    'French': "Le renard brun rapide saute par-dessus le chien paresseux",
    'German': "Der schnelle braune Fuchs springt über den faulen Hund",
    'Spanish': "El rápido zorro marrón salta sobre el perro perezoso"
}

# Extract character trigram profiles
profiles = {}
for lang, text in samples.items():
    trigrams = extract_char_ngrams(text.lower(), 3)
    profiles[lang] = Counter(trigrams).most_common(10)

Out[19]:

Top 5 character trigrams by language:
--------------------------------------------------
English   : 'the', 'he ', 'e q', ' qu', 'qui'
French    : 'le ', ' pa', 'par', 'ess', 'e r'
German    : 'er ', 'en ', 'der', 'r s', ' sc'
Spanish   : 'el ', 'rro', 'ro ', ' pe', 'per'

Each language has distinctive character trigram signatures. German shows "sch" and "ber", French has "ard" and "eux", and Spanish features "rro" and "bre". These patterns form the basis of language detection algorithms.

Out[20]:

Visualization

Grouped bar chart showing top 10 character trigram frequencies for English, French, German, and Spanish. — Character trigram frequency profiles for four languages. Each bar shows the frequency of the top 10 most common character trigrams. Note how each language has distinctive patterns: German''s ''sch'' and ''ber'', French''s ''ard'' and ''eux'', Spanish''s ''err'' and ''bre''. These characteristic signatures enable automatic language identification.

Skip-grams: Flexible Context WindowsLink Copied

Standard n-grams require strict adjacency: every word must be immediately next to its neighbors. Skip-grams relax this constraint by allowing gaps between words. A skip-gram with $k$ skips can have up to $k$ words between any two selected words.

Skip-gram

A skip-gram is a generalization of n-grams that allows non-adjacent words to form a sequence. A $k$ -skip- $n$ -gram includes all subsequences of $n$ words where the gap between any two consecutive selected words is at most $k$ .

In[21]:

def extract_skip_bigrams(tokens, max_skip):
    """Extract skip-bigrams with up to max_skip words between."""
    skip_bigrams = []
    for i in range(len(tokens)):
        for j in range(i + 1, min(i + max_skip + 2, len(tokens))):
            skip_bigrams.append((tokens[i], tokens[j]))
    return skip_bigrams

# Compare regular bigrams with skip-bigrams
sentence = "the cat sat on the mat"
tokens = sentence.split()

regular_bigrams = extract_ngrams(tokens, 2)
skip_1_bigrams = extract_skip_bigrams(tokens, 1)
skip_2_bigrams = extract_skip_bigrams(tokens, 2)

def extract_skip_bigrams(tokens, max_skip):
    """Extract skip-bigrams with up to max_skip words between."""
    skip_bigrams = []
    for i in range(len(tokens)):
        for j in range(i + 1, min(i + max_skip + 2, len(tokens))):
            skip_bigrams.append((tokens[i], tokens[j]))
    return skip_bigrams

# Compare regular bigrams with skip-bigrams
sentence = "the cat sat on the mat"
tokens = sentence.split()

regular_bigrams = extract_ngrams(tokens, 2)
skip_1_bigrams = extract_skip_bigrams(tokens, 1)
skip_2_bigrams = extract_skip_bigrams(tokens, 2)

Out[22]:

Sentence: 'the cat sat on the mat'

Regular bigrams (5):
  [('the', 'cat'), ('cat', 'sat'), ('sat', 'on'), ('on', 'the'), ('the', 'mat')]

Skip-1 bigrams (9):
  [('the', 'cat'), ('the', 'sat'), ('cat', 'sat'), ('cat', 'on'), ('sat', 'on'), ('sat', 'the'), ('on', 'the'), ('on', 'mat'), ('the', 'mat')]

Skip-2 bigrams (12):
  [('the', 'cat'), ('the', 'sat'), ('the', 'on'), ('cat', 'sat'), ('cat', 'on'), ('cat', 'the'), ('sat', 'on'), ('sat', 'the'), ('sat', 'mat'), ('on', 'the'), ('on', 'mat'), ('the', 'mat')]

Skip-grams capture relationships between words that aren't immediately adjacent. In "the cat sat on the mat", regular bigrams miss the relationship between "cat" and "on", but skip-bigrams capture it. This is useful when:

Word order varies: Different phrasings of the same idea
Modifiers intervene: "the big fluffy cat" vs "the cat"
Long-distance dependencies: Subject-verb agreement across clauses

The trade-off is a larger vocabulary. Skip-grams with $k$ skips produce more combinations than regular n-grams, exacerbating the vocabulary explosion.

In[23]:

# Count vocabulary sizes for different skip values
skip_vocab_sizes = {}
for max_skip in range(4):
    if max_skip == 0:
        all_skipgrams = []
        for doc in corpus:
            doc_tokens = doc.lower().split()
            all_skipgrams.extend(extract_ngrams(doc_tokens, 2))
    else:
        all_skipgrams = []
        for doc in corpus:
            doc_tokens = doc.lower().split()
            all_skipgrams.extend(extract_skip_bigrams(doc_tokens, max_skip))
    
    skip_vocab_sizes[max_skip] = len(set(all_skipgrams))

# Count vocabulary sizes for different skip values
skip_vocab_sizes = {}
for max_skip in range(4):
    if max_skip == 0:
        all_skipgrams = []
        for doc in corpus:
            doc_tokens = doc.lower().split()
            all_skipgrams.extend(extract_ngrams(doc_tokens, 2))
    else:
        all_skipgrams = []
        for doc in corpus:
            doc_tokens = doc.lower().split()
            all_skipgrams.extend(extract_skip_bigrams(doc_tokens, max_skip))
    
    skip_vocab_sizes[max_skip] = len(set(all_skipgrams))

Out[24]:

Vocabulary size by skip distance:
----------------------------------------
  Regular bigrams         15 unique
  Skip-1 bigrams          29 unique
  Skip-2 bigrams          38 unique
  Skip-3 bigrams          44 unique

Out[25]:

Visualization

Bar chart showing vocabulary size increasing from regular bigrams through skip-1, skip-2, and skip-3 bigrams. — Vocabulary size growth as skip distance increases. Regular bigrams (skip=0) produce the smallest vocabulary. Each additional skip distance allows more word pairs, rapidly expanding the vocabulary. Skip-3 bigrams can produce nearly double the unique pairs of regular bigrams, illustrating the trade-off between capturing distant relationships and vocabulary explosion.

The vocabulary growth with skip distance is substantial. Even with a small corpus, skip-3 bigrams produce significantly more unique pairs than regular bigrams. In larger corpora, this effect compounds, making high-skip models memory-intensive.

N-gram Indexing for SearchLink Copied

N-grams enable efficient approximate string matching and search. By indexing documents by their n-grams, you can quickly find documents containing similar phrases even with minor variations.

In[26]:

from collections import defaultdict

class NGramIndex:
    """Simple n-gram based search index."""
    
    def __init__(self, n=3):
        self.n = n
        self.index = defaultdict(set)
        self.documents = {}
    
    def add_document(self, doc_id, text):
        """Index a document by its n-grams."""
        self.documents[doc_id] = text
        tokens = text.lower().split()
        ngrams = extract_ngrams(tokens, self.n)
        for ngram in ngrams:
            self.index[ngram].add(doc_id)
    
    def search(self, query, min_matches=1):
        """Find documents containing query n-grams."""
        query_tokens = query.lower().split()
        query_ngrams = extract_ngrams(query_tokens, self.n)
        
        # Count matches per document
        doc_scores = Counter()
        for ngram in query_ngrams:
            for doc_id in self.index.get(ngram, []):
                doc_scores[doc_id] += 1
        
        # Filter by minimum matches
        results = [(doc_id, score) for doc_id, score in doc_scores.items() 
                   if score >= min_matches]
        return sorted(results, key=lambda x: -x[1])

# Build index
index = NGramIndex(n=2)
documents = [
    "machine learning algorithms",
    "deep learning neural networks",
    "machine learning for natural language processing",
    "statistical machine translation",
    "reinforcement learning agents"
]

for i, doc in enumerate(documents):
    index.add_document(i, doc)

from collections import defaultdict

class NGramIndex:
    """Simple n-gram based search index."""
    
    def __init__(self, n=3):
        self.n = n
        self.index = defaultdict(set)
        self.documents = {}
    
    def add_document(self, doc_id, text):
        """Index a document by its n-grams."""
        self.documents[doc_id] = text
        tokens = text.lower().split()
        ngrams = extract_ngrams(tokens, self.n)
        for ngram in ngrams:
            self.index[ngram].add(doc_id)
    
    def search(self, query, min_matches=1):
        """Find documents containing query n-grams."""
        query_tokens = query.lower().split()
        query_ngrams = extract_ngrams(query_tokens, self.n)
        
        # Count matches per document
        doc_scores = Counter()
        for ngram in query_ngrams:
            for doc_id in self.index.get(ngram, []):
                doc_scores[doc_id] += 1
        
        # Filter by minimum matches
        results = [(doc_id, score) for doc_id, score in doc_scores.items() 
                   if score >= min_matches]
        return sorted(results, key=lambda x: -x[1])

# Build index
index = NGramIndex(n=2)
documents = [
    "machine learning algorithms",
    "deep learning neural networks",
    "machine learning for natural language processing",
    "statistical machine translation",
    "reinforcement learning agents"
]

for i, doc in enumerate(documents):
    index.add_document(i, doc)

Out[27]:

Indexed documents:
  [0] machine learning algorithms
  [1] deep learning neural networks
  [2] machine learning for natural language processing
  [3] statistical machine translation
  [4] reinforcement learning agents

Query: 'machine learning'
  Score 1: [0] machine learning algorithms
  Score 1: [2] machine learning for natural language processing

Query: 'deep learning'
  Score 1: [1] deep learning neural networks

Query: 'natural language'
  Score 1: [2] machine learning for natural language processing

This inverted index approach is the foundation of many search engines and fuzzy matching systems. The n-gram index trades space for speed: storing all n-grams requires more memory, but queries become fast set intersection operations.

Using NLTK for N-gram ExtractionLink Copied

While we've implemented n-gram extraction from scratch, NLTK provides optimized utilities:

In[28]:

from nltk import ngrams, bigrams, trigrams
from nltk.util import everygrams

# NLTK's n-gram functions
text = "the quick brown fox jumps over the lazy dog"
tokens = text.split()

# Extract using NLTK
nltk_bigrams = list(bigrams(tokens))
nltk_trigrams = list(trigrams(tokens))
nltk_ngrams_4 = list(ngrams(tokens, 4))

# everygrams extracts all n-grams from 1 to max_len
all_grams = list(everygrams(tokens, min_len=1, max_len=3))

from nltk import ngrams, bigrams, trigrams
from nltk.util import everygrams

# NLTK's n-gram functions
text = "the quick brown fox jumps over the lazy dog"
tokens = text.split()

# Extract using NLTK
nltk_bigrams = list(bigrams(tokens))
nltk_trigrams = list(trigrams(tokens))
nltk_ngrams_4 = list(ngrams(tokens, 4))

# everygrams extracts all n-grams from 1 to max_len
all_grams = list(everygrams(tokens, min_len=1, max_len=3))

Out[29]:

NLTK n-gram extraction:

Bigrams: [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps')]...
Trigrams: [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps')]...
4-grams: [('the', 'quick', 'brown', 'fox'), ('quick', 'brown', 'fox', 'jumps')]...

Everygrams (1-3): 24 total
  Sample: [('the',), ('the', 'quick'), ('the', 'quick', 'brown'), ('quick',), ('quick', 'brown')]...

NLTK also provides FreqDist for counting and analyzing n-gram frequencies:

In[30]:

from nltk import FreqDist

# Count bigram frequencies across a corpus
all_corpus_bigrams = []
for doc in corpus:
    doc_tokens = doc.lower().split()
    all_corpus_bigrams.extend(bigrams(doc_tokens))

bigram_freq = FreqDist(all_corpus_bigrams)

from nltk import FreqDist

# Count bigram frequencies across a corpus
all_corpus_bigrams = []
for doc in corpus:
    doc_tokens = doc.lower().split()
    all_corpus_bigrams.extend(bigrams(doc_tokens))

bigram_freq = FreqDist(all_corpus_bigrams)

Out[31]:

NLTK FreqDist analysis:
  Total bigrams: 21
  Unique bigrams: 15
  Most common: [(('machine', 'learning'), 4), (('i', 'love'), 2), (('deep', 'learning'), 2), (('learning', 'is'), 2), (('love', 'machine'), 1)]
  Frequency of ('machine', 'learning'): 4

Practical ConsiderationsLink Copied

When working with n-grams, several practical considerations affect your results:

Choosing n: Larger n captures more context but creates sparser data. For most applications:

Unigrams (n=1): Baseline, loses all word order
Bigrams (n=2): Good balance for most tasks
Trigrams (n=3): Richer context, common in language modeling
n > 3: Rarely used due to sparsity

Vocabulary pruning: Remove n-grams that appear too rarely (min_df) or too frequently (max_df). Rare n-grams add noise; frequent n-grams (like "of the") add little discriminative value.

Padding: For some applications, you may want to add special start and end tokens to capture sentence boundaries:

In[32]:

def extract_padded_ngrams(tokens, n, pad_symbol='<PAD>'):
    """Extract n-grams with padding for boundary awareness."""
    padded = [pad_symbol] * (n - 1) + tokens + [pad_symbol] * (n - 1)
    return extract_ngrams(padded, n)

sentence_tokens = ["the", "cat", "sat"]
padded_bigrams = extract_padded_ngrams(sentence_tokens, 2)
padded_trigrams = extract_padded_ngrams(sentence_tokens, 3)

def extract_padded_ngrams(tokens, n, pad_symbol='<PAD>'):
    """Extract n-grams with padding for boundary awareness."""
    padded = [pad_symbol] * (n - 1) + tokens + [pad_symbol] * (n - 1)
    return extract_ngrams(padded, n)

sentence_tokens = ["the", "cat", "sat"]
padded_bigrams = extract_padded_ngrams(sentence_tokens, 2)
padded_trigrams = extract_padded_ngrams(sentence_tokens, 3)

Out[33]:

Padded n-grams capture sentence boundaries:
  Tokens: ['the', 'cat', 'sat']
  Padded bigrams: [('<PAD>', 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', '<PAD>')]
  Padded trigrams: [('<PAD>', '<PAD>', 'the'), ('<PAD>', 'the', 'cat'), ('the', 'cat', 'sat'), ('cat', 'sat', '<PAD>'), ('sat', '<PAD>', '<PAD>')]

Memory efficiency: For large corpora, store n-grams as hashed integers rather than string tuples. Use sparse matrix representations when building document-term matrices.

Limitations and Trade-offsLink Copied

N-grams improve on bag of words by capturing local context, but they have significant limitations:

Vocabulary explosion: The number of unique n-grams grows exponentially with n, making high-order n-grams impractical for most applications.

Sparsity: Most n-grams appear rarely. In new text, you'll frequently encounter n-grams not seen during training.

Fixed context window: N-grams capture exactly n consecutive words, missing both shorter and longer patterns. The phrase "not very good" is meaningful as a trigram, but its sentiment depends on understanding that "not" negates "good" across the intervening word.

No semantic understanding: "Excellent movie" and "great film" share no n-grams despite having similar meanings. N-grams are purely syntactic.

Data requirements: Higher-order n-grams require exponentially more training data to estimate reliably. Trigram language models need millions of words; 5-gram models need billions.

Impact and ApplicationsLink Copied

Despite these limitations, n-grams remain foundational in NLP:

Language modeling: N-gram language models estimate the probability of word sequences. Before neural networks, trigram models dominated speech recognition and machine translation.

Text classification: Adding bigrams and trigrams to bag-of-words features often improves classification accuracy by capturing phrases.

Spell checking and autocomplete: Character n-gram similarity identifies likely corrections. Word n-grams predict the next word.

Plagiarism detection: Matching n-gram fingerprints identifies copied text even with minor modifications.

Information retrieval: N-gram indexing enables fast fuzzy matching and phrase search.

The transition from n-grams to neural models didn't make n-grams obsolete. Modern subword tokenizers like BPE and WordPiece are essentially learned character n-gram vocabularies. Understanding n-grams provides intuition for why these newer methods work.

Key Functions and ParametersLink Copied

When working with n-grams in Python, these are the essential functions and their most important parameters:

nltk.ngrams(sequence, n, pad_left=False, pad_right=False)

sequence: The input tokens (list of words or characters)
n: The size of the n-gram (2 for bigrams, 3 for trigrams, etc.)
pad_left, pad_right: Whether to add padding symbols at boundaries. Useful for language modeling where start/end context matters

nltk.bigrams(sequence) and nltk.trigrams(sequence)

Convenience functions equivalent to ngrams(sequence, 2) and ngrams(sequence, 3)
Return generators, so wrap in list() if you need to iterate multiple times

nltk.everygrams(sequence, min_len=1, max_len=-1)

min_len: Minimum n-gram size to include
max_len: Maximum n-gram size (-1 means use sequence length)
Useful when you want to combine multiple n-gram orders in a single feature set

collections.Counter(iterable)

Essential for counting n-gram frequencies
.most_common(n): Returns the n most frequent items
Supports arithmetic operations for combining counts across documents

Custom extraction parameters:

max_skip: For skip-grams, controls how many words can be skipped between selected words. Higher values capture more distant relationships but increase vocabulary size
pad_symbol: The token used for boundary padding (commonly <s>, </s>, or <PAD>)

SummaryLink Copied

N-grams extend bag of words by preserving local word order. Key takeaways:

N-grams are contiguous sequences of n tokens (words or characters)
Bigrams capture immediate word relationships; trigrams add more context
Vocabulary explosion: Unique n-grams grow exponentially with n
Zipf's law applies to n-grams, with most appearing only once
Character n-grams provide robustness to typos and work across languages
Skip-grams relax adjacency requirements to capture flexible patterns
N-gram indexing enables fast approximate search and matching
Trade-offs: More context requires more data and creates sparser representations

N-grams bridge the gap between treating words as isolated units and understanding them in context. In the next chapter, we'll explore TF-IDF, which adds statistical weighting to distinguish informative terms from common ones.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about n-grams and text representation.

Loading component...

Back to Language AI Handbook

Previous Chapter

Bag of Words

Next Chapter

N-gram Language Models

Reference

BIBTEXAcademic

@misc{ngramscapturingwordorderintextwithbigramstrigramsskipgrams, author = {Michael Brenndoerfer}, title = {N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams}, year = {2025}, url = {https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-07} }

APAAcademic

Michael Brenndoerfer (2025). N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams. Retrieved from https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp

MLAAcademic

Michael Brenndoerfer. "N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams." 2025. Web. 12/7/2025. <https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp>.

CHICAGOAcademic

Michael Brenndoerfer. "N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams." Accessed 12/7/2025. https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp.

HARVARDAcademic

Michael Brenndoerfer (2025) 'N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams'. Available at: https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp (Accessed: 12/7/2025).

SimpleBasic

Michael Brenndoerfer (2025). N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams. https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp

Direct link:

https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp

Part of Language AI Handbook

This article is part of the free-to-read Language AI Handbook

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

InteractiveN-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams

N-gramsLink Copied

From Words to Word SequencesLink Copied

Bigrams and Trigrams in PracticeLink Copied

The Vocabulary Explosion ProblemLink Copied

N-gram Frequency Distributions and Zipf's LawLink Copied

Character N-grams for RobustnessLink Copied

Skip-grams: Flexible Context WindowsLink Copied

N-gram Indexing for SearchLink Copied

Using NLTK for N-gram ExtractionLink Copied

Practical ConsiderationsLink Copied

Limitations and Trade-offsLink Copied

Impact and ApplicationsLink Copied

Key Functions and ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Reference

About the author: Michael Brenndoerfer

Related Content

N-gram Language Models: Probability-Based Text Generation & Prediction

Smoothing Techniques for N-gram Language Models: From Laplace to Kneser-Ney

Bag of Words: Document-Term Matrices, Vocabulary Construction & Sparse Representations

Stay updated