N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams

Michael BrenndoerferUpdated March 22, 202523 min read

Master n-gram text representations including bigrams, trigrams, character n-grams, and skip-grams. Learn extraction techniques, vocabulary explosion challenges, Zipf's law, and practical applications in NLP.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

N-grams

Bag of words treats every word as an isolated unit, discarding all information about word order. But word order matters. "The dog bit the man" means something entirely different from "The man bit the dog." N-grams capture this local context by representing text as sequences of consecutive tokens rather than individual words. An n-gram is simply a contiguous sequence of n items from a text, where items can be words, characters, or any other units.

This chapter explores how n-grams preserve local word order, why vocabulary size explodes as n increases, and how the statistical properties of n-grams follow predictable patterns. You'll learn to extract n-grams efficiently, understand when character n-grams outperform word n-grams, and see how skip-grams relax the strict adjacency requirement.

From Words to Word Sequences

The bag of words model creates a vocabulary of individual words and counts how often each appears. N-grams extend this by creating a vocabulary of word sequences. A bigram (2-gram) is a sequence of two consecutive words. A trigram (3-gram) is three consecutive words. The general term n-gram covers any sequence length.

N-gram

An n-gram is a contiguous sequence of n items from a text. When the items are words, we call them word n-grams. When the items are characters, we call them character n-grams. Common special cases include unigrams (n=1), bigrams (n=2), and trigrams (n=3).

Let's see how the same sentence produces different representations at each n-gram level:

In[2]:
Code
def extract_ngrams(tokens, n):
    """Extract n-grams from a list of tokens."""
    return [tuple(tokens[i : i + n]) for i in range(len(tokens) - n + 1)]


# Sample sentence
sentence = "the quick brown fox jumps over the lazy dog"
tokens = sentence.split()

# Extract different n-gram sizes
unigrams = extract_ngrams(tokens, 1)
bigrams = extract_ngrams(tokens, 2)
trigrams = extract_ngrams(tokens, 3)
Out[3]:
Console
Sentence: 'the quick brown fox jumps over the lazy dog'
Tokens: 9

Unigrams (9):
  ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Bigrams (8):
  [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumps'), ('jumps', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog')]

Trigrams (7):
  [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]

Notice the pattern: a sentence with kk tokens produces a specific number of n-grams depending on the sequence length. The formula for the number of n-grams is:

count=kn+1\text{count} = k - n + 1

where:

  • kk: the total number of tokens in the sentence
  • nn: the n-gram size (1 for unigrams, 2 for bigrams, 3 for trigrams, etc.)
  • count\text{count}: the resulting number of n-grams

This formula works because each n-gram starts at a different position, and the last valid starting position is kn+1k - n + 1 (since we need nn consecutive tokens). With 9 tokens, we get 9 unigrams (91+1=99 - 1 + 1 = 9), 8 bigrams (92+1=89 - 2 + 1 = 8), and 7 trigrams (93+1=79 - 3 + 1 = 7). As nn increases, we extract fewer n-grams from each document, but each n-gram carries more contextual information.

Bigrams and Trigrams in Practice

Bigrams are the most commonly used n-grams in practice. They capture immediate word relationships while keeping vocabulary size manageable. Trigrams provide richer context but at the cost of sparser data.

In[4]:
Code
from collections import Counter

# A small corpus to analyze
corpus = [
    "I love machine learning",
    "I love deep learning",
    "machine learning is powerful",
    "deep learning is a subset of machine learning",
    "I study machine learning every day",
]

# Tokenize and extract bigrams from entire corpus
all_bigrams = []
for doc in corpus:
    tokens = doc.lower().split()
    bigrams = extract_ngrams(tokens, 2)
    all_bigrams.extend(bigrams)

# Count bigram frequencies
bigram_counts = Counter(all_bigrams)
Out[5]:
Console
Corpus bigram frequencies:
----------------------------------------
  machine learning          4
  i love                    2
  deep learning             2
  learning is               2
  love machine              1
  love deep                 1
  is powerful               1
  is a                      1
  a subset                  1
  subset of                 1

Total bigram tokens: 21
Unique bigrams: 15

The bigram "machine learning" appears frequently because it's a meaningful phrase in this corpus. This is the key insight: common n-grams often correspond to meaningful multi-word expressions, collocations, or phrases.

In[6]:
Code
# Compare unigram vs bigram representations
all_unigrams = []
for doc in corpus:
    tokens = doc.lower().split()
    all_unigrams.extend(tokens)

unigram_counts = Counter(all_unigrams)
Out[7]:
Console
Unigram frequencies (top 10):
  learning        6
  machine         4
  i               3
  love            2
  deep            2
  is              2
  powerful        1
  a               1
  subset          1
  of              1

Unique unigrams: 13
Unique bigrams: 15
Vocabulary explosion factor: 1.2x

Even in this tiny corpus, the bigram vocabulary is larger than the unigram vocabulary. This vocabulary explosion becomes severe as n increases or corpus size grows.

Out[8]:
Visualization
Bigram co-occurrence matrix showing which words follow which in our sample corpus. Each cell represents how often the column word follows the row word. The matrix is sparse, with most combinations never occurring. High-frequency pairs like 'machine→learning' stand out, revealing meaningful collocations that unigrams would miss.
Bigram co-occurrence matrix showing which words follow which in our sample corpus. Each cell represents how often the column word follows the row word. The matrix is sparse, with most combinations never occurring. High-frequency pairs like 'machine→learning' stand out, revealing meaningful collocations that unigrams would miss.

The co-occurrence matrix reveals the structure hidden in bigram counts. The cell at row "machine" and column "learning" shows a high count because "machine learning" is a frequent phrase. Most cells are zero, reflecting the sparsity inherent in natural language: most word pairs never occur together.

The Vocabulary Explosion Problem

The theoretical maximum number of distinct n-grams grows exponentially with n. If your vocabulary has VV unique words, the maximum number of possible n-grams is VnV^n. For a modest vocabulary of 10,000 words:

  • Unigrams: 10,00010,000 possible
  • Bigrams: 100,000,000100,000,000 possible
  • Trigrams: 1,000,000,000,0001,000,000,000,000 possible

In practice, most of these combinations never occur in natural language. "The the the" is a valid trigram structurally but appears rarely. Still, the number of observed n-grams grows rapidly with n.

In[9]:
Code
import nltk

# Download sample text
nltk.download("gutenberg", quiet=True)
from nltk.corpus import gutenberg

# Use a substantial text for realistic statistics
text = gutenberg.raw("austen-emma.txt")
tokens = text.lower().split()

# Count unique n-grams for different values of n
ngram_vocab_sizes = {}
for n in range(1, 6):
    ngrams = extract_ngrams(tokens, n)
    unique_count = len(set(ngrams))
    ngram_vocab_sizes[n] = unique_count

The vocabulary roughly doubles or triples with each increment in n. A trigram vocabulary can easily be 10 times larger than the unigram vocabulary. This has practical implications for memory usage, model complexity, and the amount of training data needed.

N-gram Frequency Distributions and Zipf's Law

Individual words follow Zipf's law: the frequency of a word is inversely proportional to its rank. A few words appear extremely often while most words appear rarely. N-grams follow the same pattern, but even more extremely.

Zipf's Law

Zipf's law states that in a corpus of natural language, the frequency of any word is inversely proportional to its rank in the frequency table. The most frequent word appears roughly twice as often as the second most frequent, three times as often as the third, and so on.

In[19]:
Code
# Analyze frequency distribution for different n-gram orders
def analyze_frequency_distribution(tokens, n, top_k=10):
    """Compute frequency statistics for n-grams."""
    ngrams = extract_ngrams(tokens, n)
    counts = Counter(ngrams)

    # Get frequency of frequencies
    freq_of_freq = Counter(counts.values())

    # Calculate statistics
    total = len(ngrams)
    unique = len(counts)
    hapax = sum(1 for c in counts.values() if c == 1)  # Words appearing once

    return {
        "total": total,
        "unique": unique,
        "hapax": hapax,
        "hapax_ratio": hapax / unique,
        "top_k": counts.most_common(top_k),
        "freq_of_freq": freq_of_freq,
    }


# Analyze unigrams, bigrams, and trigrams
stats = {n: analyze_frequency_distribution(tokens, n) for n in [1, 2, 3]}

The hapax ratio (proportion of n-grams appearing only once) increases dramatically with n. For trigrams, the vast majority appear exactly once. This sparsity is the core challenge of n-gram models: most n-grams you encounter in new text won't exist in your training data.

The visualization above makes the sparsity problem concrete. With trigrams, you're essentially building a model where 4 out of 5 features appear only once in training. This severely limits the model's ability to generalize to new text.

Character N-grams for Robustness

Word n-grams assume clean, correctly spelled text. But real-world text contains typos, spelling variations, and out-of-vocabulary words. Character n-grams offer an alternative that's robust to these issues.

Character N-gram

A character n-gram is a contiguous sequence of n characters from a text, including spaces and punctuation. Character n-grams can capture subword patterns, making them robust to spelling variations and useful for languages with rich morphology.

In[26]:
Code
def extract_char_ngrams(text, n):
    """Extract character n-grams from text."""
    return [text[i : i + n] for i in range(len(text) - n + 1)]


# Compare word and character n-grams
word = "learning"
char_bigrams = extract_char_ngrams(word, 2)
char_trigrams = extract_char_ngrams(word, 3)

# Show how typos affect matching
correct = "learning"
typo = "lerning"  # Missing 'a'

correct_trigrams = set(extract_char_ngrams(correct, 3))
typo_trigrams = set(extract_char_ngrams(typo, 3))
overlap = correct_trigrams & typo_trigrams

Despite the typo, most character trigrams still match. This partial matching is why character n-grams excel at:

  • Spelling correction: Finding similar words despite typos
  • Language identification: Different languages have characteristic character patterns
  • Author attribution: Writing style shows up in character-level patterns
  • Morphologically rich languages: Capturing word stems and affixes
In[29]:
Code
# Language identification using character trigrams
samples = {
    "English": "The quick brown fox jumps over the lazy dog",
    "French": "Le renard brun rapide saute par-dessus le chien paresseux",
    "German": "Der schnelle braune Fuchs springt über den faulen Hund",
    "Spanish": "El rápido zorro marrón salta sobre el perro perezoso",
}

# Extract character trigram profiles
profiles = {}
for lang, text in samples.items():
    trigrams = extract_char_ngrams(text.lower(), 3)
    profiles[lang] = Counter(trigrams).most_common(10)

Each language has distinctive character trigram signatures. German shows "sch" and "ber", French has "ard" and "eux", and Spanish features "rro" and "bre". These patterns form the basis of language detection algorithms.

Skip-grams: Flexible Context Windows

Standard n-grams require strict adjacency: every word must be immediately next to its neighbors. Skip-grams relax this constraint by allowing gaps between words. A skip-gram with kk skips can have up to kk words between any two selected words.

Skip-gram

A skip-gram is a generalization of n-grams that allows non-adjacent words to form a sequence. A kk-skip-nn-gram includes all subsequences of nn words where the gap between any two consecutive selected words is at most kk.

In[34]:
Code
def extract_skip_bigrams(tokens, max_skip):
    """Extract skip-bigrams with up to max_skip words between."""
    skip_bigrams = []
    for i in range(len(tokens)):
        for j in range(i + 1, min(i + max_skip + 2, len(tokens))):
            skip_bigrams.append((tokens[i], tokens[j]))
    return skip_bigrams


# Compare regular bigrams with skip-bigrams
sentence = "the cat sat on the mat"
tokens = sentence.split()

regular_bigrams = extract_ngrams(tokens, 2)
skip_1_bigrams = extract_skip_bigrams(tokens, 1)
skip_2_bigrams = extract_skip_bigrams(tokens, 2)

Skip-grams capture relationships between words that aren't immediately adjacent. In "the cat sat on the mat", regular bigrams miss the relationship between "cat" and "on", but skip-bigrams capture it. This is useful when:

  • Word order varies: Different phrasings of the same idea
  • Modifiers intervene: "the big fluffy cat" vs "the cat"
  • Long-distance dependencies: Subject-verb agreement across clauses

The trade-off is a larger vocabulary. Skip-grams with kk skips produce more combinations than regular n-grams, exacerbating the vocabulary explosion.

In[37]:
Code
# Count vocabulary sizes for different skip values
skip_vocab_sizes = {}
for max_skip in range(4):
    if max_skip == 0:
        all_skipgrams = []
        for doc in corpus:
            doc_tokens = doc.lower().split()
            all_skipgrams.extend(extract_ngrams(doc_tokens, 2))
    else:
        all_skipgrams = []
        for doc in corpus:
            doc_tokens = doc.lower().split()
            all_skipgrams.extend(extract_skip_bigrams(doc_tokens, max_skip))

    skip_vocab_sizes[max_skip] = len(set(all_skipgrams))

The vocabulary growth with skip distance is substantial. Even with a small corpus, skip-3 bigrams produce significantly more unique pairs than regular bigrams. In larger corpora, this effect compounds, making high-skip models memory-intensive.

N-grams enable efficient approximate string matching and search. By indexing documents by their n-grams, you can quickly find documents containing similar phrases even with minor variations.

In[41]:
Code
from collections import defaultdict


class NGramIndex:
    """Simple n-gram based search index."""

    def __init__(self, n=3):
        self.n = n
        self.index = defaultdict(set)
        self.documents = {}

    def add_document(self, doc_id, text):
        """Index a document by its n-grams."""
        self.documents[doc_id] = text
        tokens = text.lower().split()
        ngrams = extract_ngrams(tokens, self.n)
        for ngram in ngrams:
            self.index[ngram].add(doc_id)

    def search(self, query, min_matches=1):
        """Find documents containing query n-grams."""
        query_tokens = query.lower().split()
        query_ngrams = extract_ngrams(query_tokens, self.n)

        # Count matches per document
        doc_scores = Counter()
        for ngram in query_ngrams:
            for doc_id in self.index.get(ngram, []):
                doc_scores[doc_id] += 1

        # Filter by minimum matches
        results = [
            (doc_id, score)
            for doc_id, score in doc_scores.items()
            if score >= min_matches
        ]
        return sorted(results, key=lambda x: -x[1])


# Build index
index = NGramIndex(n=2)
documents = [
    "machine learning algorithms",
    "deep learning neural networks",
    "machine learning for natural language processing",
    "statistical machine translation",
    "reinforcement learning agents",
]

for i, doc in enumerate(documents):
    index.add_document(i, doc)

This inverted index approach is the foundation of many search engines and fuzzy matching systems. The n-gram index trades space for speed: storing all n-grams requires more memory, but queries become fast set intersection operations.

Using NLTK for N-gram Extraction

While we've implemented n-gram extraction from scratch, NLTK provides optimized utilities:

In[44]:
Code
from nltk import ngrams, bigrams, trigrams
from nltk.util import everygrams

# NLTK's n-gram functions
text = "the quick brown fox jumps over the lazy dog"
tokens = text.split()

# Extract using NLTK
nltk_bigrams = list(bigrams(tokens))
nltk_trigrams = list(trigrams(tokens))
nltk_ngrams_4 = list(ngrams(tokens, 4))

# everygrams extracts all n-grams from 1 to max_len
all_grams = list(everygrams(tokens, min_len=1, max_len=3))

NLTK also provides FreqDist for counting and analyzing n-gram frequencies:

In[47]:
Code
from nltk import FreqDist

# Count bigram frequencies across a corpus
all_corpus_bigrams = []
for doc in corpus:
    doc_tokens = doc.lower().split()
    all_corpus_bigrams.extend(bigrams(doc_tokens))

bigram_freq = FreqDist(all_corpus_bigrams)

Practical Considerations

When working with n-grams, several practical considerations affect your results:

Choosing n: Larger n captures more context but creates sparser data. For most applications:

  • Unigrams (n=1): Baseline, loses all word order
  • Bigrams (n=2): Good balance for most tasks
  • Trigrams (n=3): Richer context, common in language modeling
  • n > 3: Rarely used due to sparsity

Vocabulary pruning: Remove n-grams that appear too rarely (min_df) or too frequently (max_df). Rare n-grams add noise; frequent n-grams (like "of the") add little discriminative value.

Padding: For some applications, you may want to add special start and end tokens to capture sentence boundaries:

In[50]:
Code
def extract_padded_ngrams(tokens, n, pad_symbol="<PAD>"):
    """Extract n-grams with padding for boundary awareness."""
    padded = [pad_symbol] * (n - 1) + tokens + [pad_symbol] * (n - 1)
    return extract_ngrams(padded, n)


sentence_tokens = ["the", "cat", "sat"]
padded_bigrams = extract_padded_ngrams(sentence_tokens, 2)
padded_trigrams = extract_padded_ngrams(sentence_tokens, 3)

Memory efficiency: For large corpora, store n-grams as hashed integers rather than string tuples. Use sparse matrix representations when building document-term matrices.

Limitations and Trade-offs

N-grams improve on bag of words by capturing local context, but they have significant limitations:

Vocabulary explosion: The number of unique n-grams grows exponentially with n, making high-order n-grams impractical for most applications.

Sparsity: Most n-grams appear rarely. In new text, you'll frequently encounter n-grams not seen during training.

Fixed context window: N-grams capture exactly n consecutive words, missing both shorter and longer patterns. The phrase "not very good" is meaningful as a trigram, but its sentiment depends on understanding that "not" negates "good" across the intervening word.

No semantic understanding: "Excellent movie" and "great film" share no n-grams despite having similar meanings. N-grams are purely syntactic.

Data requirements: Higher-order n-grams require exponentially more training data to estimate reliably. Trigram language models need millions of words; 5-gram models need billions.

Impact and Applications

Despite these limitations, n-grams remain foundational in NLP:

Language modeling: N-gram language models estimate the probability of word sequences. Before neural networks, trigram models dominated speech recognition and machine translation.

Text classification: Adding bigrams and trigrams to bag-of-words features often improves classification accuracy by capturing phrases.

Spell checking and autocomplete: Character n-gram similarity identifies likely corrections. Word n-grams predict the next word.

Plagiarism detection: Matching n-gram fingerprints identifies copied text even with minor modifications.

Information retrieval: N-gram indexing enables fast fuzzy matching and phrase search.

The transition from n-grams to neural models didn't make n-grams obsolete. Modern subword tokenizers like BPE and WordPiece are essentially learned character n-gram vocabularies. Understanding n-grams provides intuition for why these newer methods work.

Key Functions and Parameters

When working with n-grams in Python, these are the essential functions and their most important parameters:

nltk.ngrams(sequence, n, pad_left=False, pad_right=False)

  • sequence: The input tokens (list of words or characters)
  • n: The size of the n-gram (2 for bigrams, 3 for trigrams, etc.)
  • pad_left, pad_right: Whether to add padding symbols at boundaries. Useful for language modeling where start/end context matters

nltk.bigrams(sequence) and nltk.trigrams(sequence)

  • Convenience functions equivalent to ngrams(sequence, 2) and ngrams(sequence, 3)
  • Return generators, so wrap in list() if you need to iterate multiple times

nltk.everygrams(sequence, min_len=1, max_len=-1)

  • min_len: Minimum n-gram size to include
  • max_len: Maximum n-gram size (-1 means use sequence length)
  • Useful when you want to combine multiple n-gram orders in a single feature set

collections.Counter(iterable)

  • Essential for counting n-gram frequencies
  • .most_common(n): Returns the n most frequent items
  • Supports arithmetic operations for combining counts across documents

Custom extraction parameters:

  • max_skip: For skip-grams, controls how many words can be skipped between selected words. Higher values capture more distant relationships but increase vocabulary size
  • pad_symbol: The token used for boundary padding (commonly <s>, </s>, or <PAD>)

Summary

N-grams extend bag of words by preserving local word order. Key takeaways:

  • N-grams are contiguous sequences of n tokens (words or characters)
  • Bigrams capture immediate word relationships; trigrams add more context
  • Vocabulary explosion: Unique n-grams grow exponentially with n
  • Zipf's law applies to n-grams, with most appearing only once
  • Character n-grams provide robustness to typos and work across languages
  • Skip-grams relax adjacency requirements to capture flexible patterns
  • N-gram indexing enables fast approximate search and matching
  • Trade-offs: More context requires more data and creates sparser representations

N-grams bridge the gap between treating words as isolated units and understanding them in context. In the next chapter, we'll explore TF-IDF, which adds statistical weighting to distinguish informative terms from common ones.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about n-grams and text representation.

Loading component...

Reference

BIBTEXAcademic
@misc{ngramscapturingwordorderintextwithbigramstrigramsskipgrams, author = {Michael Brenndoerfer}, title = {N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams}, year = {2025}, url = {https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams. Retrieved from https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp
MLAAcademic
Michael Brenndoerfer. "N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams." 2026. Web. today. <https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp>.
CHICAGOAcademic
Michael Brenndoerfer. "N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams." Accessed today. https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp.
HARVARDAcademic
Michael Brenndoerfer (2025) 'N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams'. Available at: https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). N-grams: Capturing Word Order in Text with Bigrams, Trigrams & Skip-grams. https://mbrenndoerfer.com/writing/n-grams-bigrams-trigrams-text-representation-nlp