Search

Search articles

Special Tokens in Transformers: CLS, SEP, PAD, MASK & More

Michael BrenndoerferDecember 15, 202527 min read

Learn how special tokens like [CLS], [SEP], [PAD], and [MASK] structure transformer inputs. Understand token type IDs, attention masks, and custom tokens.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Special Tokens

Introduction

Special tokens are the invisible orchestrators of modern language models. While subword tokenization handles the vocabulary problem for regular words, special tokens serve a fundamentally different purpose: they provide structural signals that guide how models process, understand, and generate text. These tokens do not represent words or subwords from the training corpus. Instead, they carry meta-information about sequence boundaries, padding, masking, and task-specific markers.

When you feed text into BERT, GPT, or any transformer-based model, the tokenizer doesn't just convert words to numbers. It wraps your input in a scaffold of special tokens. The [CLS] token tells BERT where to look for a sentence-level representation. The [SEP] token marks boundaries between segments. The [PAD] token fills sequences to uniform length. Without these structural markers, transformers would struggle to know where sentences begin, where they end, and which parts to attend to.

This chapter explores the taxonomy of special tokens, their roles in different architectures, and how to work with them effectively. You'll learn not just what each token does, but why it exists and how its design reflects the underlying model's training objectives.

Technical Deep Dive

To understand special tokens, we must first recognize a fundamental problem: transformer models operate on sequences of embeddings, but they have no inherent understanding of where a sentence begins, ends, or how multiple sentences relate to each other. Raw token sequences carry content but lack structure. Special tokens solve this by embedding structural information directly into the input. They are reserved vocabulary positions that carry meta-information rather than linguistic content.

Unlike subword tokens that emerge organically from frequency-based algorithms like BPE or WordPiece, special tokens are manually defined before training begins. They occupy fixed positions in the vocabulary (typically at the very beginning) and are never merged or split during tokenization. This deliberate design ensures they retain their intended structural meaning throughout training and inference.

The Core Special Tokens

Modern language models share a common vocabulary of structural markers, though the exact symbols vary between architectures. Let's examine each one, understanding not just what it does, but why it's needed.

The Classification Token [CLS]

Consider the challenge of sequence classification. You have a sentence like "I loved this movie!" and need a single vector representation to feed into a classifier. But transformers produce one embedding per token. Where should you look for the "meaning" of the whole sentence?

Classification Token [CLS]

A special token prepended to every input sequence. The hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. During pre-training, the model learns to encode sentence-level information into this position.

The [CLS] token provides a dedicated aggregation point. Because self-attention allows every token to attend to every other token, information from the entire sequence flows into the [CLS] position. By the final layer, the [CLS] embedding has "seen" the whole sentence and can represent its overall meaning. This is more principled than alternatives like averaging all token embeddings, because the model learns during pre-training exactly what information to aggregate.

The Separator Token [SEP]

Many NLP tasks involve comparing two pieces of text: question answering (question + context), natural language inference (premise + hypothesis), or semantic similarity (sentence A + sentence B). How does the model know where one ends and the other begins?

Separator Token [SEP]

Marks boundaries between segments in multi-segment inputs. For tasks like question answering or natural language inference, the input contains two distinct text segments. The [SEP] token tells the model where one segment ends and another begins.

The [SEP] token acts as a visible boundary marker. When the model sees [CLS] question tokens [SEP] context tokens [SEP], it can learn that tokens before the first [SEP] are the question and tokens after are the context. This explicit boundary enables the model to learn different attention patterns for cross-segment reasoning.

The Padding Token [PAD]

GPU computation is most efficient when processing batches of sequences simultaneously. But what if your batch contains sentences of different lengths? You can't have a jagged tensor: all sequences must have the same length.

Padding Token [PAD]

Fills sequences to a uniform length within a batch. Since transformers process batches of sequences in parallel, all sequences must have the same length. Shorter sequences are padded, and attention masks ensure the model ignores these padding positions.

The [PAD] token fills shorter sequences to match the longest one in the batch. But padding introduces a problem: we don't want the model to attend to these meaningless positions. This is where attention masks become essential (we'll explore this mechanism shortly).

The Mask Token [MASK]

How do you train a language model without labeled data? BERT's innovation was masked language modeling: hide some words and train the model to predict them from context. But you need a placeholder for the hidden words.

Mask Token [MASK]

Used exclusively during masked language modeling pre-training. A percentage of input tokens are replaced with [MASK], and the model learns to predict the original tokens. This creates a self-supervised learning signal without requiring labeled data.

The [MASK] token signals "predict what goes here." During pre-training, about 15% of tokens are replaced with [MASK], and the model's objective is to reconstruct the original vocabulary item. This forces the model to build rich contextual representations, since it must understand the surrounding context deeply enough to infer the missing word.

The Unknown Token [UNK]

What happens when the tokenizer encounters something it cannot represent? While subword tokenization can theoretically decompose any string into known pieces, edge cases exist.

Unknown Token [UNK]

A fallback for tokens that cannot be represented by the vocabulary. While subword tokenization can theoretically handle any input by breaking it into smaller pieces, some tokenizers may encounter truly unknown characters or sequences.

The [UNK] token is a fallback for characters or sequences the tokenizer cannot handle. Modern byte-level tokenizers rarely need it, but it remains a safety net.

Input Formats Across Architectures

Different model architectures use special tokens in characteristic patterns. Understanding these patterns helps you correctly format inputs for any model.

Encoder-only models (BERT, RoBERTa, ALBERT)

These models process text bidirectionally, attending to both past and future tokens. Their input format follows a strict template:

For single-segment tasks (like sentiment classification):

[CLS]+tokens+[SEP][\text{CLS}] + \text{tokens} + [\text{SEP}]

For two-segment tasks (like question answering):

[CLS]+segment1+[SEP]+segment2+[SEP][\text{CLS}] + \text{segment}_1 + [\text{SEP}] + \text{segment}_2 + [\text{SEP}]

The [CLS] always appears first, providing the aggregation point. Each segment ends with [SEP]. This consistent structure allows the model to learn reliable positional expectations during pre-training.

Decoder-only models (GPT-2, GPT-3, LLaMA)

Autoregressive models generate text left-to-right and need different markers. They typically use:

  • Beginning-of-sequence (BOS): Signals the start of generation
  • End-of-sequence (EOS): Signals when to stop generating

The naming varies across implementations:

Beginning and end-of-sequence tokens across model families. GPT-2 uses the same token for both roles.
ModelBOS TokenEOS Token
GPT-2<|endoftext|><|endoftext|>
LLaMA<s></s>
Generic<bos><eos>

GPT-2's use of the same token for both beginning and end reflects its design: text documents are simply concatenated with <|endoftext|> between them during training.

Encoder-decoder models (T5, BART)

These models combine conventions: the encoder uses separator-style tokens, while the decoder uses beginning/end tokens. T5 introduces additional sentinel tokens (<extra_id_0>, <extra_id_1>, etc.) for its span corruption pre-training objective.

Token Type IDs: Distinguishing Segments

Knowing where segments end (via [SEP]) isn't enough. The model also needs to know which segment each token belongs to. Token type IDs (also called segment IDs) provide this information explicitly.

Consider processing "The cat sat." and "A dog barked." as a sentence pair:

Tokens: [CLS] The cat sat . [SEP] A dog barked . [SEP] Token Type IDs: 0 0 0 0 0 0 1 1 1 1 1

The pattern is straightforward:

token_type_ids=[0,0,...,0,0,1,1,...,1,1]\text{token\_type\_ids} = [0, 0, ..., 0, 0, 1, 1, ..., 1, 1]

where:

  • All tokens in segment 1 (including [CLS] and the first [SEP]) receive type ID 0
  • All tokens in segment 2 (including the final [SEP]) receive type ID 1

These IDs are converted to learned embeddings, giving the model explicit information about segment membership. This allows the model to learn different behaviors for tokens depending on which segment they belong to, which matters for tasks like determining if one sentence entails another.

The Complete Input Embedding

With all the pieces in place, we can now understand how a token's position, identity, and segment membership combine into a single representation. For each position ii in the input sequence, the model computes a total embedding by summing three components:

Ei=Etoken(xi)+Eposition(i)+Esegment(typei)E_i = E_{\text{token}}(x_i) + E_{\text{position}}(i) + E_{\text{segment}}(\text{type}_i)

where:

  • Etoken(xi)E_{\text{token}}(x_i): the embedding for token xix_i, looked up from the vocabulary embedding table. This captures the token's semantic meaning.
  • Eposition(i)E_{\text{position}}(i): the positional embedding for position ii. This encodes where the token appears in the sequence.
  • Esegment(typei)E_{\text{segment}}(\text{type}_i): the segment embedding for the token type at position ii. This distinguishes which segment the token belongs to.

All three embedding tables are learned during pre-training. The additive combination allows each type of information to influence the representation while keeping the model architecture simple.

Out[3]:
Visualization
Bar chart showing the three embedding components for each position in a sample sequence, with annotations showing the final sum.
Composition of BERT input embeddings. Each token position receives the sum of three learned embeddings: the token embedding (semantic meaning), the position embedding (sequential position), and the segment embedding (which segment the token belongs to). This additive combination creates a rich representation before any transformer layers process the sequence.

The visualization shows how each token's input representation combines three distinct signals. Token embeddings (blue) carry semantic meaning and vary based on what word appears at each position. Position embeddings (red) encode sequential order, gradually increasing as we move through the sequence. Segment embeddings (green) distinguish the two segments, with a visible boundary at position 4 where the second segment begins.

Attention Masks: Neutralizing Padding

Padding solves the variable-length problem but creates a new one: how do we prevent the model from attending to these meaningless positions? The answer is the attention mask, a binary vector that marks which tokens are real:

attention_mask=[1,1,1,1,1,0,0,0]\text{attention\_mask} = [1, 1, 1, 1, 1, 0, 0, 0]

where:

  • 1 indicates a real token that should participate in attention
  • 0 indicates padding that should be ignored

The mechanism works by modifying the attention computation. Recall that self-attention computes a weighted combination of values based on query-key compatibility. Before applying softmax to the attention scores, we add a mask matrix MM:

Attention(Q,K,V)=softmax(QKTdk+M)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V

where:

  • QQ, KK, VV are the query, key, and value matrices
  • dkd_k is the key dimension (for scaling stability)
  • MM is the mask matrix with values of 00 for real tokens and -\infty for padding

Why negative infinity? When you add -\infty to an attention score and then apply softmax, the exponential of negative infinity is zero:

softmax()=eescores=0escores=0\text{softmax}(-\infty) = \frac{e^{-\infty}}{\sum e^{scores}} = \frac{0}{\sum e^{scores}} = 0

This mathematically eliminates padding positions from the attention computation. Real tokens never "see" padding, ensuring that representations are computed purely from meaningful content.

Out[4]:
Visualization
Attention weights before masking. Attention is distributed across all positions including padding tokens, which would corrupt representations if left uncorrected.
Attention weights before masking. Attention is distributed across all positions including padding tokens, which would corrupt representations if left uncorrected.
Attention weights after masking with -∞. Padding positions receive exactly zero attention weight. Red dashed boxes highlight the masked regions that no longer participate in computation.
Attention weights after masking with -∞. Padding positions receive exactly zero attention weight. Red dashed boxes highlight the masked regions that no longer participate in computation.

The heatmaps demonstrate the masking mechanism. Before masking (left), attention is distributed across all positions including padding. After adding -\infty to padding positions and applying softmax (right), those columns and rows become exactly zero. Real tokens only attend to other real tokens, while padding positions are completely isolated from the computation.

A Worked Example

Now that we understand the theory, let's trace through the complete tokenization process step by step. We'll see exactly how special tokens, token type IDs, and attention masks come together to create a properly formatted input.

Single-Segment Example: Sentiment Classification

Consider classifying the sentiment of: "I loved this movie!"

Step 1: Subword Tokenization

First, the text is converted to subword tokens using the model's vocabulary:

["i", "loved", "this", "movie", "!"]

At this point, we have content tokens but no structure. The model wouldn't know where the sentence starts or ends.

Step 2: Adding Special Tokens

The tokenizer wraps the sequence with structural markers:

["[CLS]", "i", "loved", "this", "movie", "!", "[SEP]"]

Now the model has:

  • A dedicated position ([CLS]) for aggregating sequence-level meaning
  • A clear boundary marker ([SEP]) indicating the sequence is complete

Step 3: Converting to IDs

Each token is mapped to its vocabulary ID. Special tokens occupy fixed positions at the start of the vocabulary:

[101, 1045, 2866, 2023, 3185, 999, 102]

Here, 101 is always [CLS] and 102 is always [SEP] in BERT's vocabulary. These fixed IDs ensure consistent behavior across all inputs.

Step 4: Creating Auxiliary Tensors

The tokenizer generates two additional tensors that guide the model's attention:

Auxiliary tensors for a single-segment input. No padding needed means all attention mask values are 1.
TensorValuesMeaning
Attention mask[1, 1, 1, 1, 1, 1, 1]All positions contain real tokens
Token type IDs[0, 0, 0, 0, 0, 0, 0]Single segment (all zeros)

With no padding needed, the attention mask is all ones. With only one segment, all token type IDs are zero.

Two-Segment Example: Natural Language Inference

Now consider a more complex case. Natural language inference requires comparing two sentences:

  • Premise: "The cat sat on the mat."
  • Hypothesis: "A feline was resting."

The model must determine their relationship (entailment, contradiction, or neutral).

Step 1-2: Tokenization with Special Tokens

Both segments are tokenized and joined with appropriate markers:

["[CLS]", "the", "cat", "sat", "on", "the", "mat", ".", "[SEP]", "a", "fe", "##line", "was", "resting", ".", "[SEP]"]

Notice several things:

  • [CLS] starts the entire input
  • The first [SEP] separates premise from hypothesis
  • The second [SEP] marks the end of the hypothesis
  • "feline" is split into ["fe", "##line"] by WordPiece (the ## indicates continuation)

Step 3: Token Type IDs

This is where segment distinction matters:

Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Token: [CLS] the cat sat on the mat . [SEP] a fe ##line was resting . [SEP] Token Type: 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1

The first 9 positions (premise including its [SEP]) have type 0. The remaining positions (hypothesis) have type 1. This explicit segmentation enables the model to learn different reasoning patterns for cross-segment comparison.

Step 4: Attention Mask

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

All tokens are real, so all positions get attention weight 1.

Adding Padding

What if we're batching multiple inputs of different lengths? Consider adding a shorter sentence to our batch:

  • Input 1: "The cat sat on the mat." + "A feline was resting." (16 tokens)
  • Input 2: "Hello world!" (5 tokens after adding [CLS] and [SEP])

To process these together, Input 2 must be padded to length 16:

Input 2 tokens: ["[CLS]", "hello", "world", "!", "[SEP]", "[PAD]", "[PAD]", ... "[PAD]"] Attention mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

The attention mask now contains zeros for padding positions, telling the model to ignore them during attention computation. Without this mask, the model would attend to padding and produce corrupted representations.

Code Implementation

Having understood the theory and traced through examples by hand, let's now implement and explore special tokens using the Hugging Face transformers library. We'll verify our understanding through code, examining how tokenizers handle special tokens internally and how to work with them in practice.

Examining BERT's Special Tokens

Let's start by inspecting BERT's special token configuration to confirm what we learned in the theory section:

In[5]:
Code
from transformers import BertTokenizer

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Get special tokens
special_tokens = {
    "cls_token": tokenizer.cls_token,
    "sep_token": tokenizer.sep_token,
    "pad_token": tokenizer.pad_token,
    "mask_token": tokenizer.mask_token,
    "unk_token": tokenizer.unk_token,
}

# Get their IDs
special_token_ids = {
    name: tokenizer.convert_tokens_to_ids(token)
    for name, token in special_tokens.items()
}
Out[6]:
Console
BERT Special Tokens:
----------------------------------------
cls_token       [CLS]      ID: 101
sep_token       [SEP]      ID: 102
pad_token       [PAD]      ID: 0
mask_token      [MASK]     ID: 103
unk_token       [UNK]      ID: 100

As expected, BERT reserves the first vocabulary positions for special tokens. The [PAD] token at ID 0 follows a common convention: zero-padding is the default behavior in many frameworks, so placing [PAD] at index 0 aligns with standard tensor initialization.

Tokenizing with Special Tokens

Now let's observe special token addition in action. We'll encode text with and without special tokens to see exactly what the tokenizer adds:

In[7]:
Code
text = "Special tokens guide transformer attention."

# Encode with and without special tokens
tokens_with_special = tokenizer.encode(text, add_special_tokens=True)
tokens_without_special = tokenizer.encode(text, add_special_tokens=False)

# Decode to see the difference
decoded_with = tokenizer.decode(tokens_with_special)
decoded_without = tokenizer.decode(tokens_without_special)

# Get detailed encoding
encoding = tokenizer(text, return_tensors=None)
Out[8]:
Console
Original text: 'Special tokens guide transformer attention.'

Without special tokens:
  IDs: [2569, 19204, 2015, 5009, 10938, 2121, 3086, 1012]
  Decoded: 'special tokens guide transformer attention.'

With special tokens:
  IDs: [101, 2569, 19204, 2015, 5009, 10938, 2121, 3086, 1012, 102]
  Decoded: '[CLS] special tokens guide transformer attention. [SEP]'

Full encoding:
  input_ids: [101, 2569, 19204, 2015, 5009, 10938, 2121, 3086, 1012, 102]
  token_type_ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  attention_mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

The comparison reveals exactly what the tokenizer adds: [CLS] (ID 101) at the beginning and [SEP] (ID 102) at the end. The add_special_tokens=True flag (the default) triggers this wrapping automatically. Notice that the full encoding includes all three components we discussed: input IDs, attention mask (all ones for real tokens), and token type IDs (all zeros for a single segment).

Two-Segment Encoding

Let's now verify our understanding of segment handling by encoding a sentence pair:

In[9]:
Code
premise = "The weather is sunny today."
hypothesis = "It's a beautiful day outside."

# Encode as a pair
pair_encoding = tokenizer(premise, hypothesis, return_tensors=None)

# Get token-level details
tokens = tokenizer.convert_ids_to_tokens(pair_encoding["input_ids"])

# Find segment boundaries
sep_positions = [i for i, t in enumerate(tokens) if t == "[SEP]"]
Out[10]:
Console
Two-segment encoding:
Premise: 'The weather is sunny today.'
Hypothesis: 'It's a beautiful day outside.'

Tokens: ['[CLS]', 'the', 'weather', 'is', 'sunny', 'today', '.', '[SEP]', 'it', "'", 's', 'a', 'beautiful', 'day', 'outside', '.', '[SEP]']
Input IDs: [101, 1996, 4633, 2003, 11559, 2651, 1012, 102, 2009, 1005, 1055, 1037, 3376, 2154, 2648, 1012, 102]
Token Type IDs: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Attention Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

[SEP] positions: [7, 16]
Segment 1 (type 0): positions 0-7
Segment 2 (type 1): positions 8-16

The output confirms our worked example. Token type IDs switch from 0 to 1 exactly at the segment boundary, and both segments end with [SEP]. The model receives explicit information about which tokens belong to which segment, enabling it to learn appropriate cross-segment reasoning patterns.

Padding and Batching

Now let's examine the padding mechanism. We'll encode multiple sentences of different lengths to see how the tokenizer handles batching:

In[11]:
Code
sentences = [
    "Short sentence.",
    "This is a medium length sentence with more words.",
    "Tiny.",
]

# Pad to longest sequence in batch
batch_encoding = tokenizer(sentences, padding=True, return_tensors=None)

# Analyze padding
sequence_lengths = [sum(mask) for mask in batch_encoding["attention_mask"]]
max_length = len(batch_encoding["input_ids"][0])
Out[12]:
Console
Batch encoding with padding:
Max sequence length: 12

Sentence 1: 'Short sentence.'
  Real tokens: 5, Padding: 7
  Attention mask: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
  Tokens: ['[CLS]', 'short', 'sentence', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

Sentence 2: 'This is a medium length sentence with more words.'
  Real tokens: 12, Padding: 0
  Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  Tokens: ['[CLS]', 'this', 'is', 'a', 'medium', 'length', 'sentence', 'with', 'more', 'words', '.', '[SEP]']

Sentence 3: 'Tiny.'
  Real tokens: 4, Padding: 8
  Attention mask: [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
  Tokens: ['[CLS]', 'tiny', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]']

The output demonstrates the padding mechanism in action. Each sequence is padded to match the longest one (the medium sentence). The attention mask precisely tracks which positions are real (1) versus padding (0). Shorter sentences like "Tiny." have mostly zeros in their attention masks, ensuring the model ignores all those [PAD] tokens during attention computation.

The table below quantifies the computational cost of padding in our batch:

Out[13]:
Console
<IPython.core.display.Markdown object>

This table reveals a hidden cost of padding. While the attention mask prevents padding from corrupting representations, the model still processes every padding token through all its layers, consuming memory and computation for no benefit. Sentence 3, with only 5 real tokens padded to 13, wastes over 60% of its allocated computation. Techniques like dynamic batching (grouping similar-length sequences) and sequence packing (concatenating multiple sequences) can significantly reduce this overhead in production systems.

Comparing Tokenizers Across Models

We noted earlier that different architectures use different special token conventions. Let's verify this by comparing BERT, GPT-2, and T5:

In[14]:
Code
from transformers import GPT2Tokenizer, T5Tokenizer

# Load tokenizers
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
t5_tokenizer = T5Tokenizer.from_pretrained("t5-small")

# Collect special token info
tokenizers_info = {
    "BERT": {
        "tokenizer": tokenizer,
        "bos": getattr(tokenizer, "cls_token", None),
        "eos": getattr(tokenizer, "sep_token", None),
        "pad": tokenizer.pad_token,
        "unk": tokenizer.unk_token,
    },
    "GPT-2": {
        "tokenizer": gpt2_tokenizer,
        "bos": gpt2_tokenizer.bos_token,
        "eos": gpt2_tokenizer.eos_token,
        "pad": gpt2_tokenizer.pad_token,
        "unk": gpt2_tokenizer.unk_token,
    },
    "T5": {
        "tokenizer": t5_tokenizer,
        "bos": getattr(t5_tokenizer, "bos_token", None),
        "eos": t5_tokenizer.eos_token,
        "pad": t5_tokenizer.pad_token,
        "unk": t5_tokenizer.unk_token,
    },
}
Out[15]:
Console
Special Token Comparison Across Models:
-------------------------------------------------------
Model      BOS/CLS         EOS/SEP         PAD        UNK       
-------------------------------------------------------
BERT       [CLS]           [SEP]           [PAD]      [UNK]     
GPT-2      <|endoftext|>   <|endoftext|>   None       <|endoftext|>
T5         None            </s>            <pad>      <unk>     

The comparison reveals fundamental differences in architecture design. BERT uses [CLS] for sequence representation and [SEP] for boundaries, reflecting its bidirectional encoder nature. GPT-2 lacks a dedicated padding token, reflecting its original design for unpadded text generation. T5 uses </s> as both separator and end marker, consistent with its encoder-decoder design.

When fine-tuning GPT-2, you'll often need to set a padding token explicitly (commonly by reusing the EOS token, since GPT-2 wasn't designed for batched training).

Masked Language Modeling

Having explored structural tokens, let's now see the [MASK] token in action. This token is central to BERT's pre-training. Let's verify that a trained model can actually predict masked words:

In[16]:
Code
from transformers import BertForMaskedLM
import torch

# Load BERT for MLM
mlm_model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Create a masked sentence
text = "The capital of France is Paris."
masked_text = "The capital of France is [MASK]."

# Tokenize
inputs = tokenizer(masked_text, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = mlm_model(**inputs)
    predictions = outputs.logits

# Find the masked position
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[
    1
]

# Get top predictions
mask_logits = predictions[0, mask_token_index, :]
top_tokens = torch.topk(mask_logits, 5, dim=1)
top_token_ids = top_tokens.indices[0].tolist()
top_token_probs = torch.softmax(mask_logits, dim=1)[0][top_token_ids].tolist()
Out[17]:
Console
Original: 'The capital of France is Paris.'
Masked: 'The capital of France is [MASK].'

Top 5 predictions for [MASK]:
  paris           probability: 0.4168
  lille           probability: 0.0714
  lyon            probability: 0.0634
  marseille       probability: 0.0444
  tours           probability: 0.0303
Out[18]:
Visualization
Horizontal bar chart showing predicted tokens and their probabilities, with 'paris' having the highest probability.
Top 10 predictions for the masked position in 'The capital of France is [MASK].' The model assigns high probability to 'paris' based on contextual understanding learned during pre-training. Geographic and factual knowledge emerges from seeing millions of similar patterns.

The model correctly predicts "paris" with high confidence. The probability distribution reveals how BERT has internalized factual knowledge through pre-training: it strongly favors the correct capital while assigning lower probabilities to plausible alternatives. This pattern of learning factual associations through self-supervised prediction is the foundation of how language models acquire world knowledge.

Adding Custom Special Tokens

Sometimes you need special tokens beyond the standard set. For example, a dialogue system might need speaker markers:

In[19]:
Code
from transformers import BertTokenizer

# Create a fresh tokenizer
custom_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define custom special tokens
custom_tokens = {
    "additional_special_tokens": [
        "[SPEAKER1]",
        "[SPEAKER2]",
        "[SYSTEM]",
        "[ACTION]",
    ]
}

# Add them to the tokenizer
num_added = custom_tokenizer.add_special_tokens(custom_tokens)

# Test tokenization with custom tokens
dialogue = "[SPEAKER1] Hello! [SPEAKER2] Hi there! [ACTION] waves"
custom_tokens_list = custom_tokenizer.tokenize(dialogue)
custom_ids = custom_tokenizer.encode(dialogue, add_special_tokens=False)
Out[20]:
Console
Added 4 new special tokens
New vocabulary size: 30526

Custom special tokens and their IDs:
  [SPEAKER1]: 30522
  [SPEAKER2]: 30523
  [SYSTEM]: 30524
  [ACTION]: 30525

Dialogue: '[SPEAKER1] Hello! [SPEAKER2] Hi there! [ACTION] waves'
Tokens: ['[SPEAKER1]', 'hello', '!', '[SPEAKER2]', 'hi', 'there', '!', '[ACTION]', 'waves']
IDs: [30522, 7592, 999, 30523, 7632, 2045, 999, 30525, 5975]

The custom tokens receive IDs at the end of the existing vocabulary (30522 onwards in BERT's case). Each custom token is treated atomically during tokenization. Notice how [SPEAKER1] remains intact rather than being split.

One critical detail: when using a model with custom tokens, you must resize its embedding layer to accommodate the new vocabulary size. Otherwise, token IDs beyond the original vocabulary size will cause index errors:

In[33]:
Code
# Required when using custom tokens with a model
model.resize_token_embeddings(len(custom_tokenizer))

Visualizing Special Token Positions

To consolidate our understanding, let's create a comprehensive visualization showing how all the components (tokens, token type IDs, and attention masks) work together:

Out[21]:
Visualization
Horizontal bar chart showing token positions with colors indicating special tokens, segment 1 tokens, segment 2 tokens, and padding.
Structure of a BERT input sequence showing special tokens, real tokens, and padding. The [CLS] token provides the sequence representation, [SEP] marks segment boundaries, and [PAD] fills to uniform length. Token type IDs distinguish the two segments.

The visualization brings together everything we've learned. The top row shows the actual tokens with color-coding: red for [CLS], orange for [SEP], blue for segment 1, green for segment 2, and gray for padding. The middle row displays token type IDs. Notice how they switch from 0 to 1 at the segment boundary. The bottom row shows the attention mask with clear 1s for real tokens and 0s for padding.

This three-component structure (tokens, token types, and attention mask) is the complete input specification that transformers expect. Every input you send to BERT or similar models must include all three.

Special Tokens in Generation

We've focused on encoder models, but special tokens play equally important roles in generation. The EOS token, in particular, signals when the model should stop producing output:

In[22]:
Code
from transformers import GPT2LMHeadModel

# Load GPT-2 model
gpt2_model = GPT2LMHeadModel.from_pretrained("gpt2")

# Set padding token for generation
gpt2_tokenizer.pad_token = gpt2_tokenizer.eos_token

# Generate text with explicit EOS handling
prompt = "The future of artificial intelligence is"
inputs = gpt2_tokenizer(prompt, return_tensors="pt")

# Generate with EOS token as stopping criterion
generated = gpt2_model.generate(
    **inputs,
    max_new_tokens=30,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=gpt2_tokenizer.eos_token_id,
    eos_token_id=gpt2_tokenizer.eos_token_id,
)

generated_text = gpt2_tokenizer.decode(generated[0], skip_special_tokens=True)
generated_with_special = gpt2_tokenizer.decode(
    generated[0], skip_special_tokens=False
)
Out[23]:
Console
Prompt: 'The future of artificial intelligence is'

Generated (without special tokens):
  The future of artificial intelligence is already set to change, with artificial intelligence being born into the world's top universities and colleges.

A growing number of companies are looking to develop

Generated (with special tokens visible):
  The future of artificial intelligence is already set to change, with artificial intelligence being born into the world's top universities and colleges.

A growing number of companies are looking to develop

EOS token: '<|endoftext|>' (ID: 50256)

Two decoding options are shown: with and without special tokens visible. For user-facing output, skip_special_tokens=True produces clean text. For debugging, seeing the special tokens helps verify that the model terminated properly (at <|endoftext|>) and didn't hit the maximum length cutoff.

The eos_token_id parameter defines the stopping condition for generation. Without it, the model would continue generating until hitting max_new_tokens, potentially producing incomplete or rambling output.

Limitations & Impact

While special tokens are essential for transformer architectures, they come with tradeoffs in terms of computational efficiency, model flexibility, and cross-architecture compatibility. Understanding these limitations helps practitioners make informed design decisions.

Limitations

Special tokens introduce subtle but important challenges. The [CLS] token, while convenient, forces sequence-level information into a single position. For long documents or complex reasoning tasks, this bottleneck can limit model performance. Some researchers have proposed using pooled representations from all tokens instead, or learning multiple aggregate representations.

The fixed vocabulary of special tokens can also be limiting. When adapting a pre-trained model to a new domain with new structural requirements (like code with specific delimiters, or legal documents with citation markers), you must add custom special tokens. This requires resizing the embedding layer and potentially fine-tuning to help the model learn useful representations for these new tokens. Models pre-trained without exposure to your custom tokens start with random embeddings, which can slow convergence.

Padding introduces computational inefficiency. Even with attention masks that prevent padding from influencing representations, the model still processes padding tokens through its layers, consuming memory and computation. Techniques like dynamic batching (grouping similar-length sequences) and sequence packing (concatenating multiple sequences with separators) help mitigate this, but add implementation complexity.

The reliance on specific special token formats creates compatibility challenges. A model trained with BERT-style tokens ([CLS], [SEP]) expects exactly that format during inference. Using the wrong special tokens, or forgetting them entirely, leads to degraded performance. This has led to the creation of standardized formats like the ChatML template for conversational models, but fragmentation persists across the ecosystem.

Impact

Despite these limitations, special tokens have become essential to modern NLP. The [CLS] token enabled BERT's approach to transfer learning, allowing a single pre-trained model to be fine-tuned for dozens of different tasks. The [MASK] token made self-supervised pre-training on unlabeled text possible at scale, eliminating the need for expensive labeled datasets.

The segment separation mechanism (using [SEP] and token type IDs) enabled models to jointly reason about multiple text pieces, unlocking tasks like question answering, natural language inference, and semantic similarity. Before this, models typically processed each input independently.

Custom special tokens have enabled specialized applications. Code models use tokens for different programming constructs. Dialogue systems use speaker tokens. Retrieval-augmented models use document boundary tokens. The flexibility to extend the special token vocabulary has made transformers adaptable to an enormous range of tasks beyond their original design.

Special tokens also established a convention for structuring model inputs that the entire field now follows. This standardization enabled the creation of shared benchmarks, reproducible research, and interoperable tools. When you use any modern NLP library, you're building on the foundation that special tokens provide.

Summary

Special tokens are the structural backbone of modern language models, providing essential signals that guide how transformers process text:

  • [CLS] aggregates sequence-level information for classification tasks
  • [SEP] marks boundaries between segments in multi-input tasks
  • [PAD] enables batched processing by filling sequences to uniform length
  • [MASK] enables self-supervised pre-training through masked language modeling
  • [UNK] handles out-of-vocabulary items when subword tokenization isn't sufficient

Beyond these core tokens, models use:

  • Token type IDs to distinguish segments within a sequence
  • Attention masks to prevent padding from influencing real token representations
  • Beginning/end tokens to mark sequence boundaries in generative models
  • Custom special tokens for domain-specific structural needs

The design of special tokens reflects each model's training objectives and intended use cases. BERT's [CLS] and [SEP] support its bidirectional encoder architecture and sentence-pair tasks. GPT's simpler <|endoftext|> matches its autoregressive generation paradigm. T5's sentinel tokens enable its span corruption pre-training.

When working with special tokens, remember to add them during encoding (add_special_tokens=True), handle them appropriately during decoding (skip_special_tokens=True for clean output), and resize model embeddings when adding custom tokens. These small details often determine whether a model performs as expected or produces nonsensical results.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about special tokens in transformer models.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{specialtokensintransformersclsseppadmaskmore, author = {Michael Brenndoerfer}, title = {Special Tokens in Transformers: CLS, SEP, PAD, MASK & More}, year = {2025}, url = {https://mbrenndoerfer.com/writing/special-tokens-transformers-cls-sep-pad-mask}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-15} }
APAAcademic
Michael Brenndoerfer (2025). Special Tokens in Transformers: CLS, SEP, PAD, MASK & More. Retrieved from https://mbrenndoerfer.com/writing/special-tokens-transformers-cls-sep-pad-mask
MLAAcademic
Michael Brenndoerfer. "Special Tokens in Transformers: CLS, SEP, PAD, MASK & More." 2025. Web. 12/15/2025. <https://mbrenndoerfer.com/writing/special-tokens-transformers-cls-sep-pad-mask>.
CHICAGOAcademic
Michael Brenndoerfer. "Special Tokens in Transformers: CLS, SEP, PAD, MASK & More." Accessed 12/15/2025. https://mbrenndoerfer.com/writing/special-tokens-transformers-cls-sep-pad-mask.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Special Tokens in Transformers: CLS, SEP, PAD, MASK & More'. Available at: https://mbrenndoerfer.com/writing/special-tokens-transformers-cls-sep-pad-mask (Accessed: 12/15/2025).
SimpleBasic
Michael Brenndoerfer (2025). Special Tokens in Transformers: CLS, SEP, PAD, MASK & More. https://mbrenndoerfer.com/writing/special-tokens-transformers-cls-sep-pad-mask
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free