Global Tokens: How Efficient Transformers Enable Long-Range Attention

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how global tokens solve the information bottleneck in sparse attention by creating communication hubs that reduce path length from O(n/w) to just 2 hops.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Global TokensLink Copied

Sparse attention patterns like sliding windows solve the quadratic complexity problem but introduce a new challenge: how does information travel across the full sequence? If each token only attends to nearby neighbors, a token at position 0 cannot directly communicate with a token at position 1000. Information must hop through many intermediate windows, risking degradation along the way. Global tokens solve this by designating certain positions as communication hubs that can attend to and be attended by every position in the sequence.

The Information Bottleneck in Local AttentionLink Copied

Sliding window attention limits each token to a local neighborhood. While efficient, this creates a path length problem reminiscent of RNNs. To transfer information from the start of a document to the end, data must flow through overlapping windows one hop at a time.

Consider a sequence of 4096 tokens with a window size of 512. A token at position 0 can directly see tokens up to position 256 (half the window). To reach position 4000, information must traverse roughly 16 hops through successive windows. Each hop introduces potential for information loss or distortion.

Path Length

Path length measures the number of attention operations needed for one position to influence another. Standard self-attention has path length 1 (direct connection). Sliding window attention has path length proportional to sequence length divided by window size.

Global tokens restore direct connections by serving as intermediaries. Any token can communicate with any other token by passing through a global token, reducing the maximum path length to 2.

In[2]:

Code

import numpy as np


def compute_path_lengths(seq_len, window_size, num_global=0):
    """
    Compute maximum path lengths for different attention patterns.

    Path length = minimum number of attention operations for info to flow
    between any two positions.
    """
    # Sliding window: path length is ceil(distance / (window_size / 2))
    max_distance = seq_len - 1
    window_path = int(np.ceil(max_distance / (window_size // 2)))

    # With global tokens: max path is 2 (token -> global -> token)
    global_path = 2 if num_global > 0 else window_path

    return {
        "full_attention": 1,
        "sliding_window": window_path,
        "with_global_tokens": global_path,
    }


# Example: 4096 token sequence
seq_len = 4096
window_size = 512
paths = compute_path_lengths(seq_len, window_size, num_global=1)

import numpy as np


def compute_path_lengths(seq_len, window_size, num_global=0):
    """
    Compute maximum path lengths for different attention patterns.

    Path length = minimum number of attention operations for info to flow
    between any two positions.
    """
    # Sliding window: path length is ceil(distance / (window_size / 2))
    max_distance = seq_len - 1
    window_path = int(np.ceil(max_distance / (window_size // 2)))

    # With global tokens: max path is 2 (token -> global -> token)
    global_path = 2 if num_global > 0 else window_path

    return {
        "full_attention": 1,
        "sliding_window": window_path,
        "with_global_tokens": global_path,
    }


# Example: 4096 token sequence
seq_len = 4096
window_size = 512
paths = compute_path_lengths(seq_len, window_size, num_global=1)

Out[3]:

Visualization

Bar chart comparing path lengths: full attention at 1, global tokens at 2, and sliding window at 16. — Maximum path length comparison across attention patterns. Full attention connects any two positions in 1 hop. Sliding window requires 16 hops for a 4096-token sequence with window size 512. Adding a single global token reduces this to just 2 hops.

The difference is clear. Sliding window attention requires 16 hops to connect distant positions, while adding just one global token reduces this to 2. This improvement in connectivity comes with minimal computational overhead since only a few tokens become global.

CLS Token as Global AttentionLink Copied

The idea of global tokens predates efficient transformers. BERT's [CLS] token, placed at the start of every sequence, was designed to aggregate sentence-level information. During fine-tuning, the [CLS] representation feeds into classification heads because it has learned to summarize the entire input.

Longformer and BigBird formalized this intuition by giving the [CLS] token bidirectional global attention. In standard BERT, [CLS] only receives information through the normal attention mechanism. In Longformer, [CLS] can attend to every token and every token can attend to [CLS].

In[4]:

Code

def create_longformer_attention_mask(seq_len, window_size, global_positions):
    """
    Create an attention mask for Longformer-style attention.

    Returns:
        mask: Boolean array where True means position i can attend to position j
    """
    mask = np.zeros((seq_len, seq_len), dtype=bool)

    # Local window attention for all positions
    for i in range(seq_len):
        start = max(0, i - window_size // 2)
        end = min(seq_len, i + window_size // 2 + 1)
        mask[i, start:end] = True

    # Global positions can attend to everything and be attended by everything
    for g in global_positions:
        mask[g, :] = True  # Global attends to all
        mask[:, g] = True  # All attend to global

    return mask


# Example: 16 tokens, window of 4, CLS token at position 0
seq_len = 16
window_size = 4
global_positions = [0]  # CLS token

mask = create_longformer_attention_mask(seq_len, window_size, global_positions)

def create_longformer_attention_mask(seq_len, window_size, global_positions):
    """
    Create an attention mask for Longformer-style attention.

    Returns:
        mask: Boolean array where True means position i can attend to position j
    """
    mask = np.zeros((seq_len, seq_len), dtype=bool)

    # Local window attention for all positions
    for i in range(seq_len):
        start = max(0, i - window_size // 2)
        end = min(seq_len, i + window_size // 2 + 1)
        mask[i, start:end] = True

    # Global positions can attend to everything and be attended by everything
    for g in global_positions:
        mask[g, :] = True  # Global attends to all
        mask[:, g] = True  # All attend to global

    return mask


# Example: 16 tokens, window of 4, CLS token at position 0
seq_len = 16
window_size = 4
global_positions = [0]  # CLS token

mask = create_longformer_attention_mask(seq_len, window_size, global_positions)

Out[5]:

Visualization

Heatmap showing attention mask with global CLS token at position 0, displaying a cross pattern against diagonal sliding window. — Attention mask with CLS token (position 0) as a global token. The first row and column are fully connected, while other positions follow sliding window attention. This creates a star-shaped pattern where CLS serves as a central hub.

The attention mask reveals the structure. The cross pattern emanating from position 0 shows the CLS token's global connectivity. The diagonal band represents local window attention for all other positions. This combination ensures that every token is at most 2 hops from any other token: local token → CLS → distant local token.

Task-Specific Global TokensLink Copied

While CLS provides classification-focused global attention, different tasks benefit from different global token configurations. Question answering, for instance, naturally has two distinct segments: the question and the context passage. Making all question tokens global ensures that every context token can directly attend to the question.

In[6]:

Code

def create_qa_attention_mask(question_len, context_len, window_size):
    """
    Create attention mask for question answering.
    Question tokens are global, context tokens use sliding window.
    """
    total_len = question_len + context_len
    mask = np.zeros((total_len, total_len), dtype=bool)

    # Question tokens are global (positions 0 to question_len-1)
    for i in range(question_len):
        mask[i, :] = True  # Question attends to all
        mask[:, i] = True  # All attend to question

    # Context tokens use sliding window among themselves
    for i in range(question_len, total_len):
        start = max(question_len, i - window_size // 2)
        end = min(total_len, i + window_size // 2 + 1)
        mask[i, start:end] = True

    return mask


# Example: 4 question tokens, 12 context tokens
question_len = 4
context_len = 12
window_size = 4

qa_mask = create_qa_attention_mask(question_len, context_len, window_size)

def create_qa_attention_mask(question_len, context_len, window_size):
    """
    Create attention mask for question answering.
    Question tokens are global, context tokens use sliding window.
    """
    total_len = question_len + context_len
    mask = np.zeros((total_len, total_len), dtype=bool)

    # Question tokens are global (positions 0 to question_len-1)
    for i in range(question_len):
        mask[i, :] = True  # Question attends to all
        mask[:, i] = True  # All attend to question

    # Context tokens use sliding window among themselves
    for i in range(question_len, total_len):
        start = max(question_len, i - window_size // 2)
        end = min(total_len, i + window_size // 2 + 1)
        mask[i, start:end] = True

    return mask


# Example: 4 question tokens, 12 context tokens
question_len = 4
context_len = 12
window_size = 4

qa_mask = create_qa_attention_mask(question_len, context_len, window_size)

Out[7]:

Visualization

Heatmap showing question answering attention mask with global question tokens and local context attention. — Attention pattern for question answering. Question tokens (Q0-Q3) have bidirectional global attention, ensuring every context token can directly access the question. Context tokens attend locally within their sliding window.

The mask shows a block pattern. The upper-left quadrant is fully connected (question tokens attending to each other). The left columns and top rows extend this connectivity to context tokens. The lower-right shows the sliding window pattern among context tokens.

This task-specific design makes semantic sense. When answering a question, every context word should be able to consider the question directly. Without global question tokens, a context word might need to hop through multiple windows before accessing question information.

Learned Global TokensLink Copied

CLS and task-specific tokens are fixed by design. An alternative approach introduces learnable global tokens that the model discovers during training. These tokens don't correspond to any input words but serve purely as memory and communication buffers.

Perceiver and Perceiver IO use this idea extensively. They introduce a small set of latent tokens (often 256-512) that cross-attend to a much longer input sequence. The latent tokens then self-attend among themselves efficiently, and finally cross-attend back to produce outputs.

In[8]:

Code

def create_latent_global_mask(input_len, num_latents, window_size):
    """
    Create attention mask with learned latent global tokens.
    Latent tokens attend globally; input tokens attend locally + to latents.
    """
    total_len = num_latents + input_len
    mask = np.zeros((total_len, total_len), dtype=bool)

    # Latent tokens (positions 0 to num_latents-1) have global attention
    for i in range(num_latents):
        mask[i, :] = True
        mask[:, i] = True

    # Input tokens use sliding window among themselves
    for i in range(num_latents, total_len):
        input_pos = i - num_latents
        # Local window in input space
        start = num_latents + max(0, input_pos - window_size // 2)
        end = num_latents + min(input_len, input_pos + window_size // 2 + 1)
        mask[i, start:end] = True

    return mask


# Example: 4 latent tokens, 16 input tokens
num_latents = 4
input_len = 16
window_size = 4

latent_mask = create_latent_global_mask(input_len, num_latents, window_size)

def create_latent_global_mask(input_len, num_latents, window_size):
    """
    Create attention mask with learned latent global tokens.
    Latent tokens attend globally; input tokens attend locally + to latents.
    """
    total_len = num_latents + input_len
    mask = np.zeros((total_len, total_len), dtype=bool)

    # Latent tokens (positions 0 to num_latents-1) have global attention
    for i in range(num_latents):
        mask[i, :] = True
        mask[:, i] = True

    # Input tokens use sliding window among themselves
    for i in range(num_latents, total_len):
        input_pos = i - num_latents
        # Local window in input space
        start = num_latents + max(0, input_pos - window_size // 2)
        end = num_latents + min(input_len, input_pos + window_size // 2 + 1)
        mask[i, start:end] = True

    return mask


# Example: 4 latent tokens, 16 input tokens
num_latents = 4
input_len = 16
window_size = 4

latent_mask = create_latent_global_mask(input_len, num_latents, window_size)

Out[9]:

Visualization

Heatmap showing attention mask with learned latent tokens having global attention and input tokens having local attention. — Attention pattern with learned latent global tokens. The first 4 positions are latent tokens with full bidirectional attention. Input tokens attend locally plus to all latent tokens, which serve as a compressed global memory.

The advantage of learned latents is flexibility. The model can discover what information to store in global memory rather than being constrained to predefined positions. The latent tokens act as a compressed representation of the full sequence, enabling efficient long-range communication.

Global Token CountLink Copied

How many global tokens do you need? This depends on the task and sequence length. Too few global tokens create an information bottleneck. Too many defeat the purpose of sparse attention by reintroducing quadratic interactions.

For classification tasks with a single CLS token, one global token often suffices. The CLS token aggregates a summary representation, and the classification head only needs this single vector.

For tasks requiring fine-grained outputs (like question answering or named entity recognition), more global tokens help. Making all question tokens global in QA provides richer context for every answer span candidate.

In[10]:

Code

def compute_attention_complexity(seq_len, window_size, num_global):
    """
    Compute number of attention connections for different configurations.
    """
    # Full attention: n^2 connections
    full = seq_len * seq_len

    # Sliding window: approximately n * window_size
    local = seq_len * window_size

    # Global token connections: each global attends to all (n)
    # and is attended by all (n), minus overlap
    # Approximately: 2 * num_global * seq_len
    global_connections = 2 * num_global * seq_len - num_global * num_global

    # Total sparse: local + global (with some overlap correction)
    sparse = local + global_connections

    return {
        "full_attention": full,
        "local_only": local,
        "local_plus_global": sparse,
        "savings_percent": 100 * (1 - sparse / full),
    }


# Test with different global token counts
seq_len = 4096
window_size = 512

results = []
for num_global in [1, 4, 16, 64]:
    complexity = compute_attention_complexity(seq_len, window_size, num_global)
    results.append(
        {
            "num_global": num_global,
            "connections": complexity["local_plus_global"],
            "savings": complexity["savings_percent"],
        }
    )

def compute_attention_complexity(seq_len, window_size, num_global):
    """
    Compute number of attention connections for different configurations.
    """
    # Full attention: n^2 connections
    full = seq_len * seq_len

    # Sliding window: approximately n * window_size
    local = seq_len * window_size

    # Global token connections: each global attends to all (n)
    # and is attended by all (n), minus overlap
    # Approximately: 2 * num_global * seq_len
    global_connections = 2 * num_global * seq_len - num_global * num_global

    # Total sparse: local + global (with some overlap correction)
    sparse = local + global_connections

    return {
        "full_attention": full,
        "local_only": local,
        "local_plus_global": sparse,
        "savings_percent": 100 * (1 - sparse / full),
    }


# Test with different global token counts
seq_len = 4096
window_size = 512

results = []
for num_global in [1, 4, 16, 64]:
    complexity = compute_attention_complexity(seq_len, window_size, num_global)
    results.append(
        {
            "num_global": num_global,
            "connections": complexity["local_plus_global"],
            "savings": complexity["savings_percent"],
        }
    )

Out[11]:

Visualization

Bar chart showing attention connections for 1, 4, 16, and 64 global tokens, all far below the full attention baseline. — Total attention connections remain far below full attention even with 64 global tokens. The red dashed line shows the 16.8M connections required by full attention.

Bar chart showing savings percentage above 84% for all global token counts tested. — Savings percentage stays above 84% because global token cost scales linearly, not quadratically. Even 64 global tokens preserve most efficiency gains.

Even with 64 global tokens, we maintain over 84% savings compared to full attention. The marginal cost of additional global tokens is linear in sequence length, not quadratic. This means you can afford to be generous with global tokens for complex tasks without sacrificing efficiency.

Global-Local Attention MixingLink Copied

The interplay between global and local attention creates interesting information flow patterns. In a single attention layer, information moves in two streams. The local stream propagates through sliding windows, carrying fine-grained positional information. The global stream broadcasts through designated tokens, providing sequence-wide context.

In[12]:

Code

def simulate_information_flow(
    seq_len, window_size, global_positions, num_layers
):
    """
    Simulate how information spreads through global-local attention.
    Returns reachability matrix: which positions can influence which after L layers.
    """
    # Start with direct connections from attention mask
    mask = create_longformer_attention_mask(
        seq_len, window_size, global_positions
    )
    reachable = mask.copy()

    layer_snapshots = [reachable.copy()]

    # Propagate through layers
    for _ in range(num_layers - 1):
        # Position j can reach position i if there's any intermediate k
        # where j reaches k and k reaches i
        new_reachable = reachable @ reachable > 0
        reachable = reachable | new_reachable
        layer_snapshots.append(reachable.copy())

    return layer_snapshots


# Simulate: 32 tokens, window 8, CLS at 0
seq_len = 32
window_size = 8
global_positions = [0]

snapshots = simulate_information_flow(
    seq_len, window_size, global_positions, num_layers=3
)

def simulate_information_flow(
    seq_len, window_size, global_positions, num_layers
):
    """
    Simulate how information spreads through global-local attention.
    Returns reachability matrix: which positions can influence which after L layers.
    """
    # Start with direct connections from attention mask
    mask = create_longformer_attention_mask(
        seq_len, window_size, global_positions
    )
    reachable = mask.copy()

    layer_snapshots = [reachable.copy()]

    # Propagate through layers
    for _ in range(num_layers - 1):
        # Position j can reach position i if there's any intermediate k
        # where j reaches k and k reaches i
        new_reachable = reachable @ reachable > 0
        reachable = reachable | new_reachable
        layer_snapshots.append(reachable.copy())

    return layer_snapshots


# Simulate: 32 tokens, window 8, CLS at 0
seq_len = 32
window_size = 8
global_positions = [0]

snapshots = simulate_information_flow(
    seq_len, window_size, global_positions, num_layers=3
)

Out[13]:

Visualization

Sparse connectivity matrix after one attention layer showing diagonal band plus global token connections. — Layer 1: Direct attention connections. Tokens only reach their local window and the global CLS token. Most pairs are disconnected.

Nearly full connectivity matrix after two layers, most positions can reach each other through global token. — Layer 2: Two-hop connections. Through CLS, every token can now reach every other token. The matrix is nearly full.

Fully connected matrix after three layers, all positions can exchange information. — Layer 3: Three-hop connections. Full connectivity is achieved. Information can flow between any pair of positions.

The visualization reveals how quickly information propagates. After just one layer, connectivity is limited to local neighborhoods plus the global token. By layer 2, the global token has served its purpose as a relay: nearly all positions can reach each other through the path local → global → local. By layer 3, full connectivity is achieved.

This suggests that even with aggressive sparsity, a few layers of global-local attention can match the connectivity of full attention. The key insight is that global tokens provide "shortcuts" that reduce the effective diameter of the attention graph.

Implementation StrategiesLink Copied

Implementing global tokens efficiently requires careful attention to the attention computation itself. The naive approach computes global and local attention separately, then merges results. A more efficient approach uses a single sparse attention kernel.

In[14]:

Code

def global_local_attention(
    query, key, value, window_size, global_positions, d_k
):
    """
    Compute global-local attention efficiently.

    Args:
        query, key, value: Arrays of shape (seq_len, d_model)
        window_size: Size of local attention window
        global_positions: List of positions with global attention
        d_k: Dimension for scaling

    Returns:
        output: Attended values of shape (seq_len, d_model)
        attention_weights: Sparse attention matrix
    """
    seq_len = query.shape[0]

    # Create attention mask
    mask = create_longformer_attention_mask(
        seq_len, window_size, global_positions
    )

    # Compute raw attention scores
    scores = query @ key.T / np.sqrt(d_k)

    # Apply mask: set non-attended positions to -inf
    masked_scores = np.where(mask, scores, -np.inf)

    # Softmax (row-wise)
    exp_scores = np.exp(
        masked_scores - masked_scores.max(axis=1, keepdims=True)
    )
    # Handle -inf: exp(-inf) = 0
    exp_scores = np.nan_to_num(exp_scores, nan=0.0, posinf=0.0, neginf=0.0)
    attention_weights = exp_scores / (
        exp_scores.sum(axis=1, keepdims=True) + 1e-9
    )

    # Compute output
    output = attention_weights @ value

    return output, attention_weights

def global_local_attention(
    query, key, value, window_size, global_positions, d_k
):
    """
    Compute global-local attention efficiently.

    Args:
        query, key, value: Arrays of shape (seq_len, d_model)
        window_size: Size of local attention window
        global_positions: List of positions with global attention
        d_k: Dimension for scaling

    Returns:
        output: Attended values of shape (seq_len, d_model)
        attention_weights: Sparse attention matrix
    """
    seq_len = query.shape[0]

    # Create attention mask
    mask = create_longformer_attention_mask(
        seq_len, window_size, global_positions
    )

    # Compute raw attention scores
    scores = query @ key.T / np.sqrt(d_k)

    # Apply mask: set non-attended positions to -inf
    masked_scores = np.where(mask, scores, -np.inf)

    # Softmax (row-wise)
    exp_scores = np.exp(
        masked_scores - masked_scores.max(axis=1, keepdims=True)
    )
    # Handle -inf: exp(-inf) = 0
    exp_scores = np.nan_to_num(exp_scores, nan=0.0, posinf=0.0, neginf=0.0)
    attention_weights = exp_scores / (
        exp_scores.sum(axis=1, keepdims=True) + 1e-9
    )

    # Compute output
    output = attention_weights @ value

    return output, attention_weights

The mask determines which attention scores to compute. Setting masked positions to negative infinity before softmax effectively zeros their contribution to the weighted average. The global tokens see the full sequence in their rows, while local tokens see only their windows plus global positions.

Let's verify the implementation with a small example:

In[15]:

Code

# Test the implementation
np.random.seed(42)
seq_len = 16
d_model = 64
d_k = 64
window_size = 4
global_positions = [0, 8]  # CLS and mid-sequence global token

# Random queries, keys, values
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)

output, weights = global_local_attention(
    Q, K, V, window_size, global_positions, d_k
)

# Test the implementation
np.random.seed(42)
seq_len = 16
d_model = 64
d_k = 64
window_size = 4
global_positions = [0, 8]  # CLS and mid-sequence global token

# Random queries, keys, values
Q = np.random.randn(seq_len, d_model)
K = np.random.randn(seq_len, d_model)
V = np.random.randn(seq_len, d_model)

output, weights = global_local_attention(
    Q, K, V, window_size, global_positions, d_k
)

Out[16]:

Console

Input shape: (16, 64)
Output shape: (16, 64)
Attention weights shape: (16, 16)

Active connections per position:
  Position  0 (global): 16 connections
  Position  1 (local ):  5 connections
  Position  2 (local ):  6 connections
  Position  3 (local ):  7 connections
  Position  4 (local ):  7 connections
  Position  5 (local ):  7 connections
  Position  6 (local ):  6 connections
  Position  7 (local ):  6 connections
  Position  8 (global): 16 connections
  Position  9 (local ):  6 connections
  Position 10 (local ):  6 connections
  Position 11 (local ):  7 connections
  Position 12 (local ):  7 connections
  Position 13 (local ):  7 connections
  Position 14 (local ):  6 connections
  Position 15 (local ):  5 connections

Global positions connect to all 16 tokens. Local positions connect to their window (4-5 tokens) plus the 2 global tokens, totaling around 6-7 connections each.

Out[17]:

Visualization

Heatmap of attention weights showing dense rows for global tokens and sparse diagonal patterns for local tokens. — Attention weights from global-local attention. Rows 0 and 8 (global tokens) show attention across all positions. Other rows show concentrated local attention with visible connections to global positions at columns 0 and 8.

The attention weight heatmap shows the characteristic pattern. The global token rows (0 and 8) have distributed attention across all positions. The local token rows show concentrated attention within their windows, with visible bumps at columns 0 and 8 where they attend to global tokens.

Comparison: Different Global Token StrategiesLink Copied

Different architectures use global tokens in distinct ways. Understanding these variations helps when choosing or designing efficient attention mechanisms.

Comparison of global token strategies across efficient transformer architectures.

Strategy	Global Tokens	Use Case	Advantages
CLS-only (Longformer)	First token	Classification	Simple, single aggregation point
Task-specific (QA)	Question tokens	Extractive tasks	Semantic alignment with task structure
Periodic (BigBird)	Every k-th token	General	Uniform coverage, no special tokens needed
Learned latents (Perceiver)	Separate buffer	Multimodal, long sequences	Flexible, task-agnostic

Each strategy makes trade-offs. CLS-only is simplest but creates a single bottleneck. Task-specific requires knowing the task structure. Periodic global tokens offer even coverage but may waste capacity on uninformative positions. Learned latents are most flexible but add trainable parameters.

In[18]:

Code

def compare_global_strategies(seq_len, window_size):
    """Compare connectivity of different global token strategies."""
    strategies = {
        "CLS only": [0],
        "CLS + SEP": [0, seq_len - 1],
        "Periodic (every 8)": list(range(0, seq_len, 8)),
        "First 4 tokens": list(range(4)),
    }

    results = []
    for name, global_pos in strategies.items():
        mask = create_longformer_attention_mask(
            seq_len, window_size, global_pos
        )
        density = mask.sum() / (seq_len * seq_len)
        complexity = compute_attention_complexity(
            seq_len, window_size, len(global_pos)
        )

        results.append(
            {
                "strategy": name,
                "num_global": len(global_pos),
                "density": density * 100,
                "savings": complexity["savings_percent"],
            }
        )

    return results


# Compare strategies
seq_len = 64
window_size = 8
comparison = compare_global_strategies(seq_len, window_size)

def compare_global_strategies(seq_len, window_size):
    """Compare connectivity of different global token strategies."""
    strategies = {
        "CLS only": [0],
        "CLS + SEP": [0, seq_len - 1],
        "Periodic (every 8)": list(range(0, seq_len, 8)),
        "First 4 tokens": list(range(4)),
    }

    results = []
    for name, global_pos in strategies.items():
        mask = create_longformer_attention_mask(
            seq_len, window_size, global_pos
        )
        density = mask.sum() / (seq_len * seq_len)
        complexity = compute_attention_complexity(
            seq_len, window_size, len(global_pos)
        )

        results.append(
            {
                "strategy": name,
                "num_global": len(global_pos),
                "density": density * 100,
                "savings": complexity["savings_percent"],
            }
        )

    return results


# Compare strategies
seq_len = 64
window_size = 8
comparison = compare_global_strategies(seq_len, window_size)

Out[19]:

Console

Comparison: seq_len=64, window_size=8

Strategy               # Global      Density      Savings
--------------------------------------------------------
CLS only                      1        16.5%        84.4%
CLS + SEP                     2        19.3%        81.3%
Periodic (every 8)            8        33.9%        64.1%
First 4 tokens                4        24.8%        75.4%

All strategies maintain significant savings compared to full attention. The periodic strategy with 8 global tokens has higher density (more connections) but still achieves over 80% savings. The choice depends on whether you need uniform global coverage (periodic) or task-aligned global positions (CLS, question tokens).

Limitations and ImpactLink Copied

Global tokens solve the long-range dependency problem in sparse attention, but they come with limitations.

The primary limitation is the potential information bottleneck. A single CLS token must compress an entire document's worth of information into one vector. For long documents with diverse content, this creates pressure on the token's representation capacity. Tasks requiring fine-grained global reasoning may need multiple global tokens, increasing complexity.

Global tokens also introduce asymmetry in the attention pattern. The CLS token sees everything, while other tokens see only windows plus global positions. This asymmetry can lead to uneven gradient flow during training, with global token parameters receiving disproportionately large updates. Careful initialization and learning rate scheduling help mitigate this issue.

Despite these limitations, global tokens work well in practice. Longformer achieved state-of-the-art results on long-document tasks like WikiHop and TriviaQA. The ability to process 4096+ tokens efficiently opened new applications in document understanding, long-form question answering, and summarization. BigBird extended this further with theoretical guarantees about universal approximation.

The conceptual impact may be even more significant. Global tokens demonstrate that carefully designed sparse patterns can match or exceed the performance of full attention while being much more efficient. This principle has influenced subsequent architectures, from Perceiver's latent arrays to modern retrieval-augmented models that retrieve relevant context rather than attending to everything.

Key ParametersLink Copied

When implementing global-local attention, several parameters control the trade-off between efficiency and expressiveness:

window_size: The number of tokens each position can attend to locally. Larger windows capture more local context but increase computation. Typical values range from 256 to 512 for long-document models. The window should be large enough to capture phrase-level dependencies.
global_positions: A list of token indices designated as global. Common strategies include:
- First position only (CLS token) for classification
- First and last positions (CLS + SEP) for sequence-pair tasks
- All question tokens for extractive QA
- Every k-th position for uniform coverage
num_global: The count of global tokens. More global tokens reduce the information bottleneck but add $O(g \cdot n)$ connections where $g$ is the global count and $n$ is sequence length. Start with 1-4 global tokens and increase if task performance plateaus.
d_k: The dimension used for scaling attention scores. Standard practice sets this equal to the model's head dimension (typically 64). Scaling by $\sqrt{d_k}$ keeps attention logits in a stable range regardless of dimensionality.

SummaryLink Copied

Global tokens bridge efficient local attention and effective long-range modeling. By designating certain positions as communication hubs with full attention span, they reduce the maximum path length between any two positions to just 2 hops.

Key takeaways:

The bottleneck problem: Sliding window attention limits direct communication to local neighborhoods, requiring many hops for long-range information flow.
Global tokens as hubs: Designated positions that can attend to and be attended by all positions, serving as relay points for sequence-wide communication.
CLS token attention: BERT's classification token naturally fits the global attention role, aggregating sequence information for downstream tasks.
Task-specific globals: Question tokens in QA, separator tokens in sentence pairs, or any semantically meaningful positions can serve as global tokens.
Learned latents: Trainable global tokens that discover what information to aggregate, offering flexibility at the cost of additional parameters.
Efficient scaling: Global token count scales linearly, so even many global tokens preserve the efficiency gains of sparse attention.
Reduced path length: With global tokens, maximum path length drops from $O(n/w)$ to 2, where $n$ is the sequence length and $w$ is the window size. This matches full attention's connectivity properties.

The next chapter examines Longformer, which combines sliding window attention with global tokens into a complete architecture. We'll see how these components work together to achieve strong performance on document-level NLP tasks while maintaining linear complexity in sequence length.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about global tokens in efficient transformers.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Sliding Window Attention

Next Chapter

Longformer

Reference

BIBTEXAcademic

@misc{globaltokenshowefficienttransformersenablelongrangeattention, author = {Michael Brenndoerfer}, title = {Global Tokens: How Efficient Transformers Enable Long-Range Attention}, year = {2025}, url = {https://mbrenndoerfer.com/writing/global-tokens-efficient-transformers-long-range-attention}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Global Tokens: How Efficient Transformers Enable Long-Range Attention. Retrieved from https://mbrenndoerfer.com/writing/global-tokens-efficient-transformers-long-range-attention

MLAAcademic

Michael Brenndoerfer. "Global Tokens: How Efficient Transformers Enable Long-Range Attention." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/global-tokens-efficient-transformers-long-range-attention>.

CHICAGOAcademic

Michael Brenndoerfer. "Global Tokens: How Efficient Transformers Enable Long-Range Attention." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/global-tokens-efficient-transformers-long-range-attention.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Global Tokens: How Efficient Transformers Enable Long-Range Attention'. Available at: https://mbrenndoerfer.com/writing/global-tokens-efficient-transformers-long-range-attention (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Global Tokens: How Efficient Transformers Enable Long-Range Attention. https://mbrenndoerfer.com/writing/global-tokens-efficient-transformers-long-range-attention

Direct link:

https://mbrenndoerfer.com/writing/global-tokens-efficient-transformers-long-range-attention

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Global Tokens: How Efficient Transformers Enable Long-Range Attention

Global TokensLink Copied

The Information Bottleneck in Local AttentionLink Copied

CLS Token as Global AttentionLink Copied

Task-Specific Global TokensLink Copied

Learned Global TokensLink Copied

Global Token CountLink Copied

Global-Local Attention MixingLink Copied

Implementation StrategiesLink Copied

Comparison: Different Global Token StrategiesLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

FlashAttention Implementation: GPU Memory Optimization for Transformers

FlashAttention Algorithm: Memory-Efficient Exact Attention via GPU-Aware Tiling

Linear Attention: Breaking the Quadratic Bottleneck with Kernel Feature Maps

Stay updated