Search

Search articles

Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM

Michael BrenndoerferUpdated July 4, 202538 min read

Learn why the first tokens in transformer sequences absorb excess attention weight, how this causes streaming inference failures, and how StreamingLLM preserves these attention sinks for unlimited text generation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Attention Sinks

When you deploy a language model for streaming generation, something strange happens. After processing a few thousand tokens, the model's output quality degrades. It starts repeating itself, loses coherence, and eventually produces gibberish. This failure mode isn't about running out of memory or hitting a hard context limit. It's about a subtle architectural quirk: the first few tokens in a sequence absorb a disproportionate amount of attention weight, regardless of their semantic relevance.

These overloaded positions are called attention sinks. Understanding why they exist and how to work around them unlocks true streaming inference: models that can generate coherent text indefinitely without quality degradation.

The Discovery: Why First Tokens Hoard Attention

The attention sink phenomenon was first systematically analyzed in the StreamingLLM paper by Xiao et al. (2023). Researchers observed that in autoregressive language models, the very first tokens in a sequence consistently receive high attention weights from all later positions, even when those initial tokens carry no special semantic meaning.

Attention Sink

An attention sink is a token position that accumulates disproportionately high attention weights across many query positions, typically regardless of the token's actual content or relevance. In autoregressive transformers, the first few tokens often serve as attention sinks due to softmax normalization requirements.

Consider what happens when a model processes a long sequence. For each new token, the self-attention mechanism computes attention weights over all previous positions. These weights must sum to 1 due to the softmax normalization. When the model encounters positions that aren't particularly relevant to the current prediction, it still needs to distribute some probability mass. The first tokens become a convenient "dump" for this excess attention.

This behavior emerges from training dynamics, not explicit design. During pretraining, models learn that attending to initial tokens is "safe" because they're always present and their representations stabilize early. The first token position essentially becomes a learned bias term that absorbs attention weight that would otherwise need to be spread across irrelevant positions.

Out[2]:
Visualization
Line plot showing attention weight distribution with a sharp spike at position 0 and 1, then low uniform weights for middle positions.
Attention weights from a late sequence position to all previous positions. The first few tokens receive disproportionately high attention regardless of content, demonstrating the attention sink phenomenon. This pattern persists across different input texts and model architectures.

The visualization shows the characteristic shape: a sharp spike at the beginning of the sequence (the sink tokens), low uniform attention across the middle positions, and slightly elevated attention on recent tokens that provide immediate context.

To understand just how much attention concentrates in the first few positions, let's look at the cumulative attention distribution. By summing the attention weights from left to right, we can see what fraction of total attention is captured by the first kk tokens.

Out[3]:
Visualization
Line plot showing cumulative attention rising steeply for the first few positions, then flattening out before rising again for recent tokens.
Cumulative attention distribution showing the fraction of total attention weight captured by the first k positions. The first 4 tokens capture over 40% of all attention, demonstrating how strongly the sink phenomenon concentrates attention at the beginning of sequences.

The cumulative view highlights the sink phenomenon. The first 4 tokens alone capture over 40% of all attention weight, despite representing only 4% of the sequence positions. This concentration explains why removing these tokens causes such dramatic failures: you're removing the positions that the model relies on most heavily.

Why Removing Initial Tokens Breaks Generation

This puzzle shows the importance of attention sinks. Suppose you want to do streaming inference: process a very long document one chunk at a time, discarding old tokens to save memory. A natural approach is to keep a sliding window of the most recent LL tokens.

But this fails catastrophically. When you remove the first tokens, the model loses its attention sinks. The excess attention weight that previously flowed to position 0 must now go somewhere else. The model wasn't trained with this attention distribution, so it produces outputs that don't match any pattern it learned during pretraining.

Out[4]:
Visualization
Bar chart showing attention weights with high bar at position 0 and moderate bars at recent positions.
Normal attention with sink tokens present. Excess attention flows to position 0, which was trained to absorb it. The model generates coherent output.
Bar chart showing attention weights without position 0, with irregular distribution across remaining positions.
Sliding window without sink tokens. When position 0 is removed, attention redistributes to positions the model wasn't trained to handle. Output quality degrades rapidly.

The contrast is clear. With sink tokens present (left), attention follows the pattern the model learned during training: high weight on the initial positions, with the rest distributed sensibly across recent context. Without sink tokens (right), the attention distribution becomes erratic. The model tries to find new sinks among the remaining tokens, but these positions weren't trained to serve that role.

In practice, this manifests as:

  • Increased perplexity on downstream tokens
  • Repetitive or looping generation
  • Loss of coherence over long generations
  • Complete degeneration into nonsense after enough tokens

The StreamingLLM Solution

StreamingLLM proposes a simple fix: instead of a pure sliding window, always keep the first few tokens. The attention mechanism becomes:

Context={sink tokens}{recent window}\text{Context} = \{\text{sink tokens}\} \cup \{\text{recent window}\}

By preserving just 1 to 4 initial tokens (the attention sinks), plus a sliding window of recent tokens, you maintain the attention distribution the model expects while bounding memory usage. The generation can continue indefinitely without quality degradation.

Out[5]:
Visualization
Diagram showing token sequence with first 4 positions marked as fixed sink tokens and last portion marked as sliding window.
StreamingLLM cache structure showing the combination of fixed sink tokens and a sliding window. The sink tokens (positions 0-3) are always retained, while the window slides to include only the most recent L tokens. This preserves the attention pattern while enabling infinite-length generation.

The key insight is that you don't need the entire history. You need:

  1. The attention sinks to absorb excess attention weight
  2. Recent context for actual language modeling

Everything in between can be safely discarded without affecting generation quality.

Mathematical Analysis

To understand why attention sinks emerge and why they're essential for streaming inference, we need to examine the mathematics of attention itself. This walkthrough from the basic attention formula to the StreamingLLM solution will show a core tension in transformer design: the softmax function that makes attention work also creates the problem that sinks must solve.

The Attention Mechanism: A Probability Distribution Problem

At its core, self-attention is a weighted averaging mechanism. For each position in a sequence, the model asks: "Which previous positions should I draw information from, and how much from each?" The answer comes as a set of weights that sum to 1, forming a probability distribution over the context.

In a standard autoregressive transformer, the attention computation at position tt produces:

Attention(Qt,K1:t,V1:t)=softmax(QtK1:tdk)V1:t\text{Attention}(Q_t, K_{1:t}, V_{1:t}) = \text{softmax}\left(\frac{Q_t K_{1:t}^\top}{\sqrt{d_k}}\right) V_{1:t}

where:

  • QtQ_t: the query vector at position tt, representing what information this position is looking for
  • K1:tK_{1:t}: the key matrix containing key vectors from positions 1 through tt, where each key represents what information that position offers
  • V1:tV_{1:t}: the value matrix containing value vectors from positions 1 through tt, holding the actual content to be aggregated
  • dkd_k: the dimension of the key vectors, used to scale the dot products and prevent them from growing too large

The softmax function is the critical piece. It takes the raw attention scores (dot products between queries and keys) and converts them into a probability distribution:

i=1tαt,i=1\sum_{i=1}^{t} \alpha_{t,i} = 1

where each individual weight is:

αt,i=exp(QtKi/dk)j=1texp(QtKj/dk)\alpha_{t,i} = \frac{\exp(Q_t K_i^\top / \sqrt{d_k})}{\sum_{j=1}^{t} \exp(Q_t K_j^\top / \sqrt{d_k})}

where:

  • αt,i\alpha_{t,i}: the attention weight from query position tt to key position ii, indicating how much position tt attends to position ii
  • QtKiQ_t K_i^\top: the dot product between query tt and key ii, measuring their compatibility
  • exp()\exp(\cdot): the exponential function, which ensures all values are positive
  • The denominator sums over all tt positions, normalizing the weights to form a valid probability distribution

This is where the tension arises. The exponential function has a mathematical property that creates the sink phenomenon: exp(x)>0\exp(x) > 0 for all finite xx. No matter how irrelevant a position is, the model cannot assign it exactly zero attention. The softmax constraint forces the model to distribute some probability mass everywhere.

When the current token genuinely needs information from only a handful of positions, what happens to the attention weight that must go to the other 995 tokens in a 1000-token sequence? The model needs somewhere to put this "excess" attention, somewhere that won't disrupt the computation. During training, the first tokens become that destination.

How Sink Tokens Emerge During Training

The emergence of sink tokens is not designed but learned. During pretraining on billions of tokens, the model discovers that attending to early positions is "safe" for several reasons:

  1. They're always present: Position 0 exists in every training example, so the model can reliably use it
  2. Their representations stabilize early: By the time later positions attend to them, their hidden states have passed through many layers
  3. Attending to them causes minimal harm: Mixing in a small amount of irrelevant but stable information is better than spreading that attention across many varying positions

Let K0K_0 denote the key vector at position 0 after training. This vector develops a distinctive property: for most query vectors QtQ_t, the dot product QtK0Q_t K_0^\top is relatively high compared to random positions. The model has learned to make K0K_0 a universal "attractor" in key space.

Out[6]:
Visualization
Heatmap with 6 rows (layers) and 32 columns (positions), showing bright spots in the first few columns across all rows.
Attention weights across 6 transformer layers, each showing attention from the last token to all 32 previous positions. The first few positions consistently receive high attention across all layers, demonstrating that the sink phenomenon is not limited to a single layer but emerges as a global pattern throughout the network.

The heatmap shows that sink behavior is not an artifact of any single layer. Across all transformer layers, the first few positions consistently attract high attention weights. This cross-layer consistency suggests that sink tokens serve a core role in how the model processes information, not just a quirk of one particular layer's learned weights.

We can write the attention weight on position 0 using simplified notation:

αt,0=exp(st,0)j=0t1exp(st,j)\alpha_{t,0} = \frac{\exp(s_{t,0})}{\sum_{j=0}^{t-1} \exp(s_{t,j})}

where:

  • αt,0\alpha_{t,0}: the attention weight from query position tt to position 0 (the first token)
  • st,j=QtKj/dks_{t,j} = Q_t K_j^\top / \sqrt{d_k}: the scaled attention score between query tt and key jj
  • exp(st,0)\exp(s_{t,0}): the numerator, which grows exponentially with how well query tt matches key 0
  • j=0t1exp(st,j)\sum_{j=0}^{t-1} \exp(s_{t,j}): the normalization constant summing over all previous positions

For αt,0\alpha_{t,0} to be consistently high across different queries and contexts, st,0s_{t,0} must be consistently large. The key vector K0K_0 evolves during training to project strongly onto the directions that queries typically occupy, regardless of the actual content at position 0.

The Mathematics of Failure: What Happens When Sinks Disappear

Understanding why removal of sink tokens breaks the model requires following the math carefully. When you remove position 0 from the key-value cache, the attention computation changes in a specific and damaging way.

The new attention weights become:

αt,j=exp(st,j)k0exp(st,k)\alpha'_{t,j} = \frac{\exp(s_{t,j})}{\sum_{k \neq 0} \exp(s_{t,k})}

where:

  • αt,j\alpha'_{t,j}: the modified attention weight after removing position 0
  • The denominator now sums only over positions k0k \neq 0, excluding the removed sink

The critical change is in the denominator. By removing exp(st,0)\exp(s_{t,0}) from the sum, we've made the denominator smaller. Let's trace through what this means:

  1. Numerators stay the same: For any remaining position jj, its numerator exp(st,j)\exp(s_{t,j}) hasn't changed
  2. Denominator shrinks: The sum that divides the numerator is now missing the term exp(st,0)\exp(s_{t,0})
  3. All weights increase: Since numerator/smaller-denominator > numerator/larger-denominator, every remaining position receives more attention

The probability mass that previously went to position 0 must redistribute across the remaining positions. If the sink absorbed 30% of attention (αt,0=0.3\alpha_{t,0} = 0.3), that 30% now spreads across positions that weren't trained to receive it.

This redistribution causes two compounding problems:

  1. Distribution mismatch: The new attention distribution doesn't match what the model saw during training. Each attention head has learned to produce useful representations given a specific expected distribution of weights. When that distribution changes, the representations become unreliable, leading to out-of-distribution hidden states that propagate through subsequent layers.

  2. Cascade of new sinks: The model may try to use other early positions as sinks, but they lack the specialized key representation of position 0. As generations continue, the model keeps shifting which positions absorb excess attention, never stabilizing into a pattern it can use effectively.

The StreamingLLM Solution: Preserving the Distribution

StreamingLLM's insight is that you don't need the full history to maintain the attention distribution the model expects. You need exactly two things: the sink tokens to absorb excess attention, and recent context for actual language modeling. Everything in between can be discarded.

Formally, we maintain positions {0,1,,k1}\{0, 1, \ldots, k-1\} as sinks plus a sliding window {tL,,t1}\{t-L, \ldots, t-1\} for recent context. The attention weight for any position ii in this combined set becomes:

αt,i=exp(st,i)jsinksexp(st,j)+jwindowexp(st,j)\alpha_{t,i} = \frac{\exp(s_{t,i})}{\sum_{j \in \text{sinks}} \exp(s_{t,j}) + \sum_{j \in \text{window}} \exp(s_{t,j})}

where:

  • αt,i\alpha_{t,i}: the attention weight from query position tt to position ii
  • st,is_{t,i}: the attention score between query tt and key ii
  • sinks={0,1,,k1}\text{sinks} = \{0, 1, \ldots, k-1\}: the set of kk preserved sink token positions
  • window={tL,,t1}\text{window} = \{t-L, \ldots, t-1\}: the sliding window of the most recent LL positions
  • The denominator sums over both sets, ensuring the attention weights form a valid probability distribution

The key insight is that the denominator structure matches what the model saw during training. The sink tokens contribute their exp(st,j)\exp(s_{t,j}) terms to the normalization, absorbing their usual share of attention weight. The window tokens receive the context-relevant attention for actual language modeling. Because the sink tokens are present, the model operates in its training distribution rather than an out-of-distribution state.

This is why StreamingLLM works: it doesn't try to eliminate the sink phenomenon or work around it. Instead, it embraces the sink as a necessary component of how the model has learned to distribute attention, and simply ensures that component is always present.

Implementation

With the mathematical foundation in place, let's translate these concepts into working code. The implementation centers on a key insight from our analysis: the cache must always contain the sink tokens, so we need a data structure that preserves specific positions while sliding others.

Our StreamingLLMCache class manages this by treating the cache as two distinct regions: a fixed sink region at the beginning that never changes, and a sliding window region that shifts as new tokens arrive. When the cache fills up, we evict the oldest window tokens while the sink tokens remain untouched.

In[7]:
Code
import torch


class StreamingLLMCache:
    """
    KV cache for streaming inference that preserves attention sinks.

    The cache maintains:
    - First `num_sink_tokens` positions (attention sinks)
    - Most recent `window_size` positions (sliding window)
    """

    def __init__(
        self,
        num_sink_tokens=4,
        window_size=1020,
        num_layers=12,
        num_heads=12,
        head_dim=64,
    ):
        self.num_sink_tokens = num_sink_tokens
        self.window_size = window_size
        self.max_cache_size = num_sink_tokens + window_size
        self.num_layers = num_layers
        self.num_heads = num_heads
        self.head_dim = head_dim

        # Initialize empty caches for keys and values
        # Shape: (num_layers, batch_size, num_heads, cache_len, head_dim)
        self.k_cache = None
        self.v_cache = None
        self.cache_len = 0

    def update(self, layer_idx, new_k, new_v):
        """
        Add new key-value pairs to the cache, evicting old non-sink tokens if needed.

        Args:
            layer_idx: Which transformer layer this update is for
            new_k: New key tensor of shape (batch, heads, new_len, head_dim)
            new_v: New value tensor of shape (batch, heads, new_len, head_dim)

        Returns:
            Combined key and value tensors for attention computation
        """
        batch_size = new_k.shape[0]
        new_len = new_k.shape[2]

        # Initialize cache on first update
        if self.k_cache is None:
            self.k_cache = torch.zeros(
                self.num_layers,
                batch_size,
                self.num_heads,
                self.max_cache_size,
                self.head_dim,
                device=new_k.device,
                dtype=new_k.dtype,
            )
            self.v_cache = torch.zeros_like(self.k_cache)

        # If cache has room, simply append
        if self.cache_len + new_len <= self.max_cache_size:
            self.k_cache[
                layer_idx, :, :, self.cache_len : self.cache_len + new_len
            ] = new_k
            self.v_cache[
                layer_idx, :, :, self.cache_len : self.cache_len + new_len
            ] = new_v
            if layer_idx == self.num_layers - 1:
                self.cache_len += new_len
        else:
            # Cache is full: preserve sinks, slide window
            # Keep first num_sink_tokens, evict oldest non-sink tokens
            sink_k = self.k_cache[layer_idx, :, :, : self.num_sink_tokens]
            sink_v = self.v_cache[layer_idx, :, :, : self.num_sink_tokens]

            # Calculate how many window tokens we can keep
            tokens_to_keep = self.window_size - new_len

            if tokens_to_keep > 0:
                # Keep recent window tokens
                window_k = self.k_cache[layer_idx, :, :, -tokens_to_keep:]
                window_v = self.v_cache[layer_idx, :, :, -tokens_to_keep:]

                # Reconstruct cache: sinks + kept window + new tokens
                self.k_cache[layer_idx, :, :, : self.num_sink_tokens] = sink_k
                self.k_cache[
                    layer_idx,
                    :,
                    :,
                    self.num_sink_tokens : self.num_sink_tokens
                    + tokens_to_keep,
                ] = window_k
                self.k_cache[
                    layer_idx,
                    :,
                    :,
                    self.num_sink_tokens + tokens_to_keep : self.num_sink_tokens
                    + tokens_to_keep
                    + new_len,
                ] = new_k

                self.v_cache[layer_idx, :, :, : self.num_sink_tokens] = sink_v
                self.v_cache[
                    layer_idx,
                    :,
                    :,
                    self.num_sink_tokens : self.num_sink_tokens
                    + tokens_to_keep,
                ] = window_v
                self.v_cache[
                    layer_idx,
                    :,
                    :,
                    self.num_sink_tokens + tokens_to_keep : self.num_sink_tokens
                    + tokens_to_keep
                    + new_len,
                ] = new_v
            else:
                # Window is smaller than new tokens, just keep sinks and new
                self.k_cache[layer_idx, :, :, : self.num_sink_tokens] = sink_k
                self.k_cache[
                    layer_idx,
                    :,
                    :,
                    self.num_sink_tokens : self.num_sink_tokens + new_len,
                ] = new_k

                self.v_cache[layer_idx, :, :, : self.num_sink_tokens] = sink_v
                self.v_cache[
                    layer_idx,
                    :,
                    :,
                    self.num_sink_tokens : self.num_sink_tokens + new_len,
                ] = new_v

            if layer_idx == self.num_layers - 1:
                self.cache_len = (
                    self.num_sink_tokens + max(0, tokens_to_keep) + new_len
                )

        # Return the active portion of the cache for this layer
        return (
            self.k_cache[layer_idx, :, :, : self.cache_len],
            self.v_cache[layer_idx, :, :, : self.cache_len],
        )

    def get_cache_length(self):
        return self.cache_len

The cache management is straightforward: when adding new tokens would exceed the maximum cache size, we preserve the sink tokens, keep as many recent tokens as possible, and append the new tokens. Let's test the cache behavior by adding tokens in batches:

In[8]:
Code
# Test the cache behavior
cache = StreamingLLMCache(
    num_sink_tokens=4, window_size=12, num_layers=2, num_heads=4, head_dim=32
)

# Simulate adding tokens in batches
batch_size = 1
cache_lengths = []

for step in range(5):
    new_k = torch.randn(batch_size, 4, 4, 32)  # 4 new tokens per step
    new_v = torch.randn(batch_size, 4, 4, 32)

    for layer in range(2):
        k, v = cache.update(layer, new_k, new_v)

    cache_lengths.append(cache.get_cache_length())
Out[9]:
Console
StreamingLLM Cache Demonstration
==================================================
Sink tokens: 4
Window size: 12
Max cache size: 16

Step 1: Added 4 tokens, cache length = 4
Step 2: Added 4 tokens, cache length = 8
Step 3: Added 4 tokens, cache length = 12
Step 4: Added 4 tokens, cache length = 16
Step 5: Added 4 tokens, cache length = 16

The cache grows from 4 to 16 tokens over the first four steps, then stabilizes at 16 (the maximum cache size of 4 sink tokens + 12 window tokens). Once the cache is full, the window slides forward while always preserving the sink tokens.

Now let's implement the attention mechanism that uses this cache:

In[10]:
Code
import torch.nn.functional as F


def streaming_attention(query, cache, layer_idx, new_k, new_v, scale=None):
    """
    Compute attention with StreamingLLM cache.

    Args:
        query: Query tensor of shape (batch, heads, 1, head_dim) for single token
        cache: StreamingLLMCache instance
        layer_idx: Current layer index
        new_k: Key for new token(s)
        new_v: Value for new token(s)
        scale: Attention scale factor (default: 1/sqrt(head_dim))

    Returns:
        Attention output and updated cache
    """
    head_dim = query.shape[-1]
    if scale is None:
        scale = 1.0 / (head_dim**0.5)

    # Update cache and get all keys/values
    keys, values = cache.update(layer_idx, new_k, new_v)

    # Compute attention scores: (batch, heads, query_len, cache_len)
    scores = torch.matmul(query, keys.transpose(-2, -1)) * scale

    # Apply softmax to get attention weights
    attn_weights = F.softmax(scores, dim=-1)

    # Compute output
    output = torch.matmul(attn_weights, values)

    return output, attn_weights
In[11]:
Code
# Demonstrate attention weight distribution
torch.manual_seed(42)

cache = StreamingLLMCache(
    num_sink_tokens=4, window_size=20, num_layers=1, num_heads=1, head_dim=64
)

# Fill cache with some initial tokens
initial_k = torch.randn(1, 1, 20, 64)
initial_v = torch.randn(1, 1, 20, 64)

# Make first few tokens have distinctive key patterns (simulating trained sinks)
initial_k[:, :, 0, :] = initial_k[:, :, 0, :] * 2 + 0.5  # Sink token 0
initial_k[:, :, 1, :] = initial_k[:, :, 1, :] * 1.5 + 0.3  # Sink token 1

cache.update(0, initial_k, initial_v)

# Now add a new query token
query = torch.randn(1, 1, 1, 64)
new_k = torch.randn(1, 1, 1, 64)
new_v = torch.randn(1, 1, 1, 64)

output, attn_weights = streaming_attention(query, cache, 0, new_k, new_v)
Out[12]:
Visualization
Bar chart showing attention weights across cache positions with higher values for sink token positions.
Attention weights from StreamingLLM inference. The first few positions (sink tokens) receive elevated attention weight. This is the expected distribution that enables stable long-sequence generation.

The attention weights show the expected pattern: elevated attention on the first few positions (the sink tokens) with the remaining attention distributed across the window tokens.

Designing Effective Sink Tokens

Not all initial tokens make equally good sinks. The effectiveness of an attention sink depends on what token occupies that position during training. Several strategies can improve sink behavior:

Beginning-of-Sequence Token

Most tokenizers include a dedicated <BOS> (beginning-of-sequence) or <s> token that always appears at position 0. Because this token appears at the start of every training sequence, it develops a strong sink representation. The model learns that this token is always present and always available to absorb excess attention.

System Prompt Tokens

For instruction-tuned models, the system prompt often provides natural sink tokens. Phrases like "You are a helpful assistant" appear at the start of every conversation, making them reliable candidates for sink positions. The model has seen these tokens in position 0-20 millions of times, so their key representations have thoroughly adapted to the sink role.

Dedicated Sink Tokens

Some models are trained with explicit sink tokens. These are special tokens inserted at the beginning of sequences specifically to serve as attention sinks. During training, the model learns to route excess attention to these tokens rather than developing the behavior organically.

In[13]:
Code
def count_sink_tokens_needed(model_perplexity_data, threshold=1.05):
    """
    Determine the optimal number of sink tokens based on perplexity analysis.

    When removing initial tokens, perplexity increases. We find the minimum
    number of tokens to preserve such that perplexity stays within threshold.

    Args:
        model_perplexity_data: Dict mapping num_preserved_tokens -> perplexity
        threshold: Maximum acceptable perplexity ratio vs. baseline

    Returns:
        Recommended number of sink tokens to preserve
    """
    if not model_perplexity_data:
        return 4  # Default recommendation from StreamingLLM paper

    baseline_perplexity = max(model_perplexity_data.values())

    for num_tokens in sorted(model_perplexity_data.keys()):
        ppl = model_perplexity_data[num_tokens]
        if ppl <= baseline_perplexity * threshold:
            return num_tokens

    return max(model_perplexity_data.keys())
In[14]:
Code
# Simulated perplexity data showing effect of sink token count
simulated_data = {
    0: 45.2,  # No sinks: very high perplexity
    1: 18.7,  # 1 sink: still elevated
    2: 12.3,  # 2 sinks: getting better
    3: 10.1,  # 3 sinks: close to baseline
    4: 9.8,  # 4 sinks: matches baseline
    8: 9.7,  # 8 sinks: marginal improvement
}

recommended = count_sink_tokens_needed(simulated_data)
Out[15]:
Console
Perplexity vs Number of Preserved Sink Tokens
=============================================
  0 sink tokens: perplexity = 45.2 <-- recommended
  1 sink tokens: perplexity = 18.7
  2 sink tokens: perplexity = 12.3
  3 sink tokens: perplexity = 10.1
  4 sink tokens: perplexity = 9.8
  8 sink tokens: perplexity = 9.7

The data shows that perplexity drops sharply when preserving the first few tokens, then plateaus. With zero sink tokens, perplexity explodes to 45.2, indicating severe model degradation. Adding just one sink token cuts perplexity by more than half. By four sink tokens, perplexity reaches 9.8, nearly matching baseline performance. Additional sinks beyond four provide diminishing returns, matching the StreamingLLM paper's recommendation.

Streaming Inference Pipeline

Let's put everything together into a complete streaming inference pipeline:

In[16]:
Code
class StreamingInferenceEngine:
    """
    Complete streaming inference engine with attention sink preservation.
    """

    def __init__(self, model, tokenizer, num_sink_tokens=4, window_size=2044):
        self.model = model
        self.tokenizer = tokenizer
        self.num_sink_tokens = num_sink_tokens
        self.window_size = window_size

        # Extract model configuration
        config = model.config
        self.num_layers = config.num_hidden_layers
        self.num_heads = config.num_attention_heads
        self.head_dim = config.hidden_size // config.num_attention_heads

        # Initialize cache
        self.cache = None
        self.total_tokens_generated = 0

    def reset(self):
        """Reset the cache for a new generation session."""
        self.cache = None
        self.total_tokens_generated = 0

    def _init_cache(self, batch_size, device, dtype):
        """Initialize the streaming cache."""
        self.cache = StreamingLLMCache(
            num_sink_tokens=self.num_sink_tokens,
            window_size=self.window_size,
            num_layers=self.num_layers,
            num_heads=self.num_heads,
            head_dim=self.head_dim,
        )

    def generate_token(self, input_ids, past_key_values=None):
        """
        Generate a single token using the streaming cache.

        In a real implementation, this would integrate with the model's
        forward pass to use the streaming cache instead of standard KV cache.
        """
        # This is a simplified demonstration
        # Real implementation would modify the model's attention layers
        pass

Let's demonstrate the memory savings achieved by StreamingLLM compared to a full KV cache:

In[17]:
Code
# Simulate a long generation scenario
total_tokens = 10000
sink_tokens = 4
window_size = 2044
max_cache_size = sink_tokens + window_size

# Calculate memory usage (assuming 768-dim embeddings, 12 layers, float32)
# KV cache stores keys and values (2x) for each layer
full_cache_memory = total_tokens * 768 * 12 * 2 * 4 / (1024**2)  # MB
streaming_cache_memory = max_cache_size * 768 * 12 * 2 * 4 / (1024**2)  # MB
memory_saved = full_cache_memory - streaming_cache_memory
savings_percent = (1 - streaming_cache_memory / full_cache_memory) * 100
Out[18]:
Console
Streaming Inference Demonstration
==================================================

Generating 10,000 tokens with:
  - 4 sink tokens (always preserved)
  - 2,044 window tokens (sliding)
  - Maximum cache size: 2,048 tokens

Memory usage comparison:
  Full KV cache: 703.1 MB
  StreamingLLM cache: 144.0 MB
  Memory saved: 559.1 MB (79.5%)

The streaming approach enables generation of arbitrarily long sequences while using fixed memory. In this example, generating 10,000 tokens with a full cache would require over 700 MB, but StreamingLLM uses only about 145 MB, an 80% reduction.

Out[19]:
Visualization
Line plot showing linear memory growth for full cache versus flat horizontal line for StreamingLLM cache.
Memory usage comparison between full KV cache and StreamingLLM as sequence length increases. Full cache grows linearly (eventually exhausting GPU memory), while StreamingLLM stays constant regardless of generation length. At 100K tokens, full caching would require 7 GB while StreamingLLM still uses only 145 MB.

The visualization makes the scaling difference clear. Full KV caching follows a diagonal line that quickly exceeds typical GPU memory limits. At 100,000 tokens, a full cache would require nearly 7 GB. StreamingLLM's flat line at the bottom shows constant memory usage, enabling generation of arbitrarily long sequences without hitting memory limits.

Empirical Validation

Let's examine how perplexity changes as we vary the number of sink tokens and window size:

Out[20]:
Visualization
Line plot showing perplexity decreasing sharply from 0 to 4 sink tokens, then plateauing.
Perplexity versus number of preserved sink tokens. Model performance stabilizes after preserving 4 sink tokens, with minimal improvement from additional sinks.
Line plot showing perplexity decreasing as window size increases from 128 to 2048.
Perplexity versus window size with 4 sink tokens. Larger windows improve perplexity until the window captures sufficient local context, typically around 1024-2048 tokens.

The left plot confirms that preserving 4 sink tokens is sufficient for most models. Additional sinks provide diminishing returns. The right plot shows that window size matters too: larger windows give the model more recent context to work with, improving perplexity until it approaches the full-context baseline.

Beyond StreamingLLM: Learnable Sink Tokens

The original StreamingLLM approach preserves the first tokens from the input sequence. A more principled approach is to train models with explicit, learnable sink tokens. These tokens are:

  1. Added to the vocabulary as special tokens
  2. Always prepended to the input sequence during training
  3. Optimized end-to-end to serve as effective attention sinks

Models trained this way develop more efficient sink representations. The sink tokens learn key vectors that are optimally positioned to capture excess attention, rather than relying on whatever tokens happen to appear at the start of the sequence.

Learnable Sink Tokens

Learnable sink tokens are special tokens added to a model's vocabulary and trained end-to-end to serve as attention sinks. Unlike implicit sinks that emerge from training on natural text, learnable sinks are explicitly optimized to absorb excess attention weight.

The training procedure is straightforward:

  1. Add kk sink tokens to the tokenizer vocabulary
  2. Prepend these tokens to every training sequence
  3. Train the model normally, allowing the sink token embeddings to learn

During inference, you prepend the same sink tokens to the input, and they immediately serve their purpose without needing to preserve specific input tokens.

In[21]:
Code
def prepare_input_with_sink_tokens(tokenizer, text, num_sink_tokens=4):
    """
    Prepare input with explicit sink tokens prepended.

    Args:
        tokenizer: Tokenizer with sink tokens in vocabulary
        text: Input text to tokenize
        num_sink_tokens: Number of sink tokens to prepend

    Returns:
        Token IDs with sink tokens at the beginning
    """
    # Assume sink tokens are named [SINK_0], [SINK_1], etc.
    sink_token_ids = [
        tokenizer.convert_tokens_to_ids(f"[SINK_{i}]")
        for i in range(num_sink_tokens)
    ]

    # Tokenize the actual text
    text_tokens = tokenizer.encode(text, add_special_tokens=False)

    # Combine: sinks first, then text
    return sink_token_ids + text_tokens

Key Parameters

When implementing StreamingLLM, the following parameters have the greatest impact on generation quality and memory usage:

  • num_sink_tokens: Number of initial tokens to preserve as attention sinks. The default of 4 works well for most models, but you may need fewer (1-2) for smaller models or more (8-16) for models with unusual attention patterns. Start with 4 and adjust based on perplexity measurements.

  • window_size: Number of recent tokens to keep in the sliding window. Larger windows capture more context but require more memory. Common values range from 512 to 2048. The right choice depends on the typical dependency length in your generation task: conversational AI may need 512-1024, while document summarization benefits from 2048+.

  • max_cache_size: Total cache size, computed as num_sink_tokens + window_size. This determines the fixed memory footprint. For a 7B parameter model with 32 layers and 4096 head dimension, each token in the cache uses approximately 1 MB, so a cache of 2048 tokens requires about 2 GB.

  • position_encoding_strategy: How to handle position encodings for window tokens. Options include keeping original positions (works with RoPE), resetting positions when sliding (may cause distribution mismatch), or using relative encodings (most robust). The choice depends on the base model's positional encoding scheme.

Limitations and Considerations

While attention sinks and StreamingLLM enable streaming inference, they come with limitations you should understand.

The fundamental limitation is that StreamingLLM trades context for memory. When you slide the window forward, you permanently lose access to information in the discarded tokens. If the model needs to recall a specific detail from 5,000 tokens ago but that token is no longer in the cache, the information is simply gone. For tasks requiring precise long-range recall (like answering questions about specific facts from early in a document), StreamingLLM will struggle compared to full-context approaches.

The attention sink phenomenon is also somewhat architecture-dependent. While it appears consistently across decoder-only transformer models trained autoregressively, the exact number of sink tokens needed and their effectiveness varies. Models with different training procedures, position encodings, or attention patterns may exhibit different sink behaviors. The 4-sink recommendation from the original paper is a good starting point, but you may need to tune this for your specific model.

There's also a subtle issue with positional encoding. When tokens slide out of the window, the remaining tokens don't automatically adjust their position encodings. A token at original position 5000 keeps its position-5000 encoding even when it becomes the "first" window token. This works reasonably well with relative position encodings like RoPE, but can cause issues with absolute position encodings. Some implementations recompute position encodings for the active cache, but this adds complexity and may not match the model's training distribution.

Finally, StreamingLLM doesn't help with the initial context window. You still can't attend to more than your window allows at any single step. If understanding a passage requires simultaneously seeing tokens 1-1000 and tokens 5000-6000, you're limited by what fits in the window plus sinks. StreamingLLM enables infinite generation, not infinite context.

Summary

Attention sinks are an emergent property of autoregressive transformers: the first few tokens in a sequence learn to absorb excess attention weight that would otherwise need to be distributed across irrelevant positions. This behavior emerges from the softmax normalization requirement that attention weights sum to 1.

StreamingLLM exploits this phenomenon for practical benefit. By preserving a small number of sink tokens alongside a sliding window of recent context, you can generate text of arbitrary length with fixed memory usage. The key insights are:

  • Attention sinks are essential: Removing the first tokens breaks the attention distribution the model learned during training, causing quality degradation
  • A few sinks suffice: Four sink tokens typically achieve baseline-equivalent performance, with diminishing returns from additional sinks
  • Window size matters: Larger windows capture more local context, improving generation quality until the window is large enough to cover typical dependency ranges
  • Memory is constant: Unlike full KV caching that grows linearly with sequence length, StreamingLLM uses fixed memory regardless of generation length

In practice, you can deploy language models for continuous generation tasks (chatbots, streaming summaries, real-time translation) without worrying about memory growth. The model maintains coherent output quality for thousands or millions of tokens, constrained only by compute time rather than memory.

Understanding attention sinks also provides insight into transformer behavior more broadly. The fact that models spontaneously learn to use certain positions as "attention dumps" shows how they manage the constraint that attention weights must form a probability distribution. This has implications for model interpretability, efficient architecture design, and the development of even longer-context language models.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about attention sinks and StreamingLLM.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{attentionsinksenablinginfinitelengthllmgenerationwithstreamingllm, author = {Michael Brenndoerfer}, title = {Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM}, year = {2025}, url = {https://mbrenndoerfer.com/writing/attention-sinks-streamingllm-infinite-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM. Retrieved from https://mbrenndoerfer.com/writing/attention-sinks-streamingllm-infinite-generation
MLAAcademic
Michael Brenndoerfer. "Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/attention-sinks-streamingllm-infinite-generation>.
CHICAGOAcademic
Michael Brenndoerfer. "Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/attention-sinks-streamingllm-infinite-generation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM'. Available at: https://mbrenndoerfer.com/writing/attention-sinks-streamingllm-infinite-generation (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Attention Sinks: Enabling Infinite-Length LLM Generation with StreamingLLM. https://mbrenndoerfer.com/writing/attention-sinks-streamingllm-infinite-generation
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free