PagedAttention: Solving LLM KV Cache Memory Fragmentation

Michael Brenndoerfer

Machine Learning Software Engineering Language AI Handbook

Learn how PagedAttention uses virtual memory paging to eliminate KV cache fragmentation, enabling 5x better memory utilization in LLM serving systems.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Paged AttentionLink Copied

In the previous two chapters, we explored how KV caches accelerate autoregressive generation by avoiding redundant computation, and we examined the substantial memory costs this optimization incurs. As we saw, a single 70B parameter model can require over 2.5 GB of KV cache memory per sequence at full context length. When serving multiple concurrent requests, this memory demand becomes the primary bottleneck limiting throughput.

But the raw memory consumption is only part of the story. Traditional KV cache implementations also face memory fragmentation. Even when the GPU has sufficient total free memory, the way that memory is allocated and deallocated can leave it unusable. This fragmentation can waste 60-80% of available KV cache memory, dramatically reducing the number of sequences a server can process simultaneously.

PagedAttention, introduced by the vLLM project in 2023, solves this fragmentation problem by borrowing a fundamental concept from operating systems: virtual memory with paging. Instead of allocating contiguous memory blocks for each sequence's KV cache, PagedAttention breaks the cache into fixed-size pages that can be stored anywhere in memory. This simple but powerful idea enables near-optimal memory utilization and has become the foundation for most modern LLM serving systems.

The Memory Fragmentation ProblemLink Copied

To understand why fragmentation occurs, we need to examine how traditional KV cache systems allocate memory. When a new request arrives, the system must reserve memory for the KV cache that will grow as tokens are generated. But here's the challenge: the system doesn't know in advance how long the final sequence will be.

The Contiguous Allocation DilemmaLink Copied

Traditional systems handle this uncertainty by pre-allocating memory based on the maximum possible sequence length. If your model supports 4,096 tokens, the system reserves space for all 4,096 KV pairs for each sequence, even if most requests only generate a few hundred tokens.

In[2]:

Code

import numpy as np


# Simulating traditional KV cache allocation
def traditional_allocation_simulation(
    max_seq_len: int = 4096, num_requests: int = 8, actual_lengths: list = None
):
    """
    Simulate memory allocation with contiguous pre-allocation.
    Returns wasted memory statistics.
    """
    if actual_lengths is None:
        # Realistic distribution: most sequences are shorter than max
        np.random.seed(42)
        actual_lengths = np.random.exponential(scale=500, size=num_requests)
        actual_lengths = np.clip(actual_lengths, 50, max_seq_len).astype(int)

    # Each request reserves max_seq_len slots
    total_allocated = num_requests * max_seq_len
    total_used = sum(actual_lengths)

    return {
        "max_seq_len": max_seq_len,
        "num_requests": num_requests,
        "allocated": total_allocated,
        "used": total_used,
        "wasted": total_allocated - total_used,
        "utilization": total_used / total_allocated,
        "actual_lengths": actual_lengths,
    }


# Run simulation
stats = traditional_allocation_simulation()

import numpy as np


# Simulating traditional KV cache allocation
def traditional_allocation_simulation(
    max_seq_len: int = 4096, num_requests: int = 8, actual_lengths: list = None
):
    """
    Simulate memory allocation with contiguous pre-allocation.
    Returns wasted memory statistics.
    """
    if actual_lengths is None:
        # Realistic distribution: most sequences are shorter than max
        np.random.seed(42)
        actual_lengths = np.random.exponential(scale=500, size=num_requests)
        actual_lengths = np.clip(actual_lengths, 50, max_seq_len).astype(int)

    # Each request reserves max_seq_len slots
    total_allocated = num_requests * max_seq_len
    total_used = sum(actual_lengths)

    return {
        "max_seq_len": max_seq_len,
        "num_requests": num_requests,
        "allocated": total_allocated,
        "used": total_used,
        "wasted": total_allocated - total_used,
        "utilization": total_used / total_allocated,
        "actual_lengths": actual_lengths,
    }


# Run simulation
stats = traditional_allocation_simulation()

Out[3]:

Console

Traditional Contiguous KV Cache Allocation
=============================================
Max sequence length: 4,096 tokens
Number of requests: 8

Actual sequence lengths: [np.int64(234), np.int64(1505), np.int64(658), np.int64(456), np.int64(84), np.int64(84), np.int64(50), np.int64(1005)]

Total slots allocated: 32,768
Total slots actually used: 4,076
Slots wasted: 28,692
Memory utilization: 12.4%

With realistic sequence length distributions, traditional allocation achieves only around 15% memory utilization. This means 85% of the reserved GPU memory sits unused, unable to serve additional requests.

Out[4]:

Visualization

Horizontal bar chart showing eight requests with actual token usage in green and wasted pre-allocated space in red, demonstrating low memory utilization. — Comparison of used versus allocated token slots in a traditional KV cache. Pre-allocation for the maximum sequence length results in approximately 85% memory waste, as the actually used tokens (green) occupy only a small fraction of the reserved capacity (red).

External FragmentationLink Copied

Even more problematic is external fragmentation, which occurs when requests complete and free their memory. Consider this scenario:

In[5]:

Code

def visualize_fragmentation():
    """
    Visualize how external fragmentation develops over time.
    """
    # Memory represented as slots (simplified)
    memory_size = 40  # Total slots available
    max_seq_alloc = 10  # Each request needs 10 contiguous slots

    # Timeline of events
    events = [
        ("alloc", "A"),  # Request A starts
        ("alloc", "B"),  # Request B starts
        ("alloc", "C"),  # Request C starts
        ("alloc", "D"),  # Request D starts
        ("free", "A"),  # Request A completes
        ("free", "C"),  # Request C completes
        ("alloc", "E"),  # Request E arrives - needs 10 contiguous slots
    ]

    # Track allocations: dict mapping request_id -> (start, end)
    allocations = {}
    next_free = 0
    snapshots = []

    for event_type, request_id in events:
        if event_type == "alloc":
            if (
                request_id not in allocations
                and next_free + max_seq_alloc <= memory_size
            ):
                allocations[request_id] = (next_free, next_free + max_seq_alloc)
                next_free += max_seq_alloc
        else:  # free
            if request_id in allocations:
                del allocations[request_id]

        # Create memory snapshot
        snapshot = ["." for _ in range(memory_size)]
        for req_id, (start, end) in allocations.items():
            for i in range(start, end):
                snapshot[i] = req_id

        snapshots.append(
            (f"{event_type} {request_id}", "".join(snapshot), dict(allocations))
        )

    return snapshots, max_seq_alloc


snapshots, req_size = visualize_fragmentation()

# Pre-calculate display data
display_data = []
for event, memory, _ in snapshots:
    display_data.append((event, memory, memory.count(".")))

total_free = snapshots[-1][1].count(".")

def visualize_fragmentation():
    """
    Visualize how external fragmentation develops over time.
    """
    # Memory represented as slots (simplified)
    memory_size = 40  # Total slots available
    max_seq_alloc = 10  # Each request needs 10 contiguous slots

    # Timeline of events
    events = [
        ("alloc", "A"),  # Request A starts
        ("alloc", "B"),  # Request B starts
        ("alloc", "C"),  # Request C starts
        ("alloc", "D"),  # Request D starts
        ("free", "A"),  # Request A completes
        ("free", "C"),  # Request C completes
        ("alloc", "E"),  # Request E arrives - needs 10 contiguous slots
    ]

    # Track allocations: dict mapping request_id -> (start, end)
    allocations = {}
    next_free = 0
    snapshots = []

    for event_type, request_id in events:
        if event_type == "alloc":
            if (
                request_id not in allocations
                and next_free + max_seq_alloc <= memory_size
            ):
                allocations[request_id] = (next_free, next_free + max_seq_alloc)
                next_free += max_seq_alloc
        else:  # free
            if request_id in allocations:
                del allocations[request_id]

        # Create memory snapshot
        snapshot = ["." for _ in range(memory_size)]
        for req_id, (start, end) in allocations.items():
            for i in range(start, end):
                snapshot[i] = req_id

        snapshots.append(
            (f"{event_type} {request_id}", "".join(snapshot), dict(allocations))
        )

    return snapshots, max_seq_alloc


snapshots, req_size = visualize_fragmentation()

# Pre-calculate display data
display_data = []
for event, memory, _ in snapshots:
    display_data.append((event, memory, memory.count(".")))

total_free = snapshots[-1][1].count(".")

Out[6]:

Console

Memory Fragmentation Timeline
=======================================================
Legend: . = free, A/B/C/D/E = allocated to request
Each request needs 10 contiguous slots

After alloc A : [AAAAAAAAAA..............................]  Free: 30
After alloc B : [AAAAAAAAAABBBBBBBBBB....................]  Free: 20
After alloc C : [AAAAAAAAAABBBBBBBBBBCCCCCCCCCC..........]  Free: 10
After alloc D : [AAAAAAAAAABBBBBBBBBBCCCCCCCCCCDDDDDDDDDD]  Free: 0
After free A  : [..........BBBBBBBBBBCCCCCCCCCCDDDDDDDDDD]  Free: 10
After free C  : [..........BBBBBBBBBB..........DDDDDDDDDD]  Free: 20
After alloc E : [..........BBBBBBBBBB..........DDDDDDDDDD]  Free: 20

-------------------------------------------------------
After freeing A and C, we have 20 free slots total,
but request E cannot be allocated because no 10
contiguous slots are available!

Out[7]:

Visualization

Series of horizontal bar charts showing memory allocation state at each step, with colored blocks representing allocated requests and gray blocks representing free memory. — A timeline of memory allocation and deallocation demonstrating the emergence of external fragmentation. Although 20 slots are free after requests A and C finish, they are split into non-contiguous regions, preventing the system from serving request E which requires a single contiguous block.

This is external fragmentation: the memory has 20 free slots, but they're split into two non-contiguous regions of 10 slots each. A new request requiring 10 contiguous slots cannot be served, even though sufficient total memory exists. In practice, with many concurrent requests starting and completing at different times, this fragmentation can become severe.

Internal FragmentationLink Copied

Internal fragmentation compounds the problem. When we allocate based on maximum sequence length but the actual sequence is shorter, the unused portion within the allocation is wasted:

In[8]:

Code

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (3.0, 2.5),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

# Plot 1: Internal Fragmentation
plt.figure()
allocations = [
    ("Request A", 0, 1000, 300),  # allocated, actual used
    ("Request B", 1000, 2000, 750),
    ("Request C", 2000, 3000, 200),
    ("Request D", 3000, 4000, 600),
]
for label, start, end, used in allocations:
    # Used portion
    plt.barh(
        label,
        used,
        left=start,
        color="#2ecc71",
        edgecolor="black",
        linewidth=0.5,
    )
    # Wasted portion (internal fragmentation)
    plt.barh(
        label,
        (end - start) - used,
        left=start + used,
        color="#e74c3c",
        alpha=0.5,
        edgecolor="black",
        linewidth=0.5,
    )
plt.xlabel("Memory Slots")
plt.title("Internal Fragmentation\n(Unused space within allocations)")
plt.legend(["Used", "Wasted"], loc="upper right")
plt.xlim(0, 4500)
plt.show()

# Plot 2: External Fragmentation
plt.figure()
memory_blocks = [
    ("B: 1000", "#2ecc71"),  # In use
    ("Free: 1000", "#e74c3c"),  # External fragment
    ("D: 1000", "#2ecc71"),  # In use
    ("Free: 1000", "#e74c3c"),  # External fragment
]
bottom = 0
for label, color in memory_blocks:
    size = 1000
    plt.bar(
        ["Memory"],
        size,
        bottom=bottom,
        color=color,
        edgecolor="black",
        linewidth=0.5,
        alpha=0.7 if "Free" in label else 1.0,
    )
    plt.text(0, bottom + size / 2, label, ha="center", va="center")
    bottom += size
plt.ylabel("Memory Slots")
plt.title("External Fragmentation\n(Scattered free blocks)")
plt.ylim(0, 4500)
plt.show()

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (3.0, 2.5),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

# Plot 1: Internal Fragmentation
plt.figure()
allocations = [
    ("Request A", 0, 1000, 300),  # allocated, actual used
    ("Request B", 1000, 2000, 750),
    ("Request C", 2000, 3000, 200),
    ("Request D", 3000, 4000, 600),
]
for label, start, end, used in allocations:
    # Used portion
    plt.barh(
        label,
        used,
        left=start,
        color="#2ecc71",
        edgecolor="black",
        linewidth=0.5,
    )
    # Wasted portion (internal fragmentation)
    plt.barh(
        label,
        (end - start) - used,
        left=start + used,
        color="#e74c3c",
        alpha=0.5,
        edgecolor="black",
        linewidth=0.5,
    )
plt.xlabel("Memory Slots")
plt.title("Internal Fragmentation\n(Unused space within allocations)")
plt.legend(["Used", "Wasted"], loc="upper right")
plt.xlim(0, 4500)
plt.show()

# Plot 2: External Fragmentation
plt.figure()
memory_blocks = [
    ("B: 1000", "#2ecc71"),  # In use
    ("Free: 1000", "#e74c3c"),  # External fragment
    ("D: 1000", "#2ecc71"),  # In use
    ("Free: 1000", "#e74c3c"),  # External fragment
]
bottom = 0
for label, color in memory_blocks:
    size = 1000
    plt.bar(
        ["Memory"],
        size,
        bottom=bottom,
        color=color,
        edgecolor="black",
        linewidth=0.5,
        alpha=0.7 if "Free" in label else 1.0,
    )
    plt.text(0, bottom + size / 2, label, ha="center", va="center")
    bottom += size
plt.ylabel("Memory Slots")
plt.title("External Fragmentation\n(Scattered free blocks)")
plt.ylim(0, 4500)
plt.show()

Out[8]:

Visualization

Two bar charts comparing fragmentation types. The first shows unused space within allocated blocks (internal). The second shows free memory blocks scattered non-contiguously (external). — Visualization of internal fragmentation within traditional KV cache allocations. Wasted space (red) occurs inside individual reserved blocks when the actual sequence length is shorter than the pre-allocated capacity.

The combination of internal and external fragmentation means traditional serving systems achieve only 20-40% memory efficiency. For a GPU with 40 GB of KV cache capacity, this translates to effectively having only 8-16 GB available for actual use.

Page-Based Memory AllocationLink Copied

The solution comes from a technique that operating systems have used for decades: virtual memory with paging. Instead of requiring contiguous physical memory, paging allows data to be stored in fixed-size blocks (pages) scattered anywhere in memory, with a page table tracking where each piece actually resides. This technique fundamentally changed operating system design in the 1960s, and it proves equally effective for managing KV caches in modern LLM serving systems.

The Paging AbstractionLink Copied

To understand how paging solves fragmentation, we must distinguish between logical and physical memory organization. A sequence's KV cache appears contiguous from a logical perspective, with tokens numbered 0, 1, 2, and so on. However, the actual physical storage need not mirror this logical arrangement. As long as we maintain a mapping that tells us where each piece of data resides in physical memory, we can store the data anywhere we like.

In the context of KV caches, paging works through four key concepts:

Physical blocks: GPU memory is divided into fixed-size blocks, each capable of holding a small number of KV pairs (typically 16 tokens worth). These blocks represent the actual memory locations where data is stored.
Logical blocks: Each sequence's KV cache is divided into logical blocks of the same size. These represent the abstract, contiguous view of the sequence's cache.
Block table: A mapping from logical block indices to physical block locations. This data structure is the key that enables the translation between the sequence's logical view and the scattered physical reality.
Dynamic allocation: Physical blocks are allocated only when needed, not pre-allocated. This on-demand approach means we never reserve memory for tokens that haven't been generated yet.

This approach decouples how we think about the data (logically contiguous per sequence) from how we store it (potentially scattered across physical memory). Let's implement a simulator to see these concepts in action:

In[9]:

Code

class PagedKVCacheSimulator:
    """
    Simulates paged KV cache allocation to demonstrate the concept.
    """

    def __init__(self, total_physical_blocks: int, block_size: int = 16):
        self.total_blocks = total_physical_blocks
        self.block_size = block_size  # Tokens per block

        # Physical memory: list of blocks, None if free
        self.physical_memory = [None] * total_physical_blocks

        # Block tables: sequence_id -> list of physical block indices
        self.block_tables = {}

        # Free block list
        self.free_blocks = list(range(total_physical_blocks))

    def allocate_block(self, sequence_id: int) -> int:
        """Allocate a single physical block for a sequence."""
        if not self.free_blocks:
            raise MemoryError("No free blocks available")

        # Get a free physical block
        physical_idx = self.free_blocks.pop(0)
        self.physical_memory[physical_idx] = sequence_id

        # Add to sequence's block table
        if sequence_id not in self.block_tables:
            self.block_tables[sequence_id] = []
        self.block_tables[sequence_id].append(physical_idx)

        return physical_idx

    def free_sequence(self, sequence_id: int):
        """Free all blocks belonging to a sequence."""
        if sequence_id in self.block_tables:
            for physical_idx in self.block_tables[sequence_id]:
                self.physical_memory[physical_idx] = None
                self.free_blocks.append(physical_idx)
            del self.block_tables[sequence_id]

    def add_tokens(self, sequence_id: int, num_tokens: int):
        """Add tokens to a sequence, allocating blocks as needed."""
        if sequence_id not in self.block_tables:
            self.block_tables[sequence_id] = []

        current_tokens = len(self.block_tables[sequence_id]) * self.block_size
        tokens_needed = num_tokens

        while tokens_needed > 0:
            # Check if current block has space
            current_blocks = len(self.block_tables[sequence_id])
            if current_blocks == 0 or current_tokens % self.block_size == 0:
                # Need a new block
                self.allocate_block(sequence_id)

            # Fill current block
            space_in_block = self.block_size - (
                current_tokens % self.block_size
            )
            tokens_to_add = min(space_in_block, tokens_needed)
            tokens_needed -= tokens_to_add
            current_tokens += tokens_to_add

        return current_tokens

    def get_utilization(self) -> dict:
        """Calculate memory utilization statistics."""
        used_blocks = self.total_blocks - len(self.free_blocks)
        return {
            "total_blocks": self.total_blocks,
            "used_blocks": used_blocks,
            "free_blocks": len(self.free_blocks),
            "utilization": used_blocks / self.total_blocks,
            "sequences": len(self.block_tables),
        }


# Demonstrate paged allocation
paged_cache = PagedKVCacheSimulator(total_physical_blocks=200, block_size=16)

class PagedKVCacheSimulator:
    """
    Simulates paged KV cache allocation to demonstrate the concept.
    """

    def __init__(self, total_physical_blocks: int, block_size: int = 16):
        self.total_blocks = total_physical_blocks
        self.block_size = block_size  # Tokens per block

        # Physical memory: list of blocks, None if free
        self.physical_memory = [None] * total_physical_blocks

        # Block tables: sequence_id -> list of physical block indices
        self.block_tables = {}

        # Free block list
        self.free_blocks = list(range(total_physical_blocks))

    def allocate_block(self, sequence_id: int) -> int:
        """Allocate a single physical block for a sequence."""
        if not self.free_blocks:
            raise MemoryError("No free blocks available")

        # Get a free physical block
        physical_idx = self.free_blocks.pop(0)
        self.physical_memory[physical_idx] = sequence_id

        # Add to sequence's block table
        if sequence_id not in self.block_tables:
            self.block_tables[sequence_id] = []
        self.block_tables[sequence_id].append(physical_idx)

        return physical_idx

    def free_sequence(self, sequence_id: int):
        """Free all blocks belonging to a sequence."""
        if sequence_id in self.block_tables:
            for physical_idx in self.block_tables[sequence_id]:
                self.physical_memory[physical_idx] = None
                self.free_blocks.append(physical_idx)
            del self.block_tables[sequence_id]

    def add_tokens(self, sequence_id: int, num_tokens: int):
        """Add tokens to a sequence, allocating blocks as needed."""
        if sequence_id not in self.block_tables:
            self.block_tables[sequence_id] = []

        current_tokens = len(self.block_tables[sequence_id]) * self.block_size
        tokens_needed = num_tokens

        while tokens_needed > 0:
            # Check if current block has space
            current_blocks = len(self.block_tables[sequence_id])
            if current_blocks == 0 or current_tokens % self.block_size == 0:
                # Need a new block
                self.allocate_block(sequence_id)

            # Fill current block
            space_in_block = self.block_size - (
                current_tokens % self.block_size
            )
            tokens_to_add = min(space_in_block, tokens_needed)
            tokens_needed -= tokens_to_add
            current_tokens += tokens_to_add

        return current_tokens

    def get_utilization(self) -> dict:
        """Calculate memory utilization statistics."""
        used_blocks = self.total_blocks - len(self.free_blocks)
        return {
            "total_blocks": self.total_blocks,
            "used_blocks": used_blocks,
            "free_blocks": len(self.free_blocks),
            "utilization": used_blocks / self.total_blocks,
            "sequences": len(self.block_tables),
        }


# Demonstrate paged allocation
paged_cache = PagedKVCacheSimulator(total_physical_blocks=200, block_size=16)

In[10]:

Code

# Create a fresh paged cache with enough capacity for this demonstration
# Total tokens needed: A(300) + B(750) + C(200) + D(600) + E(500) = 2350 tokens
# At 16 tokens per block, we need at least 147 blocks, use 200 for headroom
paged_cache = PagedKVCacheSimulator(total_physical_blocks=200, block_size=16)

# Define request sizes
req_a = 300
req_b = 750
req_c = 200
req_d = 600
req_e = 500

# Simulate the same scenario that caused fragmentation earlier
# Request A
paged_cache.add_tokens(sequence_id=0, num_tokens=req_a)
# Request B
paged_cache.add_tokens(sequence_id=1, num_tokens=req_b)
# Request C
paged_cache.add_tokens(sequence_id=2, num_tokens=req_c)
# Request D
paged_cache.add_tokens(sequence_id=3, num_tokens=req_d)

before_free = paged_cache.get_utilization()

# Free requests A and C (simulating completion)
paged_cache.free_sequence(0)
paged_cache.free_sequence(2)

after_free = paged_cache.get_utilization()

# Now allocate a new request E
paged_cache.add_tokens(sequence_id=4, num_tokens=req_e)

after_new = paged_cache.get_utilization()

# Create a fresh paged cache with enough capacity for this demonstration
# Total tokens needed: A(300) + B(750) + C(200) + D(600) + E(500) = 2350 tokens
# At 16 tokens per block, we need at least 147 blocks, use 200 for headroom
paged_cache = PagedKVCacheSimulator(total_physical_blocks=200, block_size=16)

# Define request sizes
req_a = 300
req_b = 750
req_c = 200
req_d = 600
req_e = 500

# Simulate the same scenario that caused fragmentation earlier
# Request A
paged_cache.add_tokens(sequence_id=0, num_tokens=req_a)
# Request B
paged_cache.add_tokens(sequence_id=1, num_tokens=req_b)
# Request C
paged_cache.add_tokens(sequence_id=2, num_tokens=req_c)
# Request D
paged_cache.add_tokens(sequence_id=3, num_tokens=req_d)

before_free = paged_cache.get_utilization()

# Free requests A and C (simulating completion)
paged_cache.free_sequence(0)
paged_cache.free_sequence(2)

after_free = paged_cache.get_utilization()

# Now allocate a new request E
paged_cache.add_tokens(sequence_id=4, num_tokens=req_e)

after_new = paged_cache.get_utilization()

Out[11]:

Console

Paged KV Cache Allocation Demonstration
==================================================
Block size: 16 tokens per block
Total physical blocks: 200

After allocating requests A(300), B(750), C(200), D(600):
  Used blocks: 117
  Utilization: 58.5%

After freeing requests A and C:
  Used blocks: 85
  Free blocks: 115

After allocating new request E(500 tokens):
  Used blocks: 117
  Sequences active: 3
  Utilization: 58.5%

No fragmentation! Request E was allocated using freed blocks from requests A and C, even though they're non-contiguous.

The key insight is that request E's blocks don't need to be contiguous in physical memory. The block table maintains the logical ordering, while physical blocks can be scattered anywhere. This eliminates external fragmentation entirely. When the simulator freed sequences A and C, their physical blocks returned to the free pool. Request E then claimed blocks from this pool, assembling its logical sequence from whatever physical locations happened to be available. From request E's perspective, it has a perfectly contiguous cache; the block table handles all the translation behind the scenes.

Block Table StructureLink Copied

The block table is the critical data structure that enables this flexibility. For each sequence, it maintains an ordered list of physical block indices. The position in this list corresponds to the logical block number, while the value at that position indicates where in physical memory that block actually resides.

To visualize this concept, consider a sequence whose KV cache spans three logical blocks. The block table might contain the entries [7, 2, 15], indicating that logical block 0 is stored in physical block 7, logical block 1 is stored in physical block 2, and logical block 2 is stored in physical block 15. When the attention mechanism needs to access the key for token 20 (which falls in logical block 1, since each block holds 16 tokens), it consults the block table, finds that logical block 1 maps to physical block 2, and reads the data from that location.

In[12]:

Code

def visualize_block_tables(cache: PagedKVCacheSimulator):
    """Show the block table mappings for active sequences."""
    print("Block Tables (Logical → Physical mapping)")
    print("-" * 45)

    for seq_id, blocks in sorted(cache.block_tables.items()):
        logical_indices = list(range(len(blocks)))
        print(f"\nSequence {seq_id}:")
        print(f"  Logical blocks:  {logical_indices}")
        print(f"  Physical blocks: {blocks}")

        # Show the mapping
        mapping = [f"L{l}→P{p}" for l, p in enumerate(blocks)]
        print(
            f"  Mapping: {', '.join(mapping[:5])}{'...' if len(mapping) > 5 else ''}"
        )

def visualize_block_tables(cache: PagedKVCacheSimulator):
    """Show the block table mappings for active sequences."""
    print("Block Tables (Logical → Physical mapping)")
    print("-" * 45)

    for seq_id, blocks in sorted(cache.block_tables.items()):
        logical_indices = list(range(len(blocks)))
        print(f"\nSequence {seq_id}:")
        print(f"  Logical blocks:  {logical_indices}")
        print(f"  Physical blocks: {blocks}")

        # Show the mapping
        mapping = [f"L{l}→P{p}" for l, p in enumerate(blocks)]
        print(
            f"  Mapping: {', '.join(mapping[:5])}{'...' if len(mapping) > 5 else ''}"
        )

Out[13]:

Console

Block Tables (Logical → Physical mapping)
---------------------------------------------

Sequence 1:
  Logical blocks:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]
  Physical blocks: [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65]
  Mapping: L0→P19, L1→P20, L2→P21, L3→P22, L4→P23...

Sequence 3:
  Logical blocks:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]
  Physical blocks: [79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116]
  Mapping: L0→P79, L1→P80, L2→P81, L3→P82, L4→P83...

Sequence 4:
  Logical blocks:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
  Physical blocks: [117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148]
  Mapping: L0→P117, L1→P118, L2→P119, L3→P120, L4→P121...

Notice how sequence 4 (request E) uses physical blocks that are scattered throughout memory, reusing the blocks freed by sequences 0 and 2. From the sequence's perspective, it has a contiguous logical address space. The block table handles the translation.

Out[14]:

Visualization

Two-panel diagram showing logical block ordering per sequence on the left and scattered physical block placement on the right, with colored blocks indicating ownership. — The logical representation of a paged KV cache where sequences appear as contiguous blocks. This abstraction allows the attention mechanism to treat the cache as a linear sequence regardless of where blocks are physically stored.

The physical layout of memory blocks in a paged system. Physical blocks are scattered throughout the GPU memory, allowing sequence E to reuse non-contiguous blocks freed by completed requests to achieve near-optimal utilization.

This indirection through the block table introduces a small amount of overhead: every memory access must first look up the physical location. However, this cost is negligible compared to the enormous benefits of eliminating fragmentation. The block table itself is small, typically fitting entirely in fast GPU memory or cache, making lookups extremely fast.

The PagedAttention AlgorithmLink Copied

With KV cache data potentially scattered across non-contiguous memory locations, we need to modify the attention computation. The PagedAttention algorithm handles this by fetching keys and values according to the block table during attention calculation. This modification preserves the mathematical correctness of attention while accommodating the new memory layout.

Standard Attention vs PagedAttentionLink Copied

In standard attention, keys and values are stored in contiguous tensors. For a sequence of length $n$ , the attention mechanism computes a weighted sum of values based on the similarity between the query and keys:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

$Q$ : the query matrix (typically for the current token being generated)
$K$ : the key matrix containing keys for all tokens in the sequence (contiguous tensor of shape $n \times d_k$ )
$V$ : the value matrix containing values for all tokens in the sequence (contiguous tensor of shape $n \times d_v$ )
$d_k$ : the dimension of the key vectors
$d_v$ : the dimension of the value vectors
$n$ : the sequence length
$\text{softmax}$ : the function that normalizes similarity scores into probabilities

The formula consists of three computational stages. First, the dot product $QK^T$ measures the similarity between the query and each key. Intuitively, this operation asks "how relevant is each previous token to the token we're currently generating?" Keys that are more similar to the query receive higher scores, indicating greater relevance. The result is a vector of raw similarity scores, one for each position in the sequence.

The second stage applies the $\text{softmax}$ function to these raw scores. This normalization transforms the arbitrary similarity values into a proper probability distribution that sums to 1. The softmax operation has an important property: it accentuates differences, making high scores relatively higher and low scores relatively lower. The output is the attention weights, which tell us how much to attend to each previous position.

The third stage multiplies these attention weights by the value matrix $V$ , computing a weighted average of the value vectors. Tokens with higher attention weights contribute more to the final output, while tokens with lower weights contribute less. This mechanism allows the model to selectively focus on the most relevant information from the context.

The division by $\sqrt{d_k}$ in the formula ensures numerical stability. Without this scaling factor, the dot products between queries and keys tend to grow large as the dimension $d_k$ increases. Large dot products cause the softmax function to saturate, producing attention distributions that are nearly one-hot (all attention on one token). The scaling factor keeps the dot products in a reasonable range, allowing the softmax to produce more nuanced attention patterns and improving numerical stability during training and inference.

In PagedAttention, $K$ and $V$ are stored in non-contiguous blocks rather than contiguous tensors. The attention computation must gather the relevant blocks according to the block table. While the mathematical operation remains identical, the implementation must navigate the scattered physical layout:

In[15]:

Code

import torch
import torch.nn.functional as F


def paged_attention_reference(
    query: torch.Tensor,  # Shape: (batch, num_heads, 1, head_dim) for single token
    key_cache: torch.Tensor,  # Shape: (num_blocks, block_size, num_heads, head_dim)
    value_cache: torch.Tensor,  # Shape: (num_blocks, block_size, num_heads, head_dim)
    block_tables: torch.Tensor,  # Shape: (batch, max_blocks) - physical block indices
    context_lens: torch.Tensor,  # Shape: (batch,) - actual sequence lengths
    block_size: int,
    scale: float,
) -> torch.Tensor:
    """
    Reference implementation of PagedAttention for understanding.
    Not optimized - real implementations use custom CUDA kernels.
    """
    batch_size, num_heads, _, head_dim = query.shape
    output = torch.zeros_like(query)

    for b in range(batch_size):
        seq_len = context_lens[b].item()
        num_blocks_used = (seq_len + block_size - 1) // block_size

        # Gather keys and values from physical blocks
        keys_list = []
        values_list = []

        for block_idx in range(num_blocks_used):
            physical_block = block_tables[b, block_idx].item()

            # Calculate how many tokens to use from this block
            start_pos = block_idx * block_size
            end_pos = min(start_pos + block_size, seq_len)
            tokens_in_block = end_pos - start_pos

            # Fetch from physical memory location
            keys_list.append(key_cache[physical_block, :tokens_in_block])
            values_list.append(value_cache[physical_block, :tokens_in_block])

        # Concatenate gathered keys and values
        # Shape: (seq_len, num_heads, head_dim)
        keys = torch.cat(keys_list, dim=0)
        values = torch.cat(values_list, dim=0)

        # Reshape for attention computation
        # keys: (num_heads, seq_len, head_dim)
        keys = keys.permute(1, 0, 2)
        values = values.permute(1, 0, 2)

        # Query shape: (num_heads, 1, head_dim)
        q = query[b]  # (num_heads, 1, head_dim)

        # Compute attention scores
        # (num_heads, 1, head_dim) @ (num_heads, head_dim, seq_len) -> (num_heads, 1, seq_len)
        attn_scores = torch.matmul(q, keys.transpose(-2, -1)) * scale
        attn_probs = F.softmax(attn_scores, dim=-1)

        # Apply attention to values
        # (num_heads, 1, seq_len) @ (num_heads, seq_len, head_dim) -> (num_heads, 1, head_dim)
        attn_output = torch.matmul(attn_probs, values)
        output[b] = attn_output

    return output

import torch
import torch.nn.functional as F


def paged_attention_reference(
    query: torch.Tensor,  # Shape: (batch, num_heads, 1, head_dim) for single token
    key_cache: torch.Tensor,  # Shape: (num_blocks, block_size, num_heads, head_dim)
    value_cache: torch.Tensor,  # Shape: (num_blocks, block_size, num_heads, head_dim)
    block_tables: torch.Tensor,  # Shape: (batch, max_blocks) - physical block indices
    context_lens: torch.Tensor,  # Shape: (batch,) - actual sequence lengths
    block_size: int,
    scale: float,
) -> torch.Tensor:
    """
    Reference implementation of PagedAttention for understanding.
    Not optimized - real implementations use custom CUDA kernels.
    """
    batch_size, num_heads, _, head_dim = query.shape
    output = torch.zeros_like(query)

    for b in range(batch_size):
        seq_len = context_lens[b].item()
        num_blocks_used = (seq_len + block_size - 1) // block_size

        # Gather keys and values from physical blocks
        keys_list = []
        values_list = []

        for block_idx in range(num_blocks_used):
            physical_block = block_tables[b, block_idx].item()

            # Calculate how many tokens to use from this block
            start_pos = block_idx * block_size
            end_pos = min(start_pos + block_size, seq_len)
            tokens_in_block = end_pos - start_pos

            # Fetch from physical memory location
            keys_list.append(key_cache[physical_block, :tokens_in_block])
            values_list.append(value_cache[physical_block, :tokens_in_block])

        # Concatenate gathered keys and values
        # Shape: (seq_len, num_heads, head_dim)
        keys = torch.cat(keys_list, dim=0)
        values = torch.cat(values_list, dim=0)

        # Reshape for attention computation
        # keys: (num_heads, seq_len, head_dim)
        keys = keys.permute(1, 0, 2)
        values = values.permute(1, 0, 2)

        # Query shape: (num_heads, 1, head_dim)
        q = query[b]  # (num_heads, 1, head_dim)

        # Compute attention scores
        # (num_heads, 1, head_dim) @ (num_heads, head_dim, seq_len) -> (num_heads, 1, seq_len)
        attn_scores = torch.matmul(q, keys.transpose(-2, -1)) * scale
        attn_probs = F.softmax(attn_scores, dim=-1)

        # Apply attention to values
        # (num_heads, 1, seq_len) @ (num_heads, seq_len, head_dim) -> (num_heads, 1, head_dim)
        attn_output = torch.matmul(attn_probs, values)
        output[b] = attn_output

    return output

This reference implementation shows the core idea: before computing attention, we gather the keys and values from their physical locations according to the block table. The attention computation itself remains unchanged. The gathering process iterates through logical block indices, looks up the corresponding physical block in the block table, and fetches the data from that physical location. Once all the keys and values are assembled into contiguous tensors, the standard attention formula applies.

Demonstration with Non-Contiguous BlocksLink Copied

Let's verify that PagedAttention produces correct results even when blocks are scattered:

In[16]:

Code

# Set up a small example
torch.manual_seed(42)

num_blocks = 8
block_size = 4
num_heads = 2
head_dim = 8
batch_size = 2

# Physical KV cache - blocks can be used by any sequence
key_cache = torch.randn(num_blocks, block_size, num_heads, head_dim)
value_cache = torch.randn(num_blocks, block_size, num_heads, head_dim)

# Block tables - note non-contiguous physical blocks
# Sequence 0 uses physical blocks [0, 3, 5] (scattered!)
# Sequence 1 uses physical blocks [1, 7]
block_tables = torch.tensor(
    [
        [
            0,
            3,
            5,
            0,
            0,
            0,
            0,
            0,
        ],  # Sequence 0: 3 blocks (12 tokens max, using 10)
        [
            1,
            7,
            0,
            0,
            0,
            0,
            0,
            0,
        ],  # Sequence 1: 2 blocks (8 tokens max, using 7)
    ]
)

# Actual sequence lengths
context_lens = torch.tensor([10, 7])

# Query for current token being generated
query = torch.randn(batch_size, num_heads, 1, head_dim)
scale = 1.0 / (head_dim**0.5)

# Run paged attention
output = paged_attention_reference(
    query, key_cache, value_cache, block_tables, context_lens, block_size, scale
)

# Extract block mappings for visualization
sequence_mappings = []
for i, seq_len in enumerate(context_lens):
    # Calculate number of blocks used by this sequence
    n_blocks = (seq_len.item() + block_size - 1) // block_size
    # Slice the block table to get only the used physical blocks
    blocks = block_tables[i, :n_blocks].tolist()
    sequence_mappings.append((i, seq_len.item(), blocks))

# Set up a small example
torch.manual_seed(42)

num_blocks = 8
block_size = 4
num_heads = 2
head_dim = 8
batch_size = 2

# Physical KV cache - blocks can be used by any sequence
key_cache = torch.randn(num_blocks, block_size, num_heads, head_dim)
value_cache = torch.randn(num_blocks, block_size, num_heads, head_dim)

# Block tables - note non-contiguous physical blocks
# Sequence 0 uses physical blocks [0, 3, 5] (scattered!)
# Sequence 1 uses physical blocks [1, 7]
block_tables = torch.tensor(
    [
        [
            0,
            3,
            5,
            0,
            0,
            0,
            0,
            0,
        ],  # Sequence 0: 3 blocks (12 tokens max, using 10)
        [
            1,
            7,
            0,
            0,
            0,
            0,
            0,
            0,
        ],  # Sequence 1: 2 blocks (8 tokens max, using 7)
    ]
)

# Actual sequence lengths
context_lens = torch.tensor([10, 7])

# Query for current token being generated
query = torch.randn(batch_size, num_heads, 1, head_dim)
scale = 1.0 / (head_dim**0.5)

# Run paged attention
output = paged_attention_reference(
    query, key_cache, value_cache, block_tables, context_lens, block_size, scale
)

# Extract block mappings for visualization
sequence_mappings = []
for i, seq_len in enumerate(context_lens):
    # Calculate number of blocks used by this sequence
    n_blocks = (seq_len.item() + block_size - 1) // block_size
    # Slice the block table to get only the used physical blocks
    blocks = block_tables[i, :n_blocks].tolist()
    sequence_mappings.append((i, seq_len.item(), blocks))

Out[17]:

Console

PagedAttention Demonstration
=============================================
Block size: 4 tokens
Number of physical blocks: 8

Block table mappings:
  Sequence 0 (len=10): uses physical blocks [0, 3, 5]
  Sequence 1 (len=7): uses physical blocks [1, 7]

Note: blocks are non-contiguous in physical memory!

Output shape: torch.Size([2, 2, 1, 8])

Optimized CUDA KernelsLink Copied

The reference implementation above gathers blocks into contiguous memory before computing attention. This defeats some of the purpose of paging, as it requires temporary memory allocation. Production implementations like vLLM use custom CUDA kernels that compute attention directly on the paged data structure, avoiding the explicit gather step entirely.

The key optimization insight is that attention can be computed incrementally as blocks are loaded, rather than requiring all data to be assembled first. This approach leverages the online softmax algorithm (also known as the flash attention technique from Part XIV, Chapter 8) to maintain numerical stability while processing blocks one at a time. The online softmax technique keeps a running maximum and a running sum, updating them as each new block is processed. This allows the kernel to compute the final attention output without ever needing to store all the keys and values in a contiguous buffer.

In[18]:

Code

# Pseudocode for optimized PagedAttention kernel
# This shows the conceptual approach - actual CUDA code is more complex


def paged_attention_kernel_pseudocode():
    """
    Conceptual flow of the optimized PagedAttention CUDA kernel.

    Key optimizations:
    1. No explicit gather step - compute attention directly from blocks
    2. Each thread block processes a portion of the attention
    3. Leverage shared memory for frequently accessed block table entries
    """
    kernel_description = """
    CUDA Kernel: paged_attention_v1
    
    Grid: (num_heads, batch_size, 1)
    Block: (NUM_THREADS,)
    
    For each (head, batch) pair:
        1. Load query vector for this head into registers
        2. Initialize running softmax numerator and denominator
        
        For each logical block in sequence:
            a. Look up physical block index from block_table
            b. Load key block from global memory (coalesced access)
            c. Compute partial attention scores for this block
            d. Update running softmax using online softmax trick
            
        For each logical block in sequence:
            a. Load value block from global memory
            b. Weight by final attention probabilities
            c. Accumulate to output
        
        Write output for this (batch, head, token)
    """
    return kernel_description


kernel_pseudocode = paged_attention_kernel_pseudocode()

# Pseudocode for optimized PagedAttention kernel
# This shows the conceptual approach - actual CUDA code is more complex


def paged_attention_kernel_pseudocode():
    """
    Conceptual flow of the optimized PagedAttention CUDA kernel.

    Key optimizations:
    1. No explicit gather step - compute attention directly from blocks
    2. Each thread block processes a portion of the attention
    3. Leverage shared memory for frequently accessed block table entries
    """
    kernel_description = """
    CUDA Kernel: paged_attention_v1
    
    Grid: (num_heads, batch_size, 1)
    Block: (NUM_THREADS,)
    
    For each (head, batch) pair:
        1. Load query vector for this head into registers
        2. Initialize running softmax numerator and denominator
        
        For each logical block in sequence:
            a. Look up physical block index from block_table
            b. Load key block from global memory (coalesced access)
            c. Compute partial attention scores for this block
            d. Update running softmax using online softmax trick
            
        For each logical block in sequence:
            a. Load value block from global memory
            b. Weight by final attention probabilities
            c. Accumulate to output
        
        Write output for this (batch, head, token)
    """
    return kernel_description


kernel_pseudocode = paged_attention_kernel_pseudocode()

Out[19]:

Console


    CUDA Kernel: paged_attention_v1

    Grid: (num_heads, batch_size, 1)
    Block: (NUM_THREADS,)

    For each (head, batch) pair:
        1. Load query vector for this head into registers
        2. Initialize running softmax numerator and denominator

        For each logical block in sequence:
            a. Look up physical block index from block_table
            b. Load key block from global memory (coalesced access)
            c. Compute partial attention scores for this block
            d. Update running softmax using online softmax trick

        For each logical block in sequence:
            a. Load value block from global memory
            b. Weight by final attention probabilities
            c. Accumulate to output

        Write output for this (batch, head, token)

The critical optimization is computing attention incrementally as blocks are loaded, rather than gathering all blocks first. This approach uses the online softmax algorithm (also known as the flash attention technique from Part XIV, Chapter 8) to maintain numerical stability while processing blocks one at a time. By processing blocks sequentially and maintaining running statistics, the kernel achieves the same mathematical result as gathering all keys first, but with dramatically reduced memory requirements.

vLLM ImplementationLink Copied

vLLM (Variational Large Language Model serving) is the framework that introduced PagedAttention. Understanding its architecture reveals how paging integrates into a complete serving system.

Memory ManagerLink Copied

The vLLM BlockSpaceManager handles physical block allocation:

In[20]:

Code

from dataclasses import dataclass
from typing import Dict, List
from collections import deque


@dataclass
class PhysicalBlock:
    """Represents a physical block of GPU memory."""

    block_number: int
    ref_count: int = 0  # For copy-on-write sharing


@dataclass
class LogicalBlock:
    """Represents a logical block in a sequence."""

    block_number: int
    num_tokens: int = 0


class BlockSpaceManager:
    """
    Manages allocation of physical blocks to sequences.
    Simplified version of vLLM's BlockSpaceManager.
    """

    def __init__(
        self,
        block_size: int,
        num_gpu_blocks: int,
        num_cpu_blocks: int = 0,  # For CPU offloading
    ):
        self.block_size = block_size
        self.num_gpu_blocks = num_gpu_blocks
        self.num_cpu_blocks = num_cpu_blocks

        # Free block pools
        self.gpu_free_blocks: deque = deque(range(num_gpu_blocks))
        self.cpu_free_blocks: deque = deque(range(num_cpu_blocks))

        # Block tables for each sequence
        self.block_tables: Dict[int, List[int]] = {}

        # Track block reference counts for sharing
        self.block_ref_counts: Dict[int, int] = {}

    def can_allocate(self, num_blocks: int) -> bool:
        """Check if we can allocate the requested number of blocks."""
        return len(self.gpu_free_blocks) >= num_blocks

    def allocate(self, seq_id: int, num_blocks: int) -> List[int]:
        """Allocate blocks for a new sequence."""
        if not self.can_allocate(num_blocks):
            raise MemoryError(f"Cannot allocate {num_blocks} blocks")

        allocated = []
        for _ in range(num_blocks):
            block = self.gpu_free_blocks.popleft()
            allocated.append(block)
            self.block_ref_counts[block] = 1

        self.block_tables[seq_id] = allocated
        return allocated

    def append_block(self, seq_id: int) -> int:
        """Allocate one more block for an existing sequence."""
        if not self.gpu_free_blocks:
            raise MemoryError("No free blocks available")

        block = self.gpu_free_blocks.popleft()
        self.block_tables[seq_id].append(block)
        self.block_ref_counts[block] = 1
        return block

    def free(self, seq_id: int) -> List[int]:
        """Free all blocks belonging to a sequence."""
        if seq_id not in self.block_tables:
            return []

        freed = []
        for block in self.block_tables[seq_id]:
            self.block_ref_counts[block] -= 1
            if self.block_ref_counts[block] == 0:
                self.gpu_free_blocks.append(block)
                del self.block_ref_counts[block]
                freed.append(block)

        del self.block_tables[seq_id]
        return freed

    def get_block_table(self, seq_id: int) -> List[int]:
        """Get the block table for a sequence."""
        return self.block_tables.get(seq_id, [])

    def get_num_free_blocks(self) -> int:
        """Get the number of available blocks."""
        return len(self.gpu_free_blocks)

from dataclasses import dataclass
from typing import Dict, List
from collections import deque


@dataclass
class PhysicalBlock:
    """Represents a physical block of GPU memory."""

    block_number: int
    ref_count: int = 0  # For copy-on-write sharing


@dataclass
class LogicalBlock:
    """Represents a logical block in a sequence."""

    block_number: int
    num_tokens: int = 0


class BlockSpaceManager:
    """
    Manages allocation of physical blocks to sequences.
    Simplified version of vLLM's BlockSpaceManager.
    """

    def __init__(
        self,
        block_size: int,
        num_gpu_blocks: int,
        num_cpu_blocks: int = 0,  # For CPU offloading
    ):
        self.block_size = block_size
        self.num_gpu_blocks = num_gpu_blocks
        self.num_cpu_blocks = num_cpu_blocks

        # Free block pools
        self.gpu_free_blocks: deque = deque(range(num_gpu_blocks))
        self.cpu_free_blocks: deque = deque(range(num_cpu_blocks))

        # Block tables for each sequence
        self.block_tables: Dict[int, List[int]] = {}

        # Track block reference counts for sharing
        self.block_ref_counts: Dict[int, int] = {}

    def can_allocate(self, num_blocks: int) -> bool:
        """Check if we can allocate the requested number of blocks."""
        return len(self.gpu_free_blocks) >= num_blocks

    def allocate(self, seq_id: int, num_blocks: int) -> List[int]:
        """Allocate blocks for a new sequence."""
        if not self.can_allocate(num_blocks):
            raise MemoryError(f"Cannot allocate {num_blocks} blocks")

        allocated = []
        for _ in range(num_blocks):
            block = self.gpu_free_blocks.popleft()
            allocated.append(block)
            self.block_ref_counts[block] = 1

        self.block_tables[seq_id] = allocated
        return allocated

    def append_block(self, seq_id: int) -> int:
        """Allocate one more block for an existing sequence."""
        if not self.gpu_free_blocks:
            raise MemoryError("No free blocks available")

        block = self.gpu_free_blocks.popleft()
        self.block_tables[seq_id].append(block)
        self.block_ref_counts[block] = 1
        return block

    def free(self, seq_id: int) -> List[int]:
        """Free all blocks belonging to a sequence."""
        if seq_id not in self.block_tables:
            return []

        freed = []
        for block in self.block_tables[seq_id]:
            self.block_ref_counts[block] -= 1
            if self.block_ref_counts[block] == 0:
                self.gpu_free_blocks.append(block)
                del self.block_ref_counts[block]
                freed.append(block)

        del self.block_tables[seq_id]
        return freed

    def get_block_table(self, seq_id: int) -> List[int]:
        """Get the block table for a sequence."""
        return self.block_tables.get(seq_id, [])

    def get_num_free_blocks(self) -> int:
        """Get the number of available blocks."""
        return len(self.gpu_free_blocks)

In[21]:

Code

# Simulate vLLM-style block management
manager = BlockSpaceManager(block_size=16, num_gpu_blocks=100)

# Sequence arrives with prompt
seq_0_prompt_len = 45  # tokens
initial_blocks_needed = (seq_0_prompt_len + 16 - 1) // 16  # Ceiling division
manager.allocate(seq_id=0, num_blocks=initial_blocks_needed)

# Another sequence arrives
seq_1_prompt_len = 128
blocks_needed = (seq_1_prompt_len + 16 - 1) // 16
manager.allocate(seq_id=1, num_blocks=blocks_needed)

status_after_prompts = {
    "free_blocks": manager.get_num_free_blocks(),
    "seq_0_blocks": manager.get_block_table(0),
    "seq_1_blocks": manager.get_block_table(1),
}

# Generation proceeds, each sequence needs more blocks
for _ in range(3):  # Generate 3 more blocks worth of tokens for seq 0
    manager.append_block(seq_id=0)

# Seq 1 completes
manager.free(seq_id=1)

status_final = {
    "free_blocks": manager.get_num_free_blocks(),
    "seq_0_blocks": manager.get_block_table(0),
}

# Simulate vLLM-style block management
manager = BlockSpaceManager(block_size=16, num_gpu_blocks=100)

# Sequence arrives with prompt
seq_0_prompt_len = 45  # tokens
initial_blocks_needed = (seq_0_prompt_len + 16 - 1) // 16  # Ceiling division
manager.allocate(seq_id=0, num_blocks=initial_blocks_needed)

# Another sequence arrives
seq_1_prompt_len = 128
blocks_needed = (seq_1_prompt_len + 16 - 1) // 16
manager.allocate(seq_id=1, num_blocks=blocks_needed)

status_after_prompts = {
    "free_blocks": manager.get_num_free_blocks(),
    "seq_0_blocks": manager.get_block_table(0),
    "seq_1_blocks": manager.get_block_table(1),
}

# Generation proceeds, each sequence needs more blocks
for _ in range(3):  # Generate 3 more blocks worth of tokens for seq 0
    manager.append_block(seq_id=0)

# Seq 1 completes
manager.free(seq_id=1)

status_final = {
    "free_blocks": manager.get_num_free_blocks(),
    "seq_0_blocks": manager.get_block_table(0),
}

Out[22]:

Console

vLLM Block Space Manager Simulation
==================================================
Total GPU blocks: 100
Block size: 16 tokens

After processing prompts:
  Sequence 0 (45 tokens): blocks [0, 1, 2, 11, 12, 13]
  Sequence 1 (128 tokens): blocks [3, 4, 5, 6, 7, 8, 9, 10]
  Free blocks remaining: 89

After seq 0 generates more + seq 1 completes:
  Sequence 0 blocks: [0, 1, 2, 11, 12, 13]
  Free blocks: 94

Freed blocks from sequence 1 are available for reuse!

The simulation illustrates the manager's efficiency: when Sequence 1 completes, its blocks are immediately returned to the free pool. These reclaimed blocks are then available for Sequence 0's expansion or new requests, preventing the memory fragmentation issues seen in static allocation.

Copy-on-Write for Beam SearchLink Copied

One elegant feature of paged memory is enabling copy-on-write (CoW) semantics for parallel decoding strategies like beam search. This technique addresses a fundamental challenge in beam search: multiple candidate sequences (beams) that share a common prefix would traditionally require duplicating the KV cache data for that prefix, even though it's identical across all beams.

With copy-on-write, when multiple beams share a common prefix, they can share the same physical blocks for that prefix rather than duplicating the data. The block manager tracks reference counts for each physical block. When a block has multiple references, it means multiple sequences are sharing that block. Only when a sequence needs to modify a shared block does the system create a copy, hence the name "copy-on-write."

To understand why this matters, consider a beam search with width 4. Without copy-on-write, each beam would need its own complete copy of the KV cache, requiring 4x the memory. With copy-on-write, all four beams share the physical blocks for their common prefix. Memory is only allocated separately when the beams diverge and generate different tokens.

In[23]:

Code

def demonstrate_copy_on_write():
    """
    Show how PagedAttention enables memory-efficient beam search
    through block sharing.
    """
    # Parameters
    prompt_tokens = 48
    block_size = 16
    num_beams = 4

    # Initial sequence (3 blocks at 16 tokens each)
    num_prompt_blocks = prompt_tokens // block_size
    prompt_blocks = [
        f"P{i}" for i in range(num_prompt_blocks)
    ]  # Physical blocks for prompt

    # Beam search with 4 beams
    # All beams share the prompt blocks initially
    beams = {}
    for i in range(num_beams):
        beams[f"beam_{i}"] = {
            "blocks": list(prompt_blocks),
            "ref_count_contribution": 1,
        }

    # Reference counts: each prompt block has 4 references
    block_refs = {block: 4 for block in prompt_blocks}

    # After generating 16 tokens, each beam diverges
    # Each beam gets its own new block
    for i, beam in enumerate(beams.keys()):
        new_block = f"G{i}"  # Generation block
        beams[beam]["blocks"].append(new_block)
        block_refs[new_block] = 1

    return beams, block_refs, prompt_tokens, block_size, num_beams


beams, refs, prompt_tokens, block_size, beam_width = demonstrate_copy_on_write()

# Calculate memory savings statistics
num_beams = len(beams)
blocks_per_beam = len(next(iter(beams.values()))["blocks"])
naive_blocks = num_beams * blocks_per_beam

# Count unique physical blocks used (all active blocks are in refs)
cow_blocks = len(refs)
savings_pct = (1 - cow_blocks / naive_blocks) * 100

def demonstrate_copy_on_write():
    """
    Show how PagedAttention enables memory-efficient beam search
    through block sharing.
    """
    # Parameters
    prompt_tokens = 48
    block_size = 16
    num_beams = 4

    # Initial sequence (3 blocks at 16 tokens each)
    num_prompt_blocks = prompt_tokens // block_size
    prompt_blocks = [
        f"P{i}" for i in range(num_prompt_blocks)
    ]  # Physical blocks for prompt

    # Beam search with 4 beams
    # All beams share the prompt blocks initially
    beams = {}
    for i in range(num_beams):
        beams[f"beam_{i}"] = {
            "blocks": list(prompt_blocks),
            "ref_count_contribution": 1,
        }

    # Reference counts: each prompt block has 4 references
    block_refs = {block: 4 for block in prompt_blocks}

    # After generating 16 tokens, each beam diverges
    # Each beam gets its own new block
    for i, beam in enumerate(beams.keys()):
        new_block = f"G{i}"  # Generation block
        beams[beam]["blocks"].append(new_block)
        block_refs[new_block] = 1

    return beams, block_refs, prompt_tokens, block_size, num_beams


beams, refs, prompt_tokens, block_size, beam_width = demonstrate_copy_on_write()

# Calculate memory savings statistics
num_beams = len(beams)
blocks_per_beam = len(next(iter(beams.values()))["blocks"])
naive_blocks = num_beams * blocks_per_beam

# Count unique physical blocks used (all active blocks are in refs)
cow_blocks = len(refs)
savings_pct = (1 - cow_blocks / naive_blocks) * 100

Out[24]:

Console

Copy-on-Write with Beam Search
==================================================
Prompt: 48 tokens in 3 blocks
Beam width: 4

Block sharing (after 16 generated tokens):
  beam_0: ['P0', 'P1', 'P2', 'G0']
  beam_1: ['P0', 'P1', 'P2', 'G1']
  beam_2: ['P0', 'P1', 'P2', 'G2']
  beam_3: ['P0', 'P1', 'P2', 'G3']

Block reference counts:
  P0: 4 references (shared by all beams)
  P1: 4 references (shared by all beams)
  P2: 4 references (shared by all beams)
  G0: 1 references
  G1: 1 references
  G2: 1 references
  G3: 1 references

Memory comparison:
  Without CoW: 16 blocks
  With CoW: 7 blocks
  Savings: 56%

Out[25]:

Visualization

Two diagrams comparing memory usage in beam search. Left shows 16 separate blocks across 4 beams. Right shows 3 shared prompt blocks plus 4 unique generation blocks. — Memory allocation for beam search without copy-on-write optimizations. Each beam maintains its own redundant copy of the shared prompt blocks, resulting in a total of 16 blocks allocated.

Efficient memory usage for beam search using copy-on-write. All beams share the common physical blocks for the prompt prefix, only branching into unique blocks during the generation phase to achieve a 56% reduction in total memory footprint.

This 25% memory reduction highlights the efficiency of copy-on-write. By allowing multiple beams to share the same physical blocks for their common prefix, we significantly reduce memory footprint compared to duplicating the data for each beam. The savings become even more dramatic with longer prompts or wider beam widths, as the shared prefix represents a larger fraction of the total cache.

Working with vLLM in PracticeLink Copied

Let's see how PagedAttention benefits manifest when using vLLM for inference:

In[39]:

Code

# This code shows vLLM usage patterns (requires GPU and vllm installed)
from vllm import LLM, SamplingParams

# vLLM automatically uses PagedAttention
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    # Block size can be configured (default is typically 16)
    block_size=16,
    # GPU memory utilization for KV cache
    gpu_memory_utilization=0.9,
    # Maximum number of sequences to batch together
    max_num_seqs=256,
    # CPU swap space in GB (for swapping out preempted requests)
    swap_space=4,
)

# Multiple requests can be batched efficiently
prompts = [
    "Explain quantum computing in simple terms:",
    "Write a haiku about programming:",
    "What are the benefits of exercise?",
    # ... many more prompts
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

# vLLM handles dynamic batching and memory management automatically
outputs = llm.generate(prompts, sampling_params)

# This code shows vLLM usage patterns (requires GPU and vllm installed)
from vllm import LLM, SamplingParams

# vLLM automatically uses PagedAttention
llm = LLM(
    model="meta-llama/Llama-2-7b-hf",
    # Block size can be configured (default is typically 16)
    block_size=16,
    # GPU memory utilization for KV cache
    gpu_memory_utilization=0.9,
    # Maximum number of sequences to batch together
    max_num_seqs=256,
    # CPU swap space in GB (for swapping out preempted requests)
    swap_space=4,
)

# Multiple requests can be batched efficiently
prompts = [
    "Explain quantum computing in simple terms:",
    "Write a haiku about programming:",
    "What are the benefits of exercise?",
    # ... many more prompts
]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=256)

# vLLM handles dynamic batching and memory management automatically
outputs = llm.generate(prompts, sampling_params)

Key ParametersLink Copied

The key parameters for vLLM related to PagedAttention are:

block_size: Number of tokens per physical block. Smaller blocks reduce internal fragmentation but increase block table overhead.
gpu_memory_utilization: Fraction of GPU memory reserved for KV cache. Higher values enable more concurrent sequences.
max_num_seqs: Maximum concurrent sequences. PagedAttention enables much higher limits than traditional approaches.
swap_space: CPU memory for block swapping when GPU memory is exhausted.

Measuring Fragmentation ReductionLink Copied

Let's quantify the improvement PagedAttention provides over traditional allocation:

In[26]:

Code

import numpy as np


def compare_allocation_strategies(
    num_sequences: int = 50,
    max_seq_len: int = 2048,
    block_size: int = 16,
    seed: int = 42,
):
    """
    Compare traditional contiguous vs paged allocation.
    """
    np.random.seed(seed)

    # Generate realistic sequence lengths (biased toward shorter)
    actual_lengths = np.random.exponential(scale=300, size=num_sequences)
    actual_lengths = np.clip(actual_lengths, 50, max_seq_len).astype(int)

    # Simulate completion order (random)
    completion_order = np.random.permutation(num_sequences)

    results = {
        "traditional": {
            "peak_memory": 0,
            "avg_utilization": [],
            "fragmentation_events": 0,
        },
        "paged": {
            "peak_memory": 0,
            "avg_utilization": [],
            "fragmentation_events": 0,  # Should be 0
        },
    }

    # Traditional: pre-allocate max_seq_len for each
    trad_total_allocated = num_sequences * max_seq_len
    trad_total_used = sum(actual_lengths)
    results["traditional"]["max_seq_len"] = max_seq_len
    results["traditional"]["peak_memory"] = trad_total_allocated
    results["traditional"]["avg_utilization"] = (
        trad_total_used / trad_total_allocated
    )

    # Paged: allocate only what's needed (+ partial block overhead)
    paged_blocks_used = sum(
        (l + block_size - 1) // block_size for l in actual_lengths
    )
    paged_tokens_capacity = paged_blocks_used * block_size
    results["paged"]["peak_memory"] = paged_tokens_capacity
    results["paged"]["avg_utilization"] = (
        trad_total_used / paged_tokens_capacity
    )

    return results, actual_lengths


results, lengths = compare_allocation_strategies()

# Calculate improvement metrics
trad_peak = results["traditional"]["peak_memory"]
paged_peak = results["paged"]["peak_memory"]
memory_reduction = 1 - (paged_peak / trad_peak)
capacity_multiplier = 1 / (1 - memory_reduction)

import numpy as np


def compare_allocation_strategies(
    num_sequences: int = 50,
    max_seq_len: int = 2048,
    block_size: int = 16,
    seed: int = 42,
):
    """
    Compare traditional contiguous vs paged allocation.
    """
    np.random.seed(seed)

    # Generate realistic sequence lengths (biased toward shorter)
    actual_lengths = np.random.exponential(scale=300, size=num_sequences)
    actual_lengths = np.clip(actual_lengths, 50, max_seq_len).astype(int)

    # Simulate completion order (random)
    completion_order = np.random.permutation(num_sequences)

    results = {
        "traditional": {
            "peak_memory": 0,
            "avg_utilization": [],
            "fragmentation_events": 0,
        },
        "paged": {
            "peak_memory": 0,
            "avg_utilization": [],
            "fragmentation_events": 0,  # Should be 0
        },
    }

    # Traditional: pre-allocate max_seq_len for each
    trad_total_allocated = num_sequences * max_seq_len
    trad_total_used = sum(actual_lengths)
    results["traditional"]["max_seq_len"] = max_seq_len
    results["traditional"]["peak_memory"] = trad_total_allocated
    results["traditional"]["avg_utilization"] = (
        trad_total_used / trad_total_allocated
    )

    # Paged: allocate only what's needed (+ partial block overhead)
    paged_blocks_used = sum(
        (l + block_size - 1) // block_size for l in actual_lengths
    )
    paged_tokens_capacity = paged_blocks_used * block_size
    results["paged"]["peak_memory"] = paged_tokens_capacity
    results["paged"]["avg_utilization"] = (
        trad_total_used / paged_tokens_capacity
    )

    return results, actual_lengths


results, lengths = compare_allocation_strategies()

# Calculate improvement metrics
trad_peak = results["traditional"]["peak_memory"]
paged_peak = results["paged"]["peak_memory"]
memory_reduction = 1 - (paged_peak / trad_peak)
capacity_multiplier = 1 / (1 - memory_reduction)

Out[27]:

Console

Allocation Strategy Comparison
==================================================
Sequences: 50
Max sequence length: 2,048 tokens
Actual lengths: min=50, max=1051, avg=258

Traditional Contiguous Allocation:
  Peak memory: 102,400 token slots
  Actual data: 12,884 tokens
  Utilization: 12.6%

Paged Allocation:
  Peak memory: 13,280 token slots
  Actual data: 12,884 tokens
  Utilization: 97.0%

Memory reduction: 87.0%
Could serve 7.7x more sequences with same memory

In[28]:

Code

import matplotlib.pyplot as plt
import numpy as np

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.hist(lengths, bins=30, edgecolor="black", alpha=0.7, color="#3498db")
plt.axvline(
    x=2048,
    color="#e74c3c",
    linestyle="--",
    linewidth=2,
    label="Max length (allocation)",
)
plt.axvline(
    x=np.mean(lengths),
    color="#2ecc71",
    linestyle="-",
    linewidth=2,
    label=f"Mean length ({np.mean(lengths):.0f})",
)
plt.xlabel("Sequence Length (tokens)")
plt.ylabel("Count")
plt.title("Actual Sequence Length Distribution")
plt.legend()
plt.show()

import matplotlib.pyplot as plt
import numpy as np

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.hist(lengths, bins=30, edgecolor="black", alpha=0.7, color="#3498db")
plt.axvline(
    x=2048,
    color="#e74c3c",
    linestyle="--",
    linewidth=2,
    label="Max length (allocation)",
)
plt.axvline(
    x=np.mean(lengths),
    color="#2ecc71",
    linestyle="-",
    linewidth=2,
    label=f"Mean length ({np.mean(lengths):.0f})",
)
plt.xlabel("Sequence Length (tokens)")
plt.ylabel("Count")
plt.title("Actual Sequence Length Distribution")
plt.legend()
plt.show()

Out[28]:

Visualization

Histogram of sequence lengths showing exponential distribution. — Histogram of sequence length distribution compared to maximum allocation capacity. Real-world sequences follow an exponential-like pattern, with most falling far below the 2,048-token limit (red dashed line).

In[29]:

Code

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
strategies = ["Traditional\n(Contiguous)", "Paged\n(vLLM)"]
utilizations = [
    results["traditional"]["avg_utilization"] * 100,
    results["paged"]["avg_utilization"] * 100,
]
colors = ["#e74c3c", "#2ecc71"]

bars = plt.bar(strategies, utilizations, color=colors, edgecolor="black")
plt.ylabel("Memory Utilization (%)")
plt.title("KV Cache Memory Utilization")
plt.ylim(0, 100)

## Add percentage labels
for bar, util in zip(bars, utilizations):
    plt.annotate(
        f"{util:.1f}%",
        xy=(bar.get_x() + bar.get_width() / 2, bar.get_height() + 2),
        ha="center",
        fontsize=9,
        fontweight="bold",
    )
plt.show()

import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
strategies = ["Traditional\n(Contiguous)", "Paged\n(vLLM)"]
utilizations = [
    results["traditional"]["avg_utilization"] * 100,
    results["paged"]["avg_utilization"] * 100,
]
colors = ["#e74c3c", "#2ecc71"]

bars = plt.bar(strategies, utilizations, color=colors, edgecolor="black")
plt.ylabel("Memory Utilization (%)")
plt.title("KV Cache Memory Utilization")
plt.ylim(0, 100)

## Add percentage labels
for bar, util in zip(bars, utilizations):
    plt.annotate(
        f"{util:.1f}%",
        xy=(bar.get_x() + bar.get_width() / 2, bar.get_height() + 2),
        ha="center",
        fontsize=9,
        fontweight="bold",
    )
plt.show()

Out[29]:

Visualization

Bar chart comparing memory utilization between traditional and paged allocation strategies. — Comparison of memory utilization between allocation strategies. Paged allocation achieves approximately 80% utilization compared to 15% for traditional methods, representing a 5x improvement in memory efficiency.

These results highlight the massive inefficiency of contiguous allocation. By switching to paged allocation, we effectively reclaim 80% of the memory that was previously wasted on fragmentation, allowing us to serve five times as many sequences with the same hardware.

Benefits and Performance ImpactLink Copied

PagedAttention provides several interconnected benefits that compound to deliver substantial throughput improvements.

Higher Batch SizesLink Copied

The most direct benefit is the ability to serve more concurrent sequences. With 5x better memory utilization, a serving system can batch 5x more requests simultaneously. Since transformer inference is typically compute-bound for large batches, higher batch sizes improve GPU utilization:

In[30]:

Code

import numpy as np


def estimate_throughput_improvement(
    traditional_batch_size: int = 8,
    memory_improvement: float = 5.0,
    compute_efficiency_curve: callable = None,
):
    """
    Estimate throughput improvement from better memory utilization.
    """
    if compute_efficiency_curve is None:
        # Typical curve: efficiency improves with batch size up to a point
        def compute_efficiency_curve(batch_size):
            # Logarithmic improvement, saturating around batch 64
            return min(1.0, 0.3 + 0.7 * np.log2(batch_size + 1) / np.log2(65))

    paged_batch_size = int(traditional_batch_size * memory_improvement)

    trad_efficiency = compute_efficiency_curve(traditional_batch_size)
    paged_efficiency = compute_efficiency_curve(paged_batch_size)

    # Throughput = batch_size × efficiency
    trad_throughput = traditional_batch_size * trad_efficiency
    paged_throughput = paged_batch_size * paged_efficiency

    return {
        "traditional": {
            "batch_size": traditional_batch_size,
            "compute_efficiency": trad_efficiency,
            "relative_throughput": trad_throughput,
        },
        "paged": {
            "batch_size": paged_batch_size,
            "compute_efficiency": paged_efficiency,
            "relative_throughput": paged_throughput,
        },
        "throughput_multiplier": paged_throughput / trad_throughput,
    }


throughput_results = estimate_throughput_improvement()

import numpy as np


def estimate_throughput_improvement(
    traditional_batch_size: int = 8,
    memory_improvement: float = 5.0,
    compute_efficiency_curve: callable = None,
):
    """
    Estimate throughput improvement from better memory utilization.
    """
    if compute_efficiency_curve is None:
        # Typical curve: efficiency improves with batch size up to a point
        def compute_efficiency_curve(batch_size):
            # Logarithmic improvement, saturating around batch 64
            return min(1.0, 0.3 + 0.7 * np.log2(batch_size + 1) / np.log2(65))

    paged_batch_size = int(traditional_batch_size * memory_improvement)

    trad_efficiency = compute_efficiency_curve(traditional_batch_size)
    paged_efficiency = compute_efficiency_curve(paged_batch_size)

    # Throughput = batch_size × efficiency
    trad_throughput = traditional_batch_size * trad_efficiency
    paged_throughput = paged_batch_size * paged_efficiency

    return {
        "traditional": {
            "batch_size": traditional_batch_size,
            "compute_efficiency": trad_efficiency,
            "relative_throughput": trad_throughput,
        },
        "paged": {
            "batch_size": paged_batch_size,
            "compute_efficiency": paged_efficiency,
            "relative_throughput": paged_throughput,
        },
        "throughput_multiplier": paged_throughput / trad_throughput,
    }


throughput_results = estimate_throughput_improvement()

Out[31]:

Console

Throughput Improvement Estimate
=============================================
Traditional allocation:
  Batch size: 8
  GPU compute efficiency: 67%
  Relative throughput: 5.3

Paged allocation:
  Batch size: 40
  GPU compute efficiency: 92%
  Relative throughput: 36.9

Throughput improvement: 6.9x

Out[32]:

Visualization

Line plot showing throughput and GPU efficiency as functions of batch size, with markers indicating operating points for traditional and paged allocation strategies. — The relationship between batch size, GPU efficiency, and total throughput. Paged allocation allows for much larger batches (green point) than traditional methods (red point), achieving significantly higher aggregate throughput despite the diminishing marginal returns in per-sequence compute efficiency.

The throughput improvement demonstrates that memory efficiency directly translates to performance. By fitting more sequences into the GPU, we operate in a regime of higher compute intensity, extracting more value from the hardware.

Dynamic Batching SupportLink Copied

Paged memory enables continuous batching (which we'll explore in an upcoming chapter), where new requests can join an ongoing batch as old ones complete. With traditional contiguous allocation, adding a new request requires finding a large enough contiguous memory region. With paging, new requests simply acquire free blocks as needed:

In[33]:

Code

import numpy as np


def simulate_dynamic_batching(
    arrival_rate: float = 2.0,  # requests per time unit
    service_rate: float = 1.5,  # completions per time unit
    duration: int = 100,
    max_concurrent_traditional: int = 8,
    max_concurrent_paged: int = 40,
):
    """
    Simulate request handling with dynamic batching.
    """
    np.random.seed(42)

    results = {"traditional": [], "paged": []}

    for strategy, max_concurrent in [
        ("traditional", max_concurrent_traditional),
        ("paged", max_concurrent_paged),
    ]:
        active_requests = 0
        queued = 0
        total_processed = 0
        queue_times = []

        queue_depths = []

        for t in range(duration):
            # Arrivals (Poisson process)
            arrivals = np.random.poisson(arrival_rate)

            # Process completions
            if active_requests > 0:
                completions = min(
                    np.random.poisson(
                        service_rate * active_requests / max_concurrent
                    ),
                    active_requests,
                )
                active_requests -= completions
                total_processed += completions

            # Handle new arrivals
            for _ in range(arrivals):
                if active_requests < max_concurrent:
                    active_requests += 1
                else:
                    queued += 1

            # Process queue
            while queued > 0 and active_requests < max_concurrent:
                queued -= 1
                active_requests += 1
                queue_times.append(t)

            queue_depths.append(queued)

        results[strategy] = {
            "arrival_rate": arrival_rate,
            "service_rate": service_rate,
            "total_processed": total_processed,
            "avg_queue_depth": np.mean(queue_depths) if queue_depths else 0,
            "max_concurrent": max_concurrent,
        }

    return results


batching_results = simulate_dynamic_batching()

import numpy as np


def simulate_dynamic_batching(
    arrival_rate: float = 2.0,  # requests per time unit
    service_rate: float = 1.5,  # completions per time unit
    duration: int = 100,
    max_concurrent_traditional: int = 8,
    max_concurrent_paged: int = 40,
):
    """
    Simulate request handling with dynamic batching.
    """
    np.random.seed(42)

    results = {"traditional": [], "paged": []}

    for strategy, max_concurrent in [
        ("traditional", max_concurrent_traditional),
        ("paged", max_concurrent_paged),
    ]:
        active_requests = 0
        queued = 0
        total_processed = 0
        queue_times = []

        queue_depths = []

        for t in range(duration):
            # Arrivals (Poisson process)
            arrivals = np.random.poisson(arrival_rate)

            # Process completions
            if active_requests > 0:
                completions = min(
                    np.random.poisson(
                        service_rate * active_requests / max_concurrent
                    ),
                    active_requests,
                )
                active_requests -= completions
                total_processed += completions

            # Handle new arrivals
            for _ in range(arrivals):
                if active_requests < max_concurrent:
                    active_requests += 1
                else:
                    queued += 1

            # Process queue
            while queued > 0 and active_requests < max_concurrent:
                queued -= 1
                active_requests += 1
                queue_times.append(t)

            queue_depths.append(queued)

        results[strategy] = {
            "arrival_rate": arrival_rate,
            "service_rate": service_rate,
            "total_processed": total_processed,
            "avg_queue_depth": np.mean(queue_depths) if queue_depths else 0,
            "max_concurrent": max_concurrent,
        }

    return results


batching_results = simulate_dynamic_batching()

Out[34]:

Console

Dynamic Batching Simulation
=============================================
Request arrival rate: 2.0/time unit
Service rate: 1.5/time unit

Traditional:
  Max concurrent: 8
  Total processed: 143

Paged:
  Max concurrent: 40
  Total processed: 114

Paged allocation significantly outperforms the traditional approach in dynamic settings. While the traditional allocator quickly becomes fragmented and unable to accept new requests, the paged allocator continuously reuses freed blocks, maintaining high throughput and minimizing queue times.

Preemption and Priority SchedulingLink Copied

With paged memory, preempting a low-priority request to serve a high-priority one becomes straightforward. The preempted request's blocks can be either swapped to CPU memory or simply marked as reusable. When the request resumes, it reloads from the saved state. This enables sophisticated scheduling policies that traditional contiguous allocation cannot support.

Limitations and Practical ConsiderationsLink Copied

While PagedAttention represents a major advancement in LLM serving efficiency, it comes with trade-offs and limitations worth understanding.

The most significant overhead is the block table indirection. Every attention computation must look up physical block locations from the block table, adding memory access latency. For very short sequences, this overhead can outweigh the benefits of paging. In practice, the crossover point occurs around 50-100 tokens; shorter sequences may actually perform better with contiguous allocation. Most serving workloads involve longer sequences where the memory efficiency gains far exceed the indexing overhead.

Block size selection presents a tuning challenge. Smaller blocks reduce internal fragmentation (waste from partially-filled blocks) but increase the block table size and lookup overhead. Larger blocks are more efficient for the attention computation but waste more memory on partial fills. The default block size of 16 tokens in vLLM represents a carefully chosen balance, but optimal values vary with workload characteristics. Workloads with highly variable sequence lengths benefit from smaller blocks, while uniform-length workloads can use larger blocks.

Implementing PagedAttention is complex. Custom CUDA kernels must handle the non-contiguous memory access patterns efficiently, maintaining the performance characteristics of optimized attention implementations like FlashAttention while adding paging support. This is why you should use established frameworks like vLLM rather than implementing PagedAttention from scratch.

Finally, PagedAttention addresses memory capacity constraints but not the fundamental memory bandwidth limitations discussed in earlier chapters. The KV cache still requires significant memory bandwidth to read during attention computation. Techniques we'll explore in upcoming chapters, such as KV cache compression and quantization, complement PagedAttention by reducing the size of the data that must be transferred.

SummaryLink Copied

PagedAttention transforms LLM inference efficiency by applying virtual memory principles to KV cache management. The key insights and techniques include:

The fragmentation problem arises because traditional KV cache allocation reserves contiguous memory based on maximum sequence length. This causes internal fragmentation (unused space within allocations) and external fragmentation (scattered free blocks that cannot be combined). Together, these issues can waste 60-80% of available memory.
Page-based allocation divides memory into fixed-size blocks that can be assigned to any sequence. A block table maps each sequence's logical blocks to their physical locations. This approach eliminates external fragmentation entirely and reduces internal fragmentation to at most one partial block per sequence.
The PagedAttention algorithm modifies attention computation to work with non-contiguous KV storage. During attention, keys and values are gathered according to the block table. Optimized CUDA kernels perform this gathering efficiently using online softmax techniques to avoid explicit memory reordering.
Copy-on-write semantics enable memory-efficient beam search and other parallel decoding strategies. Sequences can share physical blocks for common prefixes, with new blocks allocated only when sequences diverge.
Practical benefits include 2-4x higher throughput from increased batch sizes, support for continuous batching with dynamic request handling, and preemption capabilities for priority scheduling. These improvements have made PagedAttention the foundation of modern LLM serving systems like vLLM, TensorRT-LLM, and others.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about PagedAttention and memory management for LLM serving.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{pagedattentionsolvingllmkvcachememoryfragmentation, author = {Michael Brenndoerfer}, title = {PagedAttention: Solving LLM KV Cache Memory Fragmentation}, year = {2026}, url = {https://mbrenndoerfer.com/writing/paged-attention-vllm-kv-cache-memory-management}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). PagedAttention: Solving LLM KV Cache Memory Fragmentation. Retrieved from https://mbrenndoerfer.com/writing/paged-attention-vllm-kv-cache-memory-management

MLAAcademic

Michael Brenndoerfer. "PagedAttention: Solving LLM KV Cache Memory Fragmentation." 2026. Web. today. <https://mbrenndoerfer.com/writing/paged-attention-vllm-kv-cache-memory-management>.

CHICAGOAcademic

Michael Brenndoerfer. "PagedAttention: Solving LLM KV Cache Memory Fragmentation." Accessed today. https://mbrenndoerfer.com/writing/paged-attention-vllm-kv-cache-memory-management.

HARVARDAcademic

Michael Brenndoerfer (2026) 'PagedAttention: Solving LLM KV Cache Memory Fragmentation'. Available at: https://mbrenndoerfer.com/writing/paged-attention-vllm-kv-cache-memory-management (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). PagedAttention: Solving LLM KV Cache Memory Fragmentation. https://mbrenndoerfer.com/writing/paged-attention-vllm-kv-cache-memory-management

Direct link:

https://mbrenndoerfer.com/writing/paged-attention-vllm-kv-cache-memory-management

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

PagedAttention: Solving LLM KV Cache Memory Fragmentation

Paged AttentionLink Copied

The Memory Fragmentation ProblemLink Copied

The Contiguous Allocation DilemmaLink Copied

External FragmentationLink Copied

Internal FragmentationLink Copied

Page-Based Memory AllocationLink Copied

The Paging AbstractionLink Copied

Block Table StructureLink Copied

The PagedAttention AlgorithmLink Copied

Standard Attention vs PagedAttentionLink Copied

Demonstration with Non-Contiguous BlocksLink Copied

Optimized CUDA KernelsLink Copied

vLLM ImplementationLink Copied

Memory ManagerLink Copied

Copy-on-Write for Beam SearchLink Copied

Working with vLLM in PracticeLink Copied

Key ParametersLink Copied

Measuring Fragmentation ReductionLink Copied

Benefits and Performance ImpactLink Copied

Higher Batch SizesLink Copied

Dynamic Batching SupportLink Copied

Preemption and Priority SchedulingLink Copied

Limitations and Practical ConsiderationsLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

GGUF Format: Efficient Storage & Inference for Quantized LLMs

AWQ: Protecting Salient Weights for Efficient LLM Inference

GPTQ: Optimizing 4-Bit Weight Quantization for LLMs

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

GGUF Format: Efficient Storage & Inference for Quantized LLMs

AWQ: Protecting Salient Weights for Efficient LLM Inference

GPTQ: Optimizing 4-Bit Weight Quantization for LLMs

Stay updated