Continuous Batching: Optimizing LLM Inference Throughput

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Discover how continuous batching achieves 2-3x throughput gains in LLM inference through iteration-level scheduling, eliminating static batch inefficiencies.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Continuous BatchingLink Copied

Large language model inference presents a unique challenge that sets it apart from nearly every other machine learning workload: requests arrive at different times, require different numbers of output tokens, and finish at unpredictable moments Traditional batch processing, borrowed from training workflows where uniformity is the norm, handles this reality poorly. A batch must wait for its slowest member, leaving GPU resources idle while fast requests sit completed but unable to exit. The result is a system that wastes expensive compute cycles and frustrates you when you expect snappy responses.

Continuous batching solves this fundamental mismatch by rethinking how we group work together. Rather than treating a batch as a monolithic unit processed as a single group, continuous batching treats each decoding iteration as an opportunity to reshuffle which requests are actively using compute. Completed requests exit immediately, and waiting requests join without delay.

This chapter explores why static batching creates inefficiency, how continuous batching addresses it through iteration-level scheduling, and the mechanisms for handling request completion. The throughput gains from this approach are substantial, often doubling or tripling the requests per second a single GPU can handle. Understanding continuous batching is essential for when you deploy language models in production, as it has become the standard technique in modern serving frameworks.

The Batching Problem in LLM InferenceLink Copied

Batching multiple requests together is essential for efficient GPU utilization. Modern GPUs contain thousands of processing cores designed to execute operations in parallel, and feeding them work one request at a time leaves most of this capability unused. As we discussed in the KV Cache chapter, autoregressive generation produces one token per forward pass. Each forward pass involves substantial overhead: loading model weights from memory, executing attention computations, and applying the feed-forward network. Without batching, each forward pass processes a single request, leaving most of the GPU's parallel processing capability unused. Matrix multiplications that could operate on batch sizes of 32 or 64 instead operate on a single row, significantly underutilizing the thousands of available cores.

Training workflows batch naturally because all samples pass through the same computation graph with identical sequence lengths (after padding). Every example in a training batch starts together, processes together, and finishes together. This uniformity makes scheduling trivial. Inference is different in three critical ways that break this simple model:

Variable input lengths, requests arrive with prompts of different sizes. You might submit a 10-token question while another provides a 2000-token document for summarization.
Variable output lengths: Some responses are 10 tokens, others are 1000. A simple factual answer completes quickly, while a detailed explanation continues for many iterations.
Asynchronous arrivals: Requests come in continuously, not in neat batches. You submit queries at unpredictable times, and the system cannot wait indefinitely to form perfect batches.

These differences create a fundamental tension between batching efficiency and response latency. We want large batches to maximize GPU utilization, but we also want to return responses quickly without waiting for other requests.

Static BatchingLink Copied

The simplest approach to batching inference requests is static batching, where we collect a fixed group of requests, process them together until all complete, and then start the next batch. This approach mirrors how training works and requires minimal changes to existing infrastructure.

How Static Batching WorksLink Copied

In static batching, the system follows a straightforward process that prioritizes simplicity over flexibility. The core idea is to treat a batch as an indivisible unit that processes from start to finish without any changes to its composition. This makes implementation easy but creates significant inefficiencies when requests have different characteristics.

The static batching algorithm proceeds through five distinct phases:

Accumulate requests. Wait until a batch of size $B$ requests arrives, or a timeout expires. During this phase, incoming requests queue up while the system decides whether enough work has accumulated to justify starting a new batch.
Pad to uniform length. Pad all input sequences to match the longest input in the batch. Since matrix operations require uniform dimensions, shorter sequences receive padding tokens that consume compute but produce no useful output.
Process together. Run all requests through every decoding iteration. The entire batch moves in lockstep, with each request generating one token per forward pass.
Wait for completion. Continue until the longest output sequence finishes. Even if some requests complete after just a few tokens, they remain in the batch, consuming resources while producing only padding.
Return results. Send all responses back and start the next batch. Only after every request has finished can the system collect a new batch and begin again.

In[2]:

Code

from dataclasses import dataclass
from typing import List


@dataclass
class Request:
    """Simulated inference request."""

    id: int
    input_length: int
    output_length: int  # How many tokens this request will generate


def simulate_static_batch(requests: List[Request]) -> dict:
    """Simulate static batching behavior."""
    if not requests:
        return {
            "total_iterations": 0,
            "wasted_iterations": 0,
            "max_output": 0,
            "useful_iterations": 0,
            "efficiency": 0,
        }

    # All requests must wait for the longest output
    max_output = max(r.output_length for r in requests)

    # Calculate wasted computation
    total_iterations = len(requests) * max_output
    useful_iterations = sum(r.output_length for r in requests)
    wasted_iterations = total_iterations - useful_iterations

    return {
        "max_output": max_output,
        "total_iterations": total_iterations,
        "useful_iterations": useful_iterations,
        "wasted_iterations": wasted_iterations,
        "efficiency": useful_iterations / total_iterations,
    }

from dataclasses import dataclass
from typing import List


@dataclass
class Request:
    """Simulated inference request."""

    id: int
    input_length: int
    output_length: int  # How many tokens this request will generate


def simulate_static_batch(requests: List[Request]) -> dict:
    """Simulate static batching behavior."""
    if not requests:
        return {
            "total_iterations": 0,
            "wasted_iterations": 0,
            "max_output": 0,
            "useful_iterations": 0,
            "efficiency": 0,
        }

    # All requests must wait for the longest output
    max_output = max(r.output_length for r in requests)

    # Calculate wasted computation
    total_iterations = len(requests) * max_output
    useful_iterations = sum(r.output_length for r in requests)
    wasted_iterations = total_iterations - useful_iterations

    return {
        "max_output": max_output,
        "total_iterations": total_iterations,
        "useful_iterations": useful_iterations,
        "wasted_iterations": wasted_iterations,
        "efficiency": useful_iterations / total_iterations,
    }

Let's see how static batching performs with requests of varying output lengths. This simulation demonstrates the core problem: when requests differ significantly in how many tokens they need to generate, the short requests end up waiting for the long ones.

In[3]:

Code

import numpy as np

# Create a batch with highly variable output lengths
np.random.seed(42)
static_batch = [
    Request(id=i, input_length=50, output_length=np.random.randint(10, 200))
    for i in range(8)
]

# Show the output lengths
output_lengths = [r.output_length for r in static_batch]

import numpy as np

# Create a batch with highly variable output lengths
np.random.seed(42)
static_batch = [
    Request(id=i, input_length=50, output_length=np.random.randint(10, 200))
    for i in range(8)
]

# Show the output lengths
output_lengths = [r.output_length for r in static_batch]

Out[4]:

Console

Static Batch Request Output Lengths:
  Request 0: 112 tokens ██████████████████████
  Request 1: 189 tokens █████████████████████████████████████
  Request 2: 102 tokens ████████████████████
  Request 3:  24 tokens ████
  Request 4: 116 tokens ███████████████████████
  Request 5:  81 tokens ████████████████
  Request 6: 198 tokens ███████████████████████████████████████
  Request 7:  30 tokens ██████

In[5]:

Code

static_results = simulate_static_batch(static_batch)

static_results = simulate_static_batch(static_batch)

Out[6]:

Console


Static Batching Analysis:
  Maximum output length: 198 tokens
  Total iterations computed: 1584
  Useful iterations: 852
  Wasted iterations: 732
  Compute efficiency: 53.8%

The Efficiency ProblemLink Copied

The analysis reveals the core problem with static batching: requests that finish early continue consuming compute resources. In our example, some requests only needed around 20 tokens while the batch waited for requests generating nearly 200 tokens. Those short requests sat idle for over 150 iterations, wasting computational resources that could have been serving you.

To understand why this matters, consider what happens inside the GPU during those wasted iterations. The completed request still occupies a slot in every tensor operation. The attention mechanism still computes over its position. The feed-forward network still processes its hidden state. The output is discarded as padding, but the compute is very real. This is not idle waiting; it is inefficient computation.

This inefficiency compounds across several dimensions:

Memory waste. Completed requests retain their KV cache entries until the batch finishes. As we covered in the KV Cache Memory chapter, each request's cache can consume gigabytes of GPU memory. A 7B parameter model might need 1-2 GB per request at full context length. Keeping completed requests' caches prevents new requests from starting, creating a bottleneck even when compute capacity exists.
Latency inflation. Short requests experience artificially high latency because they must wait for longer siblings in their batch. A 20-token response that takes 100ms to generate might not be returned for 1000ms while waiting for a 200-token sibling. From your perspective, this feels like unexplained slowness, damaging your experience even though the system technically completed your request quickly.
GPU underutilization. As requests complete, effective batch size shrinks. A batch starting with 8 requests might have only 2 active requests for its final iterations, dramatically reducing throughput. The GPU cores that could be serving new requests instead sit mostly idle, processing padding tokens for the few remaining active sequences.

Out[7]:

Visualization

Static batching execution timeline showing active and wasted computation. Short requests finish early but must wait for the longest request to complete, resulting in significant idle time (gray) where the GPU processes padding tokens instead of useful data.

The gray regions in this visualization represent wasted computation: the GPU performs forward passes that produce padding tokens, while the finished requests cannot release their resources. The shorter the original request, the larger the fraction of its total batch time spent waiting.

Continuous BatchingLink Copied

Continuous batching, also called iteration-level scheduling or in-flight batching, takes a fundamentally different approach to managing concurrent requests. Instead of treating a batch as a fixed unit that lives and dies together, it treats each decoding iteration as an independent scheduling decision. This seemingly simple change significantly improves both throughput and latency.

The Core InsightLink Copied

The key insight behind continuous batching emerges from a careful examination of how autoregressive generation actually works. Each decoding step produces exactly one token per request. Between any two decoding iterations, the model completes one forward pass and begins the next. At this boundary, we have complete freedom to reorganize which requests participate in the next iteration. The model does not care whether the same requests continue; it only needs valid input tokens and their associated KV caches.

This observation reveals that the rigid batch structure of static batching is entirely artificial. There is no technical requirement that forces us to keep completed requests around. We imposed that constraint for implementation simplicity, not because the underlying computation demands it.

With this insight, continuous batching introduces three powerful capabilities:

Remove completed requests. Immediately release slots when requests hit their end token. The moment a request generates its end-of-sequence marker, it exits the batch and its resources become available.
Add new requests. Insert waiting requests into freed slots. New requests join the batch at whatever iteration happens to have space, rather than waiting for a complete batch turnover.
Maintain continuous flow. Keep the batch saturated with active work. The goal is to maintain maximum occupancy at all times, treating the batch like a continuously flowing river rather than a series of discrete buckets.

This transforms batch processing from a rigid structure into a dynamic queue where requests flow through the system at their natural pace. Short requests complete quickly and exit. Long requests continue as needed. New requests enter as soon as space permits.

How Continuous Batching WorksLink Copied

The continuous batching algorithm operates in a perpetual loop that never truly stops between batches. Instead of collecting requests, processing them, and collecting again, it maintains a persistent set of active requests that evolves iteration by iteration. Each cycle through the loop represents a single forward pass through the model.

The algorithm follows these steps in each iteration:

Check for completions. After each iteration, identify requests that generated an end-of-sequence token or reached their maximum length. These requests have finished their work and no longer need to participate.
Evict completed requests. Remove finished requests from the active batch, freeing their slots and KV cache memory. This happens immediately, not at some future batch boundary.
Check waiting queue. Look for pending requests that arrived while the batch was processing. New requests accumulate in a queue whenever all slots are occupied.
Insert new requests. Fill empty slots with waiting requests, running their prefill phase to build initial KV caches. These new requests join alongside existing decode phase requests.
Execute iteration. Run one decoding step for all active requests. Every active request generates exactly one token.
Return completed results. Send finished responses back to clients immediately. You receive your results the moment generation completes, not when some arbitrary batch finishes.

In[8]:

Code

from collections import deque


@dataclass
class ContinuousRequest:
    """Request with tracking for continuous batching."""

    id: int
    input_length: int
    output_length: int  # Total tokens to generate
    tokens_generated: int = 0
    start_iteration: int = 0
    end_iteration: int = 0

    @property
    def is_complete(self) -> bool:
        return self.tokens_generated >= self.output_length


class ContinuousBatchSimulator:
    """Simulates continuous batching behavior."""

    def __init__(self, max_batch_size: int = 8):
        self.max_batch_size = max_batch_size
        self.active_batch: List[ContinuousRequest] = []
        self.waiting_queue: deque = deque()
        self.completed: List[ContinuousRequest] = []
        self.iteration = 0
        self.utilization_history = []

    def submit_request(self, request: ContinuousRequest):
        """Add a request to the waiting queue."""
        self.waiting_queue.append(request)

    def _fill_batch(self):
        """Fill empty slots with waiting requests."""
        while (
            len(self.active_batch) < self.max_batch_size and self.waiting_queue
        ):
            request = self.waiting_queue.popleft()
            request.start_iteration = self.iteration
            self.active_batch.append(request)

    def _process_completions(self):
        """Remove completed requests from the batch."""
        still_active = []
        for request in self.active_batch:
            if request.is_complete:
                request.end_iteration = self.iteration
                self.completed.append(request)
            else:
                still_active.append(request)
        self.active_batch = still_active

    def step(self):
        """Execute one iteration of the continuous batch."""
        # First, fill any empty slots
        self._fill_batch()

        if not self.active_batch:
            return False  # Nothing to process

        # Record utilization
        self.utilization_history.append(
            len(self.active_batch) / self.max_batch_size
        )

        # Generate one token for each active request
        for request in self.active_batch:
            request.tokens_generated += 1

        self.iteration += 1

        # Check for completions
        self._process_completions()

        return True

    def run_until_complete(self):
        """Run until all requests are processed."""
        while self.active_batch or self.waiting_queue:
            self.step()

from collections import deque


@dataclass
class ContinuousRequest:
    """Request with tracking for continuous batching."""

    id: int
    input_length: int
    output_length: int  # Total tokens to generate
    tokens_generated: int = 0
    start_iteration: int = 0
    end_iteration: int = 0

    @property
    def is_complete(self) -> bool:
        return self.tokens_generated >= self.output_length


class ContinuousBatchSimulator:
    """Simulates continuous batching behavior."""

    def __init__(self, max_batch_size: int = 8):
        self.max_batch_size = max_batch_size
        self.active_batch: List[ContinuousRequest] = []
        self.waiting_queue: deque = deque()
        self.completed: List[ContinuousRequest] = []
        self.iteration = 0
        self.utilization_history = []

    def submit_request(self, request: ContinuousRequest):
        """Add a request to the waiting queue."""
        self.waiting_queue.append(request)

    def _fill_batch(self):
        """Fill empty slots with waiting requests."""
        while (
            len(self.active_batch) < self.max_batch_size and self.waiting_queue
        ):
            request = self.waiting_queue.popleft()
            request.start_iteration = self.iteration
            self.active_batch.append(request)

    def _process_completions(self):
        """Remove completed requests from the batch."""
        still_active = []
        for request in self.active_batch:
            if request.is_complete:
                request.end_iteration = self.iteration
                self.completed.append(request)
            else:
                still_active.append(request)
        self.active_batch = still_active

    def step(self):
        """Execute one iteration of the continuous batch."""
        # First, fill any empty slots
        self._fill_batch()

        if not self.active_batch:
            return False  # Nothing to process

        # Record utilization
        self.utilization_history.append(
            len(self.active_batch) / self.max_batch_size
        )

        # Generate one token for each active request
        for request in self.active_batch:
            request.tokens_generated += 1

        self.iteration += 1

        # Check for completions
        self._process_completions()

        return True

    def run_until_complete(self):
        """Run until all requests are processed."""
        while self.active_batch or self.waiting_queue:
            self.step()

Let's simulate the same requests with continuous batching to see the difference in practice:

In[9]:

Code

# Convert our static batch to continuous batch format
continuous_requests = [
    ContinuousRequest(
        id=r.id, input_length=r.input_length, output_length=r.output_length
    )
    for r in static_batch
]

# Run continuous batching simulation
simulator = ContinuousBatchSimulator(max_batch_size=8)
for request in continuous_requests:
    simulator.submit_request(request)

simulator.run_until_complete()

# Convert our static batch to continuous batch format
continuous_requests = [
    ContinuousRequest(
        id=r.id, input_length=r.input_length, output_length=r.output_length
    )
    for r in static_batch
]

# Run continuous batching simulation
simulator = ContinuousBatchSimulator(max_batch_size=8)
for request in continuous_requests:
    simulator.submit_request(request)

simulator.run_until_complete()

Out[10]:

Console

Continuous Batching Results:
  Total iterations: 198
  Average utilization: 53.8%
  Theoretical minimum iterations: 106

Out[11]:

Visualization

Continuous batching execution timeline. Requests are scheduled dynamically, with completed sequences exiting immediately to free slots for waiting requests. This packing strategy maintains high GPU utilization by eliminating the waiting periods seen in static batching.

The high utilization and reduced iteration count demonstrate how continuous batching effectively saturates the GPU. By immediately filling slots vacated by completed requests, the system minimizes idle time and approaches the theoretical maximum throughput for the given workload. Notice that the total iterations is much closer to the theoretical minimum than what static batching achieved.

Iteration-Level SchedulingLink Copied

The power of continuous batching comes from making scheduling decisions at every iteration. Rather than committing to a fixed plan when a batch starts, the system reevaluates its choices thousands of times per second. This fine-grained control enables sophisticated policies beyond simple FIFO (first-in-first-out) ordering and allows the system to adapt dynamically to changing conditions.

Scheduling DecisionsLink Copied

At each iteration boundary, the scheduler faces several interconnected decisions that determine both throughput and fairness. These decisions happen in microseconds and must balance competing objectives.

Which requests to evict: Besides naturally completed requests, the scheduler might preempt requests to free memory for higher-priority work. This creates interesting tradeoffs between fairness and throughput. A request that has already consumed significant resources represents sunk cost, but continuing it also delays new requests. The scheduler must weigh these factors according to the system's priorities.

Which waiting requests to admit: When multiple requests are waiting, the scheduler chooses which to start first. This decision significantly impacts your perceived latency and overall system efficiency. Options include:

FIFO. First-come, first-served, which is simple and fair. Every request eventually gets its turn in the order it arrived. This approach provides predictable behavior and prevents starvation.
Shortest Job First. Prioritize requests with shorter expected outputs to minimize average latency. Short requests complete quickly, improving your experience. However, this requires predicting output length, which is not always possible.
Priority-based. Honor explicit priority levels attached to requests. You or time-sensitive applications might receive preferential treatment. This approach requires clear policies about what constitutes priority.

How to handle memory pressure: When the KV cache approaches capacity, the scheduler must decide whether to wait for completions or preempt active requests. Waiting reduces throughput but preserves work in progress. Preemption frees resources immediately but wastes the computation already invested in the preempted request.

Prefill and Decode PhasesLink Copied

Continuous batching must handle requests in different phases of their lifecycle. As we discussed in the KV Cache chapter, generation has two distinct phases with fundamentally different computational characteristics.

Prefill phase. Process all input tokens in parallel to build the initial KV cache. This phase is compute-bound and benefits from large batch sizes of tokens. The GPU performs dense matrix multiplications over potentially thousands of input tokens, fully utilizing its computational capacity. Memory bandwidth is not the bottleneck during prefill because the computation is so intensive.
Decode phase. Generate tokens one at a time, reading from the KV cache. This phase is memory-bandwidth-bound and benefits from large batch sizes of requests. Each decode step reads the entire KV cache but performs relatively little computation per byte loaded. The GPU spends most of its time waiting for memory, not computing.

These phases have different computational characteristics that create challenges when mixing them in the same batch. Prefill processes many tokens per request; decode processes one token per request. Mixing them creates load imbalance: prefill requests demand heavy compute while decode requests sit largely idle, waiting for memory operations to complete.

In[12]:

Code

@dataclass
class PhasedRequest:
    """Request tracking both prefill and decode phases."""

    id: int
    input_length: int
    output_length: int
    phase: str = "prefill"  # "prefill" or "decode"
    tokens_generated: int = 0
    prefill_complete: bool = False


class PhasedBatchSimulator:
    """Simulator showing prefill/decode phase handling."""

    def __init__(self, max_batch_size: int = 8):
        self.max_batch_size = max_batch_size
        self.active_batch: List[PhasedRequest] = []
        self.waiting_queue: deque = deque()
        self.completed: List[PhasedRequest] = []
        self.iteration = 0

        # Track phase composition over time
        self.phase_history = []

    def submit_request(self, request: PhasedRequest):
        self.waiting_queue.append(request)

    def _fill_batch(self):
        """Add waiting requests to available slots."""
        while (
            len(self.active_batch) < self.max_batch_size and self.waiting_queue
        ):
            request = self.waiting_queue.popleft()
            request.phase = "prefill"
            self.active_batch.append(request)

    def step(self):
        """Execute one iteration."""
        self._fill_batch()

        if not self.active_batch:
            return False

        # Count phases
        prefill_count = sum(
            1 for r in self.active_batch if r.phase == "prefill"
        )
        decode_count = len(self.active_batch) - prefill_count
        self.phase_history.append(
            {
                "iteration": self.iteration,
                "prefill": prefill_count,
                "decode": decode_count,
            }
        )

        # Process each request based on phase
        still_active = []
        for request in self.active_batch:
            if request.phase == "prefill":
                # Prefill completes in one iteration (simplified)
                request.prefill_complete = True
                request.phase = "decode"
                still_active.append(request)
            else:  # decode phase
                request.tokens_generated += 1
                if request.tokens_generated >= request.output_length:
                    self.completed.append(request)
                else:
                    still_active.append(request)

        self.active_batch = still_active
        self.iteration += 1
        return True

    def run_until_complete(self):
        while self.active_batch or self.waiting_queue:
            self.step()

@dataclass
class PhasedRequest:
    """Request tracking both prefill and decode phases."""

    id: int
    input_length: int
    output_length: int
    phase: str = "prefill"  # "prefill" or "decode"
    tokens_generated: int = 0
    prefill_complete: bool = False


class PhasedBatchSimulator:
    """Simulator showing prefill/decode phase handling."""

    def __init__(self, max_batch_size: int = 8):
        self.max_batch_size = max_batch_size
        self.active_batch: List[PhasedRequest] = []
        self.waiting_queue: deque = deque()
        self.completed: List[PhasedRequest] = []
        self.iteration = 0

        # Track phase composition over time
        self.phase_history = []

    def submit_request(self, request: PhasedRequest):
        self.waiting_queue.append(request)

    def _fill_batch(self):
        """Add waiting requests to available slots."""
        while (
            len(self.active_batch) < self.max_batch_size and self.waiting_queue
        ):
            request = self.waiting_queue.popleft()
            request.phase = "prefill"
            self.active_batch.append(request)

    def step(self):
        """Execute one iteration."""
        self._fill_batch()

        if not self.active_batch:
            return False

        # Count phases
        prefill_count = sum(
            1 for r in self.active_batch if r.phase == "prefill"
        )
        decode_count = len(self.active_batch) - prefill_count
        self.phase_history.append(
            {
                "iteration": self.iteration,
                "prefill": prefill_count,
                "decode": decode_count,
            }
        )

        # Process each request based on phase
        still_active = []
        for request in self.active_batch:
            if request.phase == "prefill":
                # Prefill completes in one iteration (simplified)
                request.prefill_complete = True
                request.phase = "decode"
                still_active.append(request)
            else:  # decode phase
                request.tokens_generated += 1
                if request.tokens_generated >= request.output_length:
                    self.completed.append(request)
                else:
                    still_active.append(request)

        self.active_batch = still_active
        self.iteration += 1
        return True

    def run_until_complete(self):
        while self.active_batch or self.waiting_queue:
            self.step()

Chunked PrefillLink Copied

Modern continuous batching systems often use chunked prefill to better balance the two phases and smooth out the computational load. Instead of processing the entire prompt in one iteration, which can cause latency spikes when long prompts arrive, long prompts are split into manageable chunks.

The chunking strategy divides a long prompt into pieces that can be processed incrementally. This allows decode iterations for other requests to proceed between prefill chunks, maintaining responsiveness for you. A request with a 3000-token prompt might be split into six chunks of 500 tokens each, with each chunk processed in a separate iteration.

In[13]:

Code

def chunk_prefill_tokens(input_length: int, chunk_size: int = 512) -> List[int]:
    """Split a prompt into prefill chunks."""
    chunks = []
    remaining = input_length
    while remaining > 0:
        chunk = min(remaining, chunk_size)
        chunks.append(chunk)
        remaining -= chunk
    return chunks

def chunk_prefill_tokens(input_length: int, chunk_size: int = 512) -> List[int]:
    """Split a prompt into prefill chunks."""
    chunks = []
    remaining = input_length
    while remaining > 0:
        chunk = min(remaining, chunk_size)
        chunks.append(chunk)
        remaining -= chunk
    return chunks

In[14]:

Code

# Example chunking
chunk_size = 512
example_lengths = [128, 512, 1500, 3000]
chunking_results = []

for length in example_lengths:
    chunks = chunk_prefill_tokens(length, chunk_size)
    chunking_results.append((length, chunks))

# Example chunking
chunk_size = 512
example_lengths = [128, 512, 1500, 3000]
chunking_results = []

for length in example_lengths:
    chunks = chunk_prefill_tokens(length, chunk_size)
    chunking_results.append((length, chunks))

Out[15]:

Console

Prefill Chunking Examples (chunk_size=512):
  128 tokens -> 1 chunks: [128]
  512 tokens -> 1 chunks: [512]
  1500 tokens -> 3 chunks: [512, 512, 476]
  3000 tokens -> 6 chunks: [512, 512, 512, 512, 512, 440]

Out[16]:

Visualization

Interleaved execution of prefill chunks and decode steps. A long prompt (Request C) is processed in multiple prefill chunks (orange), allowing decode steps for existing requests A and B (blue) to occur in between. This time-slicing prevents the system from freezing during heavy prefill operations.

Chunked prefill enables interleaving decode tokens between prefill chunks, reducing latency spikes when long prompts arrive. Without chunking, a 3000-token prompt would monopolize the GPU for a significant duration, blocking all decode operations for other requests. With chunking, decode operations proceed smoothly while the long prompt is processed incrementally.

Request Completion HandlingLink Copied

Handling completed requests efficiently is crucial for continuous batching performance. The system must quickly detect completions, release resources, and fill the vacated slots. Any delay in this process reduces the benefits of continuous batching by leaving slots empty when work is available.

Completion DetectionLink Copied

A request completes when it generates a special end-of-sequence token or reaches a maximum length limit. The completion check happens after each decode step and must be fast enough to avoid becoming a bottleneck. In practice, this means simple integer comparisons that execute in nanoseconds.

The end-of-sequence token is a special marker that the model has learned to generate when it believes the response is complete. Different models use different token IDs for this purpose, but the concept is universal. The maximum length limit serves as a safety valve, preventing runaway generation that would consume resources indefinitely.

In[17]:

Code

def check_completion(
    generated_token: int,
    tokens_generated: int,
    max_length: int,
    eos_token_id: int = 2,
) -> tuple[bool, str]:
    """
    Check if a request has completed.

    Returns:
        (is_complete, reason)
    """
    if generated_token == eos_token_id:
        return True, "eos_token"
    if tokens_generated >= max_length:
        return True, "max_length"
    return False, "ongoing"

def check_completion(
    generated_token: int,
    tokens_generated: int,
    max_length: int,
    eos_token_id: int = 2,
) -> tuple[bool, str]:
    """
    Check if a request has completed.

    Returns:
        (is_complete, reason)
    """
    if generated_token == eos_token_id:
        return True, "eos_token"
    if tokens_generated >= max_length:
        return True, "max_length"
    return False, "ongoing"

Resource ReleaseLink Copied

When a request completes, the system must release several resources promptly to make room for new work. Each resource type has its own release mechanism and timing considerations.

KV cache memory. The cached key-value tensors for all layers must be freed. With paged attention, as discussed in the Paged Attention chapter, this means returning memory blocks to the free pool. The blocks become immediately available for allocation to new requests, maximizing memory utilization.
Batch slot. The position in the batch tensor becomes available for a new request. This slot represents one concurrent generation stream, and freeing it allows the waiting queue to advance.
Request metadata. Tracking structures for the completed request can be cleaned up. This includes timing information, token histories, and any other per-request state maintained by the scheduler.

In[18]:

Code

@dataclass
class SlotState:
    """Tracks the state of a batch slot."""

    slot_id: int
    request_id: int | None = None
    is_free: bool = True
    kv_blocks_allocated: int = 0


class SlotManager:
    """Manages batch slots for continuous batching."""

    def __init__(self, num_slots: int, blocks_per_slot: int = 16):
        self.slots = [SlotState(slot_id=i) for i in range(num_slots)]
        self.blocks_per_slot = blocks_per_slot
        self.total_blocks = num_slots * blocks_per_slot
        self.free_blocks = self.total_blocks

    def allocate_slot(self, request_id: int, num_blocks: int) -> int | None:
        """Allocate a slot for a new request."""
        if num_blocks > self.free_blocks:
            return None  # Not enough memory

        for slot in self.slots:
            if slot.is_free:
                slot.is_free = False
                slot.request_id = request_id
                slot.kv_blocks_allocated = num_blocks
                self.free_blocks -= num_blocks
                return slot.slot_id

        return None  # No free slots

    def release_slot(self, slot_id: int) -> int:
        """Release a slot and return freed blocks."""
        slot = self.slots[slot_id]
        freed_blocks = slot.kv_blocks_allocated

        slot.is_free = True
        slot.request_id = None
        slot.kv_blocks_allocated = 0
        self.free_blocks += freed_blocks

        return freed_blocks

    def get_active_count(self) -> int:
        return sum(1 for s in self.slots if not s.is_free)

@dataclass
class SlotState:
    """Tracks the state of a batch slot."""

    slot_id: int
    request_id: int | None = None
    is_free: bool = True
    kv_blocks_allocated: int = 0


class SlotManager:
    """Manages batch slots for continuous batching."""

    def __init__(self, num_slots: int, blocks_per_slot: int = 16):
        self.slots = [SlotState(slot_id=i) for i in range(num_slots)]
        self.blocks_per_slot = blocks_per_slot
        self.total_blocks = num_slots * blocks_per_slot
        self.free_blocks = self.total_blocks

    def allocate_slot(self, request_id: int, num_blocks: int) -> int | None:
        """Allocate a slot for a new request."""
        if num_blocks > self.free_blocks:
            return None  # Not enough memory

        for slot in self.slots:
            if slot.is_free:
                slot.is_free = False
                slot.request_id = request_id
                slot.kv_blocks_allocated = num_blocks
                self.free_blocks -= num_blocks
                return slot.slot_id

        return None  # No free slots

    def release_slot(self, slot_id: int) -> int:
        """Release a slot and return freed blocks."""
        slot = self.slots[slot_id]
        freed_blocks = slot.kv_blocks_allocated

        slot.is_free = True
        slot.request_id = None
        slot.kv_blocks_allocated = 0
        self.free_blocks += freed_blocks

        return freed_blocks

    def get_active_count(self) -> int:
        return sum(1 for s in self.slots if not s.is_free)

In[19]:

Code

# Demonstrate slot management
manager = SlotManager(num_slots=4, blocks_per_slot=16)

# Define request requirements
req1_id, req1_blocks = 100, 8
req2_id, req2_blocks = 101, 12
req3_id, req3_blocks = 102, 6

# Allocate some slots
slot1 = manager.allocate_slot(request_id=req1_id, num_blocks=req1_blocks)
slot2 = manager.allocate_slot(request_id=req2_id, num_blocks=req2_blocks)
slot3 = manager.allocate_slot(request_id=req3_id, num_blocks=req3_blocks)

# Demonstrate slot management
manager = SlotManager(num_slots=4, blocks_per_slot=16)

# Define request requirements
req1_id, req1_blocks = 100, 8
req2_id, req2_blocks = 101, 12
req3_id, req3_blocks = 102, 6

# Allocate some slots
slot1 = manager.allocate_slot(request_id=req1_id, num_blocks=req1_blocks)
slot2 = manager.allocate_slot(request_id=req2_id, num_blocks=req2_blocks)
slot3 = manager.allocate_slot(request_id=req3_id, num_blocks=req3_blocks)

Out[20]:

Console

Slot Management Demo:
  Initial free blocks: 64
  Allocated slot 0 for request 100 (8 blocks)
  Allocated slot 1 for request 101 (12 blocks)
  Allocated slot 2 for request 102 (6 blocks)
  Free blocks remaining: 38
  Active requests: 3

The manager successfully allocates blocks for three requests, decrementing the free pool accordingly. This tracking ensures we never overcommit the available KV cache memory, which would lead to out-of-memory errors during generation.

In[21]:

Code

# Complete request 101 and free its resources
freed = manager.release_slot(slot2)

# Complete request 101 and free its resources
freed = manager.release_slot(slot2)

Out[22]:

Console


Released slot 1 (request 101)
  Freed 12 blocks
  Free blocks now: 50
  Active requests: 2

Releasing the slot immediately returns blocks to the free pool, making them available for waiting requests in the very next iteration. This rapid turnover is what enables continuous batching to maintain high utilization.

Immediate Response DeliveryLink Copied

Unlike static batching, continuous batching can return completed responses immediately. You don't wait for other requests in the batch, receiving your result the moment generation finishes. This eliminates the latency inflation common in static batching and provides a more responsive experience for you.

The immediate delivery mechanism requires careful coordination between the generation loop and the response handling system. Completed results must be extracted and transmitted without blocking the main processing loop, often using asynchronous I/O or separate response threads.

In[23]:

Code

@dataclass
class TimedRequest:
    """Request with timing information."""

    id: int
    output_length: int
    submit_time: float = 0.0
    complete_time: float = 0.0

    @property
    def latency(self) -> float:
        return self.complete_time - self.submit_time


class LatencyTracker:
    """Tracks per-request latencies."""

    def __init__(self):
        self.requests: dict[int, TimedRequest] = {}

    def submit(self, request: TimedRequest, current_time: float):
        request.submit_time = current_time
        self.requests[request.id] = request

    def complete(self, request_id: int, current_time: float):
        if request_id in self.requests:
            self.requests[request_id].complete_time = current_time

    def get_latencies(self) -> List[float]:
        return [
            r.latency for r in self.requests.values() if r.complete_time > 0
        ]

@dataclass
class TimedRequest:
    """Request with timing information."""

    id: int
    output_length: int
    submit_time: float = 0.0
    complete_time: float = 0.0

    @property
    def latency(self) -> float:
        return self.complete_time - self.submit_time


class LatencyTracker:
    """Tracks per-request latencies."""

    def __init__(self):
        self.requests: dict[int, TimedRequest] = {}

    def submit(self, request: TimedRequest, current_time: float):
        request.submit_time = current_time
        self.requests[request.id] = request

    def complete(self, request_id: int, current_time: float):
        if request_id in self.requests:
            self.requests[request_id].complete_time = current_time

    def get_latencies(self) -> List[float]:
        return [
            r.latency for r in self.requests.values() if r.complete_time > 0
        ]

Throughput AnalysisLink Copied

Let's compare the throughput of static and continuous batching quantitatively. Understanding the mathematical relationship between these approaches helps explain why continuous batching delivers such significant improvements and under what conditions those improvements are largest.

Theoretical AnalysisLink Copied

To understand the efficiency difference mathematically, we need to formalize what happens in each batching strategy. The key insight is that static batching allocates computational capacity based on the worst case, while continuous batching adapts to the actual workload.

For static batching with a fixed batch size $B$ containing requests with output lengths $L_1, L_2, \ldots, L_B$ , the total computational time is determined by the longest request. Every request in the batch must wait for this slowest member, regardless of whether it has long since completed. The total computational capacity allocated, measured in token-slots (one slot processing one token), is:

C_{\text{static}} = B \times \max(L_1, L_2, \ldots, L_B)

where:

$C_{\text{static}}$ : the total number of processing slots consumed (iterations $\times$ batch size)
$B$ : the fixed batch size
$L_i$ : the output length of the $i$ -th request
$\max(L_1, \ldots, L_B)$ : the length of the longest sequence in the batch

This formula captures the fundamental problem: we allocate capacity based on the maximum, not the sum.

We can calculate the compute efficiency $\eta_{\text{static}}$ as the ratio of useful work to total capacity. Useful work is the actual tokens generated, while total capacity is the processing slots allocated:

\eta_{\text{static}} = \frac{\sum_{i=1}^{B} L_i}{B \times \max_{j} L_j}

where:

$\eta_{\text{static}}$ : the computational efficiency (between 0 and 1)
$\sum_{i=1}^{B} L_i$ : the total number of tokens actually generated (useful work)
$B \times \max_{j} L_j$ : the total capacity allocated during the batch's lifetime

Notice that efficiency depends heavily on the distribution of output lengths. If all requests have the same length, the numerator equals the denominator and efficiency is 100%. But if one request is much longer than the others, efficiency drops dramatically. For example, if seven requests each need 20 tokens and one request needs 200 tokens, efficiency is only $(7 \times 20 + 200) / (8 \times 200) = 340/1600 \approx 21\%$ .

For continuous batching, the situation is fundamentally different. Assuming the request queue remains populated (perfect slot filling), the system avoids waiting for long requests. When a request completes, its slot immediately fills with a new request. The total iterations required approach the theoretical minimum:

T_{\text{continuous}} \approx \frac{\sum_{i=1}^{N} L_i}{B}

where:

$T_{\text{continuous}}$ : the approximate total iterations needed
$N$ : the total number of requests in the workload
$L_i$ : the output length of the $i$ -th request
$B$ : the maximum batch size (concurrency limit)
$\sum_{i=1}^{N} L_i$ : the total number of tokens to be generated across all requests

This formula shows that continuous batching distributes work evenly across the available batch slots. The total iterations is simply the total tokens divided by the parallelism factor.

We can derive the efficiency $\eta_{\text{continuous}}$ by comparing useful work to allocated capacity:

\begin{aligned} \eta_{\text{continuous}} &= \frac{\text{Useful Work}}{\text{Total Capacity}} \\ &= \frac{\sum_{i=1}^{N} L_i}{T_{\text{continuous}} \times B} \\ &\approx \frac{\sum_{i=1}^{N} L_i}{\left( \frac{\sum_{i=1}^{N} L_i}{B} \right) \times B} \\ &= 1 \end{aligned}

This demonstrates that efficiency approaches 100% because the total capacity used matches the useful work. Every slot-iteration produces a meaningful output token rather than padding. The approximation becomes exact when the queue is always full and requests are perfectly divisible across slots.

Empirical ComparisonLink Copied

Let's run a more comprehensive simulation comparing the two approaches across a realistic workload. Theory tells us continuous batching should win, but simulation reveals the magnitude of improvement and how it varies with workload characteristics.

In[24]:

Code

def simulate_workload(
    num_requests: int, batch_size: int, output_lengths: List[int], method: str
) -> dict:
    """Simulate a workload with either static or continuous batching."""

    if method == "static":
        # Process in fixed batches
        total_iterations = 0
        total_latency = 0

        for batch_start in range(0, num_requests, batch_size):
            batch_end = min(batch_start + batch_size, num_requests)
            batch_lengths = output_lengths[batch_start:batch_end]
            max_len = max(batch_lengths)

            total_iterations += max_len
            # All requests in batch have same completion time
            for length in batch_lengths:
                total_latency += max_len  # Wait for slowest

        return {
            "total_iterations": total_iterations,
            "average_latency": total_latency / num_requests,
            "throughput": num_requests / total_iterations,
        }

    else:  # continuous
        # Simulate continuous batching
        remaining = list(output_lengths)
        active = []
        waiting_idx = 0
        iteration = 0
        completion_times = []

        while active or waiting_idx < len(remaining):
            # Fill batch with waiting requests
            while len(active) < batch_size and waiting_idx < len(remaining):
                active.append(
                    {
                        "idx": waiting_idx,
                        "remaining": remaining[waiting_idx],
                        "start": iteration,
                    }
                )
                waiting_idx += 1

            if not active:
                break

            iteration += 1

            # Decrement all active and check completions
            still_active = []
            for req in active:
                req["remaining"] -= 1
                if req["remaining"] <= 0:
                    completion_times.append(iteration - req["start"])
                else:
                    still_active.append(req)
            active = still_active

        return {
            "total_iterations": iteration,
            "average_latency": np.mean(completion_times),
            "throughput": num_requests / iteration,
        }

def simulate_workload(
    num_requests: int, batch_size: int, output_lengths: List[int], method: str
) -> dict:
    """Simulate a workload with either static or continuous batching."""

    if method == "static":
        # Process in fixed batches
        total_iterations = 0
        total_latency = 0

        for batch_start in range(0, num_requests, batch_size):
            batch_end = min(batch_start + batch_size, num_requests)
            batch_lengths = output_lengths[batch_start:batch_end]
            max_len = max(batch_lengths)

            total_iterations += max_len
            # All requests in batch have same completion time
            for length in batch_lengths:
                total_latency += max_len  # Wait for slowest

        return {
            "total_iterations": total_iterations,
            "average_latency": total_latency / num_requests,
            "throughput": num_requests / total_iterations,
        }

    else:  # continuous
        # Simulate continuous batching
        remaining = list(output_lengths)
        active = []
        waiting_idx = 0
        iteration = 0
        completion_times = []

        while active or waiting_idx < len(remaining):
            # Fill batch with waiting requests
            while len(active) < batch_size and waiting_idx < len(remaining):
                active.append(
                    {
                        "idx": waiting_idx,
                        "remaining": remaining[waiting_idx],
                        "start": iteration,
                    }
                )
                waiting_idx += 1

            if not active:
                break

            iteration += 1

            # Decrement all active and check completions
            still_active = []
            for req in active:
                req["remaining"] -= 1
                if req["remaining"] <= 0:
                    completion_times.append(iteration - req["start"])
                else:
                    still_active.append(req)
            active = still_active

        return {
            "total_iterations": iteration,
            "average_latency": np.mean(completion_times),
            "throughput": num_requests / iteration,
        }

In[25]:

Code

# Generate a realistic workload with variable output lengths
np.random.seed(123)
num_requests = 100
batch_size = 8

# Output lengths following a log-normal distribution (common in real workloads)
output_lengths = np.random.lognormal(
    mean=4.0, sigma=0.8, size=num_requests
).astype(int)
output_lengths = np.clip(output_lengths, 10, 500).tolist()

static_results = simulate_workload(
    num_requests, batch_size, output_lengths, "static"
)
continuous_results = simulate_workload(
    num_requests, batch_size, output_lengths, "continuous"
)

# Generate a realistic workload with variable output lengths
np.random.seed(123)
num_requests = 100
batch_size = 8

# Output lengths following a log-normal distribution (common in real workloads)
output_lengths = np.random.lognormal(
    mean=4.0, sigma=0.8, size=num_requests
).astype(int)
output_lengths = np.clip(output_lengths, 10, 500).tolist()

static_results = simulate_workload(
    num_requests, batch_size, output_lengths, "static"
)
continuous_results = simulate_workload(
    num_requests, batch_size, output_lengths, "continuous"
)

Out[26]:

Console

Workload: 100 requests, batch size 8
Output length range: 10 - 370 tokens
Mean output length: 82.2 tokens

Static Batching:
  Total iterations: 2722
  Average latency: 214.8 iterations
  Throughput: 0.037 requests/iteration

Continuous Batching:
  Total iterations: 1148
  Average latency: 82.2 iterations
  Throughput: 0.087 requests/iteration

Improvement:
  Throughput gain: 2.37x
  Latency reduction: 61.7%

Continuous batching achieves significantly higher throughput and lower average latency. While the static batch is held back by the longest sequence in each group, continuous batching processes requests at their natural speed, returning short responses quickly while long ones continue processing.

In[27]:

Code

# Test with different variance levels
variances = [0.2, 0.5, 0.8, 1.2]
static_throughputs = []
continuous_throughputs = []
static_latencies = []
continuous_latencies = []

for var in variances:
    lengths = np.random.lognormal(mean=4.0, sigma=var, size=100).astype(int)
    lengths = np.clip(lengths, 10, 500).tolist()

    static = simulate_workload(100, 8, lengths, "static")
    continuous = simulate_workload(100, 8, lengths, "continuous")

    static_throughputs.append(static["throughput"])
    continuous_throughputs.append(continuous["throughput"])
    static_latencies.append(static["average_latency"])
    continuous_latencies.append(continuous["average_latency"])

# Test with different variance levels
variances = [0.2, 0.5, 0.8, 1.2]
static_throughputs = []
continuous_throughputs = []
static_latencies = []
continuous_latencies = []

for var in variances:
    lengths = np.random.lognormal(mean=4.0, sigma=var, size=100).astype(int)
    lengths = np.clip(lengths, 10, 500).tolist()

    static = simulate_workload(100, 8, lengths, "static")
    continuous = simulate_workload(100, 8, lengths, "continuous")

    static_throughputs.append(static["throughput"])
    continuous_throughputs.append(continuous["throughput"])
    static_latencies.append(static["average_latency"])
    continuous_latencies.append(continuous["average_latency"])

Out[28]:

Visualization

Bar chart comparing throughput and latency for static vs continuous batching. — Throughput comparison across output length variances. Continuous batching (blue) maintains high throughput as variance increases, whereas static batching (orange) efficiency drops due to padding overhead.

Latency comparison across output length variances. Average latency remains stable for continuous batching but degrades significantly for static batching as the gap between shortest and longest requests widens.

The improvement from continuous batching grows with output length variance. When all requests generate similar lengths, static batching loses less to waiting because the maximum is close to the average. But real workloads have high variance. Some of you ask for single-sentence summaries while others request long documents, making continuous batching essential for practical deployment.

Implementation ConsiderationsLink Copied

Building a production continuous batching system requires careful attention to several practical concerns that go beyond the basic algorithm. The theory is elegant, but the engineering details determine whether the system achieves its theoretical potential.

Memory Management IntegrationLink Copied

Continuous batching works best when paired with paged attention, as covered in the Paged Attention chapter. Without it, memory fragmentation limits how many requests can be active simultaneously. Traditional contiguous allocation reserves maximum-length KV cache space for each request, even if most requests use far less. Paged attention allocates memory in small blocks on demand, enabling much higher concurrency.

In[29]:

Code

@dataclass
class MemoryConfig:
    """Configuration for KV cache memory."""

    total_blocks: int  # Total memory blocks available
    block_size: int  # Tokens per block


def estimate_blocks_needed(seq_length: int, block_size: int) -> int:
    """Estimate memory blocks needed for a sequence."""
    return (seq_length + block_size - 1) // block_size  # Ceiling division


class MemoryAwareScheduler:
    """Scheduler that respects memory constraints."""

    def __init__(self, config: MemoryConfig, max_batch_size: int):
        self.config = config
        self.max_batch_size = max_batch_size
        self.free_blocks = config.total_blocks
        self.active_requests: dict[int, int] = {}  # request_id -> blocks_used

    def can_admit(self, request_length: int) -> bool:
        """Check if we have memory for a new request."""
        needed = estimate_blocks_needed(request_length, self.config.block_size)
        return (
            needed <= self.free_blocks
            and len(self.active_requests) < self.max_batch_size
        )

    def admit_request(self, request_id: int, initial_length: int) -> bool:
        """Admit a request if resources are available."""
        if not self.can_admit(initial_length):
            return False

        blocks = estimate_blocks_needed(initial_length, self.config.block_size)
        self.active_requests[request_id] = blocks
        self.free_blocks -= blocks
        return True

    def grow_request(self, request_id: int, new_tokens: int = 1) -> bool:
        """Allocate additional blocks as sequence grows."""
        if request_id not in self.active_requests:
            return False

        # Check if we need a new block
        current_blocks = self.active_requests[request_id]
        current_capacity = current_blocks * self.config.block_size

        # Simplified: assume sequence length = current_capacity + new_tokens
        new_length = current_capacity + new_tokens
        needed_blocks = estimate_blocks_needed(
            new_length, self.config.block_size
        )

        additional = needed_blocks - current_blocks
        if additional > 0:
            if additional > self.free_blocks:
                return False  # Out of memory
            self.free_blocks -= additional
            self.active_requests[request_id] = needed_blocks

        return True

    def release_request(self, request_id: int):
        """Release all blocks for a completed request."""
        if request_id in self.active_requests:
            self.free_blocks += self.active_requests[request_id]
            del self.active_requests[request_id]

@dataclass
class MemoryConfig:
    """Configuration for KV cache memory."""

    total_blocks: int  # Total memory blocks available
    block_size: int  # Tokens per block


def estimate_blocks_needed(seq_length: int, block_size: int) -> int:
    """Estimate memory blocks needed for a sequence."""
    return (seq_length + block_size - 1) // block_size  # Ceiling division


class MemoryAwareScheduler:
    """Scheduler that respects memory constraints."""

    def __init__(self, config: MemoryConfig, max_batch_size: int):
        self.config = config
        self.max_batch_size = max_batch_size
        self.free_blocks = config.total_blocks
        self.active_requests: dict[int, int] = {}  # request_id -> blocks_used

    def can_admit(self, request_length: int) -> bool:
        """Check if we have memory for a new request."""
        needed = estimate_blocks_needed(request_length, self.config.block_size)
        return (
            needed <= self.free_blocks
            and len(self.active_requests) < self.max_batch_size
        )

    def admit_request(self, request_id: int, initial_length: int) -> bool:
        """Admit a request if resources are available."""
        if not self.can_admit(initial_length):
            return False

        blocks = estimate_blocks_needed(initial_length, self.config.block_size)
        self.active_requests[request_id] = blocks
        self.free_blocks -= blocks
        return True

    def grow_request(self, request_id: int, new_tokens: int = 1) -> bool:
        """Allocate additional blocks as sequence grows."""
        if request_id not in self.active_requests:
            return False

        # Check if we need a new block
        current_blocks = self.active_requests[request_id]
        current_capacity = current_blocks * self.config.block_size

        # Simplified: assume sequence length = current_capacity + new_tokens
        new_length = current_capacity + new_tokens
        needed_blocks = estimate_blocks_needed(
            new_length, self.config.block_size
        )

        additional = needed_blocks - current_blocks
        if additional > 0:
            if additional > self.free_blocks:
                return False  # Out of memory
            self.free_blocks -= additional
            self.active_requests[request_id] = needed_blocks

        return True

    def release_request(self, request_id: int):
        """Release all blocks for a completed request."""
        if request_id in self.active_requests:
            self.free_blocks += self.active_requests[request_id]
            del self.active_requests[request_id]

Handling PreemptionLink Copied

When memory pressure is high, the scheduler might need to preempt active requests to make room for higher-priority work. This involves saving the request's state, including the KV cache and generated tokens, for later resumption. Preemption is a powerful tool for managing resources but adds significant complexity.

The decision of which request to preempt involves tradeoffs between fairness, throughput, and implementation complexity. Preempting a request that has generated many tokens wastes the computation invested in it. Preempting a request that has been preempted before may seem unfair. Different policies balance these concerns differently.

In[30]:

Code

from enum import Enum


class RequestStatus(Enum):
    WAITING = "waiting"
    RUNNING = "running"
    PREEMPTED = "preempted"
    COMPLETED = "completed"


@dataclass
class PreemptibleRequest:
    """Request that can be preempted and resumed."""

    id: int
    status: RequestStatus = RequestStatus.WAITING
    tokens_generated: int = 0
    preempt_count: int = 0


class PreemptionPolicy:
    """Policy for choosing which requests to preempt."""

    @staticmethod
    def longest_first(
        requests: List[PreemptibleRequest],
    ) -> PreemptibleRequest | None:
        """Preempt the request that has generated the most tokens."""
        running = [r for r in requests if r.status == RequestStatus.RUNNING]
        if not running:
            return None
        return max(running, key=lambda r: r.tokens_generated)

    @staticmethod
    def most_preempted_last(
        requests: List[PreemptibleRequest],
    ) -> PreemptibleRequest | None:
        """Prefer preempting requests that haven't been preempted before."""
        running = [r for r in requests if r.status == RequestStatus.RUNNING]
        if not running:
            return None
        return min(
            running, key=lambda r: (r.preempt_count, -r.tokens_generated)
        )

from enum import Enum


class RequestStatus(Enum):
    WAITING = "waiting"
    RUNNING = "running"
    PREEMPTED = "preempted"
    COMPLETED = "completed"


@dataclass
class PreemptibleRequest:
    """Request that can be preempted and resumed."""

    id: int
    status: RequestStatus = RequestStatus.WAITING
    tokens_generated: int = 0
    preempt_count: int = 0


class PreemptionPolicy:
    """Policy for choosing which requests to preempt."""

    @staticmethod
    def longest_first(
        requests: List[PreemptibleRequest],
    ) -> PreemptibleRequest | None:
        """Preempt the request that has generated the most tokens."""
        running = [r for r in requests if r.status == RequestStatus.RUNNING]
        if not running:
            return None
        return max(running, key=lambda r: r.tokens_generated)

    @staticmethod
    def most_preempted_last(
        requests: List[PreemptibleRequest],
    ) -> PreemptibleRequest | None:
        """Prefer preempting requests that haven't been preempted before."""
        running = [r for r in requests if r.status == RequestStatus.RUNNING]
        if not running:
            return None
        return min(
            running, key=lambda r: (r.preempt_count, -r.tokens_generated)
        )

Batched Token OperationsLink Copied

For efficiency, token generation for all active requests should happen in a single batched forward pass. This requires careful tensor management to handle requests at different sequence positions. Unlike training, where all sequences in a batch are padded to the same length, continuous batching must handle sequences of varying lengths efficiently.

The key challenge is that different requests have different position embeddings and attention masks. Position embeddings tell the model where each token sits in its sequence. Attention masks prevent tokens from attending to future positions. Both vary by request, but both must be batched together for efficient GPU utilization.

In[31]:

Code

import torch


def prepare_batch_tensors(
    active_requests: List[dict], vocab_size: int = 32000
) -> dict:
    """
    Prepare input tensors for a continuous batch.

    Each request has:
        - 'last_token': The most recently generated token
        - 'seq_length': Current sequence length (for position embedding)
    """
    batch_size = len(active_requests)

    # Each decode step processes just the last token
    input_ids = torch.tensor(
        [r["last_token"] for r in active_requests], dtype=torch.long
    ).unsqueeze(1)  # Shape: [batch, 1]

    # Position IDs vary by request
    position_ids = torch.tensor(
        [r["seq_length"] for r in active_requests], dtype=torch.long
    ).unsqueeze(1)  # Shape: [batch, 1]

    return {
        "input_ids": input_ids,
        "position_ids": position_ids,
        "batch_size": batch_size,
    }

import torch


def prepare_batch_tensors(
    active_requests: List[dict], vocab_size: int = 32000
) -> dict:
    """
    Prepare input tensors for a continuous batch.

    Each request has:
        - 'last_token': The most recently generated token
        - 'seq_length': Current sequence length (for position embedding)
    """
    batch_size = len(active_requests)

    # Each decode step processes just the last token
    input_ids = torch.tensor(
        [r["last_token"] for r in active_requests], dtype=torch.long
    ).unsqueeze(1)  # Shape: [batch, 1]

    # Position IDs vary by request
    position_ids = torch.tensor(
        [r["seq_length"] for r in active_requests], dtype=torch.long
    ).unsqueeze(1)  # Shape: [batch, 1]

    return {
        "input_ids": input_ids,
        "position_ids": position_ids,
        "batch_size": batch_size,
    }

In[32]:

Code

# Example batch preparation
example_requests = [
    {"last_token": 1234, "seq_length": 50},  # Short sequence
    {"last_token": 5678, "seq_length": 200},  # Medium sequence
    {"last_token": 9012, "seq_length": 15},  # Very short
    {"last_token": 3456, "seq_length": 500},  # Long sequence
]

tensors = prepare_batch_tensors(example_requests)

# Example batch preparation
example_requests = [
    {"last_token": 1234, "seq_length": 50},  # Short sequence
    {"last_token": 5678, "seq_length": 200},  # Medium sequence
    {"last_token": 9012, "seq_length": 15},  # Very short
    {"last_token": 3456, "seq_length": 500},  # Long sequence
]

tensors = prepare_batch_tensors(example_requests)

Out[33]:

Console

Batch Tensor Preparation:
  Input IDs shape: torch.Size([4, 1])
  Input IDs: [1234, 5678, 9012, 3456]
  Position IDs: [50, 200, 15, 500]

The key insight is that despite requests being at different points in their generation, they can share a single forward pass because attention masks and position embeddings handle the differences. The input is just one token per request, but each token gets its own position embedding corresponding to its place in that request's sequence.

Key ParametersLink Copied

The key parameters for the continuous batching implementation are:

max_batch_size. Maximum number of concurrent requests. Higher values improve throughput but require more memory. The optimal value depends on model size, available GPU memory, and expected request characteristics.
block_size. Number of tokens per memory block in paged attention. Typically 16 or 32. Smaller blocks reduce fragmentation but increase bookkeeping overhead.
chunk_size. Number of tokens processed per iteration during prefill. Helps balance latency and throughput. Smaller chunks provide more responsive decode but may reduce prefill efficiency.

Real-World PerformanceLink Copied

Production continuous batching systems like vLLM, TensorRT-LLM, and Text Generation Inference report significant improvements over static batching. These systems combine continuous batching with paged attention, optimized kernels, and sophisticated scheduling to achieve impressive throughput on commodity hardware.

In[34]:

Code

# Simulated benchmark data based on published results
benchmark_data = {
    "batch_size": [1, 4, 8, 16, 32],
    "static_throughput": [15, 45, 70, 95, 110],  # tokens/sec
    "continuous_throughput": [15, 58, 120, 210, 350],  # tokens/sec
    "static_latency_p50": [65, 85, 110, 150, 220],  # ms
    "continuous_latency_p50": [65, 70, 75, 85, 100],  # ms
}

# Simulated benchmark data based on published results
benchmark_data = {
    "batch_size": [1, 4, 8, 16, 32],
    "static_throughput": [15, 45, 70, 95, 110],  # tokens/sec
    "continuous_throughput": [15, 58, 120, 210, 350],  # tokens/sec
    "static_latency_p50": [65, 85, 110, 150, 220],  # ms
    "continuous_latency_p50": [65, 70, 75, 85, 100],  # ms
}

Out[35]:

Visualization

Line plot showing throughput vs batch size for both batching strategies. — Throughput scaling with batch size. Continuous batching (blue) scales throughput efficiently with batch size, significantly outperforming static batching (orange) which plateaus due to synchronization overhead.

Latency scaling with batch size. Continuous batching maintains stable low P50 latency even as batch size grows, whereas static batching latency increases linearly.

The throughput advantage of continuous batching compounds at larger batch sizes because GPU utilization stays high even as requests complete at different times. Static batching's throughput plateaus as the "waiting for slowest" overhead dominates, while continuous batching continues to scale.

Limitations and ImpactLink Copied

Continuous batching, while transformative for LLM serving, comes with its own set of challenges and tradeoffs that practitioners must understand. No optimization is free, and the benefits of continuous batching come with real engineering costs.

The primary complexity lies in implementation. A continuous batching system requires tight integration between the scheduler, memory manager, and model execution engine. The scheduler must make split-second decisions about which requests to run, potentially thousands of times per second. Any inefficiency in this hot path directly reduces throughput. Furthermore, debugging becomes harder when requests can be preempted, resumed, and interleaved in complex patterns. Reproducing issues requires capturing the exact sequence of scheduling decisions, which may be difficult in a highly dynamic system.

Memory management presents another significant challenge. While paged attention solves fragmentation, the combination of variable-length requests and preemption creates complex allocation patterns. Memory pressure can cascade: preempting one request to admit another might itself require preemption of a third request, leading to difficult-to-predict behavior. Production systems need careful tuning of admission policies and preemption thresholds to avoid thrashing.

There are also fairness considerations. Without careful policy design, long requests might be repeatedly preempted in favor of short ones, leading to starvation. Conversely, prioritizing completion of long requests might inflate latency for short queries. Finding the right balance depends on the specific application's requirements and your expectations.

Despite these challenges, continuous batching has fundamentally changed LLM deployment economics. Systems that previously needed multiple GPUs to serve moderate traffic can now handle the same load on a single device. This efficiency improvement makes LLMs practical for a much wider range of applications and organizations. The technique has become standard in production serving frameworks, and understanding its principles is essential for when you deploy language models at scale.

As we'll explore in the next chapter on Inference Serving, continuous batching is just one component of a complete serving system. Real deployments must also handle request routing, load balancing, model replication, and failure recovery, all while maintaining the low latency you expect.

SummaryLink Copied

Continuous batching transforms LLM inference from rigid, inefficient batch processing into a dynamic, high-utilization system. The key ideas we covered include:

Static batching inefficiency. Traditional batching forces all requests to wait for the slowest, wasting compute on completed requests and inflating latencies
Iteration-level scheduling. Treating each decode step as a scheduling opportunity allows immediate eviction of completed requests and admission of waiting ones
Phase handling. Managing the distinct prefill and decode phases, potentially with chunked prefill, balances compute characteristics across the batch
Resource management. Tight integration with KV cache memory management, including paged attention, maximizes the number of concurrent requests
Throughput gains. Real-world improvements of 2-3x throughput are common, with better latency characteristics especially for short requests

Continuous batching is now standard practice for production LLM serving. Combined with the memory optimizations from paged attention and the speculative decoding techniques we covered earlier, it enables serving LLMs at scale with reasonable hardware costs.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about continuous batching.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Speculative Decoding Math: Algorithms & Speedup Limits

Next Chapter

LLM Inference Serving: Architecture, Routing & Auto-Scaling

Reference

BIBTEXAcademic

@misc{continuousbatchingoptimizingllminferencethroughput, author = {Michael Brenndoerfer}, title = {Continuous Batching: Optimizing LLM Inference Throughput}, year = {2026}, url = {https://mbrenndoerfer.com/writing/continuous-batching}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). Continuous Batching: Optimizing LLM Inference Throughput. Retrieved from https://mbrenndoerfer.com/writing/continuous-batching

MLAAcademic

Michael Brenndoerfer. "Continuous Batching: Optimizing LLM Inference Throughput." 2026. Web. today. <https://mbrenndoerfer.com/writing/continuous-batching>.

CHICAGOAcademic

Michael Brenndoerfer. "Continuous Batching: Optimizing LLM Inference Throughput." Accessed today. https://mbrenndoerfer.com/writing/continuous-batching.

HARVARDAcademic

Michael Brenndoerfer (2026) 'Continuous Batching: Optimizing LLM Inference Throughput'. Available at: https://mbrenndoerfer.com/writing/continuous-batching (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). Continuous Batching: Optimizing LLM Inference Throughput. https://mbrenndoerfer.com/writing/continuous-batching

Direct link:

https://mbrenndoerfer.com/writing/continuous-batching

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Continuous Batching: Optimizing LLM Inference Throughput

Continuous BatchingLink Copied

The Batching Problem in LLM InferenceLink Copied

Static BatchingLink Copied

How Static Batching WorksLink Copied

The Efficiency ProblemLink Copied

Continuous BatchingLink Copied

The Core InsightLink Copied

How Continuous Batching WorksLink Copied

Iteration-Level SchedulingLink Copied

Scheduling DecisionsLink Copied

Prefill and Decode PhasesLink Copied

Chunked PrefillLink Copied

Request Completion HandlingLink Copied

Completion DetectionLink Copied

Resource ReleaseLink Copied

Immediate Response DeliveryLink Copied

Throughput AnalysisLink Copied

Theoretical AnalysisLink Copied

Empirical ComparisonLink Copied

Implementation ConsiderationsLink Copied

Memory Management IntegrationLink Copied

Handling PreemptionLink Copied

Batched Token OperationsLink Copied

Key ParametersLink Copied

Real-World PerformanceLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

LLM Inference Serving: Architecture, Routing & Auto-Scaling

Speculative Decoding Math: Algorithms & Speedup Limits

Speculative Decoding: Fast LLM Inference Without Quality Loss

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

LLM Inference Serving: Architecture, Routing & Auto-Scaling

Speculative Decoding Math: Algorithms & Speedup Limits

Speculative Decoding: Fast LLM Inference Without Quality Loss

Stay updated