LLM Inference Serving: Architecture, Routing & Auto-Scaling

Michael BrenndoerferJanuary 19, 202651 min read

Master LLM inference serving architecture, token-aware load balancing, and auto-scaling. Optimize time-to-first-token and throughput for production systems.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Inference Serving

Deploying a language model involves much more than loading weights and running forward passes. Production systems must handle thousands of concurrent users, maintain low latency under varying load, and do so cost-effectively. Inference serving encompasses the architecture, routing, and scaling strategies that transform a trained model into a reliable, performant service.

The previous chapters in this part covered techniques that make individual inference requests faster and more memory-efficient: KV caching reduces redundant computation, quantization shrinks memory footprints, speculative decoding accelerates token generation, and continuous batching maximizes GPU utilization. This chapter focuses on the layer above: how to orchestrate these capabilities into a production system that serves many users simultaneously while meeting latency and throughput requirements.

Inference Server Architecture

An inference server sits between client applications and the underlying model, handling request management, batching, scheduling, and response streaming. Modern LLM inference servers have evolved specialized architectures to address the unique challenges of autoregressive generation. Understanding these architectural patterns is essential for anyone building or operating production language model services.

Core Components

The architecture of an LLM inference server differs significantly from traditional ML serving systems in fundamental ways that stem from the nature of autoregressive text generation. Classification models process requests independently with fixed compute costs: an image classifier performs the same computation regardless of the input image's content. Language models, by contrast, generate tokens iteratively, with each request potentially requiring hundreds of forward passes and memory that grows with sequence length. This iterative, variable-cost nature demands specialized architectural components that can efficiently manage the unique resource patterns of text generation.

A typical inference server contains several interconnected components, each playing a distinct role in the request processing pipeline:

  • Request handler: Accepts incoming requests over HTTP/gRPC, validates inputs, and manages response streaming. For LLMs, this often involves Server-Sent Events (SSE) or WebSocket connections to stream tokens as they're generated. The request handler must maintain long-lived connections efficiently, as a single request may take several seconds to complete while tokens are progressively returned to the client.

  • Tokenizer: Converts text to token IDs on input and back to text on output. While fast compared to model inference, tokenization can become a bottleneck at high throughput because it runs on the CPU while the model runs on the GPU. Servers often parallelize tokenization across CPU cores to prevent this component from limiting overall system throughput.

  • Scheduler: Determines which requests to process together in each batch. As we discussed in the previous chapter on continuous batching, sophisticated schedulers dynamically add and remove requests from running batches. The scheduler must balance multiple objectives: maximizing GPU utilization, maintaining fair request ordering, and meeting latency targets for high-priority requests.

  • Model executor: Manages the actual model inference, including memory allocation for KV caches, attention computation, and integration with optimized kernels like FlashAttention. This component coordinates the GPU computation and ensures that the model weights, activations, and cached attention states are efficiently arranged in memory.

  • Memory manager: Tracks KV cache allocations, implements paged attention for efficient memory use, and handles memory pressure through eviction policies. Because KV cache memory requirements grow with sequence length and concurrent request count, intelligent memory management is critical for maintaining high throughput without running out of GPU memory.

Several frameworks have emerged to handle LLM serving at scale, each with different strengths and design philosophies:

vLLM implements PagedAttention for efficient KV cache management, enabling high throughput with memory-constrained hardware. By treating KV cache memory like virtual memory pages, vLLM can handle more concurrent requests than naive implementations that allocate contiguous memory blocks. It supports continuous batching and integrates with popular model architectures, making it a popular choice for production deployments.

Text Generation Inference (TGI) from Hugging Face provides production-ready serving with tensor parallelism for multi-GPU deployments, quantization support, and optimized attention implementations. Its tight integration with the Hugging Face ecosystem makes it particularly convenient for teams already using Transformers for model development.

TensorRT-LLM from NVIDIA offers highly optimized inference for NVIDIA GPUs, with custom CUDA kernels and support for quantization schemes like FP8 and INT4. When running on NVIDIA hardware, TensorRT-LLM often achieves the highest raw performance, though it requires more effort to set up than framework-agnostic alternatives.

Triton Inference Server provides a model-agnostic serving platform that can host multiple models with different backends, useful for ensemble architectures or multi-model deployments. Its flexibility makes it well-suited for complex inference pipelines that combine multiple models or processing steps.

In[2]:
Code
# Example: Basic inference server configuration concepts
# This demonstrates the key parameters inference servers expose

server_config = {
    # Model configuration
    "model_name": "meta-llama/Llama-2-7b-chat-hf",
    "tensor_parallel_size": 1,  # GPUs for tensor parallelism
    "dtype": "float16",
    # Memory management
    "gpu_memory_utilization": 0.90,  # Reserve 90% of GPU memory
    "max_num_seqs": 256,  # Maximum concurrent sequences
    "max_model_len": 4096,  # Maximum sequence length
    # Batching configuration
    "max_num_batched_tokens": 8192,  # Tokens per batch
    "enable_chunked_prefill": True,  # Chunk long prefills
    # Request handling
    "max_log_len": 1000,  # Truncate logged prompts
    "disable_log_stats": False,  # Enable throughput logging
}
Out[3]:
Console
Server Configuration Summary:
  Model: meta-llama/Llama-2-7b-chat-hf
  GPU Memory Target: 90%
  Max Concurrent Sequences: 256
  Max Tokens per Batch: 8192

The configuration balances memory allocation against concurrency in a careful trade-off. Higher gpu_memory_utilization allows more KV cache space for concurrent requests, but leaves less headroom for activation memory during computation. Setting this value too high can cause out-of-memory errors during prefill of long sequences; setting it too low wastes expensive GPU memory that could support additional concurrent requests.

Key Parameters

The key parameters for inference server configuration are:

  • gpu_memory_utilization: Fraction of GPU memory to reserve for the model weights and KV cache.
  • max_num_seqs: Maximum number of concurrent sequences the server can handle.
  • enable_chunked_prefill: Whether to split the prefill phase into chunks to prevent blocking the decode phase of other requests.

Request Lifecycle

Understanding the journey of a request through the server helps identify optimization opportunities and debug performance issues. Each stage presents different bottlenecks and optimization levers:

  1. Arrival: The request arrives via HTTP POST with prompt text and generation parameters (temperature, max tokens, etc.). The server validates the request format and assigns it a unique identifier for tracking.

  2. Preprocessing: The server tokenizes the prompt and validates that the resulting sequence fits within model limits. Long prompts may be rejected or truncated depending on server configuration.

  3. Scheduling: The scheduler decides when to begin processing. The request may queue if all batch slots are occupied. Priority and fairness policies determine ordering among waiting requests.

  4. Prefill: The model processes all prompt tokens in parallel, populating the KV cache with attention states for all input positions. This phase is compute-bound because it performs dense matrix multiplications across the entire prompt length.

  5. Decode Loop: Tokens are generated one at a time, with each new token streamed to the client. This phase is memory-bandwidth-bound because each forward pass must read the entire KV cache from GPU memory while producing only a single output token.

  6. Completion: Generation stops when the model produces an end token or reaches the maximum length. The server releases KV cache memory and closes the response stream, making capacity available for new requests.

In[4]:
Code
import time
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum


class RequestState(Enum):
    QUEUED = "queued"
    PREFILLING = "prefilling"
    DECODING = "decoding"
    COMPLETED = "completed"
    FAILED = "failed"


@dataclass
class InferenceRequest:
    """Tracks a request through its lifecycle."""

    request_id: str
    prompt: str
    max_tokens: int = 256
    temperature: float = 1.0

    # Timing metrics
    arrival_time: float = field(default_factory=time.time)
    prefill_start: Optional[float] = None
    decode_start: Optional[float] = None
    completion_time: Optional[float] = None

    # State tracking
    state: RequestState = RequestState.QUEUED
    tokens_generated: int = 0
    prompt_tokens: int = 0

    def time_to_first_token(self) -> Optional[float]:
        """Latency from arrival to first generated token."""
        if self.decode_start is not None:
            return self.decode_start - self.arrival_time
        return None

    def total_latency(self) -> Optional[float]:
        """Total request latency."""
        if self.completion_time is not None:
            return self.completion_time - self.arrival_time
        return None
In[5]:
Code
# Simulate a request lifecycle
req = InferenceRequest(
    request_id="req-001", prompt="Explain quantum computing", max_tokens=100
)
req.prompt_tokens = 4  # Simulated

# Simulate timing
req.prefill_start = req.arrival_time + 0.015  # 15ms queue time
req.decode_start = req.prefill_start + 0.050  # 50ms prefill
req.state = RequestState.DECODING
req.tokens_generated = 100
req.completion_time = req.decode_start + 2.0  # 2s decode
Out[6]:
Console
Request req-001 Metrics:
  Prompt tokens: 4
  Generated tokens: 100
  Time to first token: 65.0ms
  Total latency: 2065ms
  Decode throughput: 50.0 tokens/sec

The metrics distinguish between time-to-first-token (TTFT) and total latency because these capture fundamentally different aspects of user experience. TTFT measures user-perceived responsiveness in streaming applications: when you send a message, how quickly you see the model start to respond? Total latency matters more for batch processing where results are consumed only after complete generation. Optimizing these metrics often involves different trade-offs. For example, aggressive batching improves total throughput but can increase TTFT because requests wait for batch formation.

Request Routing

When multiple model instances serve requests, a routing layer decides which instance handles each request. Effective request routing maximizes throughput while maintaining consistent latency. The routing layer must make rapid decisions with incomplete information, balancing immediate load distribution against longer-term efficiency considerations.

Model Routing Strategies

Simple deployments route all requests to a single model version, treating the routing problem as pure load balancing. Production systems often need more sophisticated routing that considers request characteristics and business requirements:

Version-based routing directs requests to different model versions based on request metadata. This enables A/B testing by directing a fraction of traffic to a new model version, gradual rollouts that progressively shift traffic to updated models, and fallback to stable versions if new deployments exhibit problems. For example, a system might route 5% of traffic to a new model version while monitoring quality metrics before expanding the rollout.

Capability-based routing matches requests to appropriate models based on task complexity or domain. A lightweight model might handle simple queries like basic factual questions, while complex reasoning tasks route to larger, more capable models. This approach reduces cost without sacrificing quality where it matters: simple questions get fast, cheap answers while challenging problems receive appropriate computational resources.

Priority routing ensures high-priority requests (paying customers, critical applications) get preferential treatment in terms of both queue position and endpoint selection. Lower-priority requests may queue longer or route to less powerful instances. This enables tiered service levels without requiring completely separate infrastructure for each tier.

In[7]:
Code
from dataclasses import dataclass
from typing import Dict, List, Optional
import random


@dataclass
class ModelEndpoint:
    """Represents a model instance that can serve requests."""

    endpoint_id: str
    model_name: str
    capacity: int  # Max concurrent requests
    current_load: int = 0
    is_healthy: bool = True

    @property
    def available_capacity(self) -> int:
        return self.capacity - self.current_load if self.is_healthy else 0


class ModelRouter:
    """Routes requests to appropriate model endpoints."""

    def __init__(self):
        self.endpoints: Dict[str, List[ModelEndpoint]] = {}

    def register_endpoint(self, model_name: str, endpoint: ModelEndpoint):
        if model_name not in self.endpoints:
            self.endpoints[model_name] = []
        self.endpoints[model_name].append(endpoint)

    def route_request(
        self, model_name: str, priority: int = 0
    ) -> Optional[ModelEndpoint]:
        """
        Find the best endpoint for a request.
        Higher priority requests can preempt lower priority ones.
        """
        candidates = self.endpoints.get(model_name, [])

        # Filter to healthy endpoints with capacity
        available = [e for e in candidates if e.available_capacity > 0]

        if not available:
            return None

        # For high priority, pick least loaded
        # For low priority, pick randomly among available
        if priority > 5:
            return min(available, key=lambda e: e.current_load)
        else:
            return random.choice(available)
In[8]:
Code
# Demonstrate routing behavior
router = ModelRouter()

# Register multiple endpoints for the same model
for i in range(3):
    endpoint = ModelEndpoint(
        endpoint_id=f"llama-{i}",
        model_name="llama-7b",
        capacity=100,
        current_load=random.randint(20, 80),
    )
    router.register_endpoint("llama-7b", endpoint)

# Route some requests at different priorities
routing_results = []
for priority in [1, 5, 8]:
    selected = router.route_request("llama-7b", priority=priority)
    routing_results.append((priority, selected))
Out[9]:
Console
Registered Endpoints:
  llama-0: load=76/100
  llama-1: load=75/100
  llama-2: load=25/100

Routing Decisions:
  Priority 1 -> llama-2 (load: 25)
  Priority 5 -> llama-0 (load: 76)
  Priority 8 -> llama-2 (load: 25)

The router considers both capacity and priority when making routing decisions. High-priority requests route to the least-loaded endpoint to minimize queuing time and ensure fast response. Lower-priority requests distribute randomly among available endpoints to prevent creating hotspots while still avoiding overloaded instances. This differentiated treatment allows the system to provide better service to priority traffic without completely starving lower-priority requests.

Health Checks and Failover

Effective routing depends on accurate health information about each endpoint. If the router sends requests to an unhealthy endpoint, those requests will fail or experience long delays. Servers implement multiple health check levels to detect different kinds of problems:

  • Liveness checks verify the process is running and responding to basic probes. A hanging process fails liveness checks even if it hasn't technically crashed. These checks detect catastrophic failures quickly.

  • Readiness checks verify the server can accept new requests. A server in the process of loading a model passes liveness (the process is running) but fails readiness (it cannot yet serve requests). This distinction prevents routing traffic to instances that are starting up.

  • Deep health checks verify end-to-end functionality by running test inferences. These checks catch subtle issues like corrupted model weights, CUDA driver problems, or memory leaks that wouldn't be detected by simpler checks. The trade-off is that deep checks consume GPU resources and take longer to complete.

In[10]:
Code
import time
from dataclasses import dataclass
from typing import Callable, Dict, Optional


@dataclass
class HealthStatus:
    is_live: bool = True
    is_ready: bool = True
    last_check: float = 0.0
    consecutive_failures: int = 0
    error_message: Optional[str] = None


class HealthChecker:
    """Monitors endpoint health with configurable checks."""

    def __init__(
        self,
        check_interval: float = 10.0,
        failure_threshold: int = 3,
        recovery_threshold: int = 2,
    ):
        self.check_interval = check_interval
        self.failure_threshold = failure_threshold
        self.recovery_threshold = recovery_threshold
        self.endpoint_health: Dict[str, HealthStatus] = {}

    def check_endpoint(
        self,
        endpoint_id: str,
        liveness_fn: Callable[[], bool],
        readiness_fn: Callable[[], bool],
    ) -> HealthStatus:
        """Perform health check and update status."""
        status = self.endpoint_health.get(endpoint_id, HealthStatus())

        try:
            status.is_live = liveness_fn()
            status.is_ready = readiness_fn() if status.is_live else False

            if status.is_live and status.is_ready:
                status.consecutive_failures = 0
                status.error_message = None
            else:
                status.consecutive_failures += 1

        except Exception as e:
            status.is_live = False
            status.is_ready = False
            status.consecutive_failures += 1
            status.error_message = str(e)

        status.last_check = time.time()
        self.endpoint_health[endpoint_id] = status

        return status

    def should_remove_from_rotation(self, endpoint_id: str) -> bool:
        """Check if endpoint should be removed due to failures."""
        status = self.endpoint_health.get(endpoint_id)
        if status is None:
            return False
        return status.consecutive_failures >= self.failure_threshold
In[11]:
Code
# Simulate health checking
checker = HealthChecker(failure_threshold=3)

# Simulate checks for different endpoint states
scenarios = [
    ("healthy-endpoint", lambda: True, lambda: True),
    ("loading-endpoint", lambda: True, lambda: False),
    ("failing-endpoint", lambda: False, lambda: False),
]

check_results = []
for endpoint_id, live_fn, ready_fn in scenarios:
    status = checker.check_endpoint(endpoint_id, live_fn, ready_fn)
    check_results.append((endpoint_id, status))
Out[12]:
Console
Health Check Results:
  healthy-endpoint:
    Live: True, Ready: True
    Failures: 0
  loading-endpoint:
    Live: True, Ready: False
    Failures: 1
  failing-endpoint:
    Live: False, Ready: False
    Failures: 1

Health checks enable automatic failover without human intervention. When an endpoint exceeds the failure threshold (typically 3 consecutive failures), the router removes it from rotation and stops sending new requests. This prevents user-facing errors from accumulating while the underlying issue is investigated. After the underlying issue resolves, consecutive successful checks restore the endpoint to service. The recovery threshold ensures the endpoint is genuinely healthy before receiving traffic again, preventing oscillation between healthy and unhealthy states.

Load Balancing

Load balancing distributes requests across multiple model instances to maximize throughput and maintain consistent latency. While load balancing is a well-studied problem in distributed systems, LLM serving presents unique challenges that traditional load balancing algorithms don't address well. The variable cost of requests, the long-running nature of generation, and the importance of memory locality all complicate the problem.

Traditional Algorithms

Standard load balancing approaches provide a foundation for understanding the problem, even if they don't fully address LLM-specific challenges:

Round-robin distributes requests sequentially across endpoints in a fixed order. The first request goes to endpoint 0, the second to endpoint 1, and so on, cycling back to 0 after reaching the last endpoint. This approach is simple, predictable, and fair in terms of request count. However, it ignores actual load, which matters when requests have varying computational costs. A long generation request and a short one count equally, even though they impose vastly different burdens.

Least connections routes each new request to the endpoint with the fewest active requests. This approach adapts better to variable-duration requests because endpoints processing long requests naturally accumulate fewer connections over time. However, it still treats all requests as equal cost, which fails to capture the difference between generating 10 tokens and generating 1000 tokens.

Weighted distribution assigns requests proportionally to endpoint capacity, which is useful when endpoints have different hardware. An H100 GPU might receive twice as many requests as an A100 because it can process them twice as fast. Weights can be configured manually based on hardware specs or adjusted dynamically based on observed performance.

In[13]:
Code
from typing import List
import random
from collections import defaultdict


class LoadBalancer:
    """Implements various load balancing strategies."""

    def __init__(self, endpoints: List[ModelEndpoint]):
        self.endpoints = endpoints
        self.round_robin_index = 0
        self.request_counts = defaultdict(int)

    def round_robin(self) -> Optional[ModelEndpoint]:
        """Simple round-robin selection."""
        healthy = [e for e in self.endpoints if e.is_healthy]
        if not healthy:
            return None

        selected = healthy[self.round_robin_index % len(healthy)]
        self.round_robin_index += 1
        return selected

    def least_connections(self) -> Optional[ModelEndpoint]:
        """Select endpoint with lowest current load."""
        healthy = [e for e in self.endpoints if e.is_healthy]
        if not healthy:
            return None

        return min(healthy, key=lambda e: e.current_load)

    def weighted_random(self) -> Optional[ModelEndpoint]:
        """Random selection weighted by available capacity."""
        healthy = [e for e in self.endpoints if e.is_healthy]
        if not healthy:
            return None

        total_capacity = sum(e.available_capacity for e in healthy)
        if total_capacity == 0:
            return None

        weights = [e.available_capacity / total_capacity for e in healthy]
        return random.choices(healthy, weights=weights)[0]
In[14]:
Code
# Create endpoints with varying capacities and loads
endpoints = [
    ModelEndpoint("gpu-0", "llama", capacity=100, current_load=30),
    ModelEndpoint("gpu-1", "llama", capacity=100, current_load=70),
    ModelEndpoint("gpu-2", "llama", capacity=150, current_load=50),
]

balancer = LoadBalancer(endpoints)

# Simulate 1000 routing decisions with each strategy
strategies = {
    "Round Robin": balancer.round_robin,
    "Least Connections": balancer.least_connections,
    "Weighted Random": balancer.weighted_random,
}

results = {}
for name, strategy in strategies.items():
    # Reset state
    balancer.round_robin_index = 0
    counts = defaultdict(int)

    for _ in range(1000):
        selected = strategy()
        if selected:
            counts[selected.endpoint_id] += 1

    results[name] = dict(counts)
Out[15]:
Console
Request Distribution (1000 requests):
--------------------------------------------------

Round Robin:
  gpu-0: 334 (33.4%)
  gpu-1: 333 (33.3%)
  gpu-2: 333 (33.3%)

Least Connections:
  gpu-0: 1000 (100.0%)
  gpu-1: 0 (0.0%)
  gpu-2: 0 (0.0%)

Weighted Random:
  gpu-0: 360 (36.0%)
  gpu-1: 152 (15.2%)
  gpu-2: 488 (48.8%)
Out[16]:
Visualization
Request distribution across multiple endpoints using different load balancing strategies. Round-robin distributes traffic evenly regardless of current load, while Least Connections prioritizes the least-loaded endpoint (gpu-0). Weighted Random distributes requests based on available capacity, balancing utilization across heterogeneous hardware instances.
Request distribution across multiple endpoints using different load balancing strategies. Round-robin distributes traffic evenly regardless of current load, while Least Connections prioritizes the least-loaded endpoint (gpu-0). Weighted Random distributes requests based on available capacity, balancing utilization across heterogeneous hardware instances.

The results illustrate the behavioral differences between strategies. Round-robin distributes evenly regardless of load, giving each endpoint roughly 33% of requests. Least connections concentrates traffic on the least-loaded endpoint (gpu-0), which may improve its latency but could overload it if requests are long-running. Weighted random accounts for available capacity, directing more requests to endpoints with headroom while maintaining some randomness to avoid perfect synchronization effects.

Token-Aware Load Balancing

Traditional metrics like connection count fail to capture LLM workload characteristics because they treat all requests as having equal cost. In reality, a request generating 1000 tokens consumes far more GPU resources and time than one generating 10 tokens, even though both count as single connections. Similarly, a request with a 4000-token prompt requires substantially more prefill computation than one with a 100-token prompt.

Token-aware load balancing accounts for the actual computational cost of requests by tracking token counts rather than just request counts. This provides a much more accurate picture of actual endpoint load:

In[17]:
Code
from dataclasses import dataclass


@dataclass
class TokenAwareEndpoint:
    """Endpoint that tracks load in tokens, not just requests."""

    endpoint_id: str
    max_tokens_per_second: int  # Throughput capacity
    current_prompt_tokens: int = 0  # Tokens in prefill queue
    current_decode_tokens: int = 0  # Tokens being decoded
    active_requests: int = 0

    def estimated_load(self) -> float:
        """
        Estimate current load as fraction of capacity.
        Weights decode tokens more heavily since decode is the bottleneck.
        """
        # Prefill is compute-bound, decode is memory-bound
        # Decode throughput is typically 10-50x lower than prefill
        prefill_cost = self.current_prompt_tokens * 0.1
        decode_cost = self.current_decode_tokens * 1.0

        total_cost = prefill_cost + decode_cost
        return total_cost / self.max_tokens_per_second


class TokenAwareBalancer:
    """Load balancer that considers token counts."""

    def __init__(self, endpoints: List[TokenAwareEndpoint]):
        self.endpoints = endpoints

    def select_endpoint(
        self, prompt_tokens: int, estimated_output_tokens: int
    ) -> Optional[TokenAwareEndpoint]:
        """Select endpoint considering the incoming request's token cost."""
        healthy = [e for e in self.endpoints if e.estimated_load() < 0.95]

        if not healthy:
            return None

        # Find endpoint that will have lowest load after adding this request
        def projected_load(endpoint: TokenAwareEndpoint) -> float:
            current = endpoint.estimated_load()
            added_cost = prompt_tokens * 0.1 + estimated_output_tokens
            added_load = added_cost / endpoint.max_tokens_per_second
            return current + added_load

        return min(healthy, key=projected_load)
In[18]:
Code
# Demonstrate token-aware balancing
token_endpoints = [
    TokenAwareEndpoint(
        "gpu-0",
        max_tokens_per_second=100,
        current_prompt_tokens=500,
        current_decode_tokens=20,
    ),
    TokenAwareEndpoint(
        "gpu-1",
        max_tokens_per_second=100,
        current_prompt_tokens=100,
        current_decode_tokens=60,
    ),
]

token_balancer = TokenAwareBalancer(token_endpoints)

# Route requests with different characteristics
requests = [
    (100, 50),  # Short prompt, short output
    (2000, 100),  # Long prompt, medium output
    (100, 500),  # Short prompt, long output
]

routing_decisions = []
for prompt_len, output_len in requests:
    selected = token_balancer.select_endpoint(prompt_len, output_len)
    routing_decisions.append((prompt_len, output_len, selected))
Out[19]:
Console
Endpoint Token Loads:
  gpu-0: 70.0% utilized
    Prefill queue: 500 tokens
    Active decode: 20 tokens
  gpu-1: 70.0% utilized
    Prefill queue: 100 tokens
    Active decode: 60 tokens

Routing Decisions:
  100 prompt + 50 output -> gpu-0
  2000 prompt + 100 output -> gpu-0
  100 prompt + 500 output -> gpu-0

Token-aware balancing makes smarter routing decisions by considering the actual work each endpoint is performing. In this example, gpu-0 has more prompt tokens queued for prefill, but gpu-1 has more active decode tokens. Because decode is the bottleneck phase (each token requires reading the entire KV cache), the long-output request routes to gpu-0, which has less active decoding. This prevents one endpoint from becoming decode-saturated while another sits idle during its compute-bound prefill phase.

Session Affinity Considerations

Some applications benefit from routing subsequent requests to the same endpoint to enable optimization opportunities. If a conversation spans multiple API calls with the same system prompt or shared context, routing to the same instance enables KV cache reuse through prefix caching, as discussed earlier in this part. Rather than recomputing attention states for the repeated prefix, the server can reuse cached values, significantly accelerating prefill.

However, session affinity conflicts with optimal load balancing because it constrains routing decisions. If a particular session's preferred endpoint becomes overloaded, the system must choose between maintaining affinity (and accepting higher latency) or breaking affinity (and losing cache benefits). A simple compromise routes requests with known session IDs to consistent endpoints using hash-based assignment, while distributing new sessions based on current load:

In[20]:
Code
import hashlib


class AffinityBalancer:
    """Balancer with optional session affinity."""

    def __init__(self, endpoints: List[ModelEndpoint]):
        self.endpoints = endpoints

    def select_endpoint(
        self, session_id: Optional[str] = None
    ) -> Optional[ModelEndpoint]:
        healthy = [e for e in self.endpoints if e.is_healthy]
        if not healthy:
            return None

        if session_id:
            # Hash session to consistently select same endpoint
            hash_val = int(hashlib.md5(session_id.encode()).hexdigest(), 16)
            index = hash_val % len(healthy)
            return healthy[index]
        else:
            # No session: use least connections
            return min(healthy, key=lambda e: e.current_load)
In[21]:
Code
affinity_balancer = AffinityBalancer(endpoints)

# Same session always routes to same endpoint
session = "user-123-conv-456"
session_routes = []
for i in range(3):
    selected = affinity_balancer.select_endpoint(session_id=session)
    session_routes.append((i + 1, selected))

# Different sessions may route differently
users = ["user-100", "user-200", "user-300"]
user_routes = []
for user in users:
    selected = affinity_balancer.select_endpoint(session_id=user)
    user_routes.append((user, selected))
Out[22]:
Console
Session 'user-123-conv-456' routes to:
  Request 1: gpu-0
  Request 2: gpu-0
  Request 3: gpu-0

Different sessions:
  user-100: gpu-0
  user-200: gpu-1
  user-300: gpu-0

Consistent hashing ensures the same session always reaches the same endpoint, assuming the endpoint pool remains stable. This deterministic mapping enables KV cache hits for repeated prompts within a conversation. When endpoints are added or removed, consistent hashing minimizes disruption by changing the mapping for only a fraction of sessions rather than all of them. More sophisticated approaches can combine affinity preferences with load awareness, preferring the affinity endpoint when it has capacity but falling back to other endpoints when necessary.

Auto-Scaling

Static deployments with a fixed number of instances struggle with variable demand. Traffic to inference services often varies dramatically by time of day, day of week, and in response to external events. Auto-scaling adjusts the number of model instances based on load, reducing costs during low-traffic periods while maintaining performance during peaks. This elastic capacity is one of the key advantages of cloud deployments over fixed on-premises infrastructure.

Scaling Metrics

Choosing the right metrics to trigger scaling is crucial because different metrics capture different aspects of system health and have different response characteristics. Common options for LLM serving include:

Queue depth: The number of waiting requests provides a direct measure of demand exceeding capacity. High queue depth indicates insufficient capacity and directly predicts increased latency for arriving requests. This metric responds quickly to load changes because queues grow immediately when arrivals exceed processing rate.

Latency percentiles: The P50, P95, or P99 request latency directly measures user experience. Scaling based on latency targets addresses user experience directly, which is ultimately what matters. However, latency increases only after the system is already overloaded, making this a lagging indicator. By the time latency rises, users have already experienced degraded service.

GPU utilization: The fraction of GPU compute being used indicates how efficiently the current capacity is being used. Low utilization suggests over-provisioning and wasted cost. Consistently high utilization suggests under-provisioning and risk of latency degradation as load increases further.

Tokens per second: The aggregate throughput across instances measures actual work being done. Unlike request count, this metric accounts for variable request sizes. A spike in tokens per second indicates increased load even if request count is stable (perhaps because users are submitting longer prompts).

In[23]:
Code
from dataclasses import dataclass


@dataclass
class ScalingMetrics:
    """Metrics used for auto-scaling decisions."""

    queue_depth: int
    active_requests: int
    latency_p50_ms: float
    latency_p95_ms: float
    gpu_utilization: float
    tokens_per_second: float


@dataclass
class ScalingConfig:
    """Thresholds for scaling decisions."""

    # Scale up triggers
    max_queue_depth: int = 100
    max_latency_p95_ms: float = 5000
    max_gpu_utilization: float = 0.85

    # Scale down triggers
    min_queue_depth: int = 10
    min_gpu_utilization: float = 0.30

    # Cooldown to prevent thrashing
    scale_up_cooldown_seconds: float = 60
    scale_down_cooldown_seconds: float = 300

    # Limits
    min_replicas: int = 1
    max_replicas: int = 10


class AutoScaler:
    """Determines scaling actions based on metrics."""

    def __init__(self, config: ScalingConfig):
        self.config = config
        self.last_scale_up = 0.0
        self.last_scale_down = 0.0
        self.current_replicas = 1

    def evaluate(self, metrics: ScalingMetrics, current_time: float) -> int:
        """
        Returns desired replica count.
        Positive delta = scale up, negative = scale down.
        """
        # Check scale up conditions
        should_scale_up = (
            metrics.queue_depth > self.config.max_queue_depth
            or metrics.latency_p95_ms > self.config.max_latency_p95_ms
            or metrics.gpu_utilization > self.config.max_gpu_utilization
        )

        # Check scale down conditions
        should_scale_down = (
            metrics.queue_depth < self.config.min_queue_depth
            and metrics.gpu_utilization < self.config.min_gpu_utilization
        )

        # Apply cooldowns
        time_since_up = current_time - self.last_scale_up
        time_since_down = current_time - self.last_scale_down

        if (
            should_scale_up
            and time_since_up > self.config.scale_up_cooldown_seconds
        ):
            new_replicas = min(
                self.current_replicas + 1, self.config.max_replicas
            )
            if new_replicas > self.current_replicas:
                self.last_scale_up = current_time
                self.current_replicas = new_replicas

        elif (
            should_scale_down
            and time_since_down > self.config.scale_down_cooldown_seconds
        ):
            new_replicas = max(
                self.current_replicas - 1, self.config.min_replicas
            )
            if new_replicas < self.current_replicas:
                self.last_scale_down = current_time
                self.current_replicas = new_replicas

        return self.current_replicas
In[24]:
Code
# Simulate auto-scaling over time
scaling_config = ScalingConfig(
    max_queue_depth=50, max_latency_p95_ms=3000, min_replicas=1, max_replicas=5
)
scaler = AutoScaler(scaling_config)

# Simulate increasing load
scenarios = [
    (0, ScalingMetrics(10, 5, 500, 1000, 0.3, 100)),  # Low load
    (120, ScalingMetrics(60, 30, 1000, 2500, 0.7, 200)),  # Growing
    (240, ScalingMetrics(100, 50, 2000, 4000, 0.9, 250)),  # Overloaded
    (360, ScalingMetrics(80, 40, 1500, 3500, 0.85, 400)),  # Still high
    (600, ScalingMetrics(20, 10, 500, 1000, 0.4, 300)),  # Recovered
    (1000, ScalingMetrics(5, 2, 300, 500, 0.2, 100)),  # Low load
]

scaling_results = []
for time_sec, metrics in scenarios:
    replicas = scaler.evaluate(metrics, time_sec)
    scaling_results.append((time_sec, metrics, replicas))
Out[25]:
Console
Auto-scaling Simulation:
------------------------------------------------------------
t=   0s | queue= 10 | p95=1000ms | gpu=30% | replicas=1
t= 120s | queue= 60 | p95=2500ms | gpu=70% | replicas=2
t= 240s | queue=100 | p95=4000ms | gpu=90% | replicas=3
t= 360s | queue= 80 | p95=3500ms | gpu=85% | replicas=4
t= 600s | queue= 20 | p95=1000ms | gpu=40% | replicas=4
t=1000s | queue=  5 | p95=500ms | gpu=20% | replicas=3
Out[26]:
Visualization
Queue depth over 1000 seconds. A sharp spike at t=120s indicates load exceeding capacity, triggering the auto-scaler.
Queue depth over 1000 seconds. A sharp spike at t=120s indicates load exceeding capacity, triggering the auto-scaler.
Queue depth over 1000 seconds. A sharp spike at t=120s indicates load exceeding capacity, triggering the auto-scaler.
Queue depth over 1000 seconds. A sharp spike at t=120s indicates load exceeding capacity, triggering the auto-scaler.
Queue depth over 1000 seconds. A sharp spike at t=120s indicates load exceeding capacity, triggering the auto-scaler.
Queue depth over 1000 seconds. A sharp spike at t=120s indicates load exceeding capacity, triggering the auto-scaler.

The simulation illustrates key auto-scaling behaviors. The system scales up quickly when queue depth spikes, adding capacity before latency degrades too severely. It then gradually scales down as load decreases, but with a longer cooldown to avoid oscillation. The asymmetric cooldowns (60 seconds up, 300 seconds down) prevent thrashing while ensuring fast response to traffic increases. Scaling up quickly prevents user-facing degradation, while scaling down slowly ensures that brief dips in traffic don't cause premature capacity reduction that would require expensive cold starts when traffic rebounds.

Key Parameters

The key parameters for the auto-scaling logic are:

  • max_queue_depth: Threshold for the number of queued requests that triggers a scale-up event.
  • max_latency_p95_ms: P95 latency threshold that triggers scaling when exceeded.
  • scale_up_cooldown_seconds: Minimum time interval between scale-up actions to prevent flapping.

Cold Start Challenges

LLM instances take significant time to become ready for serving. Loading a 7B parameter model requires transferring approximately 14GB (in FP16) from storage to GPU memory, then initializing CUDA kernels and compiling any JIT-compiled operations, and allocating KV cache space. This cold start latency, often 30-120 seconds for typical deployments, complicates auto-scaling because new capacity isn't immediately available when scaling decisions are made.

Several strategies help mitigate cold start impact:

Warm pool: Maintain a pool of pre-initialized but idle instances that have already loaded model weights and initialized their CUDA context. New traffic immediately activates these instances rather than waiting for cold start. The trade-off is paying for idle GPU time, as warm pool instances consume resources even when not serving requests.

Predictive scaling: Use historical patterns to anticipate demand and scale preemptively. If traffic typically peaks at 9 AM on weekdays, begin scaling at 8:50 AM so new instances are ready before the rush. Machine learning models can predict traffic patterns more accurately than simple rules, but require historical data and ongoing maintenance.

Gradual traffic shift: When bringing new instances online, ramp up their traffic gradually rather than immediately directing full load. This prevents overwhelming a fresh instance while its various caches (file system cache, KV prefix cache, CUDA kernel cache) are cold. Gradual ramp-up also allows detecting problems with new instances before they impact a large fraction of traffic.

In[27]:
Code
import time
from typing import Optional


@dataclass
class WarmPoolInstance:
    """Instance in warm pool, ready for activation."""

    instance_id: str
    created_time: float
    state: str = "warm"  # warm, activating, active, draining


class WarmPoolManager:
    """Manages a pool of pre-warmed instances."""

    def __init__(
        self,
        min_warm: int = 2,
        max_warm: int = 5,
        warm_instance_ttl: float = 3600,  # 1 hour
    ):
        self.min_warm = min_warm
        self.max_warm = max_warm
        self.warm_instance_ttl = warm_instance_ttl
        self.warm_pool: List[WarmPoolInstance] = []
        self.active_instances: List[WarmPoolInstance] = []

    def activate_instance(self) -> Optional[WarmPoolInstance]:
        """
        Activate a warm instance for serving.
        Returns None if pool is empty.
        """
        if not self.warm_pool:
            return None

        instance = self.warm_pool.pop(0)
        instance.state = "activating"
        self.active_instances.append(instance)
        return instance

    def deactivate_instance(self, instance_id: str):
        """Return instance to warm pool instead of terminating."""
        for i, inst in enumerate(self.active_instances):
            if inst.instance_id == instance_id:
                inst.state = "warm"
                inst.created_time = time.time()  # Reset TTL
                self.active_instances.pop(i)
                if len(self.warm_pool) < self.max_warm:
                    self.warm_pool.append(inst)
                return

    def pool_status(self) -> dict:
        return {
            "warm_count": len(self.warm_pool),
            "active_count": len(self.active_instances),
            "warm_ids": [i.instance_id for i in self.warm_pool],
        }
In[28]:
Code
# Demonstrate warm pool behavior
pool = WarmPoolManager(min_warm=2, max_warm=4)

# Pre-populate warm pool
for i in range(3):
    pool.warm_pool.append(
        WarmPoolInstance(instance_id=f"inst-{i}", created_time=time.time())
    )

initial_status = pool.pool_status()

# Traffic spike: activate instances quickly
activated_instances = []
for i in range(2):
    inst = pool.activate_instance()
    if inst:
        activated_instances.append(inst.instance_id)

spike_status = pool.pool_status()

# Traffic subsides: return to warm pool
pool.deactivate_instance("inst-0")
subside_status = pool.pool_status()
Out[29]:
Console
Initial State:
  {'warm_count': 3, 'active_count': 0, 'warm_ids': ['inst-0', 'inst-1', 'inst-2']}

Traffic spike - activating instances:
  Activated: inst-0
  Activated: inst-1
  Pool: {'warm_count': 1, 'active_count': 2, 'warm_ids': ['inst-2']}

Traffic subsides - returning instances:
  Deactivated: inst-0
  Pool: {'warm_count': 2, 'active_count': 1, 'warm_ids': ['inst-2', 'inst-0']}

The warm pool enables near-instant scaling for traffic spikes. When traffic increases, instances activate in seconds (just the time to update routing tables) rather than the minute or more required for cold start. When traffic drops, instances return to the warm pool rather than terminating completely, preserving their warmed state for the next spike. This approach trades some ongoing cost (idle warm instances) for dramatically improved responsiveness to load changes.

Latency Optimization

Meeting latency targets requires understanding the distinct phases of LLM inference and optimizing each appropriately. Because the prefill and decode phases have fundamentally different computational characteristics, they require different optimization strategies. A one-size-fits-all approach will inevitably leave performance on the table.

Prefill vs Decode Optimization

As we've seen throughout this part, LLM inference has two distinct phases with different performance characteristics that demand different optimization approaches:

Prefill phase: Processing the input prompt involves computing attention across all prompt tokens simultaneously. This phase is compute-bound because it performs dense matrix multiplications with operations proportional to the square of the prompt length (due to attention) times the model dimension. The GPU's compute units are the bottleneck, and memory bandwidth is typically not limiting. Optimization strategies include:

  • Chunked prefill to avoid blocking the decode phase of other requests, allowing interleaved progress on multiple requests
  • Tensor parallelism to distribute compute across GPUs, reducing wall-clock time for long prompts
  • Quantized attention for faster matrix operations, trading some precision for speed

Decode phase: Generating output tokens one at a time means each forward pass produces only a single token while reading the entire KV cache from memory. This phase is memory-bandwidth-bound because the ratio of computation to memory access is low. Each token requires reading billions of bytes of model weights and cached attention states while performing relatively few arithmetic operations. Optimization strategies include:

  • KV cache compression to reduce memory reads, shrinking the data that must be transferred each step
  • Speculative decoding to generate multiple tokens per forward pass, amortizing memory access cost
  • Continuous batching to share the memory bandwidth cost across multiple requests, improving effective utilization
In[30]:
Code
# Simulate timing for different request profiles
# Assumptions: 1000-token prompt, varying output lengths
# Prefill: ~0.1ms per token (compute-bound)
# Decode: ~20ms per token (memory-bound)

prompt_tokens = 1000
prefill_time_per_token = 0.1  # ms
decode_time_per_token = 20  # ms

output_lengths = [10, 50, 100, 200, 500]
prefill_times = [
    prompt_tokens * prefill_time_per_token / 1000 for _ in output_lengths
]  # seconds
decode_times = [
    out * decode_time_per_token / 1000 for out in output_lengths
]  # seconds
Out[31]:
Visualization
Stacked bar chart showing prefill vs decode time allocation.
Time distribution between prefill and decode phases for requests of varying output length (10 to 500 tokens). Short outputs are dominated by the compute-bound prefill phase, while long outputs are dominated by the memory-bound decode phase, illustrating the shift in optimization targets.

For a typical prompt of 1000 tokens, prefill completes in roughly 100ms. A 10-token response adds 200ms of decode time, making prefill a significant fraction (33%) of total latency. A 500-token response adds 10 seconds of decode time, making prefill negligible (less than 1%). This dramatic shift means optimization priorities should align with your actual request distribution. If most requests generate short outputs, optimizing prefill matters more. If most requests generate long outputs, decode optimization dominates.

Time-to-First-Token Optimization

Many applications stream responses to users, making time-to-first-token (TTFT) the primary latency metric that determines perceived responsiveness. When a user sends a message in a chat interface, TTFT measures the time from submission until the first word of the response appears. Users perceive systems with low TTFT as fast and responsive, even if total generation time is similar.

TTFT measures the time from request arrival until the first generated token is available to the client. This encompasses several distinct stages, each offering optimization opportunities:

  • Queue Time: How long the request waits before processing begins. Reducing queue time requires either more capacity (scaling) or smarter scheduling (prioritization).
  • Tokenization: Converting the prompt text to token IDs. While typically fast, this step runs on CPU and can bottleneck under high load.
  • Prefill Duration: Processing all prompt tokens to populate the KV cache. For long prompts, this dominates TTFT.
  • First Decode Step: Generating the first output token. This is typically fast since it's a single forward pass.
In[32]:
Code
from dataclasses import dataclass
from typing import List
import numpy as np


@dataclass
class TTFTBreakdown:
    """Breakdown of time-to-first-token components."""

    queue_time_ms: float
    tokenization_ms: float
    prefill_ms: float
    first_decode_ms: float

    @property
    def total_ms(self) -> float:
        return (
            self.queue_time_ms
            + self.tokenization_ms
            + self.prefill_ms
            + self.first_decode_ms
        )

    def bottleneck(self) -> str:
        components = {
            "queue": self.queue_time_ms,
            "tokenization": self.tokenization_ms,
            "prefill": self.prefill_ms,
            "decode": self.first_decode_ms,
        }
        return max(components, key=components.get)


def analyze_ttft_samples(samples: List[TTFTBreakdown]) -> dict:
    """Compute TTFT statistics across samples."""
    totals = [s.total_ms for s in samples]
    bottlenecks = [s.bottleneck() for s in samples]

    return {
        "p50_ms": np.percentile(totals, 50),
        "p95_ms": np.percentile(totals, 95),
        "p99_ms": np.percentile(totals, 99),
        "mean_ms": np.mean(totals),
        "bottleneck_distribution": {
            b: bottlenecks.count(b) / len(bottlenecks) for b in set(bottlenecks)
        },
    }
In[33]:
Code
# Simulate TTFT measurements under different load conditions
np.random.seed(42)

# Low load scenario
low_load_samples = [
    TTFTBreakdown(
        queue_time_ms=np.random.exponential(5),
        tokenization_ms=np.random.normal(10, 2),
        prefill_ms=np.random.normal(100, 20),
        first_decode_ms=np.random.normal(20, 5),
    )
    for _ in range(100)
]

# High load scenario
high_load_samples = [
    TTFTBreakdown(
        queue_time_ms=np.random.exponential(200),  # Much higher queue times
        tokenization_ms=np.random.normal(10, 2),
        prefill_ms=np.random.normal(100, 20),
        first_decode_ms=np.random.normal(20, 5),
    )
    for _ in range(100)
]

low_load_stats = analyze_ttft_samples(low_load_samples)
high_load_stats = analyze_ttft_samples(high_load_samples)
Out[34]:
Console
TTFT Analysis:
--------------------------------------------------

Low Load:
  P50: 138ms
  P95: 165ms
  P99: 183ms
  Bottleneck distribution:
    prefill: 100%

High Load:
  P50: 248ms
  P95: 719ms
  P99: 1339ms
  Bottleneck distribution:
    prefill: 42%
    queue: 58%
Out[35]:
Visualization
Composition of Time-to-First-Token (TTFT) components under low and high load conditions. While tokenization, prefill, and first decode steps remain constant, queue time grows disproportionately under high load, becoming the primary bottleneck and degrading responsiveness.
Composition of Time-to-First-Token (TTFT) components under low and high load conditions. While tokenization, prefill, and first decode steps remain constant, queue time grows disproportionately under high load, becoming the primary bottleneck and degrading responsiveness.

The analysis reveals how bottlenecks shift with load. Under low load, prefill dominates TTFT because queue times are minimal and requests begin processing immediately upon arrival. Optimizing prefill speed through better hardware or chunking would improve TTFT in this regime. Under high load, queue time becomes the dominant bottleneck: requests spend most of their time waiting, not processing. In this regime, improving prefill speed won't help much because requests are already waiting; adding capacity or improving scheduling would be more effective. This analysis guides optimization investments toward the actual bottleneck rather than optimizing components that aren't limiting.

Latency-Throughput Trade-offs

Optimizing for latency often conflicts with optimizing for throughput because the techniques that improve one typically harm the other. Larger batches improve GPU utilization and throughput: processing 32 requests in one batch uses GPU compute more efficiently than processing them one at a time. However, larger batches increase latency for individual requests because each request waits for the batch to form and then shares compute time with other requests in the batch.

The optimal operating point depends on your SLA requirements and cost constraints:

In[36]:
Code
import numpy as np

# Simulated relationship between batch size and performance
batch_sizes = np.array([1, 2, 4, 8, 16, 32, 64])

# Throughput increases with batch size but diminishes
# GPU utilization improves until we hit memory/compute limits
throughput = 50 * np.log2(batch_sizes + 1)  # tokens/sec
throughput = np.minimum(throughput, 300)  # Cap at 300 tok/s

# Latency increases linearly with batch size (simplified model)
# Real relationship is more complex due to batching and scheduling
base_latency = 100  # ms for batch_size=1
latency = base_latency + batch_sizes * 15  # ms
Out[37]:
Visualization
System throughput across varying batch sizes. Throughput increases logarithmically as larger batches improve GPU utilization, eventually plateauing as the hardware reaches its compute or memory bandwidth limits.
System throughput across varying batch sizes. Throughput increases logarithmically as larger batches improve GPU utilization, eventually plateauing as the hardware reaches its compute or memory bandwidth limits.
P95 request latency as a function of batch size. Latency grows linearly with batch size and exceeds the 500ms service level agreement threshold at a batch size of 32, highlighting the trade-off between throughput and responsiveness.
P95 request latency as a function of batch size. Latency grows linearly with batch size and exceeds the 500ms service level agreement threshold at a batch size of 32, highlighting the trade-off between throughput and responsiveness.

If your SLA requires P95 latency under 500ms, you're constrained to batch sizes below approximately 32 in this scenario. Larger batches would improve throughput (and thus cost efficiency) but violate the latency requirement. The continuous batching techniques from the previous chapter help navigate this trade-off by allowing different requests to enter and exit batches dynamically, improving throughput while limiting how long any individual request must wait.

Monitoring and Observability

Production inference systems require comprehensive monitoring to maintain service quality and debug issues. Without visibility into system behavior, operators cannot distinguish between capacity problems, model issues, and infrastructure failures. Good observability enables proactive intervention before users experience degraded service.

Key Metrics

Effective monitoring tracks metrics at multiple levels, from individual request performance to aggregate system health to business outcomes:

Request-level metrics:

  • Time to first token (TTFT): measures perceived responsiveness for streaming applications
  • Total latency: measures end-to-end completion time for batch processing use cases
  • Tokens per request (prompt and generated): indicates request complexity and resource consumption
  • Error rates by error type: distinguishes between timeouts, out-of-memory, and other failure modes

System-level metrics:

  • GPU utilization and memory usage: indicates whether hardware is being used efficiently
  • Queue depth over time: reveals demand patterns and capacity mismatches
  • Active batch size: shows how effectively batching is working
  • KV cache utilization: indicates memory pressure and potential for cache eviction

Business-level metrics:

  • Requests per second: measures overall demand and system scale
  • Tokens per second: measures actual work being done, accounting for request size variation
  • Cost per 1000 tokens: combines infrastructure cost with throughput for economic analysis
  • SLA compliance rate: the ultimate measure of whether the system meets its commitments
In[38]:
Code
from dataclasses import dataclass, field
from typing import Dict
from collections import deque
import time
import numpy as np


@dataclass
class MetricsCollector:
    """Collects and aggregates inference metrics."""

    window_seconds: float = 60.0
    latency_samples: deque = field(default_factory=lambda: deque(maxlen=10000))
    ttft_samples: deque = field(default_factory=lambda: deque(maxlen=10000))
    token_counts: deque = field(default_factory=lambda: deque(maxlen=10000))
    error_counts: Dict[str, int] = field(default_factory=dict)

    def record_request(
        self,
        latency_ms: float,
        ttft_ms: float,
        prompt_tokens: int,
        generated_tokens: int,
        error: str = None,
    ):
        timestamp = time.time()
        self.latency_samples.append((timestamp, latency_ms))
        self.ttft_samples.append((timestamp, ttft_ms))
        self.token_counts.append((timestamp, prompt_tokens + generated_tokens))

        if error:
            self.error_counts[error] = self.error_counts.get(error, 0) + 1

    def get_current_metrics(self) -> dict:
        current_time = time.time()
        cutoff = current_time - self.window_seconds

        # Filter to recent samples
        recent_latencies = [v for t, v in self.latency_samples if t > cutoff]
        recent_ttft = [v for t, v in self.ttft_samples if t > cutoff]
        recent_tokens = [v for t, v in self.token_counts if t > cutoff]

        if not recent_latencies:
            return {"status": "no_data"}

        return {
            "latency_p50_ms": np.percentile(recent_latencies, 50),
            "latency_p95_ms": np.percentile(recent_latencies, 95),
            "latency_p99_ms": np.percentile(recent_latencies, 99),
            "ttft_p50_ms": np.percentile(recent_ttft, 50),
            "ttft_p95_ms": np.percentile(recent_ttft, 95),
            "requests_per_minute": len(recent_latencies),
            "tokens_per_minute": sum(recent_tokens),
            "error_counts": dict(self.error_counts),
        }
In[39]:
Code
# Simulate metrics collection
collector = MetricsCollector(window_seconds=60)
np.random.seed(42)

# Simulate 200 requests over the last minute
for _ in range(200):
    latency = np.random.lognormal(6, 0.5)  # Log-normal distribution
    ttft = np.random.lognormal(5, 0.3)
    prompt_tokens = np.random.randint(100, 2000)
    generated_tokens = np.random.randint(50, 500)

    # 2% error rate
    error = "timeout" if np.random.random() < 0.02 else None

    collector.record_request(
        latency, ttft, prompt_tokens, generated_tokens, error
    )

metrics = collector.get_current_metrics()
Out[40]:
Console
Current Metrics (last 60 seconds):
----------------------------------------
  Requests: 200
  Tokens processed: 254,959
  Latency P50: 401ms
  Latency P95: 947ms
  Latency P99: 1333ms
  TTFT P50: 143ms
  TTFT P95: 241ms
  Errors: {'timeout': 2}

SLOs and Alerting

Service Level Objectives (SLOs) define target performance thresholds that the system should maintain. SLOs translate business requirements (users expect responsive service) into measurable technical targets (P95 latency under 5 seconds). Common SLOs for inference services include:

  • Availability: 99.9% of requests succeed (allows approximately 8.7 hours of downtime per year). This SLO accounts for both planned maintenance and unexpected failures.
  • Latency: P95 latency below 5 seconds ensures that even unlucky requests complete in reasonable time.
  • TTFT: P95 time-to-first-token below 500ms ensures users perceive the system as responsive.
  • Error rate: Less than 0.1% of requests return errors, ensuring high reliability for applications built on the inference service.

Alerting should trigger when metrics approach SLO thresholds, not just when they're breached. Alerting only on violations means users have already experienced degraded service by the time operators learn of the problem:

In[41]:
Code
@dataclass
class SLOConfig:
    """Service Level Objective configuration."""

    name: str
    target_percentile: float  # e.g., 0.95 for P95
    threshold_ms: float  # SLO threshold
    warning_fraction: float = 0.8  # Alert when at 80% of threshold


class SLOMonitor:
    """Monitors metrics against SLOs and generates alerts."""

    def __init__(self, slos: List[SLOConfig]):
        self.slos = slos
        self.alerts: List[dict] = []

    def check_slos(self, metrics: dict) -> List[dict]:
        """Check current metrics against SLOs."""
        alerts = []

        for slo in self.slos:
            # Map SLO names to metric keys
            metric_key = f"{slo.name}_p{int(slo.target_percentile * 100)}_ms"
            if metric_key not in metrics:
                continue

            current_value = metrics[metric_key]
            warning_threshold = slo.threshold_ms * slo.warning_fraction

            if current_value > slo.threshold_ms:
                alerts.append(
                    {
                        "severity": "critical",
                        "slo": slo.name,
                        "message": f"{slo.name} P{int(slo.target_percentile * 100)} "
                        f"({current_value:.0f}ms) exceeds SLO ({slo.threshold_ms:.0f}ms)",
                    }
                )
            elif current_value > warning_threshold:
                alerts.append(
                    {
                        "severity": "warning",
                        "slo": slo.name,
                        "message": f"{slo.name} P{int(slo.target_percentile * 100)} "
                        f"({current_value:.0f}ms) approaching SLO ({slo.threshold_ms:.0f}ms)",
                    }
                )

        return alerts
In[42]:
Code
# Define SLOs and check against current metrics
slos = [
    SLOConfig("latency", target_percentile=0.95, threshold_ms=3000),
    SLOConfig("latency", target_percentile=0.99, threshold_ms=5000),
    SLOConfig("ttft", target_percentile=0.95, threshold_ms=500),
]

monitor = SLOMonitor(slos)
alerts = monitor.check_slos(metrics)
Out[43]:
Console
SLO Status:
--------------------------------------------------
  ✓ latency P95: 947ms (target: 3000ms)
  ✓ latency P99: 1333ms (target: 5000ms)
  ✓ ttft P95: 241ms (target: 500ms)

  No active alerts

The monitor verifies that all performance metrics currently satisfy their service level objectives. By implementing both warning thresholds (at 80% of the SLO) and critical thresholds (at 100%), the system provides early indications of degrading performance before actual SLA violations occur. Warning alerts give operators time to investigate and potentially scale up or address issues before users experience violations. Critical alerts indicate that the SLA has already been breached and immediate action is required.

Putting It Together: A Complete Serving Pipeline

Let's implement a simplified inference serving pipeline that combines the concepts from this chapter. This example demonstrates how request handling, streaming responses, concurrency limits, and metrics collection work together in a complete system:

In[44]:
Code
import asyncio
import time
import random
from dataclasses import dataclass, field
from typing import List, Optional, AsyncIterator
from collections import deque


@dataclass
class InferenceRequest:
    """A request to the inference server."""

    request_id: str
    prompt: str
    max_tokens: int
    priority: int = 0
    arrival_time: float = field(default_factory=time.time)


@dataclass
class InferenceResponse:
    """Response from inference, supports streaming."""

    request_id: str
    tokens: List[str]
    finished: bool
    ttft_ms: Optional[float] = None
    total_ms: Optional[float] = None


class MockModel:
    """Simulates model inference with realistic timing."""

    def __init__(self, tokens_per_second: float = 50):
        self.tokens_per_second = tokens_per_second

    async def generate(
        self, prompt: str, max_tokens: int
    ) -> AsyncIterator[str]:
        """Simulate token-by-token generation."""
        # Simulate prefill time based on prompt length
        prompt_tokens = len(prompt.split())
        prefill_time = prompt_tokens * 0.0001  # 0.1ms per token
        await asyncio.sleep(prefill_time)

        # Simulate decode, yielding tokens
        words = [
            "The",
            "quick",
            "brown",
            "fox",
            "jumps",
            "over",
            "the",
            "lazy",
            "dog",
            ".",
            "This",
            "is",
            "a",
            "test",
            ".",
        ]

        for i in range(min(max_tokens, 50)):
            await asyncio.sleep(1.0 / self.tokens_per_second)
            yield random.choice(words)


class InferenceServer:
    """Simple inference server with request handling."""

    def __init__(self, model: MockModel, max_concurrent: int = 10):
        self.model = model
        self.max_concurrent = max_concurrent
        self.active_requests = 0
        self.request_queue: deque = deque()
        self.metrics = MetricsCollector()

    async def handle_request(
        self, request: InferenceRequest
    ) -> AsyncIterator[InferenceResponse]:
        """Process a request and yield streaming responses."""

        # Wait if at capacity
        while self.active_requests >= self.max_concurrent:
            await asyncio.sleep(0.01)

        self.active_requests += 1
        tokens = []
        first_token_time = None

        try:
            async for token in self.model.generate(
                request.prompt, request.max_tokens
            ):
                if first_token_time is None:
                    first_token_time = time.time()

                tokens.append(token)
                yield InferenceResponse(
                    request_id=request.request_id,
                    tokens=tokens.copy(),
                    finished=False,
                    ttft_ms=(first_token_time - request.arrival_time) * 1000,
                )

            # Final response
            completion_time = time.time()
            ttft_ms = (first_token_time - request.arrival_time) * 1000
            total_ms = (completion_time - request.arrival_time) * 1000

            self.metrics.record_request(
                latency_ms=total_ms,
                ttft_ms=ttft_ms,
                prompt_tokens=len(request.prompt.split()),
                generated_tokens=len(tokens),
            )

            yield InferenceResponse(
                request_id=request.request_id,
                tokens=tokens,
                finished=True,
                ttft_ms=ttft_ms,
                total_ms=total_ms,
            )

        finally:
            self.active_requests -= 1
In[45]:
Code
async def simulate_traffic(
    server: InferenceServer,
    num_requests: int = 20,
    arrival_rate: float = 5.0,  # requests per second
):
    """Simulate concurrent traffic to the server."""

    async def process_request(request: InferenceRequest):
        final_response = None
        async for response in server.handle_request(request):
            if response.finished:
                final_response = response
        return final_response

    # Create requests with Poisson arrivals
    tasks = []
    for i in range(num_requests):
        request = InferenceRequest(
            request_id=f"req-{i:03d}",
            prompt=f"Request number {i}: "
            + " ".join(["word"] * random.randint(10, 100)),
            max_tokens=random.randint(20, 50),
        )
        task = asyncio.create_task(process_request(request))
        tasks.append(task)

        # Random inter-arrival time
        await asyncio.sleep(random.expovariate(arrival_rate))

    # Wait for all requests to complete
    responses = await asyncio.gather(*tasks)
    return responses
In[46]:
Code
# Run the simulation
model = MockModel(tokens_per_second=50)
server = InferenceServer(model, max_concurrent=5)

# Run async simulation
responses = await simulate_traffic(server, num_requests=30, arrival_rate=3.0)

ttfts = [r.ttft_ms for r in responses if r.ttft_ms]
totals = [r.total_ms for r in responses if r.total_ms]

# Server metrics
server_metrics = server.metrics.get_current_metrics()
Out[47]:
Console
Simulation Results:
--------------------------------------------------
Completed requests: 30

Time to First Token:
  P50: 27ms
  P95: 43ms

Total Latency:
  P50: 753ms
  P95: 995ms

Server Metrics:
  Total tokens processed: 2619

This simulation demonstrates the key components working together in a realistic pattern: requests arrive at varying rates following a Poisson process, queue when the server is at capacity (limited to 5 concurrent requests), stream responses as tokens generate, and metrics accumulate for monitoring. The async design allows concurrent request processing while respecting capacity limits, and the streaming response pattern provides tokens to clients as soon as they're available rather than waiting for complete generation.

Key Parameters

The key parameters for the inference server simulation are:

  • max_concurrent: Maximum number of requests the server processes simultaneously. Requests beyond this limit are queued, which increases their TTFT but prevents overwhelming the model or running out of GPU memory.
  • tokens_per_second: The rate at which the model generates tokens during the decode phase. Higher values indicate faster hardware or more efficient implementation.
  • arrival_rate: The average number of requests arriving per second. When arrival rate exceeds processing rate, queues build up and latency increases.

Limitations and Practical Considerations

Inference serving for LLMs remains challenging despite sophisticated techniques. Several fundamental tensions persist that operators must navigate based on their specific requirements:

Cost vs Latency: GPU instances are expensive, but maintaining low latency requires spare capacity. Running at high utilization saves money but increases queue times and latency variance because there's no buffer for traffic spikes. Organizations must balance these competing priorities based on their SLAs and budget constraints. Auto-scaling helps but introduces its own costs (warm pool instances consume resources even when idle) and latency (cold starts delay capacity increases).

Batching Trade-offs: Continuous batching improves throughput significantly by amortizing GPU memory bandwidth across multiple requests, but it complicates request prioritization and debugging. When requests share batches, attributing latency to individual requests becomes harder because each request's timing depends on what else is in the batch. Strict SLA requirements may force smaller batches, sacrificing efficiency for predictability.

Tail Latency: While median latency is often acceptable, P99 latency can be dramatically higher due to garbage collection pauses, occasional long requests that monopolize resources, and resource contention with other workloads. Achieving consistent tail latency requires overprovisioning and careful attention to outlier requests that could delay others.

Model Updates: Deploying new model versions without downtime requires careful orchestration. Blue-green deployments maintain two complete environments and switch traffic between them. Canary releases gradually shift traffic to new versions while monitoring for regressions. Both approaches require infrastructure support and add operational complexity. Rollbacks must be fast when new versions underperform.

Multi-Tenancy: Shared infrastructure serving multiple customers introduces isolation challenges. One customer's burst of long requests shouldn't degrade service for others who are paying for the same SLAs. Token-aware load balancing and request queuing help, but perfect isolation requires separate instances, which increases cost.

The techniques covered in this part (KV caching, quantization, speculative decoding, continuous batching) address many of these challenges, but their interaction with serving infrastructure requires careful tuning. Optimizing a single request's inference speed differs from optimizing system-wide throughput under concurrent load. A technique that improves single-request latency might harm throughput under load, or vice versa.

Summary

This chapter covered the infrastructure layer that transforms trained language models into production services. The key concepts include:

Inference server architecture comprises specialized components including request handlers, schedulers, model executors, and memory managers. Modern servers like vLLM and TGI implement sophisticated features like PagedAttention and continuous batching out of the box.

Request routing determines which model instance handles each request, supporting version-based routing for A/B testing, capability-based routing for cost optimization, and priority routing for SLA differentiation. Health checks ensure routing decisions reflect actual endpoint availability.

Load balancing for LLMs requires token-aware approaches that account for variable request costs. Traditional least-connections balancing fails to capture the difference between short and long generations. Session affinity enables KV cache reuse but conflicts with optimal load distribution.

Auto-scaling adjusts capacity based on metrics like queue depth, latency percentiles, and GPU utilization. Cold start latency for LLM instances (often 30-120 seconds) necessitates warm pools and predictive scaling to maintain responsiveness.

Latency optimization requires understanding the distinct prefill (compute-bound) and decode (memory-bound) phases. Time-to-first-token matters for interactive applications, while total latency matters for batch processing. These often require different optimization strategies.

Monitoring tracks request-level metrics (TTFT, latency), system-level metrics (GPU utilization, queue depth), and business metrics (tokens per second, cost). SLO-based alerting provides early warning before service degradation affects users.

These serving concepts build on the inference optimization techniques from earlier chapters in this part. Together, they enable deploying language models that scale from prototype to production, handling thousands of concurrent users while meeting latency and cost targets.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Inference Serving.

Loading component...

Reference

BIBTEXAcademic
@misc{llminferenceservingarchitectureroutingautoscaling, author = {Michael Brenndoerfer}, title = {LLM Inference Serving: Architecture, Routing & Auto-Scaling}, year = {2026}, url = {https://mbrenndoerfer.com/writing/llm-inference-serving-architecture-scaling-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). LLM Inference Serving: Architecture, Routing & Auto-Scaling. Retrieved from https://mbrenndoerfer.com/writing/llm-inference-serving-architecture-scaling-optimization
MLAAcademic
Michael Brenndoerfer. "LLM Inference Serving: Architecture, Routing & Auto-Scaling." 2026. Web. today. <https://mbrenndoerfer.com/writing/llm-inference-serving-architecture-scaling-optimization>.
CHICAGOAcademic
Michael Brenndoerfer. "LLM Inference Serving: Architecture, Routing & Auto-Scaling." Accessed today. https://mbrenndoerfer.com/writing/llm-inference-serving-architecture-scaling-optimization.
HARVARDAcademic
Michael Brenndoerfer (2026) 'LLM Inference Serving: Architecture, Routing & Auto-Scaling'. Available at: https://mbrenndoerfer.com/writing/llm-inference-serving-architecture-scaling-optimization (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). LLM Inference Serving: Architecture, Routing & Auto-Scaling. https://mbrenndoerfer.com/writing/llm-inference-serving-architecture-scaling-optimization