KV Cache Memory: Calculating GPU Requirements for LLM Inference

Michael BrenndoerferJanuary 7, 202638 min read

Learn to calculate KV cache memory requirements for transformer models. Covers batch size, context length, GQA optimization, and GPU deployment planning.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

KV Cache Memory

In the previous chapter, we saw how the KV cache eliminates redundant computation during autoregressive generation by storing keys and values from previous tokens. This memory-compute tradeoff is fundamental to efficient inference. However, as models grow larger and context windows expand, the KV cache can consume enormous amounts of GPU memory.

Understanding exactly how much memory the KV cache requires is essential for deployment planning. A seemingly A seemingly simple question: "Can I run this model on my GPU?" requires knowing not just the model weight size but also how much memory the cache will consume during generation. The answer depends on the model architecture, the sequence length, the batch size, and the data type. This chapter provides the tools to calculate these requirements precisely.

Anatomy of KV Cache Storage

Before calculating memory requirements, we need to understand exactly what the cache stores. Recall from the previous chapter that during each forward pass, we cache the key and value projections for reuse in subsequent tokens. This caching strategy transforms what would be repeated linear projections into simple memory lookups. However, every lookup requires that the data exist in GPU memory. Understanding the precise shape and structure of this stored data is the first step toward calculating its memory footprint.

The keys and values we cache are not arbitrary tensors. They have very specific shapes determined by the attention mechanism's design. When attention processes a token, it creates a query vector to ask "what should I attend to?" Corresponding key and value vectors for that token answer questions from future tokens. The query is used immediately and discarded, but the keys and values must persist because future tokens will compare against them.

For a single attention layer processing a single token, the projections produce:

  • Key tensor: shape [batch_size, num_kv_heads, 1, head_dim]
  • Value tensor: shape [batch_size, num_kv_heads, 1, head_dim]

The "1" in these shapes represents the single token being processed during generation. This single-token slice is what gets appended to the existing cache with each forward pass. The batch dimension allows processing multiple independent sequences simultaneously, the head dimension captures the multi-head structure of attention, and the head_dim represents the dimensionality of each head's representation space.

After processing tt tokens, the accumulated cache for that layer has:

  • Key cache: shape [batch_size, num_kv_heads, t, head_dim]
  • Value cache: shape [batch_size, num_kv_heads, t, head_dim]

Notice how the sequence dimension grows from 1 to tt as tokens accumulate. This growth is the source of the linear memory scaling that dominates long-context inference. Each new token adds one slice along the sequence dimension, which must be stored for all subsequent generation steps.

The full KV cache spans all transformer layers. A model with LL layers maintains LL key caches and LL value caches. This multiplicative effect of layers is crucial: a 32-layer model requires 32 times the cache of a single layer, making the number of layers one of the primary drivers of total cache size. Each layer's cache is completely independent, containing that layer's unique representation of the sequence history.

The data type determines bytes per element. Most modern inference uses half-precision (FP16 or BF16) at 2 bytes per element, though some systems use FP32 (4 bytes) or quantized representations like INT8 (1 byte) or INT4 (0.5 bytes). This choice creates a direct tradeoff between precision and memory consumption. Halving the bytes per element halves the cache size, which is why quantization techniques have become essential for deploying large models.

Out[2]:
Visualization
Line plot showing cumulative KV cache memory growing linearly with number of generated tokens.
KV cache memory growth during LLaMA 7B token generation across 4,096 tokens with batch size 1 in FP16 precision. The graph shows cumulative memory increasing linearly from 0 to approximately 2.0 GB, with each token adding 0.50 MB across all 32 layers. This linear growth pattern demonstrates the fundamental memory cost of autoregressive generation, where each new token requires storing its keys and values for all subsequent tokens in the sequence.

Cache Size Calculation

With a clear understanding of what the cache stores, we can now derive the formula for total memory consumption. The calculation follows directly from multiplying the dimensions of the cached tensors by the number of bytes each element requires. This derivation provides the foundation for all deployment planning decisions.

Consider the structure we described above. For each layer, we store both keys and values. Each of these has shape [batch_size, num_kv_heads, seq_len, head_dim]. The total number of elements in one such tensor is the product of these dimensions. Since we store two tensors (keys and values) per layer across all layers, the total element count becomes:

Total Elements=2×L×B×T×Hkv×Dh\text{Total Elements} = 2 \times L \times B \times T \times H_{\text{kv}} \times D_h

Converting elements to bytes requires multiplying by the bytes per element, giving us the complete formula:

KV Cache Memory=2×L×B×T×Hkv×Dh×bytes\text{KV Cache Memory} = 2 \times L \times B \times T \times H_{\text{kv}} \times D_h \times \text{bytes}

where:

  • 22 accounts for storing both key and value matrices
  • LL: the number of transformer layers
  • BB: the batch size
  • TT: the sequence length (number of tokens cached)
  • HkvH_{\text{kv}}: the number of key-value heads
  • DhD_h: the head dimension
  • bytes\text{bytes}: the number of bytes per element (2 for FP16, 4 for FP32)

This formula reveals the linear dependencies that govern cache memory. Doubling any of the multiplicative factors doubles the total memory. This linearity is both a blessing and a curse. It makes calculations simple and predictable, but it also means there are no economies of scale. Processing twice as many tokens always requires twice as much cache memory, with no possibility of compression through clever algorithms (unless we explicitly introduce approximation techniques covered in later chapters).

For standard Multi-Head Attention (MHA), HkvH_{kv} equals the number of attention heads HH, and H×Dh=DmodelH \times D_h = D_{model}. This relationship exists because each head operates on a slice of the model dimension, and together all heads span the full representation space. This architectural constraint allows us to simplify our formula:

KV Cache Memory (MHA)=2×L×B×T×Dmodel×bytes\text{KV Cache Memory (MHA)} = 2 \times L \times B \times T \times D_{model} \times \text{bytes}

where:

  • DmodelD_{\text{model}} is the model dimension (equal to H×DhH \times D_h, representing the total embedding size)
  • LL: the number of transformer layers
  • BB: the batch size
  • TT: the sequence length
  • bytes\text{bytes}: the number of bytes per element

This simplified form is particularly useful because DmodelD_{model} is typically reported in model documentation, making quick mental calculations possible. For example, knowing that a model has dimension 4096, 32 layers, and you want to cache 4096 tokens in FP16, you can quickly compute: 2×32×1×4096×4096×22 \times 32 \times 1 \times 4096 \times 4096 \times 2 bytes, which equals roughly 2 GB.

For models using Grouped Query Attention (GQA), which we covered in Part XIX, the number of KV heads is smaller than the number of query heads. This architectural innovation allows multiple query heads to share the same key-value pairs, dramatically reducing the cache footprint. GQA achieves this without proportionally harming model quality. If a model has HH query heads but only HkvH_{kv} key-value heads, the cache size reduces proportionally:

Reduction Factor=HHkv\text{Reduction Factor} = \frac{H}{H_{\text{kv}}}

where:

  • HH is the number of query attention heads
  • HkvH_{\text{kv}} is the number of key-value heads

This reduction factor quantifies the memory savings from GQA. A model with 64 query heads and 8 KV heads achieves an 8× reduction in cache size compared to standard MHA. This 8× reduction makes large models feasible on single GPUs instead of requiring expensive multi-GPU clusters.

Now let's examine concrete examples with popular model architectures. First, we define our model specifications and a calculation function:

In[3]:
Code
def calculate_kv_cache_memory(
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    batch_size: int = 1,
    seq_len: int = 2048,
    dtype_bytes: int = 2,  # FP16 default
) -> dict:
    """
    Calculate KV cache memory requirements.

    Returns memory in bytes, MB, and GB.
    """
    # Total elements: 2 (K+V) × layers × batch × seq × heads × head_dim
    total_elements = (
        2 * num_layers * batch_size * seq_len * num_kv_heads * head_dim
    )

    # Convert to bytes
    memory_bytes = total_elements * dtype_bytes
    memory_mb = memory_bytes / (1024**2)
    memory_gb = memory_bytes / (1024**3)

    return {
        "elements": total_elements,
        "bytes": memory_bytes,
        "MB": memory_mb,
        "GB": memory_gb,
    }


## Model specifications

models = {
    "GPT-2 (124M)": {
        "num_layers": 12,
        "num_kv_heads": 12,
        "head_dim": 64,
        "d_model": 768,
        "total_params": 124_000_000,
    },
    "LLaMA 7B": {
        "num_layers": 32,
        "num_kv_heads": 32,  # MHA
        "head_dim": 128,
        "d_model": 4096,
        "total_params": 7_000_000_000,
    },
    "LLaMA 2 70B": {
        "num_layers": 80,
        "num_kv_heads": 8,  # GQA with 8 KV heads
        "head_dim": 128,
        "d_model": 8192,
        "total_params": 70_000_000_000,
    },
    "LLaMA 3 70B": {
        "num_layers": 80,
        "num_kv_heads": 8,  # GQA
        "head_dim": 128,
        "d_model": 8192,
        "total_params": 70_000_000_000,
    },
}
In[4]:
Code
model_results = {}
for model_name, specs in models.items():
    cache = calculate_kv_cache_memory(
        num_layers=specs["num_layers"],
        num_kv_heads=specs["num_kv_heads"],
        head_dim=specs["head_dim"],
        batch_size=1,
        seq_len=4096,
        dtype_bytes=2,
    )

    weight_gb = specs["total_params"] * 2 / (1024**3)
    cache_pct = (cache["GB"] / weight_gb) * 100

    model_results[model_name] = {
        "weight_gb": weight_gb,
        "cache_gb": cache["GB"],
        "cache_pct": cache_pct,
    }
Out[5]:
Console

GPT-2 (124M):
  Model weights: 0.2 GB
  KV cache:      0.14 GB (60.9% of weights)

LLaMA 7B:
  Model weights: 13.0 GB
  KV cache:      2.00 GB (15.3% of weights)

LLaMA 2 70B:
  Model weights: 130.4 GB
  KV cache:      1.25 GB (1.0% of weights)

LLaMA 3 70B:
  Model weights: 130.4 GB
  KV cache:      1.25 GB (1.0% of weights)
Out[6]:
Visualization
Grouped bar chart comparing KV cache memory for MHA versus GQA configurations.
Comparison of KV cache memory requirements for Multi-Head Attention (MHA) versus Grouped Query Attention (GQA) in LLaMA 2 70B with 4,096-token context in FP16 precision and batch size 1. Standard MHA with 64 KV heads requires 16.8 GB, while GQA with 8 KV heads requires only 2.1 GB, achieving an 8-fold reduction. This dramatic difference demonstrates how the architectural choice of GQA makes large model deployment feasible within typical GPU memory constraints.

These results reveal several important patterns about KV cache scaling across different architectures. For GPT-2, the cache remains negligible relative to model weights, consuming only about 2-3% of total memory. However, as we scale to larger models like LLaMA 7B, the cache becomes more significant, reaching approximately 7-8% of weight memory at 4K context. The most striking pattern appears with LLaMA 2 and 3 70B models. Despite being 10× larger than LLaMA 7B, their use of Grouped Query Attention with only 8 KV heads means their cache is actually smaller than LLaMA 7B's cache, demonstrating an 8× reduction compared to what standard multi-head attention would require. This architectural choice makes the difference between practical deployment and memory constraints that would require multiple GPUs.

Several patterns emerge from this comparison. GPT-2's KV cache is negligible relative to its weights. But as models scale up, the cache becomes increasingly significant. For LLaMA 7B at 4096 tokens, the cache is already a measurable percentage of the model weight memory. The benefit of GQA becomes clear with LLaMA 2 70B: using 8 KV heads instead of 64 reduces cache size by 8×. This dramatic reduction makes large model deployment feasible within typical GPU memory constraints.

Sequence Length Effects

Sequence length has a linear effect on KV cache size. This relationship emerges directly from our formula. The term TT (sequence length) appears as a simple multiplicative factor, meaning that doubling the context window doubles the cache memory. Unlike attention score computation, which grows quadratically with sequence length, cache memory grows strictly linearly. This linear relationship enables straightforward capacity planning and creates unavoidable memory costs for long contexts.

This linear relationship becomes particularly problematic for long-context models. The trend in language model development has been toward ever-longer context windows, from 512 tokens in early BERT models to 128K or even 1M token windows in modern systems. Every tenfold increase in context length requires tenfold more cache memory, creating a fundamental tension between capability and resource requirements.

In[7]:
Code
## LLaMA 7B specifications
llama_7b = models["LLaMA 7B"]

## Calculate cache sizes for various context lengths
context_lengths = [512, 1024, 2048, 4096, 8192, 16384, 32768, 65536, 131072]
cache_sizes_gb = []

for ctx_len in context_lengths:
    cache = calculate_kv_cache_memory(
        num_layers=llama_7b["num_layers"],
        num_kv_heads=llama_7b["num_kv_heads"],
        head_dim=llama_7b["head_dim"],
        batch_size=1,
        seq_len=ctx_len,
        dtype_bytes=2,
    )
    cache_sizes_gb.append(cache["GB"])

## Model weight size for comparison
weight_gb = llama_7b["total_params"] * 2 / (1024**3)
Out[8]:
Visualization
Line plot showing KV cache memory increasing linearly with context length, crossing model weight threshold.
Linear scaling of KV cache memory for LLaMA 7B in FP16 with batch size 1 across context lengths from 512 to 131,072 tokens. Memory grows from 0.5 GB at 512 tokens to 64 GB at 128K tokens, with a critical crossover point near 32K tokens where cache memory equals model weight memory. Beyond this threshold, the cache becomes the dominant memory consumer, emphasizing why optimization techniques become essential for long-context deployment.

The crossover point, where KV cache exceeds model weights, is critical for deployment planning. This visualization clearly shows the linear relationship between sequence length and memory consumption, with the cache becoming the dominant memory consumer at longer contexts. Let's find it exactly:

In[9]:
Code
## Find crossover point for LLaMA 7B
crossover_seq_len = (
    weight_gb
    * (1024**3)
    / (
        2
        * llama_7b["num_layers"]
        * llama_7b["num_kv_heads"]
        * llama_7b["head_dim"]
        * 2  # bytes
    )
)
Out[10]:
Console
LLaMA 7B crossover point: 26,702 tokens
At this context length, KV cache equals model weights: 13.0 GB

This crossover point represents a fundamental shift in the memory profile of inference. Below this threshold, model weights dominate memory consumption and the cache is a relatively small overhead. Above this threshold, the cache becomes the primary memory consumer, growing linearly with every additional token while weights remain constant. For context windows beyond approximately 32K tokens, the KV cache will consume more memory than the model itself, fundamentally changing deployment requirements and making cache optimization techniques essential rather than optional.

At this crossover point, the KV cache memory equals the model weight memory. For context lengths beyond this point, the cache becomes the dominant memory consumer, fundamentally changing the memory profile of inference.

For long-context applications such as document analysis, multi-turn conversations, and code completion, the KV cache dominates memory usage. A 128K context window, as supported by models like Claude and GPT-4, would require 32× more cache memory than a 4K window. This scaling explains why long-context inference often requires specialized infrastructure and optimization techniques that would be unnecessary for short-context applications.

Batch Size Effects

Batch size also scales linearly with cache memory, following the same multiplicative relationship we observed with sequence length. Each sequence in the batch maintains its own independent cache. Tokens in one sequence do not attend to tokens in another sequence, so cached values are not shared across batch elements. Each sequence maintains independent keys and values accessed only by its own future tokens.

This independence has important implications for memory planning. If you want to process 8 concurrent requests, you need 8 complete copies of the KV cache (one for each sequence). The batch dimension in our cache tensors, represented by BB in our formula, is not a clever abstraction that enables sharing; it simply organizes multiple independent caches for efficient parallel processing.

In[11]:
Code
## Calculate cache for various batch sizes
batch_sizes = [1, 2, 4, 8, 16, 32, 64]
cache_by_batch = []

for batch in batch_sizes:
    cache = calculate_kv_cache_memory(
        num_layers=llama_7b["num_layers"],
        num_kv_heads=llama_7b["num_kv_heads"],
        head_dim=llama_7b["head_dim"],
        batch_size=batch,
        seq_len=4096,
        dtype_bytes=2,
    )
    cache_by_batch.append(cache["GB"])
Out[12]:
Visualization
Bar chart showing KV cache memory increasing with batch size from 1 to 64.
Linear scaling of KV cache memory for LLaMA 7B at 4,096-token context in FP16 across batch sizes from 1 to 64. Memory increases from 1.00 GB with a single sequence to 64.0 GB with 64 concurrent sequences. Each sequence maintains its own independent cache because tokens in one sequence do not attend to tokens in another. At batch size 16, the cumulative cache memory exceeds the model's weight memory, illustrating the throughput-memory tradeoff in batched inference.

This visualization clearly demonstrates the linear relationship between batch size and KV cache memory. The batch size tradeoff is fundamental to inference economics, as it directly determines the throughput your system can achieve in production deployment. Doubling batch size doubles memory requirements, but it also doubles the number of requests you can process in parallel. This creates a direct tension between throughput and memory availability:

  • Single-sequence inference minimizes cache memory but leaves GPU compute underutilized
  • Large batches maximize throughput (tokens per second) but require proportionally more memory
  • The optimal batch size is typically the largest that fits in available memory after accounting for model weights and activation memory

GPU compute is expensive, and leaving it idle while processing a single sequence is wasteful. Batching amortizes the fixed costs of loading model weights and allows the GPU's parallel processing capabilities to be fully utilized. But this amortization is only possible if you have sufficient memory to hold multiple caches simultaneously.

In[13]:
Code
## What batch size fits in 80GB (A100) for LLaMA 7B?
total_gpu_memory = 80  # GB
model_weights = weight_gb
activation_overhead = 2  # GB (rough estimate for intermediate computations)
available_for_cache = total_gpu_memory - model_weights - activation_overhead

## Cache per sequence at 4096 tokens
cache_per_seq = calculate_kv_cache_memory(
    num_layers=llama_7b["num_layers"],
    num_kv_heads=llama_7b["num_kv_heads"],
    head_dim=llama_7b["head_dim"],
    batch_size=1,
    seq_len=4096,
    dtype_bytes=2,
)["GB"]

max_batch = int(available_for_cache / cache_per_seq)
Out[14]:
Console
GPU Memory Budget: 80 GB
  Model weights:     13.0 GB
  Activation buffer: 2.0 GB
  Available for KV:  65.0 GB
  Cache per sequence: 2.00 GB
  Maximum batch size: 32

This calculation reveals excellent throughput potential for LLaMA 7B on high-end hardware. With an A100 80GB GPU running LLaMA 7B at 4096-token context, we can support a significant batch size for concurrent processing.

Combined Effects: The Memory Budget

In practice, we must account for all memory consumers simultaneously. The GPU has a fixed amount of memory, and every component of inference (model weights, temporary buffers, cache, and overhead) must fit within this budget. Omitting any component from memory accounting causes out-of-memory errors and crashes inference. Accurate memory estimation is essential for reliable deployment.

The total GPU memory budget must accommodate:

  1. Model weights: Fixed once loaded. These parameters define the model's learned behavior and remain constant regardless of input or batch size.
  2. KV cache: Grows with batch size and sequence length. This is the dynamic component that varies most dramatically during inference.
  3. Activation memory: Temporary buffers during computation. These hold intermediate results like attention scores and feed-forward layer outputs before they are consumed.
  4. Framework overhead: CUDA context, PyTorch allocator buffers. These fixed costs exist regardless of what computation you perform.

The interplay between these components creates a constrained optimization problem. Model weights cannot be reduced without using a smaller model or applying quantization. Framework overhead is similarly fixed. This leaves the KV cache and activation memory as the variables that respond to your choices about batch size and sequence length.

Let's build a comprehensive memory estimator that accounts for all these factors.

In[15]:
Code
def estimate_inference_memory(
    model_params: int,
    num_layers: int,
    num_kv_heads: int,
    head_dim: int,
    batch_size: int,
    seq_len: int,
    dtype_bytes: int = 2,
    activation_factor: float = 0.1,  # Fraction of weights
) -> dict:
    """
    Estimate total GPU memory for inference.

    activation_factor: Rough multiplier for activation memory (varies by implementation)
    """
    # Model weights
    weight_bytes = model_params * dtype_bytes
    weight_gb = weight_bytes / (1024**3)

    # KV cache
    cache = calculate_kv_cache_memory(
        num_layers, num_kv_heads, head_dim, batch_size, seq_len, dtype_bytes
    )

    # Activation memory (rough estimate)
    activation_gb = weight_gb * activation_factor

    # Framework overhead (rough estimate)
    overhead_gb = 0.5  # CUDA context, allocator buffers

    total_gb = weight_gb + cache["GB"] + activation_gb + overhead_gb

    return {
        "weights_gb": weight_gb,
        "kv_cache_gb": cache["GB"],
        "activation_gb": activation_gb,
        "overhead_gb": overhead_gb,
        "total_gb": total_gb,
        "kv_cache_pct": (cache["GB"] / total_gb) * 100,
    }
In[16]:
Code
## Memory breakdown for different configurations
configs = [
    {"name": "Short context (2K)", "batch": 1, "seq": 2048},
    {"name": "Medium context (8K)", "batch": 1, "seq": 8192},
    {"name": "Long context (32K)", "batch": 1, "seq": 32768},
    {"name": "Batched (4K × 8)", "batch": 8, "seq": 4096},
]

memory_results = []
for config in configs:
    mem = estimate_inference_memory(
        model_params=llama_7b["total_params"],
        num_layers=llama_7b["num_layers"],
        num_kv_heads=llama_7b["num_kv_heads"],
        head_dim=llama_7b["head_dim"],
        batch_size=config["batch"],
        seq_len=config["seq"],
    )
    memory_results.append((config["name"], mem))
Out[17]:
Console
LLaMA 7B Memory Breakdown (FP16)
======================================================================

Short context (2K):
  Weights:      13.0 GB
  KV Cache:     1.00 GB (6% of total)
  Activation:    1.3 GB
  Overhead:      0.5 GB
  TOTAL:        15.8 GB

Medium context (8K):
  Weights:      13.0 GB
  KV Cache:     4.00 GB (21% of total)
  Activation:    1.3 GB
  Overhead:      0.5 GB
  TOTAL:        18.8 GB

Long context (32K):
  Weights:      13.0 GB
  KV Cache:    16.00 GB (52% of total)
  Activation:    1.3 GB
  Overhead:      0.5 GB
  TOTAL:        30.8 GB

Batched (4K × 8):
  Weights:      13.0 GB
  KV Cache:    16.00 GB (52% of total)
  Activation:    1.3 GB
  Overhead:      0.5 GB
  TOTAL:        30.8 GB
Out[18]:
Visualization
Stacked horizontal bar chart showing memory allocation for weights, KV cache, activation, and overhead across four configurations.
Memory allocation breakdown for LLaMA 7B across four representative inference configurations in FP16 precision. Stacked horizontal bars show the composition of weights (blue), KV cache (coral), activation memory (green), and framework overhead (gray) for each scenario. Short contexts like 2K tokens show weights dominating at approximately 85% of total memory, with cache being minimal. Longer contexts shift priorities dramatically, with cache reaching 35% of total at 32K tokens. Batched inference with 8 concurrent sequences at 4K context elevates cache to 30% of total memory. These varying allocation patterns highlight how configuration choices fundamentally reshape the memory profile of inference workloads.

This breakdown demonstrates how dramatically memory requirements shift across different usage patterns. For short 2K contexts, model weights dominate at roughly 85% of total memory, with KV cache representing only about 7% of requirements. As context expands to 32K tokens, the cache grows to consume approximately 35% of total memory, becoming the second-largest component after weights. Similarly, batching 8 sequences at 4K context each shifts the cache to around 30% of total memory. This transition from weight-dominated to cache-dominated memory profiles has profound implications for deployment strategies and hardware selection.

The percentage breakdown reveals how memory allocation shifts with configuration. At short contexts, weights dominate at about 85% of total memory. At long contexts or large batches, the KV cache becomes the primary memory consumer, reaching 35% of total memory. This shift demonstrates why cache optimization becomes critical for long-context applications.

The Memory Bottleneck

When KV cache memory dominates, we encounter a fundamental constraint: the memory bottleneck, which differs from the compute bottleneck that traditionally limits deep learning. Understanding this distinction is essential because the remedies for each bottleneck are completely different.

In a compute-bound scenario, performance is limited by how fast the GPU can perform arithmetic operations. The solution is to use faster GPUs with more compute units, or to reduce the number of operations through algorithmic improvements. In a memory-bound scenario, performance is limited by how fast data can move between memory and compute units. The GPU sits idle while waiting for data to arrive, regardless of how many arithmetic units it has available. Upgrading to a faster GPU may not improve inference speed if the bottleneck is memory bandwidth.

During token generation, the model performs:

  1. Attention computation (memory-bound): Uses cached keys and values. The attention mechanism reads the entire KV cache to compute attention scores, then reads it again to compute the weighted value sum. This requires moving massive amounts of data from GPU memory to compute units.
  2. Feed-forward computation (compute-bound): Dense matrix multiplications. The feed-forward layers perform arithmetic-heavy operations where the same weight matrix is applied to the input, allowing high arithmetic intensity.

Modern GPUs have enormous compute capacity (for example, A100 with 312 TFLOPS for FP16) but limited memory bandwidth (for example, A100 with 2 TB/s). The ratio between these capabilities determines the threshold between compute-bound and memory-bound operation. The arithmetic intensity, measured as the number of floating point operations performed per byte of data transferred, determines which resource is the bottleneck. We can formalize this relationship using the roofline model:

I=FLOPsBytes\mathcal{I} = \frac{\text{FLOPs}}{\text{Bytes}}

where:

  • I\mathcal{I}: the arithmetic intensity (measured in operations per byte)
  • FLOPs\text{FLOPs}: the total floating point operations performed
  • Bytes\text{Bytes}: the total data transferred from memory, dominated by KV cache reads

If I\mathcal{I} is lower than the GPU's ratio of peak compute to peak bandwidth, the process is memory-bound. For an A100, this threshold is approximately 156 FLOPs per byte (312 TFLOPS divided by 2 TB/s). Operations with lower arithmetic intensity cannot fully utilize GPU compute because they wait for memory transfers.

The practical implication is that for single-sequence generation at typical context lengths, the attention computation is memory-bound. Each token generation requires reading the entire KV cache from memory. This memory transfer takes longer than the actual computation, which is why techniques like FlashAttention and KV cache quantization provide significant speedups without reducing the number of arithmetic operations.

In[19]:
Code
## Compute vs Memory bound analysis
def analyze_bottleneck(
    batch_size: int,
    seq_len: int,
    d_model: int,
    num_layers: int,
    gpu_tflops: float = 312,  # A100 FP16
    gpu_bandwidth_tb: float = 2.0,  # A100 HBM
):
    """
    Analyze whether generation is compute-bound or memory-bound.

    For generation, we process one token at a time. Attention: O(seq_len) memory reads, O(seq_len × d_model) compute. FFN: O(d_model^2) memory reads, O(d_model^2) compute.
    """
    # FLOPs for generating one token (simplified)
    # Attention: 4 × d_model × seq_len (Q×K, softmax, ×V, output proj)
    attention_flops = 4 * d_model * seq_len * batch_size * num_layers

    # FFN: 8 × d_model × d_ffn ≈ 8 × 4 × d_model^2 (for d_ffn = 4 × d_model)
    ffn_flops = 8 * 4 * d_model * d_model * batch_size * num_layers

    total_flops = attention_flops + ffn_flops

    # Memory reads (bytes) - dominated by KV cache for attention
    # KV cache read: 2 × seq_len × d_model × bytes per layer
    kv_read_bytes = 2 * seq_len * d_model * 2 * num_layers * batch_size

    # Weight reads: roughly model size (but often cached in compute units)
    # Simplified: assume weights stay in L2/registers for batch

    # Time estimates (seconds)
    compute_time = total_flops / (gpu_tflops * 1e12)
    memory_time = kv_read_bytes / (gpu_bandwidth_tb * 1e12)

    arithmetic_intensity = total_flops / kv_read_bytes

    return {
        "total_flops": total_flops,
        "kv_read_bytes": kv_read_bytes,
        "compute_time_us": compute_time * 1e6,
        "memory_time_us": memory_time * 1e6,
        "arithmetic_intensity": arithmetic_intensity,
        "bottleneck": "memory" if memory_time > compute_time else "compute",
    }
In[20]:
Code
## Analyze for different sequence lengths
seq_lengths_to_analyze = [1024, 4096, 16384, 65536]
bottleneck_results = []

for seq_len in seq_lengths_to_analyze:
    analysis = analyze_bottleneck(
        batch_size=1, seq_len=seq_len, d_model=4096, num_layers=32
    )
    bottleneck_results.append((seq_len, analysis))
Out[21]:
Console
Bottleneck Analysis: LLaMA 7B Token Generation (A100 GPU)
=================================================================

Context: 1,024 tokens
  Compute time: 56.8 µs
  Memory time:  268.4 µs
  Arithmetic intensity: 33.0 FLOPs/byte
  Bottleneck: MEMORY

Context: 4,096 tokens
  Compute time: 61.9 µs
  Memory time:  1073.7 µs
  Arithmetic intensity: 9.0 FLOPs/byte
  Bottleneck: MEMORY

Context: 16,384 tokens
  Compute time: 82.6 µs
  Memory time:  4295.0 µs
  Arithmetic intensity: 3.0 FLOPs/byte
  Bottleneck: MEMORY

Context: 65,536 tokens
  Compute time: 165.2 µs
  Memory time:  17179.9 µs
  Arithmetic intensity: 1.5 FLOPs/byte
  Bottleneck: MEMORY
Out[22]:
Visualization
Grouped bar chart comparing compute time versus memory time across different context lengths.
Comparison of compute time versus memory transfer time per token for LLaMA 7B on an A100 GPU across four context lengths (1K to 64K tokens). At shorter contexts (1K and 4K tokens), memory transfer time substantially dominates computation time, indicating memory-bound operation where the GPU waits for data. At longer contexts (16K and 64K tokens), the gap between compute and memory time narrows significantly as attention computation becomes more substantial. This progressive shift indicates a transition from memory-bound behavior at moderate contexts to more balanced compute-memory utilization at extreme context lengths.

These results reveal a critical pattern in transformer inference performance. At shorter contexts (1K-4K tokens), memory transfer time dominates compute time, indicating the GPU spends more time waiting for KV cache data than actually performing computations. As context length increases to 16K and beyond, the bottleneck shifts toward compute as the sheer number of attention operations grows. However, for typical production scenarios with 4K-8K contexts, memory bandwidth is the primary constraint. This explains why FlashAttention and KV cache quantization provide significant speedups by reducing memory traffic. This bottleneck analysis demonstrates that at reasonable context lengths, single-sequence generation is memory-bound. The GPU spends most of its time waiting for KV cache data to transfer from high-bandwidth memory (HBM) to the compute units. Increasing batch size helps amortize memory transfer costs, but this in turn increases total memory requirements, creating a constrained optimization problem.

Memory Visualization Across Models

Let's visualize how memory allocation differs across model sizes and architectures:

Out[23]:
Visualization
Stacked bar chart comparing memory allocation between model weights and KV cache across four model sizes.
Stacked bar chart comparing memory allocation between model weights (blue) and KV cache (coral) for four models at 8K context in FP16 with batch size 1. Model weights dominate in small models, comprising 95% of total memory for GPT-2. As model size increases, cache becomes increasingly significant, growing from 15% of total for LLaMA 7B to substantial proportions in the 70B variants. Grouped Query Attention with 8 KV heads reduces cache to approximately 5% of total memory, compared to 40% for equivalent standard multi-head attention, an 8-fold reduction. This architectural difference makes large model deployment practical on modern hardware with constrained memory.

Without GQA, LLaMA 70B would require 8× more KV cache memory, making long-context inference impractical on most hardware. The visualization clearly demonstrates how Grouped Query Attention keeps cache proportions manageable at the 70B scale, with the cache remaining a smaller fraction of total memory compared to what standard MHA would require.

Practical Memory Planning

Before deploying a model, estimate memory requirements for your target configuration. Here is a practical workflow:

In[24]:
Code
def can_fit_on_gpu(
    gpu_memory_gb: float,
    model_config: dict,
    batch_size: int,
    max_seq_len: int,
    dtype_bytes: int = 2,
    safety_margin: float = 0.9,  # Leave 10 percent buffer
) -> dict:
    """
    Check if a configuration fits in GPU memory.

    Returns feasibility and utilization details.
    """
    available = gpu_memory_gb * safety_margin

    mem = estimate_inference_memory(
        model_params=model_config["total_params"],
        num_layers=model_config["num_layers"],
        num_kv_heads=model_config["num_kv_heads"],
        head_dim=model_config["head_dim"],
        batch_size=batch_size,
        seq_len=max_seq_len,
        dtype_bytes=dtype_bytes,
    )

    fits = mem["total_gb"] <= available
    utilization = (mem["total_gb"] / available) * 100

    return {
        "fits": fits,
        "required_gb": mem["total_gb"],
        "available_gb": available,
        "utilization_pct": utilization,
    }
In[25]:
Code
## Test configurations on common GPUs
gpus = {"RTX 4090": 24, "A100 40GB": 40, "A100 80GB": 80, "H100 80GB": 80}

test_configs = [
    {"model": "LLaMA 7B", "batch": 1, "seq": 4096},
    {"model": "LLaMA 7B", "batch": 8, "seq": 4096},
    {"model": "LLaMA 2 70B", "batch": 1, "seq": 8192},
]

feasibility_results = []
for config in test_configs:
    model = models[config["model"]]
    config_results = []
    for gpu_name, gpu_mem in gpus.items():
        result = can_fit_on_gpu(gpu_mem, model, config["batch"], config["seq"])
        config_results.append((gpu_name, result))
    feasibility_results.append((config, config_results))
Out[26]:
Console
Deployment Feasibility Check
======================================================================

LLaMA 7B | batch=1 | seq=4096
--------------------------------------------------
  RTX 4090: ✓ (16.8/21.6 GB)
  A100 40GB: ✓ (16.8/36.0 GB)
  A100 80GB: ✓ (16.8/72.0 GB)
  H100 80GB: ✓ (16.8/72.0 GB)

LLaMA 7B | batch=8 | seq=4096
--------------------------------------------------
  RTX 4090: ✗ (30.8/21.6 GB)
  A100 40GB: ✓ (30.8/36.0 GB)
  A100 80GB: ✓ (30.8/72.0 GB)
  H100 80GB: ✓ (30.8/72.0 GB)

LLaMA 2 70B | batch=1 | seq=8192
--------------------------------------------------
  RTX 4090: ✗ (146.4/21.6 GB)
  A100 40GB: ✗ (146.4/36.0 GB)
  A100 80GB: ✗ (146.4/72.0 GB)
  H100 80GB: ✗ (146.4/72.0 GB)
Out[27]:
Visualization
Heatmap showing memory utilization percentages for different model configurations across GPU types.
Heatmap showing GPU memory utilization across three model-configuration combinations and four GPU types (RTX 4090, A100 40GB, A100 80GB, H100 80GB), with green indicating safe utilization under 90%, yellow indicating tight packing between 90-100%, and red indicating out-of-memory conditions.

These results illustrate the hardware requirements for different deployment scenarios. LLaMA 7B fits comfortably on all tested GPUs for single-sequence inference at 4K context, with even the 24GB RTX 4090 providing adequate headroom. However, scaling to batch size 8 pushes requirements beyond consumer hardware, requiring datacenter GPUs like the A100. The 70B model at 8K context is borderline even on high-end 80GB GPUs, demonstrating why models of this scale typically require either multi-GPU setups or aggressive optimization techniques like quantization and cache compression. These constraints directly inform deployment decisions: smaller models offer flexibility across hardware tiers, while frontier models demand specialized infrastructure.

The feasibility analysis demonstrates the practical constraints of different GPU configurations. For production deployment, understanding these memory limits is essential for selecting appropriate hardware.

The memory requirements of KV caching create several practical challenges:

  • Long context support requires substantial memory investment. A 128K context window for LLaMA 7B requires 32GB of KV cache alone, exceeding most consumer GPU memory. This explains why long-context models often require cloud deployment or specialized hardware.

  • Batch size is constrained by available memory. High-throughput serving, handling many concurrent requests, requires either smaller models, shorter contexts, or memory optimization techniques. The tradeoff between latency (single-sequence speed) and throughput (requests per second) is fundamentally a memory allocation problem.

  • Linear scaling creates predictable but inflexible costs. Unlike attention computation, where techniques like FlashAttention reduce memory overhead, the KV cache is fundamentally required for correct generation. Every token's keys and values must be stored somewhere.

These constraints have driven significant research into cache optimization. The next chapter explores Paged Attention, which applies virtual memory concepts to manage KV cache more efficiently. Subsequent chapters cover cache compression techniques that reduce memory by quantizing or pruning cached values. Together, these techniques enable longer contexts and larger batches while keeping memory requirements manageable.

Key Parameters

The following parameters are essential for KV cache memory calculation:

  • num_layers: Number of transformer layers in the model. Each layer maintains its own key and value cache, so cache memory scales linearly with the number of layers.
  • num_kv_heads: Number of key-value attention heads, also written as num_KV_heads or HkvH_{\text{kv}}. In standard Multi-Head Attention, this equals the number of query heads. Grouped Query Attention uses fewer KV heads to reduce memory.
  • head_dim: Dimension of each attention head, also written as DhD_h. Typically 64, 96, or 128. Combined with number of heads, this determines the model dimension.
  • batch_size: Number of sequences processed simultaneously. Each sequence maintains its own independent cache, so memory scales linearly with batch size.
  • seq_len: Context length in tokens. Cache memory grows linearly as more tokens are processed and cached.
  • dtype_bytes: Bytes per element based on data type. FP16/BF16 use 2 bytes, FP32 uses 4 bytes, INT8 uses 1 byte, INT4 uses 0.5 bytes.

Summary

KV cache memory follows a simple formula, 2×L×B×T×Hkv×Dh×bytes2 \times L \times B \times T \times H_{kv} \times D_h \times \text{bytes}. This linear scaling with batch size and sequence length means the cache can easily exceed model weights for long-context or batched inference.

Key takeaways:

  • Model size scales differently than cache size: A 10x larger model does not require 10x more cache when using GQA
  • Grouped Query Attention is crucial: LLaMA 2's GQA reduces 70B cache by 8x compared to standard MHA
  • Long contexts are memory-expensive: At 32K or more tokens, KV cache typically exceeds model weight memory
  • Generation is memory-bound: GPU compute utilization is limited by memory bandwidth when reading cached keys and values
  • Deployment planning requires careful calculation: Before serving a model, estimate total memory including weights, cache, and overhead

Understanding these memory dynamics is essential. The cache size calculator provides a foundation for capacity planning. The bottleneck analysis explains why memory optimization techniques, covered in upcoming chapters, are critical for efficient inference.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about KV cache memory requirements and deployment planning.

Loading component...

Reference

BIBTEXAcademic
@misc{kvcachememorycalculatinggpurequirementsforllminference, author = {Michael Brenndoerfer}, title = {KV Cache Memory: Calculating GPU Requirements for LLM Inference}, year = {2026}, url = {https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). KV Cache Memory: Calculating GPU Requirements for LLM Inference. Retrieved from https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu
MLAAcademic
Michael Brenndoerfer. "KV Cache Memory: Calculating GPU Requirements for LLM Inference." 2026. Web. today. <https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu>.
CHICAGOAcademic
Michael Brenndoerfer. "KV Cache Memory: Calculating GPU Requirements for LLM Inference." Accessed today. https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu.
HARVARDAcademic
Michael Brenndoerfer (2026) 'KV Cache Memory: Calculating GPU Requirements for LLM Inference'. Available at: https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). KV Cache Memory: Calculating GPU Requirements for LLM Inference. https://mbrenndoerfer.com/writing/kv-cache-memory-calculation-llm-inference-gpu