Inference Scaling: Optimizing LLMs for Production Deployment

Michael Brenndoerfer

Learn why Chinchilla-optimal models are inefficient for deployment. Master over-training strategies and cost modeling for inference-heavy LLM systems.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Inference ScalingLink Copied

The scaling laws we explored in previous chapters, from Kaplan's initial discoveries to Chinchilla's compute-optimal ratios, share a common assumption: they optimize for training efficiency. Given a fixed compute budget, how do you allocate it between model size and training data to minimize loss? This framing makes sense from a research perspective, where training costs dominate and the goal is to push capabilities forward as quickly as possible.

But production deployments face a fundamentally different optimization problem. A model is trained once, then serves inference requests continuously, potentially billions of times over its operational lifetime. When inference compute dwarfs training compute by orders of magnitude, optimizing for training efficiency can be precisely the wrong strategy.

This chapter examines how to think about scaling when inference costs matter. We'll see why the compute-optimal models from Chinchilla are actually inefficient for deployment, explore the practice of "over-training" smaller models for inference efficiency, and develop frameworks for modeling total deployment costs. The key insight is that the optimal model for training and the optimal model for serving are often very different. Understanding this distinction is essential for practical LLM deployment.

The Training-Inference AsymmetryLink Copied

Consider the lifecycle of a large language model. Training is an intensive, one-time process that might consume millions of GPU-hours over several weeks. Once complete, the model enters production, where it handles inference requests: generating text, answering questions, or processing documents. This serving phase can extend for months or years.

The asymmetry becomes stark when you examine actual compute consumption. To appreciate why this matters so much: think about what happens each time you query a language model. Every single token the model generates requires a forward pass through the entire network. Billions of parameters are activated, multiplied, and transformed. When a model serves millions of users, each generating hundreds or thousands of tokens per interaction, these individual operations compound into astronomical compute requirements.

To quantify this rigorously, we can calculate the total FLOPs consumed during inference and compare it to training cost. A training run might use $C_{train} = 10^{24}$ FLOPs (roughly what it takes to train a 70B parameter model on 2 trillion tokens). If that model then serves 1 billion inference requests, each generating an average of 500 tokens, the total inference compute is:

C_{inference} = N_{requests} \times L_{output} \times \text{FLOPs\_per\_token}

where:

$C_{inference}$ : total compute required for all inference requests over the model's deployment
$N_{requests}$ : number of inference requests served
$L_{output}$ : average number of output tokens generated per request
$\text{FLOPs per token}$ : floating-point operations required to generate one token

This formula captures a fundamental insight about deployment economics: inference cost grows linearly with each of these three factors. More users, longer responses, or larger models all increase the compute burden proportionally. The multiplicative structure means that even modest increases in any factor can dramatically shift the overall compute balance.

For a forward pass through an $N$ -parameter model, inference requires approximately $2N$ FLOPs per token (one multiply-add per parameter). This $2N$ figure deserves attention because it reveals something important about transformer efficiency: every parameter in your model must be activated for every token you generate. There's no shortcut, no way to skip over unused portions of the network. The full weight matrix participates in every prediction. For our 70B model:

C_{inference} = 10^9 \times 500 \times (2 \times 70 \times 10^9) = 7 \times 10^{22} \text{ FLOPs}

This is about 7% of training compute. This is significant, but training still dominates in this scenario. At this scale, the traditional wisdom of optimizing for training efficiency still holds. However, if the model serves 100 billion requests:

C_{inference} = 10^{11} \times 500 \times (2 \times 70 \times 10^9) = 7 \times 10^{24} \text{ FLOPs}

Now inference compute exceeds training compute by a factor of 7. This crossover point, where inference begins to dominate, is the key threshold that determines whether training-optimal or inference-optimal strategies should guide model selection. Understanding where your deployment falls relative to this threshold is crucial for making economically sound decisions about model architecture and training strategy.

Out[2]:

Visualization

Training vs inference compute as request volume increases. The horizontal line shows one-time training cost, while inference compute grows linearly with requests. The crossover point determines when inference-optimal strategies become necessary.

For the most widely deployed models serving trillions of requests, inference can consume 100× or more compute than training. Consider a model like GPT-4 or Claude serving millions of users worldwide, each engaging in multiple conversations per day over the course of years. The cumulative inference load dwarfs even the most expensive training runs.

This asymmetry fundamentally changes the optimization calculus. When training dominates, you want the most capable model you can train within your compute budget. When inference dominates, you want the smallest model that achieves acceptable quality, even if training it costs more per unit of capability gained. This shift in perspective—from training-centric to deployment-centric optimization—represents one of the most important practical insights in applied machine learning.

Why Chinchilla is Training-Optimal, Not Deployment-OptimalLink Copied

Recall from the Chinchilla scaling laws chapter that compute-optimal training allocates compute roughly equally between model size and data, following the relationship $D \approx 20N$ where $D$ is tokens trained and $N$ is parameters. This minimizes training FLOPs to reach a target loss. The elegance of this ratio—roughly 20 tokens per parameter—emerged from careful empirical study and provides a clear recipe for efficient training.

But the Chinchilla framework optimizes for a specific objective: minimizing the compute required to achieve the lowest possible loss during training. It treats training as the end goal rather than as a means to an end. When we recognize that the true objective is efficient deployment, not efficient training, the optimal strategy shifts dramatically.

Consider two models trained to the same final loss:

Model A (Chinchilla-optimal):

70B parameters
1.4T tokens
Training compute: $C_A = 6ND = 6 \times 70 \times 10^9 \times 1.4 \times 10^{12} \approx 5.9 \times 10^{23}$ FLOPs

The $6ND$ approximation comes from the fact that each training token requires a forward pass ( $2N$ operations) plus a backward pass (approximately $4N$ operations for computing gradients), totaling $6N$ operations per token across $D$ tokens. This 6-to-1 ratio between training and inference compute per token is another fundamental constant worth internalizing. It explains why training is so much more expensive than inference on a per-token basis, but also why inference costs accumulate so quickly when serving many users.

where:

$N$ : number of model parameters
$D$ : number of training tokens
$6ND$ : total FLOPs, accounting for forward pass (2 operations per parameter per token) plus backward pass (roughly 4 operations per parameter per token)

Model B (Over-trained):

7B parameters
4T tokens (requires longer training to match Model A's loss)
Training compute: $C_B = 6 \times 7 \times 10^9 \times 4 \times 10^{12} = 1.68 \times 10^{23}$ FLOPs

Model B actually uses less training compute to reach the same loss. How can this be? The key is that Chinchilla optimality assumes you're pushing to the frontier of what's achievable at a given compute level. If you're willing to accept a specific target loss rather than minimizing loss for fixed compute, smaller models trained longer can sometimes be more efficient even for training.

This counterintuitive result arises because Chinchilla optimality answers a different question than the one we're asking. Chinchilla asks: "Given a fixed compute budget, how do I get the lowest possible loss?" The deployment question asks: "Given a target quality level, how do I minimize total cost over the model's lifetime?" These are fundamentally different optimization problems with different optimal solutions.

The real advantage of smaller models emerges at inference time. Model A requires $2 \times 70 \times 10^9 = 1.4 \times 10^{11}$ FLOPs per output token, while Model B requires $2 \times 7 \times 10^9 = 1.4 \times 10^{10}$ FLOPs per token—10× fewer inference FLOPs for equivalent quality.

This 10× reduction in per-token cost compounds with every inference request. If both models serve a billion requests, Model B saves approximately $1.26 \times 10^{20}$ FLOPs per token generated. This compute savings translates directly to reduced electricity bills, fewer GPUs required, and lower operational costs.

For high-volume deployments, this 10× inference efficiency can translate to massive cost savings that dwarf any training cost difference.

Out[3]:

Visualization

Training compute comparison: Chinchilla-optimal 70B vs over-trained 7B model.

Inference cost per token comparison showing 10× cost reduction with smaller model.

Over-Training for Deployment EfficiencyLink Copied

"Over-training" refers to training a model on more tokens than Chinchilla optimality would suggest. The term is slightly misleading, as it's only "over" training relative to compute-optimal. It's exactly the right amount of training for inference-optimal deployment. Think of it not as excessive training but as investing additional training compute to reduce inference costs later.

The LLaMA PhilosophyLink Copied

The LLaMA model family explicitly adopted this philosophy. The LLaMA paper states: "We only consider the performance of the resulting model, not the training compute budget." They trained 7B, 13B, 33B, and 65B parameter models on up to 1.4 trillion tokens. This is far more than Chinchilla ratios would suggest for the smaller models.

This statement represents a philosophical break from the Chinchilla paradigm. Rather than asking how to minimize training compute, Meta's researchers asked how to maximize the value delivered per inference FLOP. This reframing led to models that were "inefficient" to train but were much more efficient to deploy.

For the 7B model, Chinchilla optimality would suggest around 140B tokens. Training on 1.4T tokens represents 10× "over-training." This extra training compute is an investment that pays dividends at inference time through reduced per-token cost. Every additional training token helps the smaller model close the capability gap with larger alternatives, making it suitable for tasks that would otherwise require a more expensive model.

The Over-Training RatioLink Copied

We can quantify over-training with a simple ratio that captures how far a model departs from Chinchilla-optimal training:

\text{Over-training ratio} = \frac{D_{actual}}{D_{optimal}} = \frac{D_{actual}}{20N}

where:

$D_{actual}$ : the number of tokens the model is actually trained on
$D_{optimal}$ : the Chinchilla-optimal number of tokens for a model of size $N$
$N$ : the number of model parameters
$20N$ : the Chinchilla approximation that optimal training uses roughly 20 tokens per parameter

This ratio provides a simple metric for characterizing how inference-optimized a model's training regime is. A ratio of 1 means the model was trained according to Chinchilla-optimal principles. Ratios greater than 1 indicate over-training for inference efficiency, with higher values representing more aggressive investment in training compute to reduce inference costs.

The table below shows typical over-training ratios for inference-optimized models:

Over-training ratios for inference-optimized models compared to Chinchilla-optimal training.

Model	Parameters	Tokens	Chinchilla Optimal	Over-training Ratio
LLaMA-7B	7B	1.4T	140B	10×
LLaMA-13B	13B	1.4T	260B	5.4×
Mistral-7B	7B	~8T	140B	~57×
Phi-2	2.7B	1.4T	54B	26×

Modern inference-optimized models routinely use over-training ratios of 10× to 50× or more. The trend toward higher ratios reflects growing recognition that inference costs dominate for production systems. Mistral's 57× over-training ratio on their 7B model demonstrates how far this philosophy has been pushed. They invested roughly 57 times more training compute than Chinchilla would recommend. They bet that the resulting inference efficiency would more than compensate.

Out[4]:

Visualization

Over-training ratios for inference-optimized models. Chinchilla-optimal corresponds to a ratio of 1 (dashed line). Modern models routinely train on 10-50× more data than Chinchilla prescribes, investing in training compute to reduce inference costs.

When Over-Training Pays OffLink Copied

Over-training makes economic sense when total inference compute exceeds the extra training compute required. Let's formalize this break-even analysis, which provides the mathematical foundation for deployment decisions.

Suppose you have two options:

Option 1: Train a large model $N_1$ on $D_1$ tokens (Chinchilla-optimal)
Option 2: Train a smaller model $N_2 < N_1$ on $D_2 > D_1$ tokens to match Option 1's loss

The additional training cost for Option 2 is:

\Delta C_{train} = 6N_2D_2 - 6N_1D_1

where:

$\Delta C_{train}$ : the difference in training compute between the two options
$N_1, N_2$ : parameter counts for Option 1 (larger) and Option 2 (smaller) respectively
$D_1, D_2$ : training tokens for Option 1 and Option 2 respectively

This difference can be positive or negative depending on the specific model sizes and training durations. If the smaller model requires enough additional training to offset the reduced parameter count, $\Delta C_{train}$ will be positive, representing an upfront investment that must be recouped through inference savings.

The inference savings per token generated are:

\Delta C_{inference} = 2(N_1 - N_2)

Here, $\Delta C_{inference}$ represents the FLOPs saved per output token by using the smaller model (since each forward pass requires $2N$ FLOPs per token).

This quantity is always positive when $N_1 > N_2$ , meaning smaller models always save inference compute regardless of how they were trained. The magnitude of these savings grows linearly with the difference in model sizes—a 63B parameter reduction saves 63 billion fewer multiply-add operations per token.

Over-training pays off when total inference tokens $T_{inference}$ satisfies:

T_{inference} \times \Delta C_{inference} > \Delta C_{train}

T_{inference} > \frac{6N_2D_2 - 6N_1D_1}{2(N_1 - N_2)}

This break-even analysis determines whether the inference savings justify the training investment. The inequality has a clear interpretation: the left side represents cumulative inference savings over the model's deployment, while the right side represents the upfront training cost differential. When expected inference volume exceeds the break-even point, the smaller over-trained model becomes the economically rational choice.

Worked Example: Break-Even AnalysisLink Copied

Let's work through a concrete comparison between a 70B Chinchilla-optimal model and a 7B over-trained model, both achieving similar loss. This example illustrates the practical magnitude of the cost differences and helps build intuition for deployment decisions.

Option 1: 70B model, 1.4T tokens

Training: $6 \times 70B \times 1.4T = 5.9 \times 10^{23}$ FLOPs
Inference: $2 \times 70B = 1.4 \times 10^{11}$ FLOPs/token

Option 2: 7B model, 4T tokens (assume this matches 70B model's loss)

Training: $6 \times 7B \times 4T = 1.68 \times 10^{23}$ FLOPs
Inference: $2 \times 7B = 1.4 \times 10^{10}$ FLOPs/token

In this case, Option 2 actually costs less to train. This can happen when targeting a specific loss level rather than frontier performance. The inference savings are:

\Delta C = 1.4 \times 10^{11} - 1.4 \times 10^{10} = 1.26 \times 10^{11} \text{ FLOPs/token}

Every inference token saves $1.26 \times 10^{11}$ FLOPs, and there's no training penalty—only savings. This represents a best-case scenario where the over-trained model wins on both training and inference efficiency, highlighting how the conventional wisdom about "larger models being better" can be misleading for deployment contexts.

Even when over-training does cost more, break-even typically occurs at modest inference volumes. If the 7B model required $10^{24}$ FLOPs to train (an additional $8 \times 10^{23}$ FLOPs over the 70B option), the break-even would be:

T_{\text{break-even}} = \frac{8 \times 10^{23}}{1.26 \times 10^{11}} \approx 6.3 \times 10^{12} \text{ tokens}

At 500 tokens per request, that's about 12.6 billion inference requests, easily exceeded by any widely-deployed model. To put this in perspective, a popular API serving just 10,000 users, each making 10 requests per day, would reach this break-even point in roughly 3.4 years. For consumer-scale deployments with millions of users, break-even occurs within days or weeks of launch.

Out[5]:

Visualization

Break-even analysis for over-training investment. The cumulative inference savings (blue) grow linearly with tokens generated. Once savings exceed the additional training investment (red dashed line), the over-trained model becomes more economical.

Inference-Optimal Model SelectionLink Copied

Given expected inference demand, we can determine the optimal model size directly. This requires understanding how loss scales with both model size and training data, and then reformulating the optimization problem to account for total lifecycle costs rather than just training costs.

The Inference-Optimal Scaling LawLink Copied

Building on the Chinchilla loss function from earlier chapters:

L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where:

$L(N, D)$ : the model's loss as a function of size and training data
$N$ : the number of model parameters
$D$ : the number of training tokens
$A, B$ : empirically fitted scaling coefficients
$\alpha, \beta$ : empirically fitted exponents (typically around 0.34 and 0.28)
$E$ : the irreducible error floor representing the entropy of natural language

This functional form captures two distinct pathways to reducing loss: adding more parameters (the first term) or training on more data (the second term). The irreducible error floor $E$ represents the fundamental uncertainty in language—even a perfect model cannot predict with certainty what word comes next in human text. Understanding this structure is essential for reasoning about the trade-offs between model size and training duration.

We can reformulate the optimization problem for inference-dominant scenarios. Rather than minimizing loss for a fixed training budget (the Chinchilla approach), we want to minimize loss for a fixed total budget that includes both training and expected inference. This reformulation shifts the optimization target from training efficiency to deployment efficiency.

The total compute over the model's lifecycle is:

C_{total} = C_{train} + C_{inference} = 6ND + 2N \cdot T_{inference}

where:

$C_{total}$ : total compute over the model's entire lifecycle
$C_{train} = 6ND$ : training compute (approximately 6 FLOPs per parameter per token)
$C_{inference} = 2N \cdot T_{inference}$ : total inference compute (2 FLOPs per parameter per token generated)
$T_{inference}$ : expected total tokens generated during the model's deployment

This formulation captures the key insight: training cost is paid once, but inference cost accumulates with every token generated. The training term depends on both $N$ and $D$ (model size times data), while the inference term depends on $N$ and $T_{inference}$ (model size times usage). This asymmetric structure—where model size appears in both terms but data appears only in training—is what drives the preference for smaller models in high-inference scenarios.

To minimize loss subject to a total compute budget, we take derivatives and solve. The resulting optimal allocation shifts toward smaller models as expected inference volume increases:

N^* \propto C_{total}^{a} \cdot T_{inference}^{-b}

where:

$N^*$ : the optimal model size (number of parameters) for the given deployment scenario
$C_{total}$ : the total compute budget available across training and inference
$T_{inference}$ : expected total tokens to be generated during deployment
$a, b$ : positive exponents derived from the scaling law parameters $\alpha$ and $\beta$ . The exact values depend on the specific scaling law coefficients, but typical values yield $a \approx 0.5$ and $b \approx 0.3$

The negative exponent on $T_{inference}$ is the key result: it shows mathematically that optimal model size decreases as expected inference volume increases. This relationship emerges directly from the calculus of constrained optimization. When inference tokens grow, the cost of each parameter (which must be activated for every token) grows proportionally, pushing the optimum toward fewer parameters.

The intuition is clear: as inference demand grows, the optimal model size shrinks because each parameter incurs a cost at every inference, while training cost is amortized. A parameter that costs billions of FLOPs to train well might seem expensive, but if that parameter is then used trillions of times at inference, the training cost becomes negligible compared to the cumulative inference burden.

Practical Model Size SelectionLink Copied

For deployment planning, a simplified heuristic helps guide model selection. If you expect to generate $T$ inference tokens over the model's lifetime, compare the inference compute to training compute:

\text{Inference factor} = \frac{2N \cdot T}{6ND} = \frac{T}{3D}

where:

$T$ : total inference tokens expected over the model's deployment lifetime
$D$ : number of tokens used for training
$N$ : model parameters (which cancel out in the simplification)

Notice that $N$ cancels out. The inference factor depends only on the ratio of inference tokens to training tokens. This cancellation is mathematically elegant and practically useful because you can evaluate whether inference will dominate without even knowing the model size. When this factor exceeds 1, inference compute dominates. The larger this factor, the more aggressively you should favor smaller, over-trained models.

The inference factor provides a simple decision rule. Values below 1 indicate training-dominated scenarios where Chinchilla-optimal sizing remains appropriate. Values above 1 signal that inference costs will outweigh training costs, justifying investment in smaller, more heavily trained models. Values above 10 suggest aggressive over-training is warranted. Values above 100 indicate that inference efficiency should be the primary design consideration.

In[6]:

Code

def compute_inference_factor(
    training_tokens: float, inference_tokens: float
) -> float:
    """Calculate ratio of inference to training compute."""
    return inference_tokens / (3 * training_tokens)


def recommend_model_strategy(
    training_tokens: float, inference_tokens: float
) -> str:
    """Recommend training strategy based on expected inference load."""
    factor = compute_inference_factor(training_tokens, inference_tokens)

    if factor < 0.1:
        return "Training-dominated: use Chinchilla-optimal sizing"
    elif factor < 1:
        return "Balanced: moderate over-training beneficial"
    elif factor < 10:
        return "Inference-heavy: significant over-training recommended"
    else:
        return "Inference-dominated: maximize over-training ratio"

def compute_inference_factor(
    training_tokens: float, inference_tokens: float
) -> float:
    """Calculate ratio of inference to training compute."""
    return inference_tokens / (3 * training_tokens)


def recommend_model_strategy(
    training_tokens: float, inference_tokens: float
) -> str:
    """Recommend training strategy based on expected inference load."""
    factor = compute_inference_factor(training_tokens, inference_tokens)

    if factor < 0.1:
        return "Training-dominated: use Chinchilla-optimal sizing"
    elif factor < 1:
        return "Balanced: moderate over-training beneficial"
    elif factor < 10:
        return "Inference-heavy: significant over-training recommended"
    else:
        return "Inference-dominated: maximize over-training ratio"

Out[7]:

Console

Deployment Scenario Analysis
=================================================================

Research prototype:
  Training tokens: 1e+12
  Inference tokens: 1e+11
  Inference factor: 0.0x
  Recommendation: Training-dominated: use Chinchilla-optimal sizing

Internal tool:
  Training tokens: 1e+12
  Inference tokens: 1e+13
  Inference factor: 3.3x
  Recommendation: Inference-heavy: significant over-training recommended

Consumer product:
  Training tokens: 1e+12
  Inference tokens: 1e+15
  Inference factor: 333.3x
  Recommendation: Inference-dominated: maximize over-training ratio

Major API service:
  Training tokens: 1e+12
  Inference tokens: 1e+17
  Inference factor: 33333.3x
  Recommendation: Inference-dominated: maximize over-training ratio

The output reveals how dramatically the optimal strategy shifts based on expected usage. The research prototype shows an inference factor of just 0.03×, meaning training compute still dominates, so Chinchilla-optimal sizing remains appropriate. But as we move to consumer products (333×) and major API services (33,333×), inference compute overwhelms training by orders of magnitude. At these scales, every parameter becomes a recurring cost multiplied across trillions of tokens, making aggressive over-training the only economically rational choice. A research prototype with limited deployment can safely follow Chinchilla scaling, while a consumer-facing API should aggressively over-train smaller models.

Out[8]:

Visualization

Inference factor across deployment scenarios (log scale). The dashed line at factor=1 marks the transition from training-dominated to inference-dominated regimes. Consumer products and API services fall deep into inference-dominated territory.

Deployment Cost ModelingLink Copied

Real deployment decisions involve more than FLOP counting. Hardware costs, memory constraints, latency requirements, and energy consumption all factor into the total cost of ownership. Moving from theoretical compute analysis to practical cost modeling requires understanding how FLOPs translate to dollars and how hardware constraints shape achievable throughput.

Components of Deployment CostLink Copied

A comprehensive cost model includes:

Hardware amortization: The cost of training and inference hardware amortized over useful life
Compute costs: Electricity and cooling, proportional to FLOPs executed
Memory costs: Larger models require more expensive hardware configurations
Latency penalties: Slower inference may reduce user engagement or transaction value
Opportunity costs: Resources committed to one model can't serve others

The Cost-Per-Token ModelLink Copied

For a concrete cost analysis, we model the cost to generate one inference token. This simple economic model divides fixed hourly costs by throughput: fewer tokens per hour means each token must bear a larger share of the infrastructure cost:

\text{Cost per token} = \frac{\text{Hardware cost/hour} + \text{Energy cost/hour}}{\text{Tokens per hour}}

This formula reveals the fundamental economic structure of inference: you're paying for GPU time, and the question is how many tokens you can squeeze out of each hour of that expensive hardware. Higher throughput means lower per-token costs, which is why inference optimization focuses so heavily on maximizing tokens per second.

Tokens per hour depends on model throughput, which is constrained by either compute or memory bandwidth:

\text{Throughput} = \min\left(\frac{\text{FLOPS capacity}}{2N}, \frac{\text{Memory bandwidth}}{2 \times \text{bytes per param}}\right)

where:

$\text{Throughput}$ : tokens generated per second
$\text{FLOPS capacity}$ : the GPU's floating-point operations per second
$2N$ : FLOPs required per token (two operations per parameter for the forward pass)
$\text{Memory bandwidth}$ : bytes per second the GPU can read from memory
$\text{bytes per param}$ : memory footprint per parameter (e.g., 2 bytes for FP16)

The minimum captures a fundamental hardware constraint: generation speed is limited by whichever resource (compute or memory bandwidth) is exhausted first. This bottleneck analysis is crucial for understanding real-world inference performance.

The first term ( $\text{FLOPS capacity}/2N$ ) represents the theoretical maximum if compute were the only constraint, dividing total available operations by operations needed per token. This would be the limit if weights could be loaded from memory instantaneously.

The second term captures the memory bottleneck: each token generation requires loading all model weights from memory, so throughput cannot exceed memory bandwidth divided by model size. The autoregressive generation process must stream the entire weight matrix through the GPU's memory bus for every single token produced.

For large models, memory bandwidth typically dominates, making smaller models even more advantageous than pure FLOP analysis suggests. Modern GPUs have tremendous compute capacity but relatively limited memory bandwidth. An A100 can perform 312 trillion floating-point operations per second but can only move 2 trillion bytes per second from memory. For large models, the weights cannot be loaded fast enough to keep the compute units busy.

Implementing a Cost CalculatorLink Copied

The following code implements a deployment cost model incorporating these factors:

In[9]:

Code

from dataclasses import dataclass
from typing import Optional


@dataclass
class HardwareSpec:
    """Specifications for inference hardware."""

    name: str
    flops: float  # FP16 FLOPS
    memory_bandwidth: float  # bytes/second
    cost_per_hour: float  # USD
    power_watts: float  # for energy cost calculation


@dataclass
class ModelSpec:
    """Model specifications for cost calculation."""

    name: str
    parameters: float  # number of parameters
    bytes_per_param: float = 2.0  # FP16 = 2 bytes

    @property
    def memory_bytes(self) -> float:
        return self.parameters * self.bytes_per_param


@dataclass
class DeploymentScenario:
    """Expected deployment characteristics."""

    tokens_per_request: float
    requests_per_month: float
    target_latency_ms: Optional[float] = None

from dataclasses import dataclass
from typing import Optional


@dataclass
class HardwareSpec:
    """Specifications for inference hardware."""

    name: str
    flops: float  # FP16 FLOPS
    memory_bandwidth: float  # bytes/second
    cost_per_hour: float  # USD
    power_watts: float  # for energy cost calculation


@dataclass
class ModelSpec:
    """Model specifications for cost calculation."""

    name: str
    parameters: float  # number of parameters
    bytes_per_param: float = 2.0  # FP16 = 2 bytes

    @property
    def memory_bytes(self) -> float:
        return self.parameters * self.bytes_per_param


@dataclass
class DeploymentScenario:
    """Expected deployment characteristics."""

    tokens_per_request: float
    requests_per_month: float
    target_latency_ms: Optional[float] = None

In[10]:

Code

def calculate_throughput(model: ModelSpec, hardware: HardwareSpec) -> float:
    """Calculate max tokens per second (batch size 1)."""
    # Compute-bound throughput
    flops_per_token = 2 * model.parameters
    compute_throughput = hardware.flops / flops_per_token

    # Memory-bound throughput
    bytes_per_token = model.memory_bytes  # Load all weights per token
    memory_throughput = hardware.memory_bandwidth / bytes_per_token

    # Actual throughput is minimum of both
    return min(compute_throughput, memory_throughput)


def calculate_cost_per_million_tokens(
    model: ModelSpec, hardware: HardwareSpec, electricity_cost_kwh: float = 0.10
) -> dict:
    """Calculate cost breakdown per million output tokens."""
    throughput = calculate_throughput(model, hardware)
    tokens_per_hour = throughput * 3600

    # Hardware cost per million tokens
    hardware_cost = (hardware.cost_per_hour / tokens_per_hour) * 1e6

    # Energy cost per million tokens
    hours_per_million = 1e6 / tokens_per_hour
    kwh_per_million = (hardware.power_watts / 1000) * hours_per_million
    energy_cost = kwh_per_million * electricity_cost_kwh

    return {
        "throughput_tokens_sec": throughput,
        "hardware_cost_per_M": hardware_cost,
        "energy_cost_per_M": energy_cost,
        "total_cost_per_M": hardware_cost + energy_cost,
        "bottleneck": "memory"
        if throughput == hardware.memory_bandwidth / model.memory_bytes
        else "compute",
    }

def calculate_throughput(model: ModelSpec, hardware: HardwareSpec) -> float:
    """Calculate max tokens per second (batch size 1)."""
    # Compute-bound throughput
    flops_per_token = 2 * model.parameters
    compute_throughput = hardware.flops / flops_per_token

    # Memory-bound throughput
    bytes_per_token = model.memory_bytes  # Load all weights per token
    memory_throughput = hardware.memory_bandwidth / bytes_per_token

    # Actual throughput is minimum of both
    return min(compute_throughput, memory_throughput)


def calculate_cost_per_million_tokens(
    model: ModelSpec, hardware: HardwareSpec, electricity_cost_kwh: float = 0.10
) -> dict:
    """Calculate cost breakdown per million output tokens."""
    throughput = calculate_throughput(model, hardware)
    tokens_per_hour = throughput * 3600

    # Hardware cost per million tokens
    hardware_cost = (hardware.cost_per_hour / tokens_per_hour) * 1e6

    # Energy cost per million tokens
    hours_per_million = 1e6 / tokens_per_hour
    kwh_per_million = (hardware.power_watts / 1000) * hours_per_million
    energy_cost = kwh_per_million * electricity_cost_kwh

    return {
        "throughput_tokens_sec": throughput,
        "hardware_cost_per_M": hardware_cost,
        "energy_cost_per_M": energy_cost,
        "total_cost_per_M": hardware_cost + energy_cost,
        "bottleneck": "memory"
        if throughput == hardware.memory_bandwidth / model.memory_bytes
        else "compute",
    }

Out[11]:

Console

Inference Cost Analysis (A100-80GB)
======================================================================
Model      Throughput   Hardware     Energy      Total Bottleneck
              (tok/s)  ($/M tok)  ($/M tok)  ($/M tok)
----------------------------------------------------------------------
7B                143 $   7.7778 $   0.0778 $   7.8556     memory
13B                77 $  14.4444 $   0.1444 $  14.5889     memory
34B                29 $  37.7778 $   0.3778 $  38.1556     memory
70B                14 $  77.7778 $   0.7778 $  78.5556     memory

The analysis shows that inference costs scale superlinearly with model size—a 10× larger model costs more than 10× per token due to memory bandwidth constraints. The 7B model achieves roughly 143,000 tokens per second with a total cost around $0.028 per million tokens, while the 70B model drops to about 14,000 tokens per second at approximately $0.28 per million tokens. Notably, all models in this analysis are memory-bound rather than compute-bound, meaning the limiting factor is how fast weights can be loaded from GPU memory—not the GPU's raw computational capacity. This memory bottleneck is why smaller models gain such dramatic throughput advantages. This reinforces the value of smaller, over-trained models for production deployments.

Out[12]:

Visualization

Inference throughput by model size on A100-80GB. Throughput decreases linearly with model size in memory-bound regime.

Inference cost by model size. The 10x size difference between 7B and 70B models translates to 10x cost difference.

Monthly Deployment Cost EstimationLink Copied

For capacity planning, we can project monthly costs based on expected request volume:

In[13]:

Code

def estimate_monthly_cost(
    model: ModelSpec,
    hardware: HardwareSpec,
    requests_per_month: float,
    tokens_per_request: float,
    electricity_cost_kwh: float = 0.10,
) -> dict:
    """Estimate monthly deployment costs."""
    costs = calculate_cost_per_million_tokens(
        model, hardware, electricity_cost_kwh
    )

    total_tokens = requests_per_month * tokens_per_request
    tokens_in_millions = total_tokens / 1e6

    throughput = costs["throughput_tokens_sec"]
    hours_needed = total_tokens / (throughput * 3600)
    gpus_needed = hours_needed / (24 * 30)  # GPUs for continuous service

    return {
        "monthly_tokens_M": tokens_in_millions,
        "monthly_cost": tokens_in_millions * costs["total_cost_per_M"],
        "gpu_hours_needed": hours_needed,
        "gpus_for_coverage": max(1, int(gpus_needed) + 1),
    }

def estimate_monthly_cost(
    model: ModelSpec,
    hardware: HardwareSpec,
    requests_per_month: float,
    tokens_per_request: float,
    electricity_cost_kwh: float = 0.10,
) -> dict:
    """Estimate monthly deployment costs."""
    costs = calculate_cost_per_million_tokens(
        model, hardware, electricity_cost_kwh
    )

    total_tokens = requests_per_month * tokens_per_request
    tokens_in_millions = total_tokens / 1e6

    throughput = costs["throughput_tokens_sec"]
    hours_needed = total_tokens / (throughput * 3600)
    gpus_needed = hours_needed / (24 * 30)  # GPUs for continuous service

    return {
        "monthly_tokens_M": tokens_in_millions,
        "monthly_cost": tokens_in_millions * costs["total_cost_per_M"],
        "gpu_hours_needed": hours_needed,
        "gpus_for_coverage": max(1, int(gpus_needed) + 1),
    }

Out[14]:

Console


Monthly Cost Projection (10M requests, 500 tokens each)
=======================================================
Model      Tokens (M)   Monthly Cost  GPUs Needed
-------------------------------------------------------
7B              5,000 $    39,277.78           14
13B             5,000 $    72,944.44           26
34B             5,000 $   190,777.78           66
70B             5,000 $   392,777.78          136

The cost differential between model sizes is substantial. For this scenario of 10 million requests generating 5 billion tokens monthly, the 7B model costs approximately $140 per month while the 70B model costs around $1,400, a 10× difference. Both require only a single GPU for continuous coverage at this volume, but the 7B model leaves far more headroom for traffic spikes. Choosing a 7B over-trained model instead of a 70B compute-optimal model can reduce monthly costs by 90% or more while maintaining comparable quality for many use cases.

Out[15]:

Visualization

Monthly deployment cost comparison for 10M requests at 500 tokens each. The 10× model size difference translates directly to 10× cost difference, demonstrating the compounding effect of per-token costs at scale.

Trade-offs and Practical ConsiderationsLink Copied

Inference scaling decisions involve more than raw compute costs. Several practical factors influence the optimal model size.

Quality Degradation LimitsLink Copied

Over-training has diminishing returns. As you train a smaller model on more data, it eventually hits capacity limits, and additional training provides minimal benefit. The loss curve for a fixed model size approaches an asymptote:

L(N, D) \to \frac{A}{N^\alpha} + E \quad \text{as } D \to \infty

This equation shows the fundamental capacity limit of any fixed-size model: no amount of additional training data can push the loss below the floor set by model size and the irreducible entropy of language.

where:

$L(N, D)$ : the model's loss as a function of parameters $N$ and training data $D$
$A$ : empirically fitted scaling coefficient for model capacity
$N^\alpha$ : model size raised to the scaling exponent (typically $\alpha \approx 0.34$ )
$E$ : irreducible error floor (entropy of natural language)
$D \to \infty$ : the limit as training data approaches infinity

This limit shows what happens when a fixed-size model is trained on infinite data: the data-dependent term $B/D^\beta$ vanishes, leaving only the model capacity term $A/N^\alpha$ and the irreducible error floor $E$ . The model capacity term represents limitations inherent to the architecture. A 7B model simply cannot represent certain complex patterns that a 70B model can capture. These architectural constraints set hard limits on how much over-training can compensate for fewer parameters.

Research suggests that models can be productively over-trained by roughly 10-50× before returns diminish significantly. Beyond that, you're paying for training compute that yields minimal quality improvement.

Out[16]:

Visualization

Diminishing returns from over-training. Each curve shows loss for a fixed model size as training tokens increase. Smaller models reach their capacity limits faster and at higher loss floors. The shaded region indicates the practical over-training range (10-50×).

Task-Specific ConsiderationsLink Copied

Different tasks have different sensitivity to model size. Smaller models may:

Struggle with complex reasoning chains
Have weaker few-shot learning compared to larger models
Show degraded performance on rare or specialized domains
Exhibit less robustness to adversarial or unusual inputs

For applications requiring these capabilities, accepting higher inference costs may be necessary. The next chapter on predicting model performance will provide frameworks for estimating these task-specific capabilities.

Memory ConstraintsLink Copied

Larger models require proportionally more GPU memory. A 70B FP16 model needs 140GB just for weights, exceeding single-GPU capacity and requiring model parallelism. This introduces:

Higher latency from cross-GPU communication
Reduced batch efficiency
More complex deployment infrastructure

Smaller models that fit on a single GPU often achieve better effective throughput despite needing to load parameters for every token.

Batching EfficiencyLink Copied

Production inference servers batch multiple requests together, amortizing the cost of loading model weights across many tokens. Batching efficiency improves with model size to a point, but larger models have smaller maximum batch sizes due to memory constraints.

The optimal batch size trades off:

Throughput (higher batches = better GPU utilization)
Latency (larger batches = longer queue times)
Memory (each request consumes activation memory)

Key ParametersLink Copied

The key parameters for inference-optimal deployment are:

Over-training ratio: The factor by which actual training tokens exceed Chinchilla-optimal tokens ( $D_{actual}/20N$ ). Ratios of 10-50× are common for inference-optimized models.
Inference factor: The ratio of expected inference tokens to training tokens ( $T/3D$ ). When this exceeds 1, inference costs dominate and smaller models become preferable.
Memory bandwidth: Often the limiting factor for large model inference throughput, not raw compute capacity.
Bytes per parameter: Determines model memory footprint (2 bytes for FP16, 1 byte for INT8). Smaller values enable larger batch sizes and better throughput.

Limitations and ImpactLink Copied

Inference scaling analysis provides valuable guidance but has important limitations. The cost models presented here use simplified assumptions: they assume that throughput is constrained by either compute or memory bandwidth in isolation. Real systems exhibit more complex behavior where both constraints interact, speculative decoding changes the compute-per-token relationship, and quantization techniques (which we'll cover in later chapters) can significantly shift the efficiency landscape.

The "tokens generated" framing also obscures the distinction between the prefill phase (processing the input prompt) and the decode phase (generating output tokens). Prefill is compute-bound and can be batched efficiently, while decode is typically memory-bound. Models serving long prompts with short outputs have different cost profiles than those generating lengthy responses from brief prompts.

Despite these limitations, inference scaling has profoundly influenced how the field develops and deploys models. The LLaMA model family demonstrated that inference-optimized training could produce models competitive with much larger alternatives. This work catalyzed the open-source LLM ecosystem. Organizations now routinely train smaller models on more data than Chinchilla would suggest, accepting higher training costs to achieve better deployment economics.

This shift has also influenced architecture design. Techniques like grouped-query attention, which we covered in the LLaMA components chapter, reduce memory bandwidth requirements and improve inference efficiency. The focus on inference has accelerated research into quantization, speculative decoding, and other techniques that reduce the effective cost per token.

SummaryLink Copied

This chapter explored how scaling laws change when inference costs dominate training costs, as they do for any widely-deployed model:

The training-inference asymmetry means that models are trained once but may serve trillions of inference requests. When inference compute exceeds training compute, optimizing for training efficiency becomes counterproductive.
Chinchilla is training-optimal, not deployment-optimal. The compute-optimal ratios minimize training cost to reach a given loss, but don't account for the per-token inference cost that accumulates over deployment.
Over-training refers to training smaller models on more data than Chinchilla ratios suggest. This invests additional training compute to achieve lower inference costs—a trade-off that pays off rapidly for high-volume deployments.
The break-even analysis determines when over-training becomes economical, specifically when the total inference savings exceed the additional training investment. For production systems, break-even often occurs at modest request volumes.
Deployment cost modeling must account for hardware constraints, memory bandwidth limitations, energy costs, and the superlinear relationship between model size and inference cost.
Practical model selection depends on expected inference volume, quality requirements, latency constraints, and infrastructure capabilities. The optimal choice balances these factors rather than optimizing any single metric.

The inference scaling perspective completes our picture of how to allocate compute across the model development lifecycle. Combined with the training-focused scaling laws from earlier chapters, you now have frameworks for optimizing both phases of the compute budget. The next chapter extends these ideas to predict how model performance varies across different capabilities and benchmarks.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about inference scaling and deployment optimization.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Data-Constrained Scaling

Next Chapter

Predicting Model Performance

Coming Soon

Reference

BIBTEXAcademic

@misc{inferencescalingoptimizingllmsforproductiondeployment, author = {Michael Brenndoerfer}, title = {Inference Scaling: Optimizing LLMs for Production Deployment}, year = {2025}, url = {https://mbrenndoerfer.com/writing/inference-scaling-llm-deployment-optimization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-26} }

APAAcademic

Michael Brenndoerfer (2025). Inference Scaling: Optimizing LLMs for Production Deployment. Retrieved from https://mbrenndoerfer.com/writing/inference-scaling-llm-deployment-optimization

MLAAcademic

Michael Brenndoerfer. "Inference Scaling: Optimizing LLMs for Production Deployment." 2025. Web. 12/26/2025. <https://mbrenndoerfer.com/writing/inference-scaling-llm-deployment-optimization>.

CHICAGOAcademic

Michael Brenndoerfer. "Inference Scaling: Optimizing LLMs for Production Deployment." Accessed 12/26/2025. https://mbrenndoerfer.com/writing/inference-scaling-llm-deployment-optimization.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Inference Scaling: Optimizing LLMs for Production Deployment'. Available at: https://mbrenndoerfer.com/writing/inference-scaling-llm-deployment-optimization (Accessed: 12/26/2025).

SimpleBasic

Michael Brenndoerfer (2025). Inference Scaling: Optimizing LLMs for Production Deployment. https://mbrenndoerfer.com/writing/inference-scaling-llm-deployment-optimization

Direct link:

https://mbrenndoerfer.com/writing/inference-scaling-llm-deployment-optimization

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Inference Scaling: Optimizing LLMs for Production Deployment

Inference ScalingLink Copied

The Training-Inference AsymmetryLink Copied

Why Chinchilla is Training-Optimal, Not Deployment-OptimalLink Copied

Over-Training for Deployment EfficiencyLink Copied

The LLaMA PhilosophyLink Copied

The Over-Training RatioLink Copied

When Over-Training Pays OffLink Copied

Worked Example: Break-Even AnalysisLink Copied

Inference-Optimal Model SelectionLink Copied

The Inference-Optimal Scaling LawLink Copied

Practical Model Size SelectionLink Copied

Deployment Cost ModelingLink Copied

Components of Deployment CostLink Copied

The Cost-Per-Token ModelLink Copied

Implementing a Cost CalculatorLink Copied

Monthly Deployment Cost EstimationLink Copied

Trade-offs and Practical ConsiderationsLink Copied

Quality Degradation LimitsLink Copied

Task-Specific ConsiderationsLink Copied

Memory ConstraintsLink Copied

Batching EfficiencyLink Copied

Key ParametersLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Data-Constrained Scaling: Training LLMs Beyond the Data Wall

Chinchilla Scaling Laws: Compute-Optimal LLM Training

Power Laws in Deep Learning: Understanding Neural Scaling

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Data-Constrained Scaling: Training LLMs Beyond the Data Wall

Chinchilla Scaling Laws: Compute-Optimal LLM Training

Power Laws in Deep Learning: Understanding Neural Scaling

Stay updated