Kaplan Scaling Laws: Predicting Language Model Performance

Michael Brenndoerfer

Machine Learning Language AI Handbook Data, Analytics & AI

Learn how Kaplan scaling laws predict LLM performance from model size, data, and compute. Master power-law relationships for optimal resource allocation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Kaplan Scaling LawsLink Copied

In January 2020, a team at OpenAI published findings that would change how the field thinks about training large language models. Their paper, "Scaling Laws for Neural Language Models," presented empirical evidence that language model performance follows predictable patterns as you increase model size, dataset size, and compute budget. These patterns, now known as the Kaplan scaling laws, provided a rigorous framework for predicting how much improvement you could expect from scaling up neural language models.

Building on our understanding of power laws from the previous chapter, we now examine how these mathematical relationships manifest specifically in language model training. The Kaplan team didn't just observe that performance improves with scale; they quantified the exponents governing these improvements and derived practical formulas for optimal resource allocation. Their central claim was clear: when you have a fixed compute budget, you should prioritize making models larger rather than training them on more data. This guidance shaped the development of GPT-3 and influenced the broader direction of large language model research.

The Experimental FoundationLink Copied

Before examining the specific scaling laws, it helps to understand how these relationships were discovered. The Kaplan team trained hundreds of transformer language models across a wide range of sizes, from 768 parameters to 1.5 billion parameters. They varied three key quantities systematically:

Model parameters (N): The total number of trainable weights in the network, excluding embedding parameters
Dataset size (D): The number of tokens in the training corpus
Compute budget (C): The total floating-point operations used for training, measured in PetaFLOP-days

All models used the same architecture (decoder-only transformers), the same dataset (WebText2), and the same training procedure. This careful experimental design allowed them to isolate the effect of each variable and identify clean power-law relationships.

Measuring Parameters

Kaplan et al. counted only non-embedding parameters when measuring model size. They excluded both input embeddings and output embeddings (weight tying was used) because embedding parameters scale differently with vocabulary size. When comparing to their equations, ensure you're using the same counting convention.

Loss vs. ParametersLink Copied

The first major finding concerns how test loss decreases as models grow larger. This relationship answers a basic question practitioners face: if I make my model bigger, how much better will it perform? Before scaling laws, researchers had only rough intuitions. More parameters generally helped, but by how much? And for how long?

To understand why a power law might govern this relationship, consider what happens as you add parameters to a neural network. Early parameters capture broad, common patterns in language: basic grammar, frequent word associations, and simple semantic relationships. These patterns appear consistently across training data, so the model learns them reliably and they contribute substantially to reducing loss. As you continue adding parameters, however, the model must capture increasingly subtle and rare phenomena: unusual grammatical constructions, domain-specific terminology, and nuanced contextual dependencies. Each successive layer of complexity appears less frequently in the data and contributes less to overall performance improvement. This diminishing marginal return is exactly what power laws describe mathematically.

When training models to convergence with effectively unlimited data, the relationship follows a power law:

L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

where:

$L$ : the cross-entropy loss on held-out text, measuring how well the model predicts the next token
$N$ : the number of non-embedding parameters (trainable weights excluding embeddings)
$N_c$ : a fitted constant representing the "characteristic scale" of parameters (approximately $8.8 \times 10^{13}$ )
$\alpha_N$ : the scaling exponent (approximately 0.076), which determines how quickly loss decreases with more parameters
$\left(\frac{N_c}{N}\right)$ : the ratio of the characteristic scale to the actual model size—when $N$ is small compared to $N_c$ , this ratio is large, yielding higher loss

The structure of this formula rewards careful examination. The characteristic scale $N_c$ represents a kind of reference point—an astronomically large number of parameters (88 trillion) that serves as a normalizing constant. You can think of $N_c$ as the scale at which the ratio $N_c/N$ equals 1, making the loss contribution from this term equal to 1 as well. Since no practical model approaches this scale, the ratio $N_c/N$ is always much greater than 1 for real models, producing loss values that decrease predictably as $N$ grows.

The power-law form emerges because each additional parameter provides diminishing marginal improvement—the ratio $N_c/N$ captures how far the model is from some theoretical limit, and raising it to a small power $\alpha_N$ yields the smooth, predictable decay observed empirically. Kaplan et al. found:

$\alpha_N \approx 0.076$
$N_c \approx 8.8 \times 10^{13}$

The exponent $\alpha_N = 0.076$ is important because it quantifies exactly how much you gain from scale. This small value tells us that each 10x increase in parameters yields a factor of $10^{0.076} \approx 1.19$ reduction in the loss ratio $(N_c/N)^{\alpha_N}$ , or equivalently about a 16% decrease in the excess loss. Translating to more intuitive terms: doubling the model size reduces loss by about 5%. This might seem modest, but remember that loss improvements compound multiplicatively and that cross-entropy loss relates exponentially to perplexity. Small loss reductions can translate to meaningfully better text generation.

In[2]:

Code

import numpy as np

# Kaplan scaling law parameters for loss vs N
alpha_N = 0.076
N_c = 8.8e13

# Generate parameter counts from 10M to 100B
N_values = np.logspace(7, 11, 100)  # 10^7 to 10^11 parameters

# Calculate predicted loss
L_N = (N_c / N_values) ** alpha_N

# Calculate improvement from doubling
doubling_improvement = 1 - (0.5**alpha_N)

# Define alpha_D and D_c here for use in later cells
alpha_D = 0.095
D_c = 5.4e13

import numpy as np

# Kaplan scaling law parameters for loss vs N
alpha_N = 0.076
N_c = 8.8e13

# Generate parameter counts from 10M to 100B
N_values = np.logspace(7, 11, 100)  # 10^7 to 10^11 parameters

# Calculate predicted loss
L_N = (N_c / N_values) ** alpha_N

# Calculate improvement from doubling
doubling_improvement = 1 - (0.5**alpha_N)

# Define alpha_D and D_c here for use in later cells
alpha_D = 0.095
D_c = 5.4e13

Out[3]:

Visualization

Log-log plot showing test loss decreasing linearly with increasing model parameters. — Power-law relationship between test loss and model parameters. Each 10x increase in parameters produces a consistent decrease in loss, with the relationship remaining linear on log-log axes across four orders of magnitude.

The linearity on log-log axes confirms the power-law relationship. This was not obvious beforehand. The loss could have plateaued, shown diminishing returns, or shown more complex behavior. Instead, the relationship holds remarkably cleanly across four orders of magnitude in model size. The straight line you observe when plotting both axes on logarithmic scales is the defining signature of a power law, and its consistency across such a wide range suggests that the underlying phenomenon reflects something fundamental about how neural networks learn from data.

Loss vs. DataLink Copied

The second scaling law describes how performance improves with more training data. This addresses a complementary question to parameter scaling: if you gather more text to train on, how much better will your model become? Understanding this relationship is crucial because data collection has its own costs—curating high-quality training corpora requires substantial effort, and some domains have limited data availability.

The intuition behind data scaling parallels parameter scaling but works through a different mechanism. When you train on a small dataset, the model encounters only a limited sample of language patterns. Common constructions appear frequently enough to learn well, but rarer phenomena (unusual vocabulary, domain-specific expressions, complex reasoning patterns) may appear too infrequently for reliable learning. As the dataset grows, these rare patterns appear more often, giving the model opportunities to learn them. Eventually, however, even very large datasets become "saturated" in some sense. The most common patterns have already been thoroughly learned, and additional data primarily provides more examples of already-captured regularities rather than genuinely new information.

When the model is large enough that it won't overfit (effectively infinite capacity), the loss follows:

L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}

where:

$L$ : the cross-entropy loss on held-out text
$D$ : the number of tokens in the training set (the total amount of text the model learns from)
$D_c$ : a fitted constant representing the characteristic data scale (approximately $5.4 \times 10^{13}$ tokens)
$\alpha_D$ : the data scaling exponent (approximately 0.095), governing how quickly loss improves with more data

The formula's structure mirrors the parameter scaling law, with $D_c$ serving as the characteristic data scale analogous to how $N_c$ served as the characteristic parameter scale. This parallel structure is not coincidental—it reflects the symmetric role that model capacity and data quantity play in determining what a neural network can learn. The characteristic scale $D_c \approx 5.4 \times 10^{13}$ tokens represents roughly 54 trillion tokens, an enormous amount of text that far exceeds any practical training corpus. This large value ensures that the ratio $D_c/D$ remains well above 1 for realistic datasets, producing the predictable decay in loss as data increases.

The intuition here parallels the parameter scaling: each additional training token provides diminishing returns, but the slightly larger exponent ( $\alpha_D > \alpha_N$ ) indicates that loss is somewhat more sensitive to data increases than parameter increases when measured in isolation.

The fitted parameters are:

$\alpha_D \approx 0.095$
$D_c \approx 5.4 \times 10^{13}$

Notice that $\alpha_D > \alpha_N$ (0.095 vs 0.076). This numerical comparison reveals something important about the relative sensitivity of loss to each resource. With a larger exponent, each multiplicative increase in data produces a bigger drop in loss than the same multiplicative increase in parameters. Specifically, for the same multiplicative increase (say, 10x), data scaling reduces loss by a factor of $10^{0.095} \approx 1.24$ while parameter scaling reduces it by only $10^{0.076} \approx 1.19$ . Loss improves faster per unit increase in data than per unit increase in parameters when measured in isolation. However, as we'll see shortly, this doesn't mean you should prefer more data over larger models when compute is constrained, because the cost of processing more data must be factored into the optimization.

In[4]:

Code

# Kaplan scaling law parameters for loss vs D
alpha_D = 0.095
D_c = 5.4e13

# Generate dataset sizes from 1B to 1T tokens
D_values = np.logspace(9, 12, 100)  # 10^9 to 10^12 tokens

# Calculate predicted loss
L_D = (D_c / D_values) ** alpha_D

# Kaplan scaling law parameters for loss vs D
alpha_D = 0.095
D_c = 5.4e13

# Generate dataset sizes from 1B to 1T tokens
D_values = np.logspace(9, 12, 100)  # 10^9 to 10^12 tokens

# Calculate predicted loss
L_D = (D_c / D_values) ** alpha_D

Out[5]:

Visualization

Log-log plot showing test loss decreasing linearly with increasing dataset size. — Power-law relationship between test loss and training dataset size. The steeper slope compared to the parameter scaling law indicates that loss decreases more rapidly per 10x increase in data.

Loss vs. ComputeLink Copied

The third scaling law relates loss directly to the compute budget. This is perhaps the most practically important relationship because compute (measured in GPU-hours, dollars, or energy consumption) is the fundamental constraint that organizations face when training models. Unlike parameters, which can be chosen freely, or data, which can often be collected or synthesized, compute represents a hard resource limit.

Understanding how loss scales with compute means recognizing that compute is a derived resource rather than a fundamental one. You don't directly "spend" compute; instead, you spend it indirectly by choosing a model size (which determines compute per forward-backward pass) and a number of training steps (which determines how many passes you perform). The compute budget constrains the product of these choices, which creates a tradeoff. You can train a large model for few steps or a small model for many steps, but you cannot train a large model for many steps within a fixed compute budget.

When training is optimally allocated between model size and training steps, the loss follows:

L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

where:

$L$ : the cross-entropy loss on held-out text
$C$ : the total compute budget measured in PetaFLOP-days (one PetaFLOP-day equals $10^{15}$ floating-point operations performed continuously for one day)
$C_c$ : a fitted constant representing the characteristic compute scale (approximately $3.1 \times 10^8$ PetaFLOP-days)
$\alpha_C$ : the compute scaling exponent (approximately 0.050)

The characteristic compute scale $C_c \approx 3.1 \times 10^8$ PetaFLOP-days represents an enormous amount of computation—roughly 310 million PetaFLOP-days. To put this in perspective, training GPT-3 required approximately 3,640 PetaFLOP-days, which is about five orders of magnitude smaller than $C_c$ . This large characteristic scale ensures that practical training runs operate in the regime where the ratio $C_c/C$ is much greater than 1, producing meaningful loss values that decrease predictably with more compute.

The compute exponent is smaller than both $\alpha_N$ and $\alpha_D$ because compute is a "derived" resource. It gets split between making the model larger (more parameters) and training longer (more data passes). The compound effect means each 10x increase in compute yields less loss reduction than a 10x increase in parameters or data alone.

The fitted parameters are:

$\alpha_C \approx 0.050$
$C_c \approx 3.1 \times 10^8$

To understand why the compute scaling works, consider that total compute is approximately $C \approx 6ND$ . The factor of 6 accounts for roughly two operations per parameter in the forward pass and four in the backward pass (computing gradients for both activations and weights). This relationship emerges from the arithmetic of matrix multiplications: each parameter participates in operations during both forward and backward passes, with the backward pass requiring roughly twice the computation of the forward pass to compute gradients. The factor of 6 is an approximation that holds reasonably well across typical transformer architectures and training configurations, though the exact multiplier can vary with specific implementation details. This means compute constrains the product of model size and data, forcing a tradeoff between the two.

The compute exponent $\alpha_C = 0.050$ is smaller than both $\alpha_N$ and $\alpha_D$ . This makes sense: compute is spent on both larger models and more training steps, so its effect is spread across multiple factors. When you increase compute, you gain both from having more parameters (benefiting from the $\alpha_N$ exponent) and from seeing more data (benefiting from the $\alpha_D$ exponent), but neither benefit is realized fully because the compute is divided between them. The key insight is that this relationship only holds when you allocate compute optimally, which brings us to Kaplan's most consequential finding.

In[6]:

Code

# Kaplan scaling law parameters for loss vs C
alpha_C = 0.050
C_c = 3.1e8

# Generate compute budgets from 10 to 10M PetaFLOP-days
C_values = np.logspace(1, 7, 100)  # 10^1 to 10^7 PF-days

# Calculate predicted loss
L_C = (C_c / C_values) ** alpha_C

# Kaplan scaling law parameters for loss vs C
alpha_C = 0.050
C_c = 3.1e8

# Generate compute budgets from 10 to 10M PetaFLOP-days
C_values = np.logspace(1, 7, 100)  # 10^1 to 10^7 PF-days

# Calculate predicted loss
L_C = (C_c / C_values) ** alpha_C

Out[7]:

Visualization

Log-log plot showing test loss decreasing linearly with increasing compute budget. — Power-law relationship between test loss and training compute budget. This law assumes optimal allocation of compute between model size and training duration.

The Unified Scaling LawLink Copied

The individual scaling laws describe performance in idealized settings: infinite data, infinite compute, or optimal allocation. In practice, models train with finite resources across all dimensions. A model might have limited parameters due to hardware constraints, limited data due to collection challenges, and limited compute due to budget restrictions. The unified scaling law addresses this realistic scenario by combining all three factors into a single predictive equation.

The challenge in formulating such a law is figuring out how different constraints interact. If you have a large model but limited data, the model may overfit or fail to use its full capacity. If you have abundant data but a small model, the model may underfit and fail to capture the complexity in the data. The unified law must capture these interactions while reducing to the individual laws in the appropriate limits.

Kaplan et al. proposed a unified equation that combines all three factors:

L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\frac{\alpha_N}{\alpha_D}} + \frac{D_c}{D}\right]^{\alpha_D}

where:

$L(N, D)$ : the predicted loss as a function of both model size and data size
$N$ : non-embedding parameters
$D$ : dataset size in tokens
$N_c, D_c$ : the characteristic scales from the individual scaling laws
$\alpha_N, \alpha_D$ : the scaling exponents (0.076 and 0.095 respectively)
$\frac{\alpha_N}{\alpha_D}$ : the ratio of exponents (approximately 0.8), which balances the relative contributions of model and data limitations
The brackets $[\cdot]^{\alpha_D}$ : the outer exponent applied after summing the two limitation terms, converting the combined constraint into a loss prediction

The formula's structure reveals how parameter and data constraints combine mathematically. Consider the expression inside the brackets: it sums two terms, each representing a different kind of limitation. The term $(N_c/N)^{\alpha_N/\alpha_D} \approx (N_c/N)^{0.8}$ captures the model's capacity limitation, representing how far the current model size is from the characteristic scale, adjusted by the ratio of exponents to ensure consistent units when adding to the data term. The second term, $D_c/D$ , captures the data limitation in a parallel fashion. Adding them inside the brackets before applying $\alpha_D$ means that whichever constraint is more binding will dominate—you cannot compensate for too little data with a larger model, or vice versa.

This additive structure inside the brackets creates what mathematicians call a "soft minimum" behavior. When one term is much larger than the other, the sum is dominated by that term, and the loss depends primarily on the binding constraint. When both terms are comparable, both constraints matter, and the loss reflects a genuine interaction between model capacity and data availability. This matches our intuition about how training should behave: you need both sufficient model capacity and sufficient data to achieve good performance.

Out[8]:

Visualization

Heatmap showing predicted loss across combinations of model parameters and dataset tokens. — Unified scaling law showing predicted loss as a function of both model size and dataset size. The diagonal band of optimal configurations illustrates how parameters and data must scale together—insufficient data with a large model (upper left) or insufficient parameters with large data (lower right) both lead to suboptimal loss.

This formula captures an important insight: there's an effective data-to-model ratio that determines performance. A large model on too little data leads to overfitting (the model can't learn effectively from limited signal). Training a small model on too much data leads to a different kind of inefficiency: the model saturates and additional data provides diminishing returns.

The unified law also reveals the irreducible loss component. Even with infinite parameters and infinite data, some entropy remains in natural language that no model can predict. This represents the fundamental unpredictability of language itself.

Optimal Compute AllocationLink Copied

Perhaps the most influential finding from Kaplan's work was their prescription for how to allocate a fixed compute budget between model size and training data. This question is very practical: given a budget of one million dollars worth of GPU time, should you train a modest model for a long time on lots of data, or should you train a very large model for a shorter time on less data? Before scaling laws, practitioners relied on intuition and trial-and-error. Kaplan's analysis provided a principled answer backed by empirical evidence.

The derivation of optimal allocation starts from the fact that compute constrains the product of model size and data: $C \approx 6ND$ . This means that for a fixed compute budget, increasing $N$ necessarily decreases $D$ (the number of training tokens seen). The question becomes: what ratio of $N$ to $D$ minimizes loss for a given $C$ ?

Given a compute budget $C$ , the optimal model size and dataset size follow:

N_{opt} \propto C^{0.73}

D_{opt} \propto C^{0.27}

where:

$N_{opt}$ : the optimal number of parameters for a given compute budget
$D_{opt}$ : the optimal number of training tokens for that same budget
$C$ : the total compute budget in PetaFLOP-days
$\propto$ : denotes proportionality—the left side scales with the right side up to a constant factor
The exponent 0.73: derived from fitting optimal configurations, indicates parameters should grow faster than the square root of compute
The exponent 0.27: the complementary exponent for data scaling, since $0.73 + 0.27 = 1.0$ reflects the constraint $C \propto ND$

These exponents sum to 1.0, which follows from the compute relationship $C \approx 6ND$ . Taking logarithms: $\log C = \log 6 + \log N + \log D$ . If we write $N \propto C^a$ and $D \propto C^b$ , then $\log C = \text{const} + a\log C + b\log C$ , requiring $a + b = 1$ . This mathematical constraint ensures consistency: you cannot independently choose how both parameters and data scale with compute, because compute is defined as their product. Notice the stark imbalance: 73% of the compute scaling goes toward larger models, while only 27% goes toward more data.

The asymmetry in these exponents reflects Kaplan's central finding: larger models are more sample-efficient. A bigger model extracts more learning from each training token it sees. This means that as you scale up compute, you should primarily invest in model size rather than training duration. The intuition is that a larger model has more capacity to represent complex patterns, so it can learn those patterns from fewer examples. A smaller model, by contrast, needs to see the same patterns many times before it can reliably capture them.

Kaplan's Central Recommendation

When your compute budget increases by 10x, you should increase model size by approximately $10^{0.73} \approx 5.4\times$ and training tokens by only $10^{0.27} \approx 1.9\times$ . Larger models are more sample-efficient. They extract more learning from each token they see.

This recommendation drove concrete decisions. For GPT-3, OpenAI chose a 175B parameter model trained on 300B tokens. Following Kaplan's guidance meant prioritizing a very large model over training on more data.

In[9]:

Code

# Kaplan optimal allocation exponents
exp_N = 0.73
exp_D = 0.27

# Calculate optimal allocation across compute budgets
compute_budgets = np.logspace(2, 6, 50)  # 100 to 1M PetaFLOP-days

# Normalize to reasonable scales (these are proportional, not absolute)
# Using rough calibration points from the paper
N_opt = (
    1e7 * (compute_budgets / 100) ** exp_N
)  # Baseline: 10M params at 100 PF-days
D_opt = (
    1e9 * (compute_budgets / 100) ** exp_D
)  # Baseline: 1B tokens at 100 PF-days

# Kaplan optimal allocation exponents
exp_N = 0.73
exp_D = 0.27

# Calculate optimal allocation across compute budgets
compute_budgets = np.logspace(2, 6, 50)  # 100 to 1M PetaFLOP-days

# Normalize to reasonable scales (these are proportional, not absolute)
# Using rough calibration points from the paper
N_opt = (
    1e7 * (compute_budgets / 100) ** exp_N
)  # Baseline: 10M params at 100 PF-days
D_opt = (
    1e9 * (compute_budgets / 100) ** exp_D
)  # Baseline: 1B tokens at 100 PF-days

Out[10]:

Visualization

Log-log plot showing optimal model parameters versus compute. — Optimal model size scaling with compute budget. Parameters grow as C^0.73, meaning most additional compute should be invested in larger models.

Log-log plot showing optimal dataset tokens versus compute. — Optimal dataset size scaling with compute budget. Tokens grow as C^0.27, indicating relatively modest increases in training data.

The visual contrast is striking. As compute increases 10,000x (from $10^2$ to $10^6$ PF-days), optimal model size increases by roughly 2500x while optimal dataset size increases by only about 20x. This asymmetry reflects Kaplan's core finding: bigger models learn more efficiently.

Out[11]:

Visualization

Line plot showing predicted loss versus fraction of compute allocated to parameters. — Impact of allocation strategy on predicted loss. For a fixed compute budget of 10,000 PetaFLOP-days, different splits between model size and training data yield different losses. The Kaplan-optimal allocation (73% to parameters, 27% to data) achieves the lowest loss, validating the 'scale models over data' recommendation.

Worked Example: Allocating a Compute BudgetLink Copied

Let's work through a concrete example to see how these scaling laws guide practical decisions. Suppose you have 10,000 PetaFLOP-days of compute available, a substantial budget representing perhaps several million dollars worth of GPU time. How should you allocate it between model size and training data?

First, establish a baseline calibration. The scaling laws provide proportionalities rather than absolute values, so we need reference points to anchor our calculations. From Kaplan's experiments, a reasonable reference is that 100 PetaFLOP-days is roughly appropriate for training a 10 million parameter model on 1 billion tokens. These numbers serve as our calibration baseline. We'll scale up from this known point using the power-law exponents.

First, we need baseline calibration points. From Kaplan's experiments, a reasonable reference is that 100 PetaFLOP-days is roughly appropriate for training a 10 million parameter model on 1 billion tokens. Using the scaling exponents:

N_{opt}(C) = N_{base} \times \left(\frac{C}{C_{base}}\right)^{0.73}

D_{opt}(C) = D_{base} \times \left(\frac{C}{C_{base}}\right)^{0.27}

where:

$N_{opt}(C)$ : the optimal parameter count for compute budget $C$
$D_{opt}(C)$ : the optimal token count for compute budget $C$
$N_{base}, D_{base}$ : reference values at a known compute level (here, 10M parameters and 1B tokens)
$C_{base}$ : the reference compute level (here, 100 PetaFLOP-days)
The ratio $C/C_{base}$ : represents how many times larger your budget is compared to the reference
The exponents 0.73 and 0.27: the Kaplan optimal allocation exponents, determining how aggressively each resource scales with compute

The calculation proceeds by first computing the ratio of our budget to the baseline budget: $C/C_{base} = 10000/100 = 100$ . This tells us we have 100 times more compute than the baseline case. We then raise this ratio to each power-law exponent to determine how much each resource should scale. For parameters, $100^{0.73} \approx 28.8$ , meaning we should use approximately 29 times more parameters than the baseline. For data, $100^{0.27} \approx 3.5$ , meaning we should use approximately 3.5 times more data than the baseline.

In[12]:

Code

# Given compute budget
C = 10000  # PetaFLOP-days

# Baseline calibration
C_base = 100  # PF-days
N_base = 1e7  # 10M parameters
D_base = 1e9  # 1B tokens

# Calculate optimal allocation using Kaplan exponents
N_optimal = N_base * (C / C_base) ** 0.73
D_optimal = D_base * (C / C_base) ** 0.27

# Given compute budget
C = 10000  # PetaFLOP-days

# Baseline calibration
C_base = 100  # PF-days
N_base = 1e7  # 10M parameters
D_base = 1e9  # 1B tokens

# Calculate optimal allocation using Kaplan exponents
N_optimal = N_base * (C / C_base) ** 0.73
D_optimal = D_base * (C / C_base) ** 0.27

Out[13]:

Console

Compute budget: 10,000 PetaFLOP-days

Kaplan optimal allocation:
  Model size: 0.29 billion parameters
  Dataset: 3.5 billion tokens
  Tokens per parameter: 12.0

For comparison, scaling factors from baseline:
  Model scale: 28.8x larger than 10M baseline
  Data scale: 3.5x larger than 1B baseline

With 10,000 PetaFLOP-days, Kaplan's scaling laws recommend training roughly a 1.4 billion parameter model on about 4.3 billion tokens. The tokens-per-parameter ratio is particularly telling: at only about 3 tokens per parameter, the model sees each parameter's worth of data only a few times. This low ratio reflects the "train big, don't overtrain" philosophy that emerged from this work. Traditional machine learning wisdom suggested that models need many samples per parameter to avoid overfitting, but Kaplan's analysis showed that for language models, the opposite approach works better. Make the model large and stop training relatively early.

Notice the tokens-per-parameter ratio is only about 3—the model sees each parameter's worth of data only a few times. This low ratio reflects the "train big, don't overtrain" philosophy that emerged from this work.

Comparing Scaling BehaviorLink Copied

To appreciate how the three scaling laws interact and to develop intuition for their relative magnitudes, it helps to visualize them together on a common scale. This comparison shows the hierarchy of exponents and explains why optimal compute allocation favors model size over data.

When we plot all three scaling relationships on the same axes, we normalize each curve to pass through the point (1, 1), representing a baseline case. This allows us to compare slopes: a steeper slope means faster loss improvement per unit increase in the resource. The steepness of each curve directly reflects its corresponding exponent—larger exponents produce steeper downward slopes on log-log axes.

In[14]:

Code

# Create normalized comparison of all three scaling laws
# Normalize so all curves pass through (1, 1) for comparison

# Relative scaling (normalized to 1 at leftmost point)
x_relative = np.logspace(0, 4, 100)  # 1x to 10000x

# Loss improvement factors using Kaplan exponents
L_N_relative = x_relative ** (-alpha_N)  # Parameters
L_D_relative = x_relative ** (-alpha_D)  # Data
L_C_relative = x_relative ** (-alpha_C)  # Compute

# Create normalized comparison of all three scaling laws
# Normalize so all curves pass through (1, 1) for comparison

# Relative scaling (normalized to 1 at leftmost point)
x_relative = np.logspace(0, 4, 100)  # 1x to 10000x

# Loss improvement factors using Kaplan exponents
L_N_relative = x_relative ** (-alpha_N)  # Parameters
L_D_relative = x_relative ** (-alpha_D)  # Data
L_C_relative = x_relative ** (-alpha_C)  # Compute

Out[15]:

Visualization

Log-log plot comparing three declining curves for data, parameters, and compute scaling. — Comparison of scaling law exponents showing relative loss improvement from each factor. Data scaling (α=0.095) improves loss fastest per unit increase, followed by parameters (α=0.076), then compute (α=0.050).

At first glance, you might think this figure contradicts the recommendation to scale models over data. After all, the data curve drops fastest. Remember that the constraint is compute, not the resource itself. This is a subtle but important distinction. The x-axis shows multiplicative increases in each resource independently, but in practice, you cannot independently scale data without consuming compute. Increasing parameters by 10x costs proportionally more compute than increasing data by 10x (roughly). When you factor in these costs, the optimal allocation still favors larger models.

The answer is that data scaling requires compute (you must process each token), while parameter scaling has a fixed cost per training step regardless of model size. Roughly speaking, the compute per step scales with $N$ , but the total compute $C \approx 6ND$ means the relationship is multiplicative. The optimal allocation calculation properly accounts for these costs and determines that investing in model size yields better returns per unit of compute than investing in more training data.

Predicting Performance at ScaleLink Copied

One of the most useful applications of scaling laws is extrapolation. If the power laws hold, you can predict performance at scales you haven't yet trained. This turns scaling laws from retrospective descriptions into planning tools. Rather than running many expensive experiments to find the best configuration, you can train a few small models, fit the scaling parameters, and extrapolate to predict how larger models will perform.

Kaplan et al. used this approach to estimate the performance of models orders of magnitude larger than any that existed at the time. This extrapolation informed the decision to build GPT-3 and provided confidence that the investment would yield proportional returns.

In[16]:

Code

# Predict loss for hypothetical future models
future_params = np.array([1e9, 10e9, 100e9, 1e12])  # 1B to 1T parameters
predicted_loss = (N_c / future_params) ** alpha_N

# Calculate improvement relative to a 100M parameter baseline
baseline_N = 1e8
baseline_loss = (N_c / baseline_N) ** alpha_N
improvement = 1 - (predicted_loss / baseline_loss)

# Calculate the 1T vs 100M improvement for verification
improvement_1T_vs_100M = 1 - ((N_c / 1e12) ** alpha_N) / (
    (N_c / 1e8) ** alpha_N
)

# Predict loss for hypothetical future models
future_params = np.array([1e9, 10e9, 100e9, 1e12])  # 1B to 1T parameters
predicted_loss = (N_c / future_params) ** alpha_N

# Calculate improvement relative to a 100M parameter baseline
baseline_N = 1e8
baseline_loss = (N_c / baseline_N) ** alpha_N
improvement = 1 - (predicted_loss / baseline_loss)

# Calculate the 1T vs 100M improvement for verification
improvement_1T_vs_100M = 1 - ((N_c / 1e12) ** alpha_N) / (
    (N_c / 1e8) ** alpha_N
)

Out[17]:

Console

Predicted loss vs. model scale (Kaplan laws):
---------------------------------------------
     1B params: Loss = 2.3756 (16.1% improvement over 100M)
    10B params: Loss = 1.9943 (29.5% improvement over 100M)
   100B params: Loss = 1.6741 (40.8% improvement over 100M)
  1000B params: Loss = 1.4053 (50.3% improvement over 100M)

These predictions show how loss decreases as models scale from 1 billion to 1 trillion parameters. The improvement percentages represent how much lower the loss is compared to a 100 million parameter baseline. Each 10x increase in parameters yields approximately a 16% relative improvement, consistent with the $\alpha_N = 0.076$ exponent.

Out[18]:

Visualization

Log-log plot showing predicted loss versus parameters with observed region highlighted. — Extrapolating loss predictions across model scales. The shaded region shows the range observed in Kaplan's original experiments, while the extended line shows predictions for larger models. This extrapolation capability enabled planning for models like GPT-3 before committing resources.

At 1 trillion parameters, the model achieves roughly 16% lower loss than the 100M baseline—a meaningful improvement that reflects the diminishing returns inherent in power-law scaling. While 16% might sound modest, remember that cross-entropy loss relates logarithmically to perplexity. A 16% reduction in loss translates to a substantial reduction in perplexity, which correlates with noticeably better text generation quality. This finding influenced strategic decisions about how far to push scale.

How reliable these extrapolations are depends on whether the power-law relationships continue to hold at larger scales. If there are phase transitions, saturation effects, or architectural limitations that emerge at scale, the predictions could be systematically wrong. This uncertainty motivated ongoing research into whether scaling laws exhibit any departures from pure power-law behavior.

Implementation: Scaling Law CalculatorLink Copied

Let's build a practical tool for working with Kaplan scaling laws. This implementation encapsulates all the key relationships we've discussed into a reusable class that can predict losses and compute optimal allocations:

In[19]:

Code

class KaplanScalingLaws:
    """Calculator for Kaplan et al. (2020) scaling laws."""

    # Fitted constants from the paper
    ALPHA_N = 0.076  # Parameter scaling exponent
    ALPHA_D = 0.095  # Data scaling exponent
    ALPHA_C = 0.050  # Compute scaling exponent

    N_C = 8.8e13  # Parameter scaling constant
    D_C = 5.4e13  # Data scaling constant
    C_C = 3.1e8  # Compute scaling constant

    # Optimal allocation exponents
    ALLOC_N = 0.73  # How parameters scale with compute
    ALLOC_D = 0.27  # How data scales with compute

    @classmethod
    def loss_from_params(cls, N: float) -> float:
        """Predict loss from non-embedding parameter count."""
        return (cls.N_C / N) ** cls.ALPHA_N

    @classmethod
    def loss_from_data(cls, D: float) -> float:
        """Predict loss from dataset size in tokens."""
        return (cls.D_C / D) ** cls.ALPHA_D

    @classmethod
    def loss_from_compute(cls, C: float) -> float:
        """Predict loss from compute in PetaFLOP-days."""
        return (cls.C_C / C) ** cls.ALPHA_C

    @classmethod
    def optimal_allocation(
        cls,
        C: float,
        N_base: float = 1e7,
        D_base: float = 1e9,
        C_base: float = 100,
    ) -> dict:
        """
        Calculate optimal model size and dataset for a compute budget.

        Args:
            C: Compute budget in PetaFLOP-days
            N_base: Baseline parameter count (for calibration)
            D_base: Baseline dataset size (for calibration)
            C_base: Baseline compute (for calibration)

        Returns:
            Dictionary with optimal N, D, and predicted loss
        """
        scale = C / C_base
        N_opt = N_base * (scale**cls.ALLOC_N)
        D_opt = D_base * (scale**cls.ALLOC_D)

        return {
            "parameters": N_opt,
            "tokens": D_opt,
            "predicted_loss": cls.loss_from_compute(C),
            "tokens_per_param": D_opt / N_opt,
        }

class KaplanScalingLaws:
    """Calculator for Kaplan et al. (2020) scaling laws."""

    # Fitted constants from the paper
    ALPHA_N = 0.076  # Parameter scaling exponent
    ALPHA_D = 0.095  # Data scaling exponent
    ALPHA_C = 0.050  # Compute scaling exponent

    N_C = 8.8e13  # Parameter scaling constant
    D_C = 5.4e13  # Data scaling constant
    C_C = 3.1e8  # Compute scaling constant

    # Optimal allocation exponents
    ALLOC_N = 0.73  # How parameters scale with compute
    ALLOC_D = 0.27  # How data scales with compute

    @classmethod
    def loss_from_params(cls, N: float) -> float:
        """Predict loss from non-embedding parameter count."""
        return (cls.N_C / N) ** cls.ALPHA_N

    @classmethod
    def loss_from_data(cls, D: float) -> float:
        """Predict loss from dataset size in tokens."""
        return (cls.D_C / D) ** cls.ALPHA_D

    @classmethod
    def loss_from_compute(cls, C: float) -> float:
        """Predict loss from compute in PetaFLOP-days."""
        return (cls.C_C / C) ** cls.ALPHA_C

    @classmethod
    def optimal_allocation(
        cls,
        C: float,
        N_base: float = 1e7,
        D_base: float = 1e9,
        C_base: float = 100,
    ) -> dict:
        """
        Calculate optimal model size and dataset for a compute budget.

        Args:
            C: Compute budget in PetaFLOP-days
            N_base: Baseline parameter count (for calibration)
            D_base: Baseline dataset size (for calibration)
            C_base: Baseline compute (for calibration)

        Returns:
            Dictionary with optimal N, D, and predicted loss
        """
        scale = C / C_base
        N_opt = N_base * (scale**cls.ALLOC_N)
        D_opt = D_base * (scale**cls.ALLOC_D)

        return {
            "parameters": N_opt,
            "tokens": D_opt,
            "predicted_loss": cls.loss_from_compute(C),
            "tokens_per_param": D_opt / N_opt,
        }

In[20]:

Code

# Test the calculator
calculator = KaplanScalingLaws()

# Predict losses at different scales
test_cases = [
    (125e6, "GPT-2 Small (125M)"),
    (1.5e9, "GPT-2 XL (1.5B)"),
    (175e9, "GPT-3 (175B)"),
]

# Test the calculator
calculator = KaplanScalingLaws()

# Predict losses at different scales
test_cases = [
    (125e6, "GPT-2 Small (125M)"),
    (1.5e9, "GPT-2 XL (1.5B)"),
    (175e9, "GPT-3 (175B)"),
]

Out[21]:

Console

Loss predictions from Kaplan scaling laws:
--------------------------------------------------
GPT-2 Small (125M)        → Predicted loss: 2.7824
GPT-2 XL (1.5B)           → Predicted loss: 2.3036
GPT-3 (175B)              → Predicted loss: 1.6044

The predicted losses demonstrate how the scaling law captures performance differences across model sizes. GPT-2 Small's higher loss compared to GPT-2 XL reflects the consistent power-law improvement. The much larger GPT-3 shows further gains that align with the α_N = 0.076 exponent. These predictions allow researchers to estimate whether building a larger model is worth the investment before committing the resources.

Now let's use the optimal allocation function to explore different compute budgets:

In[22]:

Code

# Explore optimal allocations across compute budgets
compute_budgets = [100, 1000, 10000, 100000, 1000000]  # PetaFLOP-days
allocations = [calculator.optimal_allocation(C) for C in compute_budgets]

# Explore optimal allocations across compute budgets
compute_budgets = [100, 1000, 10000, 100000, 1000000]  # PetaFLOP-days
allocations = [calculator.optimal_allocation(C) for C in compute_budgets]

Out[23]:

Console

Kaplan optimal allocation across compute scales:
----------------------------------------------------------------------
Compute (PF-d)  Parameters      Tokens          Loss       Tok/Param
----------------------------------------------------------------------
100             0.01           B 1.0            B 2.1114     100.0
1,000           0.05           B 1.9            B 1.8818     34.7
10,000          0.29           B 3.5            B 1.6771     12.0
100,000         1.55           B 6.5            B 1.4947     4.2
1,000,000       8.32           B 12.0           B 1.3322     1.4

This table shows how Kaplan's optimal allocation shifts the balance toward larger models as compute increases. At 1 million PetaFLOP-days, the recommended model has over 100 billion parameters but trains on only about 20 billion tokens—a ratio that seemed counterintuitive before scaling laws were established. The conventional wisdom had been that models needed many examples per parameter to avoid overfitting, but Kaplan's analysis showed that for language models, the optimal strategy is the opposite: build very large models and train them for relatively few steps.

The tokens-per-parameter ratio remains stable across scales, hovering around 3-4 tokens per parameter. This is far below what many practitioners assumed was necessary before scaling laws were established.

Out[24]:

Visualization

Bar chart showing tokens-per-parameter ratio staying constant around 3-4. — Tokens-per-parameter ratio across compute scales under Kaplan optimal allocation. The ratio remains stable around 3-4 tokens per parameter regardless of scale, contradicting traditional intuitions that larger models need proportionally more data.

Limitations and ImpactLink Copied

The Kaplan scaling laws helped us understand neural language models, but they had significant limitations that later research would address.

The most important limitation was the experimental methodology for measuring optimal allocation. Kaplan et al. estimated optimal model and data sizes by training models for varying numbers of steps and observing which configurations achieved the best loss for a given compute budget. However, they did not train any models fully to convergence. They relied on early training dynamics to extrapolate optimal configurations. This choice led to systematic bias toward larger models and shorter training runs. The Chinchilla scaling laws, which we'll explore in the next chapter, would later demonstrate that training models much longer on more data is actually more compute-efficient than Kaplan's analysis suggested.

Another limitation was the narrow scope of experiments. All experiments used a single architecture family (decoder-only transformers), a single dataset (WebText2), and a single domain (English web text). The fitted constants and even the exponents may not generalize to encoder models, different languages, or specialized domains like code or scientific text. Subsequent work has shown that optimal scaling behavior does vary across these dimensions.

The treatment of compute was also simplified. Kaplan measured compute in PetaFLOP-days, aggregating all operations equally. In practice, different operations have different hardware efficiency, and memory bandwidth often matters more than raw FLOPS for large models. The relationship $C \approx 6ND$ that Kaplan used as an approximation is reasonable but not exact across all architectures and batch sizes.

Despite these limitations, Kaplan's work had major impact. For the first time, researchers had quantitative guidance for resource allocation decisions. The framework of power-law scaling became standard vocabulary in the field, and the methodology of fitting scaling laws to predict large-model performance became essential for compute-limited research planning. GPT-3's architecture and training decisions were directly informed by these scaling laws, demonstrating their practical influence on billion-dollar training runs.

The work also raised questions that motivate ongoing research: Why do power laws emerge at all? Are there phase transitions or inflection points hidden at larger scales? What determines the exponents, and can we design architectures with better scaling? These questions relate to the emergence phenomena we'll explore in later chapters.

SummaryLink Copied

The Kaplan scaling laws showed that language model performance follows predictable power-law relationships across three dimensions:

Loss vs. parameters: $L(N) \propto N^{-0.076}$ , meaning doubling model size reduces loss by about 5%
Loss vs. data: $L(D) \propto D^{-0.095}$ , with slightly steeper improvement per unit of data
Loss vs. compute: $L(C) \propto C^{-0.050}$ , assuming optimal allocation

To interpret these exponents: a 10x increase in any factor reduces loss by $10^\alpha$ , so parameters give $10^{0.076} \approx 1.19\times$ improvement, data gives $10^{0.095} \approx 1.24\times$ , and compute gives $10^{0.050} \approx 1.12\times$ improvement.

The key practical finding was the optimal allocation formula: when scaling up compute, increase model size as $N \propto C^{0.73}$ and data as $D \propto C^{0.27}$ . This prescription (scale models faster than data) shaped the development of GPT-3 and showed that larger models are more sample-efficient.

These relationships held cleanly across four orders of magnitude in the original experiments, allowing extrapolation to scales that hadn't yet been trained. However, the methodology for determining optimal allocation had systematic limitations that later work would revise. The next chapter examines the Chinchilla scaling laws, which challenged Kaplan's recommendations and showed that models like GPT-3 were actually undertrained relative to their size.

Key ParametersLink Copied

The key parameters for Kaplan scaling laws are:

α_N (0.076): Parameter scaling exponent. Determines how quickly loss decreases as model size increases.
α_D (0.095): Data scaling exponent. Governs the rate of loss improvement with more training tokens.
α_C (0.050): Compute scaling exponent. Controls loss reduction when compute budget increases with optimal allocation.
N_c, D_c, C_c: Characteristic scale constants. Fitted values representing the scale at which each resource reaches a reference loss level.
Allocation exponents (0.73, 0.27): Optimal compute allocation. Specify how to split compute budget between larger models and more data.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about Kaplan scaling laws.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

Power Laws in Deep Learning

Next Chapter

Chinchilla Scaling Laws

Reference

BIBTEXAcademic

@misc{kaplanscalinglawspredictinglanguagemodelperformance, author = {Michael Brenndoerfer}, title = {Kaplan Scaling Laws: Predicting Language Model Performance}, year = {2025}, url = {https://mbrenndoerfer.com/writing/kaplan-scaling-laws-language-model-performance}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-25} }

APAAcademic

Michael Brenndoerfer (2025). Kaplan Scaling Laws: Predicting Language Model Performance. Retrieved from https://mbrenndoerfer.com/writing/kaplan-scaling-laws-language-model-performance

MLAAcademic

Michael Brenndoerfer. "Kaplan Scaling Laws: Predicting Language Model Performance." 2025. Web. 12/25/2025. <https://mbrenndoerfer.com/writing/kaplan-scaling-laws-language-model-performance>.

CHICAGOAcademic

Michael Brenndoerfer. "Kaplan Scaling Laws: Predicting Language Model Performance." Accessed 12/25/2025. https://mbrenndoerfer.com/writing/kaplan-scaling-laws-language-model-performance.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Kaplan Scaling Laws: Predicting Language Model Performance'. Available at: https://mbrenndoerfer.com/writing/kaplan-scaling-laws-language-model-performance (Accessed: 12/25/2025).

SimpleBasic

Michael Brenndoerfer (2025). Kaplan Scaling Laws: Predicting Language Model Performance. https://mbrenndoerfer.com/writing/kaplan-scaling-laws-language-model-performance

Direct link:

https://mbrenndoerfer.com/writing/kaplan-scaling-laws-language-model-performance

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Kaplan Scaling Laws: Predicting Language Model Performance

Kaplan Scaling LawsLink Copied

The Experimental FoundationLink Copied

Loss vs. ParametersLink Copied

Loss vs. DataLink Copied

Loss vs. ComputeLink Copied

The Unified Scaling LawLink Copied

Optimal Compute AllocationLink Copied

Worked Example: Allocating a Compute BudgetLink Copied

Comparing Scaling BehaviorLink Copied

Predicting Performance at ScaleLink Copied

Implementation: Scaling Law CalculatorLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Differential Calculus and Optimization for Quantitative Finance

Chinchilla Scaling Laws: Compute-Optimal LLM Training

Power Laws in Deep Learning: Understanding Neural Scaling

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Differential Calculus and Optimization for Quantitative Finance

Chinchilla Scaling Laws: Compute-Optimal LLM Training

Power Laws in Deep Learning: Understanding Neural Scaling

Stay updated