A comprehensive guide covering the 2020 scaling laws discovered by Kaplan et al. Learn how power-law relationships predict model performance from scale, enabling informed resource allocation, how scaling laws transformed model development planning, and their profound impact on GPT-3 and subsequent large language models.

This article is part of the free-to-read History of Language AI
Sign in to mark chapters as read and track your learning journey
Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.
2020: Scaling Laws for Neural Language Models
In 2020, a team of researchers at OpenAI led by Jared Kaplan published findings that would fundamentally shape how researchers and organizations approach the development of large language models. Their paper, "Scaling Laws for Neural Language Models," established mathematical relationships that predicted how model performance would improve as researchers increased model size, training data, or computational resources. These scaling laws revealed power-law relationships that enabled accurate predictions of model capabilities before training began, providing crucial guidance for allocating resources and planning model development. The discovery that performance scaled predictably with these factors transformed large language model development from an art into more of a science, giving researchers mathematical tools to make informed decisions about how to invest computational resources.
By 2020, the field had witnessed dramatic improvements from scaling transformer-based language models. GPT-2, released in 2019 with 1.5 billion parameters, had shown impressive capabilities. BERT, with its bidirectional architecture, had achieved strong performance across many NLP tasks. Researchers had observed that larger models generally performed better, but the relationship between scale and performance remained poorly understood. There was no systematic framework for predicting how much better a model would perform if given twice the parameters, twice the training data, or twice the compute. This uncertainty made it difficult to plan training runs, allocate resources efficiently, or understand the limits of scaling as a strategy for improving language models.
The lack of predictive scaling laws created significant problems for model development. Organizations planning to train large language models faced enormous uncertainty about what scale would be necessary to achieve target performance levels. Should they invest in a model with 10 billion parameters or 100 billion? How much training data would be needed? What computational resources would be required? Without answers to these questions, decisions about resource allocation were essentially guesses, leading to either wasted resources on over-engineered models or inadequate investment that failed to reach performance targets. The cost of training large models made these decisions particularly consequential, with failed training runs potentially wasting millions of dollars in compute time.
Kaplan and colleagues set out to systematically investigate how model performance scales with various factors. They trained transformer language models across a wide range of scales, varying model size from millions to billions of parameters, training data from millions to billions of tokens, and training compute accordingly. By measuring performance across these different configurations, they could identify mathematical relationships that predicted performance from scale alone. The goal was to discover whether simple power-law relationships could accurately predict how loss, and therefore performance, would improve with increased scale across different dimensions.
The results exceeded expectations. The researchers discovered clean power-law relationships between performance and scale across multiple dimensions. Model performance improved predictably with increased model size, with larger models achieving lower loss and better downstream task performance. Performance also improved predictably with increased training data, following similar power-law scaling. Most importantly, these relationships held across different model sizes and training configurations, suggesting that scaling laws were fundamental properties of neural language model training rather than artifacts of specific architectures or datasets. These laws enabled researchers to predict model performance before training, estimate resource requirements for target performance levels, and make informed decisions about how to allocate computational budgets across model size and training data.
The practical implications were immediate and profound. The scaling laws provided concrete guidance for planning training runs: researchers could now calculate approximately how many parameters and how much training data would be needed to reach a target loss, or predict how much better a model would perform if given additional resources. This predictive capability reduced uncertainty in model development and enabled more efficient resource allocation. The laws also revealed that performance improved smoothly with scale, suggesting that scaling was a viable long-term strategy for improving language models rather than hitting diminishing returns or fundamental limits.
The scaling laws also raised fundamental questions about the limits of scaling and the nature of intelligence that emerges from scale. If performance continued to improve smoothly with scale following power-law relationships, how far could this scaling continue? Would these relationships hold as models approached the limits of available training data or computational resources? The laws suggested that significant improvements were still possible through scaling, but they also hinted at potential limits and trade-offs that would become important as models grew larger. These questions would drive subsequent research, including the Chinchilla scaling laws that would refine understanding of compute-optimal training configurations.
The Problem
As language models grew larger and more sophisticated throughout the late 2010s, researchers faced a fundamental uncertainty: how should they allocate limited computational resources to maximize performance? The intuitive answer seemed to be scaling everything up, but without understanding the mathematical relationships between different scaling dimensions, decisions about resource allocation were essentially guesses. Should an organization invest in training a larger model, collecting more training data, or both? How much better would performance be if they doubled the model size? If they doubled the training data? If they doubled the computational budget allocated to training? Without answers to these questions, planning model development was fraught with risk.
Consider a scenario where a research team had a computational budget equivalent to training a 10 billion parameter model on 100 billion tokens. Should they train a 20 billion parameter model on 50 billion tokens? A 5 billion parameter model on 200 billion tokens? Or something else entirely? Without scaling laws, there was no principled way to answer these questions. Teams might default to making models as large as possible, assuming that bigger models would always perform better, but this assumption ignored the role of training data and the trade-offs between different scaling dimensions.
The lack of predictive scaling laws also made it difficult to estimate whether a training run would succeed before committing enormous computational resources. If a team wanted to achieve a specific performance target, how large should the model be? How much training data would be needed? Without scaling laws, these questions had no answers, forcing teams to rely on intuition, precedent from similar models, or trial and error. This uncertainty was particularly problematic given the enormous costs of training large models: a failed training run could waste weeks of compute time and thousands or millions of dollars in cloud computing costs.
The problem extended beyond individual training runs to strategic questions about the future of language model development. If researchers wanted to understand whether scaling could continue to improve performance indefinitely or whether they would eventually hit diminishing returns or fundamental limits, they needed systematic data about how performance scaled with resources. Without scaling laws, it was impossible to predict whether investing in larger models would continue to provide significant improvements or whether alternative approaches would be necessary to achieve further gains. This uncertainty made it difficult to plan long-term research directions or make strategic decisions about where to invest resources.
There was also a deeper theoretical problem: understanding why and how scaling improved performance was crucial for developing better models and architectures. If researchers could identify the mathematical relationships governing how performance improved with scale, they might be able to design more efficient architectures that achieved better performance with fewer resources, or understand which aspects of scaling mattered most for different types of capabilities. Without scaling laws, these insights remained inaccessible, limiting the ability to improve models beyond simply making them larger.
The field needed systematic empirical investigation that would reveal how performance scales across different dimensions. This investigation would need to train many models across a wide range of scales, carefully controlling for different factors, and measuring how performance changed. The goal would be to discover whether simple mathematical relationships could accurately predict performance from scale alone, enabling researchers to make informed decisions about resource allocation and predict model capabilities before training began.
The Solution
Kaplan and colleagues addressed these problems through systematic empirical investigation that trained transformer language models across a wide range of scales and measured how performance scaled with model size, training data, and computational resources. The solution involved training models with sizes ranging from 768 thousand to 1.5 billion parameters, training them on datasets ranging from 22 million to 23 billion tokens, and systematically varying other factors like model width, depth, and training compute. By measuring cross-entropy loss on held-out test data across all these configurations, the researchers could identify mathematical relationships that predicted performance from scale alone.
The key insight emerged from analyzing how loss scaled with different factors. The researchers discovered that test loss followed power-law relationships with model size, dataset size, and compute. Specifically, they found that test loss decreased predictably as a power-law function of model parameters, training data, and computational resources. These relationships held across different model configurations and training setups, suggesting that power-law scaling was a fundamental property of neural language model training rather than an artifact of specific choices.
A power-law relationship means that one quantity scales as a power of another. For example, if model loss scales with model parameters following a power law, then for some constants and . This means that doubling the model size reduces loss by a constant factor, independent of the starting size. The scaling laws paper discovered that such relationships held between loss and model size, loss and dataset size, and loss and compute, with different scaling exponents for each relationship.
The researchers formulated these relationships mathematically. They found that test loss followed power-law relationships with model size , dataset size , and training compute . Specifically, they discovered that loss decreased with model size as for some scaling exponent , with dataset size as for exponent , and with compute as for exponent . These relationships enabled researchers to predict how much performance would improve with increased scale in any dimension.
The scaling laws also revealed important interactions between different scaling dimensions. The researchers found that optimal performance for a given computational budget required balancing model size and training data appropriately. If too much of the computational budget was allocated to model size with insufficient training data, the model would be undertrained. If too much was allocated to training data with too small a model, the model would lack sufficient capacity. The scaling laws provided guidance for finding optimal balances, though this aspect would be refined by subsequent Chinchilla scaling laws research.
The methodology involved training models systematically across different scales and measuring performance carefully. The researchers used transformer architectures similar to GPT-2, varying the number of layers, the width of layers, and the size of the training dataset. By controlling for different factors and measuring test loss across all configurations, they could isolate how each factor contributed to performance improvements. This systematic approach enabled them to discover the power-law relationships that governed scaling.
The researchers also investigated how these scaling laws applied to downstream tasks, not just language modeling loss. They found that improvements in language modeling loss translated predictably to improvements on downstream tasks like question answering, reading comprehension, and other NLP benchmarks. This finding suggested that the scaling laws captured fundamental properties of how language models learn, not just properties of the language modeling objective itself. The predictable relationship between pretraining loss and downstream task performance made the scaling laws broadly applicable.
The solution provided concrete mathematical tools for planning and predicting model development. Researchers could now calculate approximately how many parameters and how much training data would be needed to reach a target loss, or predict how much better a model would perform if given additional resources. These predictions enabled more efficient resource allocation and reduced uncertainty in model development planning. The scaling laws transformed model development from relying on intuition and guesswork to using principled mathematical predictions.
Applications and Impact
The scaling laws had immediate practical impact on how researchers and organizations approached large language model development. The ability to predict performance from scale alone enabled more efficient planning and resource allocation. Research teams could calculate optimal configurations before training began, reducing wasted computational resources and enabling more strategic decision-making about model development. This predictive capability was particularly valuable given the enormous costs of training large models.
The scaling laws directly influenced the development of GPT-3, which was trained concurrently with the scaling laws research. Understanding how performance would scale with model size and training data enabled OpenAI to make informed decisions about GPT-3's scale. The laws suggested that scaling to 175 billion parameters would provide substantial improvements over GPT-2, and these predictions were validated by GPT-3's performance. The scaling laws research helped justify the massive investment required for GPT-3 and provided confidence that the model would achieve its target capabilities.
Beyond GPT-3, the scaling laws became a fundamental tool for planning model development across the field. Researchers at other organizations used the laws to estimate resource requirements for their own models, plan training configurations, and make strategic decisions about where to invest computational resources. The laws enabled more efficient resource allocation by helping teams identify configurations that would achieve target performance levels without unnecessary overspending on model size or training data.
The scaling laws also influenced how researchers thought about the future of language model development. The power-law relationships suggested that performance would continue to improve smoothly with increased scale, at least within the ranges investigated. This finding supported the scaling hypothesis that increasing model size, training data, and compute would continue to unlock new capabilities. The laws provided mathematical justification for continued investment in larger models and larger training datasets, influencing research directions across the field.
The practical impact extended to cost estimation and budgeting. Organizations planning large training runs could now estimate computational costs more accurately by understanding how performance scaled with compute. The laws enabled teams to calculate approximate compute requirements for target performance levels, helping with budget planning and resource allocation decisions. This capability was crucial for organizations making significant investments in large language model training.
The scaling laws also raised important questions that drove subsequent research. While the laws showed how performance improved with scale, they did not explicitly address the optimal balance between model size and training data for a fixed computational budget. This question would be addressed by Chinchilla scaling laws research in 2022, which refined understanding of compute-optimal training configurations. The 2020 scaling laws established the foundation for this subsequent work.
The impact extended beyond immediate practical applications to fundamental questions about the nature of intelligence in neural networks. The fact that performance scaled predictably following power laws suggested that there were underlying mathematical principles governing how neural language models learn and improve. This insight would influence theoretical research into neural network capabilities and limitations, contributing to understanding of what large language models can and cannot do.
The scaling laws also influenced how the field evaluated and compared different model architectures and training approaches. By providing a standardized framework for understanding how performance scales, the laws enabled more systematic comparisons between different approaches. Researchers could now ask not just whether one approach performed better than another at a given scale, but how each approach scaled, enabling deeper insights into the relative strengths of different methods.
Limitations
Despite their significant contributions, the 2020 scaling laws had important limitations that would be addressed by subsequent research. Perhaps most significantly, the laws focused on how performance scaled with individual factors but did not explicitly address the optimal balance between model size and training data for a fixed computational budget. The laws showed that performance improved with both larger models and more training data, but they provided less guidance about how to allocate a fixed computational budget between these two factors. This limitation would be addressed by Chinchilla scaling laws research, which would show that many models were undertrained relative to their size.
The scaling laws were derived from experiments with models up to 1.5 billion parameters, while subsequent models would scale to hundreds of billions of parameters. The question of whether the power-law relationships would hold at much larger scales remained open. While the laws provided reasonable predictions for GPT-3's scale, their applicability to even larger models required validation. Subsequent research would investigate whether scaling laws continued to hold at larger scales or whether new relationships emerged.
The laws also did not address all aspects of scaling. While they captured relationships between loss and model size, dataset size, and compute, they did not explicitly consider factors like model architecture, training procedures, or data quality. The relationships might vary for different architectures or training approaches, and the laws might not capture important nuances in how these factors affect scaling. Understanding these nuances would require additional research beyond the initial scaling laws formulation.
The scaling laws measured performance primarily through cross-entropy loss on language modeling tasks. While the researchers showed that improvements in loss translated to improvements on downstream tasks, the exact relationship might vary across different task types. Some tasks might benefit more from certain aspects of scaling than others, and the laws might not capture all the nuances of how scaling affects different capabilities. Understanding these variations would be important for applying the laws effectively.
The laws also did not address questions about the limits of scaling. While they showed smooth power-law improvements, they did not predict where these relationships might break down. Would the laws continue to hold as models approached the limits of available training data? Would computational constraints eventually limit scaling? Would there be fundamental limits to how much performance could improve through scaling alone? These questions remained open and would drive continued research.
The computational cost implications of the scaling laws raised concerns about accessibility and sustainability. If performance continued to improve with scale following power laws, achieving state-of-the-art performance would require increasingly enormous computational resources. This requirement could concentrate capabilities in the hands of organizations with massive computational budgets, limiting diversity in research directions and potentially creating barriers to entry for smaller organizations or researchers. These concerns would motivate research into more efficient architectures and training methods.
The scaling laws also did not address important aspects of model quality beyond raw performance metrics. Issues like bias, safety, reliability, and controllability were not captured by the loss metrics used in the scaling laws. A model that achieved lower loss might not necessarily be more useful or safer in practice. Understanding how these qualitative aspects of model behavior scale would require additional research beyond the initial scaling laws formulation.
The relationships discovered were statistical rather than deterministic. While the laws provided good predictions on average, individual training runs might deviate from these predictions due to factors like random initialization, data ordering, or other sources of variance. The laws were useful for planning and estimation but did not guarantee specific outcomes for individual training runs. This variability meant that the laws were best used as guides rather than precise guarantees.
Legacy and Looking Forward
The 2020 scaling laws established a foundational framework that would influence large language model development for years to come. The discovery of power-law relationships between scale and performance transformed model development planning, enabling researchers to make informed predictions about model capabilities and resource requirements. This predictive capability reduced uncertainty in model development and enabled more efficient resource allocation, influencing how researchers and organizations approached building large language models.
The scaling laws directly influenced the development and planning of GPT-3 and subsequent large language models. Understanding how performance would scale enabled organizations to make informed decisions about model size, training data, and computational investments. The laws provided mathematical justification for the massive scale of models like GPT-3, PaLM, and GPT-4, showing that scaling could continue to provide substantial improvements. This influence extended beyond individual models to shape the entire trajectory of large language model development.
The laws also established scaling as a central strategy for improving language model capabilities. The power-law relationships suggested that performance would continue to improve smoothly with increased scale, at least within investigated ranges. This finding supported continued investment in larger models and larger training datasets, influencing research directions across the field. The scaling hypothesis that increasing scale unlocks new capabilities became a guiding principle for large language model development.
The framework established by the scaling laws would be refined and extended by subsequent research. In 2022, Chinchilla scaling laws research would address the optimal balance between model size and training data, showing that many models were undertrained relative to their capacity. This refinement built directly on the 2020 scaling laws, extending understanding of how to allocate computational resources optimally. The 2020 laws provided the foundation that made this subsequent work possible.
The scaling laws also raised fundamental questions that would drive continued research. If performance continued to scale following power laws, where would the limits be? Would the relationships hold as models approached the limits of available training data or computational resources? Would there be diminishing returns or fundamental limits to scaling? These questions would motivate ongoing research into the limits of scaling and alternative approaches to improving model capabilities.
The predictive capability provided by the scaling laws influenced how the field evaluated different approaches and architectures. By providing a standardized framework for understanding performance scaling, the laws enabled more systematic comparisons between different methods. Researchers could ask not just whether one approach performed better at a given scale, but how each approach scaled, enabling deeper insights into relative strengths and trade-offs.
The scaling laws also contributed to understanding of what neural language models learn and how they improve. The fact that performance scaled predictably following mathematical relationships suggested underlying principles governing how neural networks learn from data. This insight would influence theoretical research into neural network capabilities, contributing to understanding of what large language models can achieve and how they achieve it.
The practical impact of the scaling laws continues today. Researchers and organizations planning large training runs still use scaling law predictions to estimate resource requirements and plan model configurations. The laws provide valuable guidance for making informed decisions about resource allocation, reducing uncertainty in model development planning. This practical utility ensures the laws remain relevant as the field continues to develop larger and more capable models.
The scaling laws also highlight important questions about the future of language model development. As models continue to grow larger, understanding the limits of scaling becomes increasingly important. Will power-law improvements continue indefinitely, or will diminishing returns or fundamental limits emerge? How should computational resources be allocated as models grow larger and training becomes more expensive? These questions remain active areas of research, with the 2020 scaling laws providing the foundation for continued investigation.
The 2020 scaling laws represent a crucial milestone in understanding how neural language models scale, providing mathematical tools that transformed model development from relying on intuition to using principled predictions. The discovery of power-law relationships enabled more efficient resource allocation, informed strategic decisions about model development, and established scaling as a central strategy for improving language models. The laws' influence extends beyond their immediate practical applications to fundamental questions about the nature of intelligence in neural networks and the future trajectory of large language model development. The framework they established continues to guide research and development, providing the foundation for subsequent advances in understanding and optimizing large language model training.
Quiz
Ready to test your understanding of the 2020 scaling laws for neural language models? Challenge yourself with these questions about power-law relationships, resource allocation, and how scaling laws transformed model development planning.
Sign in to mark chapters as read and track your learning journey
Reference

About the author: Michael Brenndoerfer
All opinions expressed here are my own and do not reflect the views of my employer.
Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.
With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.
Related Content

TF-IDF and Bag of Words: Complete Guide to Text Representation & Information Retrieval
Learn TF-IDF and Bag of Words, including term frequency, inverse document frequency, vectorization, and text classification. Master classical NLP text representation methods with Python implementation.

Dense Passage Retrieval and Retrieval-Augmented Generation: Integrating Knowledge with Language Models
A comprehensive guide covering Dense Passage Retrieval (DPR) and Retrieval-Augmented Generation (RAG), the 2020 innovations that enabled language models to access external knowledge sources. Learn how dense vector retrieval transformed semantic search, how RAG integrated retrieval with generation, and their lasting impact on knowledge-aware AI systems.

GPT-3 and In-Context Learning: Emergent Capabilities from Scale
A comprehensive guide covering OpenAI's GPT-3 introduced in 2020. Learn how scaling to 175 billion parameters unlocked in-context learning and few-shot capabilities, the mechanism behind pattern recognition in prompts, how it eliminated the need for fine-tuning on many tasks, and its profound impact on prompt engineering and modern language model deployment.
Stay updated
Get notified when I publish new articles on data and AI, private equity, technology, and more.
No spam, unsubscribe anytime.
Create a free account to unlock exclusive features, track your progress, and join the conversation.
Comments