Search

Search articles

GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery

Michael BrenndoerferUpdated July 21, 202538 min read

Explore GPT-3's 175B parameter architecture, the emergence of few-shot learning, in-context learning mechanisms, and how scale unlocked new capabilities in large language models.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

GPT-3

GPT-2 demonstrated that language models could perform tasks without task-specific training, achieving modest zero-shot performance on reading comprehension, translation, and summarization. But this capability was limited: the model often required fine-tuning to match specialized systems. GPT-3 changed this equation through sheer scale. With 175 billion parameters, over 100 times larger than GPT-2, it revealed that massive language models could learn new tasks from just a few examples provided in the prompt, a capability called few-shot learning.

This discovery reshaped how we think about language model capabilities. Instead of training separate models for each task, a single pretrained model could adapt on the fly. Prompting replaced fine-tuning for many applications. The implications extended beyond NLP: GPT-3 demonstrated that scale itself could be a source of emergent capabilities, behaviors that appear suddenly as models grow larger.

In this chapter, we explore GPT-3's architecture and training, the discovery of in-context learning, the difference between zero-shot, one-shot, and few-shot prompting, and what GPT-3 revealed about the relationship between scale and capability.

The Scale of GPT-3

GPT-3's defining characteristic is its size. At 175 billion parameters, it dwarfed all previous language models. To understand what this means in practice, let's compare it to its predecessors:

The GPT model progression shows exponential growth in scale. GPT-3 has roughly 117 times more parameters than GPT-2 and 1500 times more than GPT-1.
ModelParametersLayersHidden SizeHeadsContext Length
GPT-1117M1276812512
GPT-21.5B481600251024
GPT-3175B9612288962048

The jump from GPT-2 to GPT-3 represents more than a linear improvement. With 96 transformer layers, a hidden dimension of 12,288, and 96 attention heads, GPT-3 can represent far more complex patterns and relationships. The context length of 2,048 tokens means the model can consider roughly 1,500 words of context when making predictions.

Out[3]:
Visualization
Bar chart comparing GPT-1, GPT-2, and GPT-3 parameter counts on log scale, showing exponential growth.
Parameter count comparison across GPT generations on a logarithmic scale. Each generation represents roughly a 10-100x increase in model size, with GPT-3 reaching 175 billion parameters.

Training Data and Compute

GPT-3's training data, while not publicly disclosed in full detail, was substantially larger than GPT-2's WebText. The training corpus included:

  • Common Crawl: A filtered version of web pages (410 billion tokens, weighted to contribute 60% of training)
  • WebText2: An expanded version of GPT-2's Reddit-curated dataset (19 billion tokens)
  • Books1 and Books2: Two internet-based book corpora (12 billion and 55 billion tokens)
  • Wikipedia: The English Wikipedia (3 billion tokens)

The total training corpus contained approximately 499 billion tokens, though the model saw roughly 300 billion tokens during training due to dataset weighting that favored higher-quality sources.

Out[4]:
Visualization
Horizontal bar chart showing raw token counts for each training data source.
Raw token counts by source. Common Crawl dominates with 410 billion tokens.
Horizontal bar chart showing weighted training contribution percentages.
Weighted contribution to training. Despite its size, Common Crawl was downweighted to 60%.
Training Compute

GPT-3 required approximately 3.14×10233.14 \times 10^{23} floating-point operations (FLOPs) to train, roughly 1,000 times more compute than GPT-2. At 2020 cloud computing prices, the estimated training cost exceeded $4 million.

The compute required for training grew even faster than the parameter count. This wasn't just about having more parameters; each parameter needed more training data to be utilized effectively. A rough estimate of training compute for a language model is:

C6NDC \approx 6 \cdot N \cdot D

where:

  • CC: total floating-point operations (FLOPs) for training
  • NN: number of model parameters
  • DD: number of training tokens
  • The factor of 6 accounts for the forward and backward passes (approximately 2 FLOPs per parameter per token for forward, 4 for backward)

For GPT-3 with N=175×109N = 175 \times 10^9 parameters trained on D300×109D \approx 300 \times 10^9 tokens, this yields C3.15×1023C \approx 3.15 \times 10^{23} FLOPs, matching the reported training compute. This observation, later formalized in scaling laws, showed that optimal training balances model size with data quantity and training compute.

The GPT-3 Model Family

OpenAI released GPT-3 not as a single model but as a family of eight models spanning four orders of magnitude in size. This range enabled researchers to study how capabilities scale with model size:

The GPT-3 model family spans from 125 million to 175 billion parameters, enabling systematic study of scaling effects.
Model NameParametersLayersHidden SizeHeads
GPT-3 Small125M1276812
GPT-3 Medium350M24102416
GPT-3 Large760M24153616
GPT-3 XL1.3B24204824
GPT-3 2.7B2.7B32256032
GPT-3 6.7B6.7B32409632
GPT-3 13B13B40512040
GPT-3 175B175B961228896

The smallest GPT-3 model is roughly equivalent to GPT-1 in size, while the largest is over 1,000 times bigger. This range allowed researchers to study emergent capabilities: abilities that appear suddenly at certain scales rather than improving gradually.

Out[5]:
Visualization
Horizontal bar chart showing the 8 GPT-3 models from 125M to 175B parameters on a logarithmic scale.
The GPT-3 model family spans nearly four orders of magnitude in parameter count. The exponential spacing between model sizes enabled systematic study of how capabilities scale, revealing emergent behaviors at larger scales.

Architectural Details

GPT-3 uses the same fundamental decoder-only transformer architecture as GPT-2. The core design remains unchanged: a stack of transformer decoder blocks with masked self-attention, each followed by a feed-forward network. However, the model includes a few modifications for training stability at scale:

The key architectural differences from GPT-2 include:

  • Alternating dense and sparse attention: In some layers, GPT-3 uses sparse attention patterns that attend to every other token, reducing computational cost for long sequences
  • Learned position embeddings: Like GPT-2, positions are represented by learned embeddings rather than sinusoidal functions
  • Pre-layer normalization: Layer normalization is applied before each sublayer (attention and FFN) rather than after, following the Pre-LN configuration that stabilizes training for deep networks

Parameter Count Calculation

When we say GPT-3 has 175 billion parameters, what exactly are we counting? Understanding where these parameters live reveals the model's structure and helps explain why scale matters so much for capability.

A transformer model stores its learned knowledge in weight matrices. Each weight is a single floating-point number that the model adjusts during training. The total count of these numbers determines the model's capacity to represent patterns in language. To understand GPT-3's scale, we need to trace through each component and count its parameters systematically.

The Building Blocks

A GPT-style model consists of several distinct components, each contributing parameters:

  1. Token embeddings: A lookup table that converts each vocabulary token into a dense vector
  2. Position embeddings: A separate lookup table that encodes where each token appears in the sequence
  3. Transformer layers: The repeated blocks that process and transform token representations
  4. Final layer normalization: A normalization layer before the output projection
  5. Output projection: A matrix that converts hidden states back to vocabulary probabilities

Let's formalize this. For a model with vocabulary size VV, maximum sequence length LL, model dimension dd, number of layers NN, and feed-forward dimension dffd_{ff}, the total parameter count is:

Ptotal=Pembed+Ppos+NPlayer+Pnorm+PoutputP_{\text{total}} = P_{\text{embed}} + P_{\text{pos}} + N \cdot P_{\text{layer}} + P_{\text{norm}} + P_{\text{output}}

where:

  • Pembed=VdP_{\text{embed}} = V \cdot d: token embedding parameters. Each of the VV vocabulary tokens gets its own dd-dimensional vector, requiring V×dV \times d parameters total.
  • Ppos=LdP_{\text{pos}} = L \cdot d: position embedding parameters. Each of the LL possible positions gets a dd-dimensional vector.
  • PlayerP_{\text{layer}}: parameters per transformer layer, which we'll derive below.
  • Pnorm=2dP_{\text{norm}} = 2d: final layer normalization has a learnable scale and bias vector, each of dimension dd.
  • Poutput=dVP_{\text{output}} = d \cdot V: the output projection maps from hidden dimension dd back to vocabulary size VV. This is often weight-tied with the token embeddings (sharing the same matrix), but GPT-3 uses separate weights.

The formula captures the intuition that larger vocabularies, longer sequences, and higher dimensions all increase parameter count. But the dominant term is NPlayerN \cdot P_{\text{layer}}: the transformer layers, repeated NN times.

Inside a Transformer Layer

Each transformer layer consists of two sublayers, each followed by layer normalization:

  1. Multi-head self-attention: Allows tokens to gather information from other positions
  2. Feed-forward network (FFN): Applies the same transformation to each position independently

The attention sublayer requires four projection matrices: Query (WQ\mathbf{W}_Q), Key (WK\mathbf{W}_K), Value (WV\mathbf{W}_V), and Output (WO\mathbf{W}_O). Each matrix has shape d×dd \times d, contributing d2d^2 parameters. With four matrices, attention adds 4d24d^2 parameters.

The feed-forward network consists of two linear transformations. The first expands from dimension dd to dffd_{ff} (typically 4d4d), and the second contracts back from dffd_{ff} to dd. This requires d×dff+dff×d=2ddffd \times d_{ff} + d_{ff} \times d = 2 \cdot d \cdot d_{ff} parameters.

Each layer also has two layer normalizations (one before attention, one before FFN in the Pre-LN configuration), each with scale and bias vectors of dimension dd. This adds 4d4d parameters per layer.

Putting it together:

Player=Pattn+Pffn+Player_normsP_{\text{layer}} = P_{\text{attn}} + P_{\text{ffn}} + P_{\text{layer\_norms}}

where:

  • Pattn=4d2P_{\text{attn}} = 4d^2: the four attention projection matrices (WQ\mathbf{W}_Q, WK\mathbf{W}_K, WV\mathbf{W}_V, WO\mathbf{W}_O)
  • Pffn=2ddffP_{\text{ffn}} = 2 \cdot d \cdot d_{ff}: the two feed-forward linear layers. With dff=4dd_{ff} = 4d, this becomes 8d28d^2.
  • Player_norms=4dP_{\text{layer\_norms}} = 4d: scale and bias for two layer normalizations

Notice that the per-layer count scales with d2d^2. This quadratic dependence explains why increasing the hidden dimension has such a dramatic effect on model size: doubling dd quadruples the parameters in each layer.

A Worked Example: GPT-3 175B

Let's apply these formulas to GPT-3's actual configuration:

  • Vocabulary size: V=50257V = 50257 tokens (the GPT-2 tokenizer vocabulary)
  • Maximum sequence length: L=2048L = 2048 tokens
  • Model dimension: d=12288d = 12288
  • Number of layers: N=96N = 96
  • Feed-forward dimension: dff=4×12288=49152d_{ff} = 4 \times 12288 = 49152

First, we calculate the per-layer parameters:

Pattn=4×122882=4×150,994,944=603,979,776Pffn=2×12288×49152=1,207,959,552Player_norms=4×12288=49,152Player=603,979,776+1,207,959,552+49,1521.81 billion\begin{aligned} P_{\text{attn}} &= 4 \times 12288^2 = 4 \times 150{,}994{,}944 = 603{,}979{,}776 \\ P_{\text{ffn}} &= 2 \times 12288 \times 49152 = 1{,}207{,}959{,}552 \\ P_{\text{layer\_norms}} &= 4 \times 12288 = 49{,}152 \\ P_{\text{layer}} &= 603{,}979{,}776 + 1{,}207{,}959{,}552 + 49{,}152 \approx 1.81 \text{ billion} \end{aligned}

Each layer contributes about 1.81 billion parameters. With 96 layers, the transformer stack alone accounts for approximately 174 billion parameters. The remaining 1+ billion come from embeddings and the output projection.

This breakdown reveals an important pattern: the transformer layers dominate the parameter count. Increasing the number of layers or the hidden dimension has a much larger effect than expanding the vocabulary or context length.

Implementation

Let's implement this calculation to verify our formulas and explore how parameters distribute across components:

In[6]:
Code
def compute_gpt3_parameters(
    vocab_size=50257,
    n_layers=96,
    d_model=12288,
    n_heads=96,
    d_ff=None,
    max_seq_len=2048,
):
    """
    Calculate parameter count for a GPT-3 style model.

    The feed-forward dimension is 4x the model dimension by default.
    """
    if d_ff is None:
        d_ff = 4 * d_model

    # Token embeddings
    token_embedding_params = vocab_size * d_model

    # Position embeddings
    position_embedding_params = max_seq_len * d_model

    # Per-layer parameters
    # Attention: Q, K, V, O projections
    attention_params_per_layer = 4 * d_model * d_model

    # FFN: two linear layers
    ffn_params_per_layer = 2 * d_model * d_ff

    # Layer norms (2 per layer, each has scale and bias)
    layernorm_params_per_layer = 4 * d_model

    params_per_layer = (
        attention_params_per_layer
        + ffn_params_per_layer
        + layernorm_params_per_layer
    )

    total_layer_params = n_layers * params_per_layer

    # Final layer norm
    final_norm_params = 2 * d_model

    # Output projection (weight-tied with embedding or separate)
    # GPT-3 uses separate output weights
    output_params = d_model * vocab_size

    total = (
        token_embedding_params
        + position_embedding_params
        + total_layer_params
        + final_norm_params
        + output_params
    )

    return {
        "token_embedding": token_embedding_params,
        "position_embedding": position_embedding_params,
        "layers": total_layer_params,
        "final_norm": final_norm_params,
        "output": output_params,
        "total": total,
    }
Out[7]:
Console
GPT-3 175B Parameter Breakdown
=============================================
Token embeddings:        0.62B
Position embeddings:     0.03B
Transformer layers:    173.95B
Final layer norm:        0.02M
Output projection:       0.62B
---------------------------------------------
Total:                 175.21B

The output confirms our manual calculation. The vast majority of parameters, over 173 billion, reside in the 96 transformer layers. This makes sense: each layer contributes about 1.8 billion parameters through its attention projections and feed-forward network, and with 96 layers, these add up quickly.

The embedding layers tell an interesting story. Despite handling a vocabulary of 50,257 tokens, the token embeddings account for only about 0.62 billion parameters, less than 1% of the total. The position embeddings are even smaller at 0.025 billion (25 million) parameters. The output projection mirrors the token embedding in size but is kept separate in GPT-3 rather than being weight-tied.

This distribution has practical implications. If you want a larger model, adding layers or increasing the hidden dimension is far more effective than expanding the vocabulary or context length. Conversely, if you want a smaller model, the layers are where you need to cut.

The visualization below makes this distribution strikingly clear:

Out[8]:
Visualization
Pie chart showing GPT-3 parameter distribution with transformer layers as the largest segment.
Parameter distribution in GPT-3 175B. The transformer layers dominate, containing over 99% of all parameters. Embedding and output projections together account for less than 2%.

The pie chart reveals just how lopsided the distribution is: the transformer layers are so dominant that the embeddings and output projection are barely visible. This concentration of parameters in the processing layers, rather than the input/output interfaces, reflects where the model's computational "thinking" happens. The embeddings convert tokens to vectors and back, but the layers are where the model transforms and reasons about those representations.

The Discovery of In-Context Learning

GPT-3's most significant contribution wasn't its architecture but what it revealed about large-scale language models: they can learn new tasks from examples provided in the prompt, without any gradient updates. This capability, called in-context learning (ICL), emerged as a surprise during evaluation.

In-Context Learning

In-context learning is the ability of a language model to perform a task by conditioning on a few examples (demonstrations) in the input prompt, without updating the model's parameters. The model "learns" the task pattern from the examples and applies it to new inputs.

Consider a sentiment classification task. Instead of fine-tuning a model on thousands of labeled examples, you can simply show GPT-3 a few examples in the prompt:

Review: "This movie was absolutely fantastic!" Sentiment: Positive Review: "I wasted two hours of my life on this garbage." Sentiment: Negative Review: "The acting was decent but the plot made no sense." Sentiment:

GPT-3 would then complete this with "Negative" or "Mixed", having inferred the classification pattern from just two examples.

Zero-Shot, One-Shot, and Few-Shot Learning

OpenAI's GPT-3 paper systematically compared three prompting strategies:

  1. Zero-shot: The model receives only a task description, with no examples
  2. One-shot: The model sees one example before the test input
  3. Few-shot: The model sees multiple examples (typically 10-100, limited by context length)
In[9]:
Code
def create_prompt(examples, test_input, task_description=None):
    """
    Create a few-shot learning prompt.

    Args:
        examples: List of (input, output) tuples for demonstrations
        test_input: The input to classify/complete
        task_description: Optional zero-shot task description

    Returns:
        Formatted prompt string
    """
    prompt_parts = []

    if task_description:
        prompt_parts.append(task_description + "\n")

    for inp, out in examples:
        prompt_parts.append(f"Input: {inp}")
        prompt_parts.append(f"Output: {out}\n")

    prompt_parts.append(f"Input: {test_input}")
    prompt_parts.append("Output:")

    return "\n".join(prompt_parts)


# Zero-shot example
zero_shot = create_prompt(
    examples=[],
    test_input="The food was cold and the service was slow.",
    task_description="Classify the sentiment of the following review as Positive or Negative.",
)

# One-shot example
one_shot = create_prompt(
    examples=[("Great product, highly recommend!", "Positive")],
    test_input="The food was cold and the service was slow.",
)

# Few-shot example
few_shot = create_prompt(
    examples=[
        ("Great product, highly recommend!", "Positive"),
        ("Terrible experience, never again.", "Negative"),
        ("Best purchase I've ever made!", "Positive"),
        ("Broken on arrival, what a waste.", "Negative"),
    ],
    test_input="The food was cold and the service was slow.",
)
Out[10]:
Console
============================================================
ZERO-SHOT PROMPT
============================================================
Classify the sentiment of the following review as Positive or Negative.

Input: The food was cold and the service was slow.
Output:

============================================================
ONE-SHOT PROMPT
============================================================
Input: Great product, highly recommend!
Output: Positive

Input: The food was cold and the service was slow.
Output:

============================================================
FEW-SHOT PROMPT (4 examples)
============================================================
Input: Great product, highly recommend!
Output: Positive

Input: Terrible experience, never again.
Output: Negative

Input: Best purchase I've ever made!
Output: Positive

Input: Broken on arrival, what a waste.
Output: Negative

Input: The food was cold and the service was slow.
Output:

The key insight from GPT-3's evaluation was that performance scaled predictably with both model size and number of examples:

Out[11]:
Visualization
Line plot showing performance curves for different model sizes with increasing few-shot examples.
Illustration of how task performance typically scales with model size and number of in-context examples. Larger models benefit more from few-shot examples, while smaller models show modest improvements. This pattern, observed across many GPT-3 benchmarks, suggests that in-context learning is an emergent capability of scale.

This scaling pattern revealed something important: in-context learning is not simply pattern matching or memorization. It requires sufficient model capacity to represent the task and generalize from examples. Smaller models plateau quickly, while larger models continue to benefit from additional demonstrations.

Evaluating GPT-3's Capabilities

The GPT-3 paper evaluated the model across dozens of NLP benchmarks. Performance varied dramatically by task type, revealing both the potential and limits of in-context learning.

Strong Performance: Language Understanding

GPT-3 achieved impressive results on tasks requiring broad language understanding:

GPT-3 few-shot performance compared to fine-tuned state-of-the-art on select benchmarks. Few-shot GPT-3 sometimes matches or exceeds specialized models.
TaskDatasetGPT-3 Few-ShotFine-Tuned SOTA
Reading ComprehensionRACE-h46.8%90.0%
Question AnsweringTriviaQA71.2%75.4%
Commonsense ReasoningPIQA82.8%79.4%
Word ScramblingAnagrams66.9%N/A

On TriviaQA, GPT-3's few-shot performance (71.2%) approached the fine-tuned state-of-the-art (75.4%), despite never being explicitly trained on the task. On physical commonsense reasoning (PIQA), it actually exceeded the previous best fine-tuned model.

Emergent Capabilities

Perhaps most striking were capabilities that emerged without explicit training:

  • Arithmetic: GPT-3 could perform multi-digit addition and subtraction with reasonable accuracy, despite never being trained on a math curriculum
  • Code generation: It could write functional code snippets in Python and JavaScript from natural language descriptions
  • Translation: Few-shot GPT-3 matched or exceeded supervised baselines on some language pairs, particularly for translating into English
  • News article generation: Given headlines, it produced articles that human evaluators struggled to distinguish from real news
Out[12]:
Visualization
Grouped bar chart showing arithmetic accuracy for 2, 3, and 4 digit operations across addition and subtraction.
GPT-3's accuracy on arithmetic tasks by operation type and number of digits. Performance degrades with increasing digit count, but the model demonstrates genuine computational ability rather than pure memorization.

The arithmetic results are particularly interesting because they suggest the model learned algorithmic procedures rather than memorizing specific calculations. Performance degrades gracefully with more digits, the pattern you'd expect from imperfect procedure learning rather than lookup table failure.

Weak Performance: Structured Tasks

GPT-3 struggled on tasks requiring precise, structured outputs or multi-step reasoning:

  • Reading comprehension with multiple choice: On RACE-h, GPT-3 achieved only 46.8% compared to 90% for fine-tuned models
  • Natural language inference: Performance on ANLI (adversarial NLI) remained below random chance for some splits
  • Math word problems: Multi-step reasoning with intermediate calculations proved challenging
  • Fact verification: Distinguishing true from false claims required external knowledge verification

These weaknesses pointed to fundamental limitations: GPT-3 excels at pattern completion and surface-level understanding but struggles with tasks requiring careful logical reasoning or precise knowledge retrieval.

How In-Context Learning Works

The mechanism behind in-context learning remained mysterious. How can a model "learn" a task without updating its weights? Several hypotheses emerged from subsequent research:

Hypothesis 1: Task Recognition

One theory suggests that pretraining exposes the model to many implicit task formats. When given few-shot examples, GPT-3 recognizes the pattern from pretraining and activates the appropriate "circuit" for that task. The examples don't teach the model something new; they help it identify which of its existing capabilities to apply.

In[13]:
Code
def analyze_task_format(examples):
    """
    Analyze the format of few-shot examples to understand task structure.

    This illustrates what patterns the model might recognize.
    """
    analysis = {
        "n_examples": len(examples),
        "avg_input_length": np.mean([len(inp.split()) for inp, _ in examples]),
        "avg_output_length": np.mean([len(out.split()) for _, out in examples]),
        "output_vocabulary": set(out for _, out in examples),
    }

    # Detect if this looks like classification
    if len(analysis["output_vocabulary"]) <= 5:
        analysis["likely_task"] = "classification"
    elif analysis["avg_output_length"] > 10:
        analysis["likely_task"] = "generation"
    else:
        analysis["likely_task"] = "extraction"

    return analysis


# Analyze our sentiment examples
sentiment_examples = [
    ("Great product, highly recommend!", "Positive"),
    ("Terrible experience, never again.", "Negative"),
    ("Best purchase I've ever made!", "Positive"),
    ("Broken on arrival, what a waste.", "Negative"),
]

analysis = analyze_task_format(sentiment_examples)
Out[14]:
Console
Task Format Analysis
=============================================
Number of examples: 4
Average input length: 4.8 words
Average output length: 1.0 words
Unique outputs: {'Negative', 'Positive'}
Likely task type: classification

The analysis reveals structural patterns that a model might use to identify the task. With only two unique outputs (Positive, Negative) and short single-word responses, the format clearly signals a binary classification task. The model may recognize this pattern from similar structures encountered during pretraining, such as review datasets or labeled examples in web text.

Hypothesis 2: Implicit Fine-Tuning

Another perspective views the forward pass through GPT-3's 96 layers as an implicit optimization process. The key insight is that transformer layers can implement gradient-descent-like updates on the representations. Consider a single attention layer operating on demonstrations {(xi,yi)}\{(x_i, y_i)\} followed by a test input xtestx_{\text{test}}. The attention mechanism computes:

htest=htest(0)+i=1kαivih_{\text{test}} = h_{\text{test}}^{(0)} + \sum_{i=1}^{k} \alpha_i \cdot v_i

where:

  • htest(0)h_{\text{test}}^{(0)}: the initial representation of the test input before attention
  • αi\alpha_i: the attention weight from the test input to demonstration ii
  • viv_i: the value vector derived from demonstration ii
  • kk: the number of demonstrations in context

This weighted sum resembles a gradient update: the model adjusts its representation of the test input based on the demonstrations. With 96 layers, the model has substantial "depth" to iteratively refine this adaptation. Research has shown that in-context learning and fine-tuning produce similar representational changes, even though one updates weights and the other doesn't.

Hypothesis 3: Bayesian Inference

A third hypothesis frames in-context learning as Bayesian inference over tasks. The model implicitly maintains a prior distribution P(task)P(\text{task}) over possible tasks from pretraining. Each demonstration (xi,yi)(x_i, y_i) updates this posterior according to Bayes' rule:

P(taskx1,y1,,xk,yk)P(y1,,ykx1,,xk,task)P(task)P(\text{task} \mid x_1, y_1, \ldots, x_k, y_k) \propto P(y_1, \ldots, y_k \mid x_1, \ldots, x_k, \text{task}) \cdot P(\text{task})

where:

  • P(task)P(\text{task}): the prior probability of each task, learned during pretraining from exposure to diverse text formats
  • P(yixi,task)P(y_i \mid x_i, \text{task}): the likelihood of output yiy_i given input xix_i under a specific task interpretation
  • P(task)P(\text{task} \mid \ldots): the posterior probability of each task after observing demonstrations

By the time the model reaches the test input, it has narrowed down the task distribution sufficiently to make a confident prediction. This view explains why more demonstrations help: each one provides additional evidence that sharpens the posterior.

Out[15]:
Visualization
Diagram box for task recognition hypothesis.
Task Recognition: Demonstrations activate existing circuits from pretraining.
Diagram box for implicit fine-tuning hypothesis.
Implicit Fine-Tuning: Deep forward passes behave like gradient descent.
Diagram box for Bayesian inference hypothesis.
Bayesian Inference: Demonstrations update posterior over tasks.

Prompt Engineering Principles

The success of in-context learning spawned a new discipline: prompt engineering. Researchers discovered that how you format prompts dramatically affects performance.

Format Matters

Small changes in prompt format can yield large performance differences:

In[16]:
Code
# Different prompt formats for the same task
formats = {
    "simple": """Review: {review}
Sentiment: """,
    "instructional": """Task: Classify the sentiment of the review as Positive or Negative.

Review: {review}
Sentiment: """,
    "structured": """### Sentiment Classification

**Review:** {review}

**Sentiment:**""",
    "conversational": """You are a sentiment analysis expert. Read the following review and determine if it expresses a Positive or Negative sentiment.

The review is: "{review}"

Your classification:""",
}

test_review = (
    "The battery life is amazing but the camera quality is disappointing."
)
Out[17]:
Console
Prompt Format Variations
============================================================

--- SIMPLE FORMAT ---
Review: The battery life is amazing but the camera quality is disappointing.
Sentiment: 

--- INSTRUCTIONAL FORMAT ---
Task: Classify the sentiment of the review as Positive or Negative.

Review: The battery life is amazing but the camera quality is disappointing.
Sentiment: 

--- STRUCTURED FORMAT ---
### Sentiment Classification

**Review:** The battery life is amazing but the camera quality is disappointing.

**Sentiment:**

--- CONVERSATIONAL FORMAT ---
You are a sentiment analysis expert. Read the following review and determine if it expresses a Positive or Negative sentiment.

The review is: "The battery life is amazing but the camera quality is disappointing."

Your classification:

Research showed that the conversational format often performs best for GPT-3, likely because its training data contains many examples of conversational task-solving. The optimal format varies by model and task, requiring empirical experimentation.

Example Selection and Ordering

The choice and order of few-shot examples strongly affects performance:

  • Diversity: Examples should cover the range of possible inputs and outputs
  • Similarity: Examples semantically similar to the test input often help more
  • Recency: Examples closer to the end of the prompt have stronger influence
  • Balance: For classification, examples should be balanced across classes
Out[18]:
Visualization
Heatmap showing attention-like weights decreasing for earlier examples.
Impact of example ordering on classification performance. Examples at the end of the prompt (most recent) have stronger influence on the model's prediction due to recency effects in attention.

The Label Matters (Sometimes)

A surprising finding was that the actual labels in few-shot examples don't always matter. In some experiments, randomly assigning labels (positive reviews labeled "Negative" and vice versa) still improved performance over zero-shot, suggesting the model was learning format rather than task semantics.

However, for more complex tasks, correct labels improve performance considerably. The model can leverage both the format pattern and the actual input-output mapping.

Limitations and Concerns

GPT-3's release sparked both excitement and concern. While the model demonstrated impressive capabilities, it also exposed fundamental challenges that persist in large language models today. These limitations fall into several categories: reliability of outputs, sensitivity to inputs, societal impacts, and practical constraints.

Factual Errors and Hallucinations

GPT-3 confidently produces plausible-sounding but incorrect information. It might cite nonexistent studies, invent historical events, or provide wrong answers to factual questions. The model has no mechanism to verify claims against external sources or acknowledge uncertainty.

This limitation is particularly dangerous because GPT-3's outputs are fluent and authoritative. Users may accept incorrect information because it sounds convincing. The model's confidence is orthogonal to its accuracy.

Sensitivity to Prompts

Performance is brittle with respect to prompt phrasing. Changing a single word or reordering examples can shift accuracy by 10-20 percentage points. This makes reliable deployment challenging: a prompt that works well on test examples might fail unexpectedly on edge cases.

Bias and Fairness

GPT-3 amplifies biases present in its training data. It associates certain occupations with genders, exhibits racial stereotypes, and can generate toxic content when prompted. The model learned these patterns from internet text and has no mechanism to distinguish harmful biases from useful patterns.

Out[19]:
Visualization
Diagram box for hallucination limitation.
Hallucinations: Generates plausible but factually incorrect content with high confidence.
Diagram box for prompt sensitivity limitation.
Prompt Sensitivity: Small prompt changes cause large performance variations.
Diagram box for bias and fairness limitation.
Bias & Fairness: Reflects and amplifies biases from training data.
Diagram box for reasoning limits.
Reasoning Limits: Struggles with multi-step logical reasoning.

Context Length Constraints

With a 2,048 token context, GPT-3 cannot process long documents or maintain extended conversations. Few-shot learning is limited by how many examples fit in the context window. For tasks requiring synthesis across multiple documents, GPT-3 must work with truncated or summarized inputs.

Environmental and Access Concerns

Training GPT-3 consumed enormous computational resources with associated carbon emissions. Access was restricted to an API controlled by OpenAI, raising concerns about democratization of AI research. Only well-funded organizations could experiment with the full model, creating imbalances in who could study and critique these systems.

Impact on the Field

GPT-3's impact extended far beyond its benchmark scores. It changed how researchers and practitioners think about NLP.

Prompting as a New Paradigm

Before GPT-3, the standard approach was to fine-tune pretrained models for specific tasks. GPT-3 demonstrated that sufficiently large models could be steered through prompting alone. This shifted research attention toward:

  • Prompt engineering: Systematic study of how prompt design affects performance
  • Instruction tuning: Training models to follow natural language instructions
  • Chain-of-thought prompting: Encouraging models to show reasoning steps

Scaling Laws

GPT-3 contributed to the formalization of scaling laws: mathematical relationships between model size, training compute, dataset size, and performance. The key insight is that test loss LL follows a power-law relationship with each scaling factor:

L(N)=(NcN)αNL(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}

where:

  • L(N)L(N): the test loss as a function of model parameters
  • NN: the number of model parameters
  • NcN_c: a constant that depends on the dataset and architecture
  • αN0.076\alpha_N \approx 0.076: the scaling exponent for parameters (empirically determined)

Similar relationships hold for dataset size DD and training compute CC:

L(D)=(DcD)αD,L(C)=(CcC)αCL(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}

where αD0.095\alpha_D \approx 0.095 for dataset tokens and αC0.050\alpha_C \approx 0.050 for compute. These power laws mean that each 10x increase in resources yields a predictable (though diminishing) improvement in loss.

Out[20]:
Visualization
Log-log plot showing power-law relationship between parameters and test loss.
Parameters scaling: Test loss decreases as a power law with model size.
Log-log plot showing power-law relationship between data size and test loss.
Data scaling: Test loss decreases with dataset size (tokens).
Log-log plot showing power-law relationship between compute and test loss.
Compute scaling: Test loss decreases with training FLOPs.

The scaling law perspective suggested that many apparent limitations might be overcome by simply training larger models on more data. This hypothesis, controversial at the time, proved partially correct with subsequent models.

Foundation Models

GPT-3 exemplified the "foundation model" concept: a single large model pretrained on diverse data that can be adapted to many downstream tasks. Rather than training separate models for translation, summarization, and question answering, a single foundation model handles all tasks through different prompts or lightweight adaptation.

This paradigm consolidated AI development around a few very large models, shifting the field toward centralized training and distributed deployment.

Summary

GPT-3 marked a turning point in language AI, demonstrating that scale itself could unlock new capabilities. Its 175 billion parameters, trained on hundreds of billions of tokens, produced a model capable of few-shot learning: adapting to new tasks from just a handful of examples in the prompt.

Key takeaways from GPT-3:

  • Scale enables emergence: In-context learning appeared as model size increased, suggesting that some capabilities require a threshold of capacity before manifesting
  • Few-shot learning works: Providing examples in the prompt can match or exceed fine-tuned models for certain tasks, without any parameter updates
  • Prompt design matters: The format, ordering, and content of prompts strongly affect performance, spawning the field of prompt engineering
  • Limitations persist: Hallucinations, bias, reasoning failures, and context constraints remain challenges that scale alone doesn't solve
  • Paradigm shift: GPT-3 accelerated the move from task-specific fine-tuning toward general-purpose foundation models

The model's release catalyzed an industry-wide race to scale, leading to GPT-4, Claude, PaLM, and other models that pushed beyond GPT-3's capabilities. But GPT-3's core insight endures: with sufficient scale, language models develop surprising and useful behaviors that weren't explicitly trained.

Key Parameters

When working with GPT-3 and similar large language models, several parameters affect performance and behavior:

  • temperature: Controls randomness in token sampling. Values range from 0.0 (deterministic, always choosing the highest probability token) to 2.0 (highly random). Common values are 0.7-1.0 for creative tasks and 0.0-0.3 for factual tasks. Lower temperatures produce more focused, predictable outputs; higher temperatures increase diversity but may reduce coherence.

  • max_tokens: The maximum number of tokens to generate in the response. Set this based on expected output length to control costs and latency. GPT-3 supports up to 2,048 tokens total (prompt + completion combined), so longer prompts leave less room for generation.

  • top_p (nucleus sampling): An alternative to temperature that samples from the smallest set of tokens whose cumulative probability exceeds the threshold. A value of 0.9 means the model considers tokens until their probabilities sum to 90%. Generally, adjust either temperature or top_p, but not both simultaneously.

  • frequency_penalty: Reduces repetition by penalizing tokens based on how often they appear in the generated text so far. Values range from 0.0 to 2.0. Higher values discourage the model from repeating the same phrases, useful for open-ended generation.

  • presence_penalty: Penalizes tokens based on whether they have appeared at all, regardless of frequency. This encourages the model to introduce new topics. Values range from 0.0 to 2.0. Useful when you want diverse, wide-ranging outputs.

  • n_examples (few-shot count): The number of demonstrations to include in the prompt. More examples generally improve performance but consume context space. Empirically, 4-8 examples often provide good results while leaving room for the test input and response. Beyond 20-30 examples, returns typically diminish.

  • stop sequences: Tokens or strings that signal generation should stop. For few-shot prompts, include delimiters like "\n\n" or "Input:" to prevent the model from generating additional examples. Proper stop sequences ensure clean, usable outputs.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about GPT-3, few-shot learning, and in-context learning.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{gpt3scalefewshotlearningincontextlearningdiscovery, author = {Michael Brenndoerfer}, title = {GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery}, year = {2025}, url = {https://mbrenndoerfer.com/writing/gpt-3-scale-few-shot-in-context-learning}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery. Retrieved from https://mbrenndoerfer.com/writing/gpt-3-scale-few-shot-in-context-learning
MLAAcademic
Michael Brenndoerfer. "GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/gpt-3-scale-few-shot-in-context-learning>.
CHICAGOAcademic
Michael Brenndoerfer. "GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/gpt-3-scale-few-shot-in-context-learning.
HARVARDAcademic
Michael Brenndoerfer (2025) 'GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery'. Available at: https://mbrenndoerfer.com/writing/gpt-3-scale-few-shot-in-context-learning (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). GPT-3: Scale, Few-Shot Learning & In-Context Learning Discovery. https://mbrenndoerfer.com/writing/gpt-3-scale-few-shot-in-context-learning
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free