Search

Search articles

In-Context Learning: How LLMs Learn from Examples Without Training

Michael BrenndoerferUpdated July 27, 202551 min read

Explore how large language models learn new tasks from prompt demonstrations without weight updates. Covers example selection, scaling behavior, and theoretical explanations.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

In-Context Learning

When GPT-3 was released in 2020, researchers observed something unexpected: the model could perform new tasks simply by being shown a few examples in the prompt, without any gradient updates or fine-tuning. This capability, called in-context learning (ICL), represented a fundamental shift in how we think about adapting language models to new tasks. Rather than collecting labeled datasets and training specialized models, you could now demonstrate a task through examples and let the model infer what to do.

This chapter explores in-context learning in depth. We'll examine why it works, how it compares to traditional fine-tuning, what makes some examples more effective than others, how ICL capabilities scale with model size, and the theoretical frameworks researchers have developed to explain this surprising phenomenon.

What Is In-Context Learning?

In-context learning refers to a language model's ability to learn new tasks from examples provided in the input prompt, without updating any model parameters. The model "learns" the task pattern by conditioning on the demonstrations and applies that pattern to new inputs.

In-Context Learning

In-context learning (ICL) is the ability of a pretrained language model to perform a task by conditioning on a few input-output examples (demonstrations) in the prompt, without any gradient-based training or weight updates. The model infers the task from the examples and generalizes to new inputs.

Consider a translation task. Instead of fine-tuning a model on thousands of parallel sentences, you can provide a few examples directly in the prompt:

English: The weather is beautiful today. French: Le temps est magnifique aujourd'hui. English: I would like a cup of coffee. French: Je voudrais une tasse de café. English: Where is the nearest train station? French:

The model completes this with "Où est la gare la plus proche?" having inferred the translation pattern from just two examples. No training occurred. The model's weights remain unchanged. Yet it performs the task correctly.

This behavior was surprising because traditional machine learning assumes you need gradient updates to learn. The model must see many examples, compute a loss, and adjust its parameters. ICL breaks this assumption: learning happens within a single forward pass through the network.

The Three Regimes of Prompting

The GPT-3 paper formalized three prompting regimes based on how many examples are provided:

In[3]:
Code
def create_icl_prompt(task_description, examples, test_input):
    """
    Create an in-context learning prompt with demonstrations.

    Args:
        task_description: Optional natural language description of the task
        examples: List of (input, output) tuples as demonstrations
        test_input: The input to be completed by the model

    Returns:
        Formatted prompt string
    """
    parts = []

    if task_description:
        parts.append(task_description)
        parts.append("")

    for inp, out in examples:
        parts.append(f"Input: {inp}")
        parts.append(f"Output: {out}")
        parts.append("")

    parts.append(f"Input: {test_input}")
    parts.append("Output:")

    return "\n".join(parts)


# Zero-shot: Task description only, no examples
zero_shot = create_icl_prompt(
    task_description="Classify the sentiment as Positive or Negative.",
    examples=[],
    test_input="The movie was a complete waste of time.",
)

# One-shot: Single demonstration
one_shot = create_icl_prompt(
    task_description=None,
    examples=[("I love this product!", "Positive")],
    test_input="The movie was a complete waste of time.",
)

# Few-shot: Multiple demonstrations
few_shot = create_icl_prompt(
    task_description=None,
    examples=[
        ("I love this product!", "Positive"),
        ("Terrible customer service.", "Negative"),
        ("Exceeded all my expectations.", "Positive"),
        ("Would not recommend to anyone.", "Negative"),
    ],
    test_input="The movie was a complete waste of time.",
)
Out[4]:
Console
==================================================
ZERO-SHOT (task description only)
==================================================
Classify the sentiment as Positive or Negative.

Input: The movie was a complete waste of time.
Output:

==================================================
ONE-SHOT (1 example)
==================================================
Input: I love this product!
Output: Positive

Input: The movie was a complete waste of time.
Output:

==================================================
FEW-SHOT (4 examples)
==================================================
Input: I love this product!
Output: Positive

Input: Terrible customer service.
Output: Negative

Input: Exceeded all my expectations.
Output: Positive

Input: Would not recommend to anyone.
Output: Negative

Input: The movie was a complete waste of time.
Output:

Each regime has distinct characteristics:

  • Zero-shot: The model relies entirely on its pretrained knowledge and the task description. Works best for common tasks the model encountered during pretraining.
  • One-shot: A single example clarifies the expected format and task. Often sufficient for simple classification or formatting tasks.
  • Few-shot: Multiple examples help the model distinguish between classes, understand edge cases, and calibrate its confidence. Typically 4-32 examples, limited by context length.

The number of examples you can provide is constrained by the model's context window. With a 2,048 token limit (GPT-3) or 8,192+ tokens (later models), you must balance demonstration count against prompt length.

ICL vs Fine-Tuning

Traditional task adaptation requires fine-tuning: updating a pretrained model's weights on task-specific labeled data. In-context learning offers an alternative that trades training for inference. Understanding when to use each approach requires examining their fundamental differences.

The Fine-Tuning Paradigm

Fine-tuning involves several steps:

  1. Collect labeled examples for your task (typically thousands)
  2. Initialize from a pretrained model
  3. Train on your data with gradient descent
  4. Deploy the specialized model

The result is a dedicated model for your task. Its weights have been permanently modified to excel at that specific application. Fine-tuning produces strong performance but requires:

  • Labeled training data
  • Compute for training
  • Expertise to avoid overfitting
  • Separate model storage per task

The ICL Paradigm

In-context learning follows a different path:

  1. Prepare a small number of demonstrations (typically 1-32)
  2. Format them into a prompt
  3. Pass the prompt to the base model at inference time

The base model remains unchanged. The same model serves all tasks, with demonstrations specifying what to do. ICL requires no training but consumes inference compute for longer prompts.

Out[5]:
Visualization
Flowchart showing fine-tuning approach to task adaptation.
Fine-tuning creates task-specific models through gradient updates on labeled data.
Flowchart showing ICL approach to task adaptation.
In-context learning uses a single base model with task-specific prompts containing examples.

Comparing Performance

How do these approaches compare in practice? The answer depends on several factors:

Trade-offs between fine-tuning and in-context learning approaches.
AspectFine-TuningIn-Context Learning
Data required100s-1000s of examples1-32 examples
Training timeHours to daysNone
Inference costStandardHigher (longer prompts)
Task switchingLoad new modelChange prompt
Peak performanceGenerally higherCompetitive for many tasks
Model updatesNew base model requires retrainingAutomatic improvement

Fine-tuning typically achieves higher accuracy when sufficient training data is available. The model has many gradient steps to learn task-specific patterns. ICL is limited to what can be demonstrated in a prompt and what the model learned during pretraining.

However, ICL excels in scenarios where fine-tuning is impractical:

  • Rapid prototyping: Test ideas without training infrastructure
  • Low-resource tasks: Limited labeled data available
  • Dynamic tasks: Task definition changes frequently
  • Multi-task deployment: Single model serves many tasks
Out[6]:
Visualization
Line plot comparing fine-tuning and few-shot ICL performance as a function of training examples.
Illustrative comparison of task performance between fine-tuning and ICL. Fine-tuning (solid line) typically achieves higher accuracy with sufficient data, while ICL (dashed lines) offers strong performance with minimal examples. The shaded region shows where ICL is competitive.

The crossover point varies by task. For simple classification, ICL often matches fine-tuning with just a few examples. For complex reasoning or structured prediction, fine-tuning's advantage is more pronounced.

When Representations Converge

Recent research shows that ICL and fine-tuning produce similar internal representations despite their different mechanisms. When you examine the hidden states of a model performing a task via ICL versus a fine-tuned version of that model, the representations are often highly correlated.

This suggests that both approaches activate similar "circuits" in the network. Fine-tuning strengthens these circuits through weight updates. ICL activates them through attention over demonstrations. The end result, in terms of what the model computes, is quite similar.

Example Selection Strategies

Not all demonstrations are equally effective. Research has shown that the choice of examples can swing performance by 20-30 percentage points. Understanding what makes examples effective has become a critical skill in prompt engineering.

Factors That Influence Example Quality

Several properties affect how useful an example is as a demonstration:

  • Relevance: Examples similar to the test input help more than dissimilar ones
  • Diversity: Examples should cover the range of possible inputs and outputs
  • Clarity: Unambiguous examples with clear input-output relationships work best
  • Correctness: Erroneous labels can mislead the model
  • Ordering: Examples at the end of the prompt have stronger influence

Let's examine each factor and how to optimize for it.

Semantic Similarity Selection

One of the most effective strategies is selecting examples that are semantically similar to the test input. The intuition is that nearby examples in embedding space share relevant features that help the model generalize.

Similarity is typically measured using cosine similarity between embedding vectors:

sim(a,b)=abab\text{sim}(a, b) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

where:

  • a\mathbf{a} and b\mathbf{b}: embedding vectors for two text inputs
  • ab\mathbf{a} \cdot \mathbf{b}: the dot product of the two vectors
  • a\|\mathbf{a}\| and b\|\mathbf{b}\|: the Euclidean norms (lengths) of each vector

Cosine similarity ranges from -1 (opposite directions) to 1 (identical directions), with 0 indicating orthogonal (unrelated) vectors. For text embeddings, higher similarity indicates semantically related content.

In[7]:
Code
from sklearn.metrics.pairwise import cosine_similarity


def select_similar_examples(test_input, candidate_pool, embedder, k=4):
    """
    Select k examples most similar to the test input.

    Args:
        test_input: The input to find similar examples for
        candidate_pool: List of (input, output) example tuples
        embedder: Function that maps text to embedding vectors
        k: Number of examples to select

    Returns:
        List of k most similar examples
    """
    # Embed the test input
    test_embedding = embedder([test_input])

    # Embed all candidates
    candidate_inputs = [inp for inp, _ in candidate_pool]
    candidate_embeddings = embedder(candidate_inputs)

    # Compute similarities
    similarities = cosine_similarity(test_embedding, candidate_embeddings)[0]

    # Select top-k
    top_indices = np.argsort(similarities)[-k:][::-1]

    return [candidate_pool[i] for i in top_indices]


# Simulate with mock embeddings for demonstration
def mock_embedder(texts):
    """Mock embedder that returns random but consistent vectors."""
    np.random.seed(hash(str(texts)) % 2**32)
    return np.random.randn(len(texts), 64)


# Example pool
example_pool = [
    ("Great food and amazing service!", "Positive"),
    ("Waited an hour for cold food.", "Negative"),
    ("The steak was cooked perfectly.", "Positive"),
    ("Overpriced for what you get.", "Negative"),
    ("Best Italian restaurant in town!", "Positive"),
    ("Never coming back here again.", "Negative"),
    ("Friendly staff and cozy atmosphere.", "Positive"),
    ("Found a hair in my soup.", "Negative"),
]

test_input = "The pasta was absolutely delicious."
Out[8]:
Console
Test input: "The pasta was absolutely delicious."

Selecting 4 examples by semantic similarity...

Selected demonstrations:
  1. "Waited an hour for cold food." → Negative
  2. "Best Italian restaurant in town!" → Positive
  3. "Great food and amazing service!" → Positive
  4. "Friendly staff and cozy atmosphere." → Positive

The selection algorithm identifies examples that are semantically closest to the test input. In this case, the mock embedder produces consistent but randomized vectors, so the selected examples demonstrate the general approach. In practice, using a real sentence embedding model like sentence-transformers would select examples with genuinely similar meaning and vocabulary.

Similarity-based selection consistently outperforms random selection. The most similar examples contain vocabulary, style, and semantic features that help the model understand what kind of input it's processing.

Out[9]:
Visualization
Scatter plot showing test input and candidate examples in 2D embedding space with connections to selected neighbors.
Semantic similarity selection in embedding space. The test input (star) is compared against candidate examples (circles). The algorithm selects the k nearest neighbors (connected by dashed lines). Positive and negative examples naturally cluster based on semantic content.

Diversity Through Coverage

While similarity helps, relying solely on it can create blind spots. If all selected examples are too similar, the model may not learn to handle edge cases or the full range of outputs.

A balanced approach selects examples that are both relevant and diverse:

In[10]:
Code
def select_diverse_examples(
    test_input, candidate_pool, embedder, k=4, diversity_weight=0.3
):
    """
    Select examples balancing similarity and diversity.

    Uses maximal marginal relevance (MMR) to avoid redundant examples.

    Args:
        test_input: Input to select examples for
        candidate_pool: Available (input, output) examples
        embedder: Text to embedding function
        k: Number of examples to select
        diversity_weight: Balance between relevance (0) and diversity (1)

    Returns:
        List of k balanced examples
    """
    # Get embeddings
    test_emb = embedder([test_input])
    candidate_inputs = [inp for inp, _ in candidate_pool]
    candidate_embs = embedder(candidate_inputs)

    # Compute relevance scores
    relevance = cosine_similarity(test_emb, candidate_embs)[0]

    selected = []
    selected_embs = []
    remaining = list(range(len(candidate_pool)))

    for _ in range(k):
        if not remaining:
            break

        best_score = -float("inf")
        best_idx = None

        for idx in remaining:
            # Relevance term
            rel_score = relevance[idx]

            # Diversity term: max similarity to already selected
            if selected_embs:
                div_score = max(
                    cosine_similarity([candidate_embs[idx]], selected_embs)[0]
                )
            else:
                div_score = 0

            # MMR score
            score = (
                1 - diversity_weight
            ) * rel_score - diversity_weight * div_score

            if score > best_score:
                best_score = score
                best_idx = idx

        selected.append(candidate_pool[best_idx])
        selected_embs.append(candidate_embs[best_idx])
        remaining.remove(best_idx)

    return selected
Out[11]:
Console
Maximal Marginal Relevance (MMR) Selection
==================================================
Test: "The pasta was absolutely delicious."

With diversity_weight=0.3, selected examples:
  1. "Waited an hour for cold food." → Negative
  2. "Best Italian restaurant in town!" → Positive
  3. "Found a hair in my soup." → Negative
  4. "Great food and amazing service!" → Positive

Class distribution: 2 Positive, 2 Negative

Unlike pure similarity-based selection, MMR actively penalizes redundancy. The diversity weight of 0.3 means the algorithm balances 70% relevance with 30% diversity. This typically results in better class coverage, as selecting multiple very similar examples provides diminishing returns.

The maximal marginal relevance (MMR) algorithm iteratively selects examples that are relevant to the query but dissimilar to already-selected examples. At each step, MMR scores each candidate using:

MMR(d)=(1λ)sim(d,q)λmaxsSsim(d,s)\text{MMR}(d) = (1 - \lambda) \cdot \text{sim}(d, q) - \lambda \cdot \max_{s \in S} \text{sim}(d, s)

where:

  • dd: a candidate demonstration being evaluated
  • qq: the query (test input) we want to find examples for
  • SS: the set of already-selected demonstrations
  • sim(d,q)\text{sim}(d, q): similarity between candidate dd and query qq (relevance term)
  • maxsSsim(d,s)\max_{s \in S} \text{sim}(d, s): maximum similarity between candidate dd and any already-selected example (redundancy term)
  • λ\lambda: diversity weight balancing relevance versus diversity (typically 0.3-0.5)

The first term rewards candidates similar to the query. The second term penalizes candidates similar to already-selected examples. By subtracting the redundancy term, MMR prevents selecting demonstrations that are too similar to each other, ensuring coverage of diverse aspects of the task.

Out[12]:
Visualization
Scatter plot showing pure relevance selection at lambda=0.
λ=0.0 (Pure Relevance): Selects examples closest to test input, may be redundant.
Scatter plot showing balanced selection at lambda=0.3.
λ=0.3 (Balanced): Balances relevance and diversity for better coverage.
Scatter plot showing high diversity selection at lambda=0.7.
λ=0.7 (High Diversity): Prioritizes spreading selections across the space.

Class-Balanced Selection

For classification tasks, ensuring balanced representation across classes is critical. If your demonstrations are skewed toward one class, the model may be biased toward predicting that class.

Out[13]:
Visualization
Bar chart comparing accuracy with balanced vs imbalanced few-shot examples across classes.
Impact of class balance in few-shot demonstrations. Imbalanced examples bias the model toward the majority class, while balanced selection improves overall accuracy and calibration.

The effect of class imbalance can be dramatic. In extreme cases, providing examples from only one class causes the model to predict that class for all inputs, regardless of the actual content.

Ordering and Recency Effects

The order of examples in the prompt matters more than you might expect. Examples closer to the test input (at the end of the prompt) tend to have stronger influence on the prediction. This "recency effect" emerges from the attention mechanism, where nearby tokens naturally receive more attention.

Out[14]:
Visualization
Bar chart showing increasing influence weight for examples from first to last position.
Recency effect in few-shot prompts. Examples at the end of the prompt exert stronger influence on predictions. Placing the most relevant or representative examples last can improve performance.

Practical implications of the recency effect:

  • Place the most representative examples at the end of the prompt
  • When class-balancing, ensure the final examples aren't all from one class
  • For tasks with a default or majority class, counterbalance by ending with minority examples

ICL Scaling Behavior

One of the most interesting aspects of in-context learning is how it scales. Both the number of examples and the model size affect performance, but the relationship is not simply additive.

More Examples Help (Up to a Point)

Adding more demonstrations generally improves performance, but with diminishing returns. The first few examples provide the largest gains, while additional examples contribute less.

Out[15]:
Visualization
Line plot showing accuracy vs number of examples for simple, medium, and complex tasks.
Performance as a function of demonstration count for different task types. Simple tasks saturate quickly, while complex tasks benefit from more examples. All curves show diminishing returns, and performance eventually plateaus due to context length limits or task ceiling.

The optimal number of examples depends on:

  • Task complexity: Simple tasks need fewer examples
  • Model capacity: Larger models extract more from each example
  • Example quality: High-quality examples provide more information per demonstration
  • Context budget: Longer examples leave less room for more demonstrations

Model Size Enables ICL

The most dramatic scaling effect is the relationship between model size and ICL capability. Small models show minimal benefit from demonstrations. As models grow larger, they become increasingly able to leverage in-context examples.

Out[16]:
Visualization
Line plot showing ICL improvement over zero-shot for different model sizes from 100M to 175B parameters.
In-context learning capability as a function of model size. Smaller models (under 1B parameters) show minimal ICL ability, treating demonstrations almost as irrelevant context. ICL capability emerges around 1-10B parameters and continues improving with scale.

This emergence pattern has important implications. It suggests that in-context learning is not simply scaling up pattern matching but represents a qualitatively different capability that activates at sufficient scale. Models below a certain threshold don't just do ICL poorly; they essentially don't do it at all.

The Interplay of Scale and Examples

Model size and example count interact in interesting ways. Larger models:

  • Extract more information from each example
  • Continue improving with more examples for longer
  • Generalize better from diverse demonstrations
  • Are more robust to example ordering and selection
Out[17]:
Visualization
Multi-line plot showing performance curves for different model sizes as examples increase.
Interaction between model size and number of in-context examples. Larger models show steeper improvement curves and higher performance ceilings. The gap between zero-shot and few-shot widens with scale, demonstrating that larger models leverage demonstrations more effectively.

The practical takeaway: if you're working with a smaller model, investing heavily in example selection and count may not pay off. But with large models, careful prompt engineering with well-chosen examples can yield substantial gains.

Out[18]:
Visualization
Contour plot showing ICL accuracy as function of model parameters and demonstration count.
ICL performance landscape across model sizes and example counts. Contour lines show constant accuracy levels. Small models (bottom) show minimal improvement regardless of example count. Large models (top) achieve higher performance and continue improving with more examples. The diagonal pattern indicates that model scale and example count are complementary: larger models extract more value from each additional demonstration.

Theoretical Understanding of ICL

How can a model "learn" without updating its weights? This question strikes at the heart of what makes in-context learning so puzzling. In traditional machine learning, learning requires gradient descent: the model sees examples, computes errors, and adjusts its parameters to reduce those errors over many iterations. ICL appears to violate this fundamental requirement. A frozen model, with weights fixed at their pretrained values, somehow adapts its behavior based on demonstrations it has never seen before.

Understanding this phenomenon requires thinking carefully about what "learning" actually means in the context of a forward pass through a neural network. Three complementary hypotheses have emerged from theoretical research, each illuminating a different aspect of how ICL works. We'll explore each in turn, building from intuition to mathematical formalization.

Hypothesis 1: Task Location

The first hypothesis reframes the question: perhaps ICL doesn't involve learning at all in the traditional sense. Instead, demonstrations might serve as a kind of address that locates pre-existing knowledge.

Consider what happens during pretraining. The model processes billions of tokens from diverse sources: Wikipedia articles, books, code repositories, forums, and countless other text formats. Within this ocean of data, the model implicitly encounters many task formulations:

  • Question-answer pairs embedded in FAQ pages and forums
  • Translation examples in multilingual documents
  • Sentiment expressions in reviews and social media
  • Factual statements in encyclopedic content
  • Code with natural language comments and docstrings

The key insight is that pretraining doesn't just teach the model to predict the next token. It teaches the model to recognize patterns and activate appropriate processing strategies. When you provide few-shot demonstrations, you're not teaching the model something new. Instead, you're helping it identify which of its existing capabilities to apply.

Think of it like a library with millions of books but no catalog. The model has all the knowledge, but it needs help finding the right book. Demonstrations serve as search queries that locate the relevant "section" of the model's capabilities.

Out[19]:
Visualization
Conceptual diagram showing task regions in parameter space activated by demonstrations.
Task location hypothesis visualization. The model's parameter space contains regions specialized for different tasks. Demonstrations activate the appropriate region, effectively selecting among existing capabilities rather than learning new ones.

Several lines of evidence support the task location hypothesis:

  • ICL works best for familiar tasks: Performance is highest on tasks similar to those encountered during pretraining, suggesting the model is retrieving existing capabilities rather than learning new ones.
  • Novel operations fail: Tasks requiring genuinely new logical operations that couldn't have been learned from text data tend to fail at ICL, even with many demonstrations.
  • Task descriptions can substitute for examples: Sometimes a natural language description of the task works as well as demonstrations, which makes sense if the goal is location rather than learning.

The task location view is intuitive, but it doesn't fully explain the mechanics. How does the model actually use demonstrations to activate the right capabilities? This leads us to our second hypothesis.

Hypothesis 2: Implicit Gradient Descent

While the task location hypothesis tells us what ICL accomplishes, the implicit gradient descent hypothesis explains how the transformer architecture actually implements this process. The key insight is that the attention mechanism, when viewed mathematically, performs operations that closely resemble gradient descent.

To understand this connection, we need to examine what happens during a single attention layer. When the model processes a prompt containing demonstrations followed by a test input, the attention mechanism allows information to flow from the demonstrations to the test input.

Consider the query input (the test case we want the model to complete). Before attention, it has some initial representation hquery(0)h_{\text{query}}^{(0)}. The attention mechanism then updates this representation by aggregating information from the demonstrations:

hquery=hquery(0)+i=1kαivih_{\text{query}} = h_{\text{query}}^{(0)} + \sum_{i=1}^{k} \alpha_i \cdot v_i

where:

  • hqueryh_{\text{query}}: the updated representation of the query input after attention
  • hquery(0)h_{\text{query}}^{(0)}: the initial representation of the query input before this attention layer
  • αi\alpha_i: the attention weight assigned to demonstration ii, computed via softmax over query-key dot products
  • viv_i: the value vector derived from demonstration ii, encoding information about that example
  • kk: the total number of demonstrations in the prompt

This formula has the same structure as a gradient descent update. In standard gradient descent, we update parameters by:

θθ+ηL\theta \leftarrow \theta + \eta \cdot \nabla L

where θ\theta represents the current parameters, η\eta is the learning rate, and L\nabla L is the gradient of the loss. The analogy becomes clear when we make the following correspondences:

Correspondence between gradient descent and attention mechanism components.
Gradient DescentAttention Mechanism
Current parameters θ\thetaInitial representation hquery(0)h_{\text{query}}^{(0)}
Learning rate η\etaAttention weights αi\alpha_i
Gradient L\nabla LValue vectors viv_i
Updated parametersUpdated representation hqueryh_{\text{query}}

The attention weights αi\alpha_i act as adaptive, example-specific learning rates. They automatically assign more weight to demonstrations that are more relevant (those with higher query-key similarity). This is actually more sophisticated than standard gradient descent, where the learning rate is typically fixed.

Research has substantiated this connection with increasingly strong results:

  1. Constructive proofs: Researchers have shown that transformers can be explicitly constructed to implement gradient descent algorithms.
  2. Linear attention equivalence: For linear attention layers (without the softmax nonlinearity), the forward pass is mathematically equivalent to one step of gradient descent on a regression problem.
  3. Multi-step optimization: With 96 layers like GPT-3, the model can implement many sequential "optimization steps," allowing for sophisticated adaptation to the demonstrations.
In[20]:
Code
def attention_as_gradient_step(
    query_embedding, demo_embeddings, demo_labels, learning_rate=0.1
):
    """
    Illustrate how attention over demonstrations resembles gradient descent.

    This simplified model shows the conceptual parallel between:
    - Attention weighted combination of demonstration values
    - Gradient-based update toward demonstration patterns

    Args:
        query_embedding: Vector representation of the query input
        demo_embeddings: Matrix of demonstration input embeddings
        demo_labels: Labels/outputs for each demonstration
        learning_rate: Step size for the gradient interpretation

    Returns:
        Updated representation after "learning" from demonstrations
    """
    # Compute attention scores (simplified: dot product similarity)
    attention_logits = np.dot(demo_embeddings, query_embedding)
    attention_weights = np.exp(attention_logits) / np.sum(
        np.exp(attention_logits)
    )

    # The "gradient" can be viewed as a weighted combination of
    # directions toward each demonstration
    gradient = np.zeros_like(query_embedding)
    for i, (demo_emb, weight) in enumerate(
        zip(demo_embeddings, attention_weights)
    ):
        # Direction from query toward demonstration
        direction = demo_emb - query_embedding
        gradient += weight * direction

    # Apply the "gradient update"
    updated = query_embedding + learning_rate * gradient

    return updated, attention_weights
Out[21]:
Console
Attention as Implicit Gradient Descent
==================================================

Query embedding: [1.00, 0.50, 0.20]

Demonstration attention weights:
  Demo 1 (Positive): 0.405
  Demo 2 (Negative): 0.175
  Demo 3 (Positive): 0.420

Updated embedding: [0.978, 0.501, 0.206]
Shift magnitude: 0.0227

The attention weights reveal how the model allocates "learning" across demonstrations. Demo 1 and Demo 3 (both Positive) receive higher weights because their embeddings are more similar to the query. Demo 2 (Negative) receives lower weight due to its dissimilar embedding. The updated representation shifts toward the weighted average of demonstrations, with the shift magnitude indicating how much the query representation changed.

This toy example captures the essence of the implicit gradient descent hypothesis: demonstrations don't just provide information, they actively reshape the query representation through the attention mechanism. Similar examples contribute more to this reshaping, just as training examples with larger gradients contribute more to parameter updates in standard learning.

Out[22]:
Visualization
Heatmap showing attention weights from query to each demonstration.
Attention pattern from test query to demonstrations. Higher attention weights (darker blue) indicate demonstrations that contribute more to updating the query representation. In this example, the model attends most strongly to Demo 1 and Demo 3, which are semantically similar to the query and share the same label (Positive).

Hypothesis 3: Bayesian Inference

The first two hypotheses focus on mechanism: what capabilities exist (task location) and how attention implements adaptation (implicit gradient descent). Our third hypothesis takes a probabilistic perspective, asking: how does the model's uncertainty about the task change as it sees more demonstrations?

Imagine you're trying to guess what game someone is playing by watching them take actions. Each action provides evidence about which game it might be. Early on, many games are plausible. As you observe more actions, your beliefs concentrate on the correct game. This is exactly what Bayesian inference formalizes.

Applied to ICL, the model starts with a prior distribution over possible tasks, learned during pretraining. This prior reflects how often different task types appeared in the training data. When you provide demonstrations, each one updates this distribution according to Bayes' rule:

P(taskD1,,Dk)P(D1,,Dktask)P(task)P(\text{task} \mid D_1, \ldots, D_k) \propto P(D_1, \ldots, D_k \mid \text{task}) \cdot P(\text{task})

where:

  • Di=(xi,yi)D_i = (x_i, y_i): the ii-th demonstration, consisting of an input xix_i and its corresponding output yiy_i
  • P(task)P(\text{task}): the prior probability of each task type, reflecting how often the model encountered similar patterns during pretraining
  • P(D1,,Dktask)P(D_1, \ldots, D_k \mid \text{task}): the likelihood of observing these specific demonstrations if the task is as hypothesized (demonstrations consistent with the task have high likelihood)
  • P(taskD1,,Dk)P(\text{task} \mid D_1, \ldots, D_k): the posterior probability of each task after observing all kk demonstrations
  • \propto: indicates proportionality (the left side equals the right side divided by a normalizing constant)

Let's trace through this formula step by step to build intuition:

  1. The prior P(task)P(\text{task}) encodes pretraining knowledge. If the model saw many sentiment classification examples during pretraining, sentiment analysis has a higher prior probability than, say, a specialized domain-specific task.

  2. The likelihood P(D1,,Dktask)P(D_1, \ldots, D_k \mid \text{task}) evaluates consistency. For each possible task interpretation, how likely is it that we would observe these specific demonstrations? If the demonstrations show text mapped to "Positive" and "Negative" labels, the likelihood is high under "sentiment classification" and low under "translation."

  3. The posterior concentrates probability. Multiplying prior and likelihood (and normalizing) yields a posterior that combines our initial beliefs with the evidence from demonstrations. As more demonstrations are added, the posterior becomes sharper, concentrating on the task interpretation that best explains all the evidence.

This Bayesian view explains several empirical observations about ICL:

Out[23]:
Visualization
Bar chart showing uniform prior distribution over tasks.
Prior P(task): Uniform distribution before seeing any demonstrations.
Bar chart showing posterior after one demonstration.
After 1 Example: Posterior begins concentrating on sentiment classification.
Bar chart showing peaked posterior after four demonstrations.
After 4 Examples: High confidence in the correct task interpretation.

The visualization illustrates how the posterior evolves. Initially, probability is spread uniformly across task interpretations. After one demonstration showing an input mapped to "Positive," the posterior shifts toward sentiment classification. After four demonstrations consistently showing sentiment labels, the model has high confidence in the task identity.

Evidence supporting the Bayesian view comes from several empirical observations:

  • Inconsistent demonstrations hurt performance. If you mix correct and incorrect labels, the likelihood term is reduced for all task interpretations, leading to a diffuse posterior and poor predictions.
  • Ambiguous prompts yield uncertain outputs. When demonstrations could plausibly come from multiple tasks, the model's predictions are less confident, exactly as a diffuse posterior would predict.
  • More examples help even when redundant. Providing 16 similar examples rather than 4 can improve performance, because each example sharpens the posterior even if it doesn't add new information.

Synthesizing the Theories

These three hypotheses are not competing explanations but rather complementary perspectives on the same phenomenon. Each answers a different question:

The three ICL hypotheses address complementary questions about the phenomenon.
HypothesisQuestion Answered
Task LocationWhat capabilities can ICL access?
Implicit Gradient DescentHow does attention implement adaptation?
Bayesian InferenceHow does uncertainty reduce with evidence?

A unified view might work like this: During pretraining, the model develops specialized circuits for different tasks (task location). When demonstrations are provided, the attention mechanism adapts the query representation toward these demonstrations (implicit gradient descent), which has the effect of concentrating probability on task interpretations consistent with the evidence (Bayesian inference).

Recent theoretical work has made this unified view more concrete. Researchers have shown that transformers, through their architecture, learn to implement meta-learning algorithms during pretraining. When presented with demonstrations at inference time, they execute these algorithms to infer the task and generate appropriate predictions. The specific algorithm implemented may vary by layer, attention head, and task type, with different components of the network contributing different aspects of the adaptation process.

Improving ICL Performance

Researchers have developed numerous techniques to enhance in-context learning beyond simple few-shot prompting.

Chain-of-Thought Prompting

For reasoning tasks, showing intermediate steps significantly improves performance. Instead of just input-output pairs, demonstrations include the reasoning process:

In[24]:
Code
# Standard few-shot
standard_prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: 11

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:"""

# Chain-of-thought few-shot
cot_prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with 3 balls each, so 2 * 3 = 6 new balls. Total: 5 + 6 = 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:"""
Out[25]:
Console
Standard Few-Shot:
----------------------------------------

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: 11

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:

Chain-of-Thought Few-Shot:
----------------------------------------

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with 3 balls each, so 2 * 3 = 6 new balls. Total: 5 + 6 = 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:

Chain-of-thought (CoT) prompting guides the model to show its work. The reasoning trace helps the model:

  • Break complex problems into steps
  • Catch and correct errors mid-reasoning
  • Align intermediate computations with the final answer

CoT is particularly effective for arithmetic, logical reasoning, and multi-step problems where the direct answer is hard to compute in a single step.

Self-Consistency

Rather than generating a single answer, self-consistency samples multiple reasoning paths and aggregates them through majority voting. The procedure works as follows:

  1. Generate NN different chain-of-thought responses using temperature T>0T > 0 to introduce diversity
  2. Extract the final answer aia_i from each response ii
  3. Return the most common answer (majority vote)

Formally, the self-consistency prediction is:

a^=argmaxai=1N1[ai=a]\hat{a} = \arg\max_{a} \sum_{i=1}^{N} \mathbf{1}[a_i = a]

where:

  • a^\hat{a}: the final predicted answer
  • NN: the number of sampled reasoning paths
  • aia_i: the answer extracted from the ii-th reasoning path
  • 1[ai=a]\mathbf{1}[a_i = a]: an indicator function that equals 1 if answer aia_i matches aa, and 0 otherwise
  • argmaxa\arg\max_{a}: selects the answer aa that maximizes the vote count

This approach leverages the intuition that correct reasoning paths, while potentially different in details, converge on the same answer. Incorrect paths are more likely to diverge, producing different wrong answers that split the vote. With enough samples, the correct answer accumulates more votes than any single incorrect answer.

Out[26]:
Visualization
Diagram showing multiple reasoning paths converging through majority vote.
Self-consistency improves reliability by sampling multiple reasoning paths. While individual paths may contain errors, the majority vote typically converges on the correct answer. This technique is especially valuable for complex reasoning tasks.
Out[27]:
Visualization
Bar chart showing vote counts for different answer values from self-consistency sampling.
Self-consistency vote distribution from 10 sampled reasoning paths. The correct answer (42) receives the most votes despite individual reasoning errors in some paths. Wrong answers scatter across different values, diluting their vote share. This demonstrates how aggregation improves reliability.

Calibration and Confidence

ICL predictions can be overconfident or poorly calibrated. Several techniques address this:

  • Probability calibration: Adjust output probabilities using held-out validation data
  • Verbalizing uncertainty: Ask the model to express confidence in its answer
  • Null prompts: Compare predictions with and without demonstrations to detect when the model is guessing

Retrieval-Augmented ICL

Instead of using a fixed set of demonstrations, retrieval-augmented approaches dynamically select examples for each test input:

  1. Embed the test input
  2. Retrieve similar examples from a large pool
  3. Use retrieved examples as demonstrations

This combines the benefits of similarity-based selection with the scalability of large example pools. The model effectively has access to thousands of potential demonstrations, selecting the most relevant ones at inference time.

Out[28]:
Visualization
Grouped bar chart comparing accuracy of random vs retrieval-based example selection.
Performance comparison of random vs. retrieval-based example selection across tasks. Retrieval-augmented ICL (using semantic similarity to select demonstrations) consistently outperforms random selection, with the largest gains on tasks where relevant examples significantly differ from irrelevant ones.

Limitations of In-Context Learning

ICL has fundamental limitations that practitioners should understand.

Context Length Constraints

The number of demonstrations is bounded by the model's context window. With 4,096 tokens, you might fit 10-20 examples depending on their length. This hard limit constrains what ICL can learn from a single prompt.

Long-context models (32K, 100K+ tokens) partially address this, but attention over very long contexts introduces its own challenges: slower inference, potential loss of focus, and quadratic memory scaling.

Out[29]:
Visualization
Dual-axis line plot showing accuracy and inference time vs number of demonstrations.
Context length trade-offs in ICL. More examples improve accuracy (solid blue), but inference time increases with context length (dashed orange). At some point, the marginal accuracy gain from additional examples no longer justifies the added latency, especially for real-time applications.

Sensitivity to Irrelevant Details

ICL is sensitive to aspects of the prompt that shouldn't matter:

  • Formatting: Changing delimiters or spacing affects predictions
  • Order: Shuffling examples changes accuracy
  • Wording: Synonymous task descriptions yield different results
  • Few irrelevant tokens: Adding noise degrades performance

This brittleness makes reliable deployment challenging. A prompt optimized on development data may fail unexpectedly on edge cases.

Out[30]:
Visualization
Bar chart showing accuracy drops from various prompt perturbations.
ICL sensitivity to prompt variations. Small changes that shouldn't affect the task definition can cause significant accuracy swings. This brittleness requires careful prompt engineering and validation.

No Permanent Learning

ICL provides only temporary task adaptation. Each new inference requires providing demonstrations again. The model doesn't retain any information between queries. For frequently repeated tasks, this overhead can be significant compared to a fine-tuned model that encodes the task permanently.

Limited Task Complexity

Some tasks are too complex to demonstrate in a few examples. Tasks requiring:

  • Extensive domain knowledge
  • Multi-step procedures with many steps
  • Precise output formats (structured data, code)
  • Consistency across a long document

often exceed what ICL can reliably achieve. Fine-tuning or more sophisticated architectures may be necessary.

Summary

In-context learning has changed how we adapt language models to new tasks. Rather than training specialized models, we demonstrate tasks through examples and let the model infer what to do.

Key takeaways:

  • ICL is weight-free learning: The model performs new tasks by conditioning on demonstrations, without any gradient updates. This enables rapid task switching and prototyping.

  • Example selection matters enormously: Choosing relevant, diverse, and balanced examples can swing performance by 20+ percentage points. Strategies like similarity-based selection and MMR help optimize demonstration quality.

  • ICL scales with model size: Small models show minimal ICL capability. The ability to learn from demonstrations emerges around 1-10 billion parameters and continues improving with scale.

  • Multiple theories explain ICL: Task location (activating existing capabilities), implicit gradient descent (attention as optimization), and Bayesian inference (updating task posteriors) offer complementary perspectives on why ICL works.

  • Techniques can enhance ICL: Chain-of-thought prompting, self-consistency, and retrieval augmentation extend what ICL can achieve, particularly for reasoning tasks.

  • Limitations persist: Context length constraints, prompt sensitivity, lack of permanent learning, and task complexity bounds constrain what ICL can reliably accomplish.

In-context learning has changed how NLP is practiced. Tasks that once required careful dataset curation and model training can now be prototyped in minutes with a few well-chosen examples. Understanding both its capabilities and limitations is important for effective application.

Key Parameters

When designing in-context learning prompts, several parameters significantly affect performance:

  • k (number of demonstrations): The count of input-output examples included in the prompt. Start with 4-8 examples for most tasks. Simple classification may need only 2-4, while complex reasoning benefits from 16-32. More examples generally help but with diminishing returns, and you're constrained by context length.

  • diversity_weight (λ\lambda): In MMR-based example selection, this parameter balances relevance (similarity to test input) versus diversity (dissimilarity to already-selected examples). Values of 0.3-0.5 typically work well. Higher values (0.5-0.7) prioritize coverage across the input space; lower values (0.1-0.3) prioritize relevance to the specific test case.

  • temperature: When using self-consistency or sampling multiple outputs, temperature controls randomness. Use 0.0 for deterministic outputs, 0.7-1.0 for diverse reasoning paths in self-consistency. Higher temperatures increase diversity but may reduce coherence.

  • N (self-consistency samples): The number of reasoning paths to sample before majority voting. Common values are 5-40. More samples improve reliability but increase cost linearly. For critical applications, 20-40 samples provide robust estimates; for rapid prototyping, 5-10 may suffice.

  • example ordering: Place the most representative or highest-quality examples at the end of the prompt (closest to the test input) due to recency effects. For classification, ensure the final 2-3 examples aren't all from the same class.

  • prompt format: The delimiter style, spacing, and structure of demonstrations. Consistent formatting across examples helps the model recognize the pattern. Common formats include "Input: X / Output: Y" or "Q: X / A: Y". Match the format to what the model likely saw during pretraining.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about in-context learning.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{incontextlearninghowllmslearnfromexampleswithouttraining, author = {Michael Brenndoerfer}, title = {In-Context Learning: How LLMs Learn from Examples Without Training}, year = {2025}, url = {https://mbrenndoerfer.com/writing/in-context-learning-llm-examples}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). In-Context Learning: How LLMs Learn from Examples Without Training. Retrieved from https://mbrenndoerfer.com/writing/in-context-learning-llm-examples
MLAAcademic
Michael Brenndoerfer. "In-Context Learning: How LLMs Learn from Examples Without Training." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/in-context-learning-llm-examples>.
CHICAGOAcademic
Michael Brenndoerfer. "In-Context Learning: How LLMs Learn from Examples Without Training." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/in-context-learning-llm-examples.
HARVARDAcademic
Michael Brenndoerfer (2025) 'In-Context Learning: How LLMs Learn from Examples Without Training'. Available at: https://mbrenndoerfer.com/writing/in-context-learning-llm-examples (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). In-Context Learning: How LLMs Learn from Examples Without Training. https://mbrenndoerfer.com/writing/in-context-learning-llm-examples
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free