In-Context Learning: How LLMs Learn from Examples Without Training

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Explore how large language models learn new tasks from prompt demonstrations without weight updates. Covers example selection, scaling behavior, and theoretical explanations.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

In-Context LearningLink Copied

When GPT-3 was released in 2020, researchers observed something unexpected: the model could perform new tasks simply by being shown a few examples in the prompt, without any gradient updates or fine-tuning. This capability, called in-context learning (ICL), represented a fundamental shift in how we think about adapting language models to new tasks. Rather than collecting labeled datasets and training specialized models, you could now demonstrate a task through examples and let the model infer what to do.

This chapter explores in-context learning in depth. We'll examine why it works, how it compares to traditional fine-tuning, what makes some examples more effective than others, how ICL capabilities scale with model size, and the theoretical frameworks researchers have developed to explain this surprising phenomenon.

What Is In-Context Learning?Link Copied

In-context learning refers to a language model's ability to learn new tasks from examples provided in the input prompt, without updating any model parameters. The model "learns" the task pattern by conditioning on the demonstrations and applies that pattern to new inputs.

In-Context Learning

In-context learning (ICL) is the ability of a pretrained language model to perform a task by conditioning on a few input-output examples (demonstrations) in the prompt, without any gradient-based training or weight updates. The model infers the task from the examples and generalizes to new inputs.

Consider a translation task. Instead of fine-tuning a model on thousands of parallel sentences, you can provide a few examples directly in the prompt:

English: The weather is beautiful today.
French: Le temps est magnifique aujourd'hui.

English: I would like a cup of coffee.
French: Je voudrais une tasse de café.

English: Where is the nearest train station?
French:

The model completes this with "Où est la gare la plus proche?" having inferred the translation pattern from just two examples. No training occurred. The model's weights remain unchanged. Yet it performs the task correctly.

This behavior was surprising because traditional machine learning assumes you need gradient updates to learn. The model must see many examples, compute a loss, and adjust its parameters. ICL breaks this assumption: learning happens within a single forward pass through the network.

The Three Regimes of PromptingLink Copied

The GPT-3 paper formalized three prompting regimes based on how many examples are provided:

In[3]:

Code

def create_icl_prompt(task_description, examples, test_input):
    """
    Create an in-context learning prompt with demonstrations.

    Args:
        task_description: Optional natural language description of the task
        examples: List of (input, output) tuples as demonstrations
        test_input: The input to be completed by the model

    Returns:
        Formatted prompt string
    """
    parts = []

    if task_description:
        parts.append(task_description)
        parts.append("")

    for inp, out in examples:
        parts.append(f"Input: {inp}")
        parts.append(f"Output: {out}")
        parts.append("")

    parts.append(f"Input: {test_input}")
    parts.append("Output:")

    return "\n".join(parts)


# Zero-shot: Task description only, no examples
zero_shot = create_icl_prompt(
    task_description="Classify the sentiment as Positive or Negative.",
    examples=[],
    test_input="The movie was a complete waste of time.",
)

# One-shot: Single demonstration
one_shot = create_icl_prompt(
    task_description=None,
    examples=[("I love this product!", "Positive")],
    test_input="The movie was a complete waste of time.",
)

# Few-shot: Multiple demonstrations
few_shot = create_icl_prompt(
    task_description=None,
    examples=[
        ("I love this product!", "Positive"),
        ("Terrible customer service.", "Negative"),
        ("Exceeded all my expectations.", "Positive"),
        ("Would not recommend to anyone.", "Negative"),
    ],
    test_input="The movie was a complete waste of time.",
)

def create_icl_prompt(task_description, examples, test_input):
    """
    Create an in-context learning prompt with demonstrations.

    Args:
        task_description: Optional natural language description of the task
        examples: List of (input, output) tuples as demonstrations
        test_input: The input to be completed by the model

    Returns:
        Formatted prompt string
    """
    parts = []

    if task_description:
        parts.append(task_description)
        parts.append("")

    for inp, out in examples:
        parts.append(f"Input: {inp}")
        parts.append(f"Output: {out}")
        parts.append("")

    parts.append(f"Input: {test_input}")
    parts.append("Output:")

    return "\n".join(parts)


# Zero-shot: Task description only, no examples
zero_shot = create_icl_prompt(
    task_description="Classify the sentiment as Positive or Negative.",
    examples=[],
    test_input="The movie was a complete waste of time.",
)

# One-shot: Single demonstration
one_shot = create_icl_prompt(
    task_description=None,
    examples=[("I love this product!", "Positive")],
    test_input="The movie was a complete waste of time.",
)

# Few-shot: Multiple demonstrations
few_shot = create_icl_prompt(
    task_description=None,
    examples=[
        ("I love this product!", "Positive"),
        ("Terrible customer service.", "Negative"),
        ("Exceeded all my expectations.", "Positive"),
        ("Would not recommend to anyone.", "Negative"),
    ],
    test_input="The movie was a complete waste of time.",
)

Out[4]:

Console

==================================================
ZERO-SHOT (task description only)
==================================================
Classify the sentiment as Positive or Negative.

Input: The movie was a complete waste of time.
Output:

==================================================
ONE-SHOT (1 example)
==================================================
Input: I love this product!
Output: Positive

Input: The movie was a complete waste of time.
Output:

==================================================
FEW-SHOT (4 examples)
==================================================
Input: I love this product!
Output: Positive

Input: Terrible customer service.
Output: Negative

Input: Exceeded all my expectations.
Output: Positive

Input: Would not recommend to anyone.
Output: Negative

Input: The movie was a complete waste of time.
Output:

Each regime has distinct characteristics:

Zero-shot: The model relies entirely on its pretrained knowledge and the task description. Works best for common tasks the model encountered during pretraining.
One-shot: A single example clarifies the expected format and task. Often sufficient for simple classification or formatting tasks.
Few-shot: Multiple examples help the model distinguish between classes, understand edge cases, and calibrate its confidence. Typically 4-32 examples, limited by context length.

The number of examples you can provide is constrained by the model's context window. With a 2,048 token limit (GPT-3) or 8,192+ tokens (later models), you must balance demonstration count against prompt length.

ICL vs Fine-TuningLink Copied

Traditional task adaptation requires fine-tuning: updating a pretrained model's weights on task-specific labeled data. In-context learning offers an alternative that trades training for inference. Understanding when to use each approach requires examining their fundamental differences.

The Fine-Tuning ParadigmLink Copied

Fine-tuning involves several steps:

Collect labeled examples for your task (typically thousands)
Initialize from a pretrained model
Train on your data with gradient descent
Deploy the specialized model

The result is a dedicated model for your task. Its weights have been permanently modified to excel at that specific application. Fine-tuning produces strong performance but requires:

Labeled training data
Compute for training
Expertise to avoid overfitting
Separate model storage per task

The ICL ParadigmLink Copied

In-context learning follows a different path:

Prepare a small number of demonstrations (typically 1-32)
Format them into a prompt
Pass the prompt to the base model at inference time

The base model remains unchanged. The same model serves all tasks, with demonstrations specifying what to do. ICL requires no training but consumes inference compute for longer prompts.

Out[5]:

Visualization

Flowchart showing fine-tuning approach to task adaptation. — Fine-tuning creates task-specific models through gradient updates on labeled data.

Flowchart showing ICL approach to task adaptation. — In-context learning uses a single base model with task-specific prompts containing examples.

Comparing PerformanceLink Copied

How do these approaches compare in practice? The answer depends on several factors:

Trade-offs between fine-tuning and in-context learning approaches.

Aspect	Fine-Tuning	In-Context Learning
Data required	100s-1000s of examples	1-32 examples
Training time	Hours to days	None
Inference cost	Standard	Higher (longer prompts)
Task switching	Load new model	Change prompt
Peak performance	Generally higher	Competitive for many tasks
Model updates	New base model requires retraining	Automatic improvement

Fine-tuning typically achieves higher accuracy when sufficient training data is available. The model has many gradient steps to learn task-specific patterns. ICL is limited to what can be demonstrated in a prompt and what the model learned during pretraining.

However, ICL excels in scenarios where fine-tuning is impractical:

Rapid prototyping: Test ideas without training infrastructure
Low-resource tasks: Limited labeled data available
Dynamic tasks: Task definition changes frequently
Multi-task deployment: Single model serves many tasks

Out[6]:

Visualization

Line plot comparing fine-tuning and few-shot ICL performance as a function of training examples. — Illustrative comparison of task performance between fine-tuning and ICL. Fine-tuning (solid line) typically achieves higher accuracy with sufficient data, while ICL (dashed lines) offers strong performance with minimal examples. The shaded region shows where ICL is competitive.

The crossover point varies by task. For simple classification, ICL often matches fine-tuning with just a few examples. For complex reasoning or structured prediction, fine-tuning's advantage is more pronounced.

When Representations ConvergeLink Copied

Recent research shows that ICL and fine-tuning produce similar internal representations despite their different mechanisms. When you examine the hidden states of a model performing a task via ICL versus a fine-tuned version of that model, the representations are often highly correlated.

This suggests that both approaches activate similar "circuits" in the network. Fine-tuning strengthens these circuits through weight updates. ICL activates them through attention over demonstrations. The end result, in terms of what the model computes, is quite similar.

Example Selection StrategiesLink Copied

Not all demonstrations are equally effective. Research has shown that the choice of examples can swing performance by 20-30 percentage points. Understanding what makes examples effective has become a critical skill in prompt engineering.

Factors That Influence Example QualityLink Copied

Several properties affect how useful an example is as a demonstration:

Relevance: Examples similar to the test input help more than dissimilar ones
Diversity: Examples should cover the range of possible inputs and outputs
Clarity: Unambiguous examples with clear input-output relationships work best
Correctness: Erroneous labels can mislead the model
Ordering: Examples at the end of the prompt have stronger influence

Let's examine each factor and how to optimize for it.

Semantic Similarity SelectionLink Copied

One of the most effective strategies is selecting examples that are semantically similar to the test input. The intuition is that nearby examples in embedding space share relevant features that help the model generalize.

Similarity is typically measured using cosine similarity between embedding vectors:

\text{sim}(a, b) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

where:

$\mathbf{a}$ and $\mathbf{b}$ : embedding vectors for two text inputs
$\mathbf{a} \cdot \mathbf{b}$ : the dot product of the two vectors
$\|\mathbf{a}\|$ and $\|\mathbf{b}\|$ : the Euclidean norms (lengths) of each vector

Cosine similarity ranges from -1 (opposite directions) to 1 (identical directions), with 0 indicating orthogonal (unrelated) vectors. For text embeddings, higher similarity indicates semantically related content.

In[7]:

Code

from sklearn.metrics.pairwise import cosine_similarity


def select_similar_examples(test_input, candidate_pool, embedder, k=4):
    """
    Select k examples most similar to the test input.

    Args:
        test_input: The input to find similar examples for
        candidate_pool: List of (input, output) example tuples
        embedder: Function that maps text to embedding vectors
        k: Number of examples to select

    Returns:
        List of k most similar examples
    """
    # Embed the test input
    test_embedding = embedder([test_input])

    # Embed all candidates
    candidate_inputs = [inp for inp, _ in candidate_pool]
    candidate_embeddings = embedder(candidate_inputs)

    # Compute similarities
    similarities = cosine_similarity(test_embedding, candidate_embeddings)[0]

    # Select top-k
    top_indices = np.argsort(similarities)[-k:][::-1]

    return [candidate_pool[i] for i in top_indices]


# Simulate with mock embeddings for demonstration
def mock_embedder(texts):
    """Mock embedder that returns random but consistent vectors."""
    np.random.seed(hash(str(texts)) % 2**32)
    return np.random.randn(len(texts), 64)


# Example pool
example_pool = [
    ("Great food and amazing service!", "Positive"),
    ("Waited an hour for cold food.", "Negative"),
    ("The steak was cooked perfectly.", "Positive"),
    ("Overpriced for what you get.", "Negative"),
    ("Best Italian restaurant in town!", "Positive"),
    ("Never coming back here again.", "Negative"),
    ("Friendly staff and cozy atmosphere.", "Positive"),
    ("Found a hair in my soup.", "Negative"),
]

test_input = "The pasta was absolutely delicious."

from sklearn.metrics.pairwise import cosine_similarity


def select_similar_examples(test_input, candidate_pool, embedder, k=4):
    """
    Select k examples most similar to the test input.

    Args:
        test_input: The input to find similar examples for
        candidate_pool: List of (input, output) example tuples
        embedder: Function that maps text to embedding vectors
        k: Number of examples to select

    Returns:
        List of k most similar examples
    """
    # Embed the test input
    test_embedding = embedder([test_input])

    # Embed all candidates
    candidate_inputs = [inp for inp, _ in candidate_pool]
    candidate_embeddings = embedder(candidate_inputs)

    # Compute similarities
    similarities = cosine_similarity(test_embedding, candidate_embeddings)[0]

    # Select top-k
    top_indices = np.argsort(similarities)[-k:][::-1]

    return [candidate_pool[i] for i in top_indices]


# Simulate with mock embeddings for demonstration
def mock_embedder(texts):
    """Mock embedder that returns random but consistent vectors."""
    np.random.seed(hash(str(texts)) % 2**32)
    return np.random.randn(len(texts), 64)


# Example pool
example_pool = [
    ("Great food and amazing service!", "Positive"),
    ("Waited an hour for cold food.", "Negative"),
    ("The steak was cooked perfectly.", "Positive"),
    ("Overpriced for what you get.", "Negative"),
    ("Best Italian restaurant in town!", "Positive"),
    ("Never coming back here again.", "Negative"),
    ("Friendly staff and cozy atmosphere.", "Positive"),
    ("Found a hair in my soup.", "Negative"),
]

test_input = "The pasta was absolutely delicious."

Out[8]:

Console

Test input: "The pasta was absolutely delicious."

Selecting 4 examples by semantic similarity...

Selected demonstrations:
  1. "Waited an hour for cold food." → Negative
  2. "Best Italian restaurant in town!" → Positive
  3. "Great food and amazing service!" → Positive
  4. "Friendly staff and cozy atmosphere." → Positive

The selection algorithm identifies examples that are semantically closest to the test input. In this case, the mock embedder produces consistent but randomized vectors, so the selected examples demonstrate the general approach. In practice, using a real sentence embedding model like sentence-transformers would select examples with genuinely similar meaning and vocabulary.

Similarity-based selection consistently outperforms random selection. The most similar examples contain vocabulary, style, and semantic features that help the model understand what kind of input it's processing.

Out[9]:

Visualization

Scatter plot showing test input and candidate examples in 2D embedding space with connections to selected neighbors. — Semantic similarity selection in embedding space. The test input (star) is compared against candidate examples (circles). The algorithm selects the k nearest neighbors (connected by dashed lines). Positive and negative examples naturally cluster based on semantic content.

Diversity Through CoverageLink Copied

While similarity helps, relying solely on it can create blind spots. If all selected examples are too similar, the model may not learn to handle edge cases or the full range of outputs.

A balanced approach selects examples that are both relevant and diverse:

In[10]:

Code

def select_diverse_examples(
    test_input, candidate_pool, embedder, k=4, diversity_weight=0.3
):
    """
    Select examples balancing similarity and diversity.

    Uses maximal marginal relevance (MMR) to avoid redundant examples.

    Args:
        test_input: Input to select examples for
        candidate_pool: Available (input, output) examples
        embedder: Text to embedding function
        k: Number of examples to select
        diversity_weight: Balance between relevance (0) and diversity (1)

    Returns:
        List of k balanced examples
    """
    # Get embeddings
    test_emb = embedder([test_input])
    candidate_inputs = [inp for inp, _ in candidate_pool]
    candidate_embs = embedder(candidate_inputs)

    # Compute relevance scores
    relevance = cosine_similarity(test_emb, candidate_embs)[0]

    selected = []
    selected_embs = []
    remaining = list(range(len(candidate_pool)))

    for _ in range(k):
        if not remaining:
            break

        best_score = -float("inf")
        best_idx = None

        for idx in remaining:
            # Relevance term
            rel_score = relevance[idx]

            # Diversity term: max similarity to already selected
            if selected_embs:
                div_score = max(
                    cosine_similarity([candidate_embs[idx]], selected_embs)[0]
                )
            else:
                div_score = 0

            # MMR score
            score = (
                1 - diversity_weight
            ) * rel_score - diversity_weight * div_score

            if score > best_score:
                best_score = score
                best_idx = idx

        selected.append(candidate_pool[best_idx])
        selected_embs.append(candidate_embs[best_idx])
        remaining.remove(best_idx)

    return selected

def select_diverse_examples(
    test_input, candidate_pool, embedder, k=4, diversity_weight=0.3
):
    """
    Select examples balancing similarity and diversity.

    Uses maximal marginal relevance (MMR) to avoid redundant examples.

    Args:
        test_input: Input to select examples for
        candidate_pool: Available (input, output) examples
        embedder: Text to embedding function
        k: Number of examples to select
        diversity_weight: Balance between relevance (0) and diversity (1)

    Returns:
        List of k balanced examples
    """
    # Get embeddings
    test_emb = embedder([test_input])
    candidate_inputs = [inp for inp, _ in candidate_pool]
    candidate_embs = embedder(candidate_inputs)

    # Compute relevance scores
    relevance = cosine_similarity(test_emb, candidate_embs)[0]

    selected = []
    selected_embs = []
    remaining = list(range(len(candidate_pool)))

    for _ in range(k):
        if not remaining:
            break

        best_score = -float("inf")
        best_idx = None

        for idx in remaining:
            # Relevance term
            rel_score = relevance[idx]

            # Diversity term: max similarity to already selected
            if selected_embs:
                div_score = max(
                    cosine_similarity([candidate_embs[idx]], selected_embs)[0]
                )
            else:
                div_score = 0

            # MMR score
            score = (
                1 - diversity_weight
            ) * rel_score - diversity_weight * div_score

            if score > best_score:
                best_score = score
                best_idx = idx

        selected.append(candidate_pool[best_idx])
        selected_embs.append(candidate_embs[best_idx])
        remaining.remove(best_idx)

    return selected

Out[11]:

Console

Maximal Marginal Relevance (MMR) Selection
==================================================
Test: "The pasta was absolutely delicious."

With diversity_weight=0.3, selected examples:
  1. "Waited an hour for cold food." → Negative
  2. "Best Italian restaurant in town!" → Positive
  3. "Found a hair in my soup." → Negative
  4. "Great food and amazing service!" → Positive

Class distribution: 2 Positive, 2 Negative

Unlike pure similarity-based selection, MMR actively penalizes redundancy. The diversity weight of 0.3 means the algorithm balances 70% relevance with 30% diversity. This typically results in better class coverage, as selecting multiple very similar examples provides diminishing returns.

The maximal marginal relevance (MMR) algorithm iteratively selects examples that are relevant to the query but dissimilar to already-selected examples. At each step, MMR scores each candidate using:

\text{MMR}(d) = (1 - \lambda) \cdot \text{sim}(d, q) - \lambda \cdot \max_{s \in S} \text{sim}(d, s)

where:

$d$ : a candidate demonstration being evaluated
$q$ : the query (test input) we want to find examples for
$S$ : the set of already-selected demonstrations
$\text{sim}(d, q)$ : similarity between candidate $d$ and query $q$ (relevance term)
$\max_{s \in S} \text{sim}(d, s)$ : maximum similarity between candidate $d$ and any already-selected example (redundancy term)
$\lambda$ : diversity weight balancing relevance versus diversity (typically 0.3-0.5)

The first term rewards candidates similar to the query. The second term penalizes candidates similar to already-selected examples. By subtracting the redundancy term, MMR prevents selecting demonstrations that are too similar to each other, ensuring coverage of diverse aspects of the task.

Out[12]:

Visualization

Scatter plot showing pure relevance selection at lambda=0. — λ=0.0 (Pure Relevance): Selects examples closest to test input, may be redundant.

Scatter plot showing balanced selection at lambda=0.3. — λ=0.3 (Balanced): Balances relevance and diversity for better coverage.

Scatter plot showing high diversity selection at lambda=0.7. — λ=0.7 (High Diversity): Prioritizes spreading selections across the space.

Class-Balanced SelectionLink Copied

For classification tasks, ensuring balanced representation across classes is critical. If your demonstrations are skewed toward one class, the model may be biased toward predicting that class.

Out[13]:

Visualization

Bar chart comparing accuracy with balanced vs imbalanced few-shot examples across classes. — Impact of class balance in few-shot demonstrations. Imbalanced examples bias the model toward the majority class, while balanced selection improves overall accuracy and calibration.

The effect of class imbalance can be dramatic. In extreme cases, providing examples from only one class causes the model to predict that class for all inputs, regardless of the actual content.

Ordering and Recency EffectsLink Copied

The order of examples in the prompt matters more than you might expect. Examples closer to the test input (at the end of the prompt) tend to have stronger influence on the prediction. This "recency effect" emerges from the attention mechanism, where nearby tokens naturally receive more attention.

Out[14]:

Visualization

Bar chart showing increasing influence weight for examples from first to last position. — Recency effect in few-shot prompts. Examples at the end of the prompt exert stronger influence on predictions. Placing the most relevant or representative examples last can improve performance.

Practical implications of the recency effect:

Place the most representative examples at the end of the prompt
When class-balancing, ensure the final examples aren't all from one class
For tasks with a default or majority class, counterbalance by ending with minority examples

ICL Scaling BehaviorLink Copied

One of the most interesting aspects of in-context learning is how it scales. Both the number of examples and the model size affect performance, but the relationship is not simply additive.

More Examples Help (Up to a Point)Link Copied

Adding more demonstrations generally improves performance, but with diminishing returns. The first few examples provide the largest gains, while additional examples contribute less.

Out[15]:

Visualization

Line plot showing accuracy vs number of examples for simple, medium, and complex tasks. — Performance as a function of demonstration count for different task types. Simple tasks saturate quickly, while complex tasks benefit from more examples. All curves show diminishing returns, and performance eventually plateaus due to context length limits or task ceiling.

The optimal number of examples depends on:

Task complexity: Simple tasks need fewer examples
Model capacity: Larger models extract more from each example
Example quality: High-quality examples provide more information per demonstration
Context budget: Longer examples leave less room for more demonstrations

Model Size Enables ICLLink Copied

The most dramatic scaling effect is the relationship between model size and ICL capability. Small models show minimal benefit from demonstrations. As models grow larger, they become increasingly able to leverage in-context examples.

Out[16]:

Visualization

Line plot showing ICL improvement over zero-shot for different model sizes from 100M to 175B parameters. — In-context learning capability as a function of model size. Smaller models (under 1B parameters) show minimal ICL ability, treating demonstrations almost as irrelevant context. ICL capability emerges around 1-10B parameters and continues improving with scale.

This emergence pattern has important implications. It suggests that in-context learning is not simply scaling up pattern matching but represents a qualitatively different capability that activates at sufficient scale. Models below a certain threshold don't just do ICL poorly; they essentially don't do it at all.

The Interplay of Scale and ExamplesLink Copied

Model size and example count interact in interesting ways. Larger models:

Extract more information from each example
Continue improving with more examples for longer
Generalize better from diverse demonstrations
Are more robust to example ordering and selection

Out[17]:

Visualization

Multi-line plot showing performance curves for different model sizes as examples increase. — Interaction between model size and number of in-context examples. Larger models show steeper improvement curves and higher performance ceilings. The gap between zero-shot and few-shot widens with scale, demonstrating that larger models leverage demonstrations more effectively.

The practical takeaway: if you're working with a smaller model, investing heavily in example selection and count may not pay off. But with large models, careful prompt engineering with well-chosen examples can yield substantial gains.

Out[18]:

Visualization

Contour plot showing ICL accuracy as function of model parameters and demonstration count. — ICL performance landscape across model sizes and example counts. Contour lines show constant accuracy levels. Small models (bottom) show minimal improvement regardless of example count. Large models (top) achieve higher performance and continue improving with more examples. The diagonal pattern indicates that model scale and example count are complementary: larger models extract more value from each additional demonstration.

Theoretical Understanding of ICLLink Copied

How can a model "learn" without updating its weights? This question strikes at the heart of what makes in-context learning so puzzling. In traditional machine learning, learning requires gradient descent: the model sees examples, computes errors, and adjusts its parameters to reduce those errors over many iterations. ICL appears to violate this fundamental requirement. A frozen model, with weights fixed at their pretrained values, somehow adapts its behavior based on demonstrations it has never seen before.

Understanding this phenomenon requires thinking carefully about what "learning" actually means in the context of a forward pass through a neural network. Three complementary hypotheses have emerged from theoretical research, each illuminating a different aspect of how ICL works. We'll explore each in turn, building from intuition to mathematical formalization.

Hypothesis 1: Task LocationLink Copied

The first hypothesis reframes the question: perhaps ICL doesn't involve learning at all in the traditional sense. Instead, demonstrations might serve as a kind of address that locates pre-existing knowledge.

Consider what happens during pretraining. The model processes billions of tokens from diverse sources: Wikipedia articles, books, code repositories, forums, and countless other text formats. Within this ocean of data, the model implicitly encounters many task formulations:

Question-answer pairs embedded in FAQ pages and forums
Translation examples in multilingual documents
Sentiment expressions in reviews and social media
Factual statements in encyclopedic content
Code with natural language comments and docstrings

The key insight is that pretraining doesn't just teach the model to predict the next token. It teaches the model to recognize patterns and activate appropriate processing strategies. When you provide few-shot demonstrations, you're not teaching the model something new. Instead, you're helping it identify which of its existing capabilities to apply.

Think of it like a library with millions of books but no catalog. The model has all the knowledge, but it needs help finding the right book. Demonstrations serve as search queries that locate the relevant "section" of the model's capabilities.

Out[19]:

Visualization

Conceptual diagram showing task regions in parameter space activated by demonstrations. — Task location hypothesis visualization. The model's parameter space contains regions specialized for different tasks. Demonstrations activate the appropriate region, effectively selecting among existing capabilities rather than learning new ones.

Several lines of evidence support the task location hypothesis:

ICL works best for familiar tasks: Performance is highest on tasks similar to those encountered during pretraining, suggesting the model is retrieving existing capabilities rather than learning new ones.
Novel operations fail: Tasks requiring genuinely new logical operations that couldn't have been learned from text data tend to fail at ICL, even with many demonstrations.
Task descriptions can substitute for examples: Sometimes a natural language description of the task works as well as demonstrations, which makes sense if the goal is location rather than learning.

The task location view is intuitive, but it doesn't fully explain the mechanics. How does the model actually use demonstrations to activate the right capabilities? This leads us to our second hypothesis.

Hypothesis 2: Implicit Gradient DescentLink Copied

While the task location hypothesis tells us what ICL accomplishes, the implicit gradient descent hypothesis explains how the transformer architecture actually implements this process. The key insight is that the attention mechanism, when viewed mathematically, performs operations that closely resemble gradient descent.

To understand this connection, we need to examine what happens during a single attention layer. When the model processes a prompt containing demonstrations followed by a test input, the attention mechanism allows information to flow from the demonstrations to the test input.

Consider the query input (the test case we want the model to complete). Before attention, it has some initial representation $h_{\text{query}}^{(0)}$ . The attention mechanism then updates this representation by aggregating information from the demonstrations:

h_{\text{query}} = h_{\text{query}}^{(0)} + \sum_{i=1}^{k} \alpha_i \cdot v_i

where:

$h_{\text{query}}$ : the updated representation of the query input after attention
$h_{\text{query}}^{(0)}$ : the initial representation of the query input before this attention layer
$\alpha_i$ : the attention weight assigned to demonstration $i$ , computed via softmax over query-key dot products
$v_i$ : the value vector derived from demonstration $i$ , encoding information about that example
$k$ : the total number of demonstrations in the prompt

This formula has the same structure as a gradient descent update. In standard gradient descent, we update parameters by:

\theta \leftarrow \theta + \eta \cdot \nabla L

where $\theta$ represents the current parameters, $\eta$ is the learning rate, and $\nabla L$ is the gradient of the loss. The analogy becomes clear when we make the following correspondences:

Correspondence between gradient descent and attention mechanism components.

Gradient Descent	Attention Mechanism
Current parameters $\theta$	Initial representation $h_{\text{query}}^{(0)}$
Learning rate $\eta$	Attention weights $\alpha_i$
Gradient $\nabla L$	Value vectors $v_i$
Updated parameters	Updated representation $h_{\text{query}}$

The attention weights $\alpha_i$ act as adaptive, example-specific learning rates. They automatically assign more weight to demonstrations that are more relevant (those with higher query-key similarity). This is actually more sophisticated than standard gradient descent, where the learning rate is typically fixed.

Research has substantiated this connection with increasingly strong results:

Constructive proofs: Researchers have shown that transformers can be explicitly constructed to implement gradient descent algorithms.
Linear attention equivalence: For linear attention layers (without the softmax nonlinearity), the forward pass is mathematically equivalent to one step of gradient descent on a regression problem.
Multi-step optimization: With 96 layers like GPT-3, the model can implement many sequential "optimization steps," allowing for sophisticated adaptation to the demonstrations.

In[20]:

Code

def attention_as_gradient_step(
    query_embedding, demo_embeddings, demo_labels, learning_rate=0.1
):
    """
    Illustrate how attention over demonstrations resembles gradient descent.

    This simplified model shows the conceptual parallel between:
    - Attention weighted combination of demonstration values
    - Gradient-based update toward demonstration patterns

    Args:
        query_embedding: Vector representation of the query input
        demo_embeddings: Matrix of demonstration input embeddings
        demo_labels: Labels/outputs for each demonstration
        learning_rate: Step size for the gradient interpretation

    Returns:
        Updated representation after "learning" from demonstrations
    """
    # Compute attention scores (simplified: dot product similarity)
    attention_logits = np.dot(demo_embeddings, query_embedding)
    attention_weights = np.exp(attention_logits) / np.sum(
        np.exp(attention_logits)
    )

    # The "gradient" can be viewed as a weighted combination of
    # directions toward each demonstration
    gradient = np.zeros_like(query_embedding)
    for i, (demo_emb, weight) in enumerate(
        zip(demo_embeddings, attention_weights)
    ):
        # Direction from query toward demonstration
        direction = demo_emb - query_embedding
        gradient += weight * direction

    # Apply the "gradient update"
    updated = query_embedding + learning_rate * gradient

    return updated, attention_weights

def attention_as_gradient_step(
    query_embedding, demo_embeddings, demo_labels, learning_rate=0.1
):
    """
    Illustrate how attention over demonstrations resembles gradient descent.

    This simplified model shows the conceptual parallel between:
    - Attention weighted combination of demonstration values
    - Gradient-based update toward demonstration patterns

    Args:
        query_embedding: Vector representation of the query input
        demo_embeddings: Matrix of demonstration input embeddings
        demo_labels: Labels/outputs for each demonstration
        learning_rate: Step size for the gradient interpretation

    Returns:
        Updated representation after "learning" from demonstrations
    """
    # Compute attention scores (simplified: dot product similarity)
    attention_logits = np.dot(demo_embeddings, query_embedding)
    attention_weights = np.exp(attention_logits) / np.sum(
        np.exp(attention_logits)
    )

    # The "gradient" can be viewed as a weighted combination of
    # directions toward each demonstration
    gradient = np.zeros_like(query_embedding)
    for i, (demo_emb, weight) in enumerate(
        zip(demo_embeddings, attention_weights)
    ):
        # Direction from query toward demonstration
        direction = demo_emb - query_embedding
        gradient += weight * direction

    # Apply the "gradient update"
    updated = query_embedding + learning_rate * gradient

    return updated, attention_weights

Out[21]:

Console

Attention as Implicit Gradient Descent
==================================================

Query embedding: [1.00, 0.50, 0.20]

Demonstration attention weights:
  Demo 1 (Positive): 0.405
  Demo 2 (Negative): 0.175
  Demo 3 (Positive): 0.420

Updated embedding: [0.978, 0.501, 0.206]
Shift magnitude: 0.0227

The attention weights reveal how the model allocates "learning" across demonstrations. Demo 1 and Demo 3 (both Positive) receive higher weights because their embeddings are more similar to the query. Demo 2 (Negative) receives lower weight due to its dissimilar embedding. The updated representation shifts toward the weighted average of demonstrations, with the shift magnitude indicating how much the query representation changed.

This toy example captures the essence of the implicit gradient descent hypothesis: demonstrations don't just provide information, they actively reshape the query representation through the attention mechanism. Similar examples contribute more to this reshaping, just as training examples with larger gradients contribute more to parameter updates in standard learning.

Out[22]:

Visualization

Heatmap showing attention weights from query to each demonstration. — Attention pattern from test query to demonstrations. Higher attention weights (darker blue) indicate demonstrations that contribute more to updating the query representation. In this example, the model attends most strongly to Demo 1 and Demo 3, which are semantically similar to the query and share the same label (Positive).

Hypothesis 3: Bayesian InferenceLink Copied

The first two hypotheses focus on mechanism: what capabilities exist (task location) and how attention implements adaptation (implicit gradient descent). Our third hypothesis takes a probabilistic perspective, asking: how does the model's uncertainty about the task change as it sees more demonstrations?

Imagine you're trying to guess what game someone is playing by watching them take actions. Each action provides evidence about which game it might be. Early on, many games are plausible. As you observe more actions, your beliefs concentrate on the correct game. This is exactly what Bayesian inference formalizes.

Applied to ICL, the model starts with a prior distribution over possible tasks, learned during pretraining. This prior reflects how often different task types appeared in the training data. When you provide demonstrations, each one updates this distribution according to Bayes' rule:

P(\text{task} \mid D_1, \ldots, D_k) \propto P(D_1, \ldots, D_k \mid \text{task}) \cdot P(\text{task})

where:

$D_i = (x_i, y_i)$ : the $i$ -th demonstration, consisting of an input $x_i$ and its corresponding output $y_i$
$P(\text{task})$ : the prior probability of each task type, reflecting how often the model encountered similar patterns during pretraining
$P(D_1, \ldots, D_k \mid \text{task})$ : the likelihood of observing these specific demonstrations if the task is as hypothesized (demonstrations consistent with the task have high likelihood)
$P(\text{task} \mid D_1, \ldots, D_k)$ : the posterior probability of each task after observing all $k$ demonstrations
$\propto$ : indicates proportionality (the left side equals the right side divided by a normalizing constant)

Let's trace through this formula step by step to build intuition:

The prior $P(\text{task})$ encodes pretraining knowledge. If the model saw many sentiment classification examples during pretraining, sentiment analysis has a higher prior probability than, say, a specialized domain-specific task.
The likelihood $P(D_1, \ldots, D_k \mid \text{task})$ evaluates consistency. For each possible task interpretation, how likely is it that we would observe these specific demonstrations? If the demonstrations show text mapped to "Positive" and "Negative" labels, the likelihood is high under "sentiment classification" and low under "translation."
The posterior concentrates probability. Multiplying prior and likelihood (and normalizing) yields a posterior that combines our initial beliefs with the evidence from demonstrations. As more demonstrations are added, the posterior becomes sharper, concentrating on the task interpretation that best explains all the evidence.

This Bayesian view explains several empirical observations about ICL:

Out[23]:

Visualization

Bar chart showing uniform prior distribution over tasks. — Prior P(task): Uniform distribution before seeing any demonstrations.

Bar chart showing posterior after one demonstration. — After 1 Example: Posterior begins concentrating on sentiment classification.

Bar chart showing peaked posterior after four demonstrations. — After 4 Examples: High confidence in the correct task interpretation.

The visualization illustrates how the posterior evolves. Initially, probability is spread uniformly across task interpretations. After one demonstration showing an input mapped to "Positive," the posterior shifts toward sentiment classification. After four demonstrations consistently showing sentiment labels, the model has high confidence in the task identity.

Evidence supporting the Bayesian view comes from several empirical observations:

Inconsistent demonstrations hurt performance. If you mix correct and incorrect labels, the likelihood term is reduced for all task interpretations, leading to a diffuse posterior and poor predictions.
Ambiguous prompts yield uncertain outputs. When demonstrations could plausibly come from multiple tasks, the model's predictions are less confident, exactly as a diffuse posterior would predict.
More examples help even when redundant. Providing 16 similar examples rather than 4 can improve performance, because each example sharpens the posterior even if it doesn't add new information.

Synthesizing the TheoriesLink Copied

These three hypotheses are not competing explanations but rather complementary perspectives on the same phenomenon. Each answers a different question:

The three ICL hypotheses address complementary questions about the phenomenon.

Hypothesis	Question Answered
Task Location	What capabilities can ICL access?
Implicit Gradient Descent	How does attention implement adaptation?
Bayesian Inference	How does uncertainty reduce with evidence?

A unified view might work like this: During pretraining, the model develops specialized circuits for different tasks (task location). When demonstrations are provided, the attention mechanism adapts the query representation toward these demonstrations (implicit gradient descent), which has the effect of concentrating probability on task interpretations consistent with the evidence (Bayesian inference).

Recent theoretical work has made this unified view more concrete. Researchers have shown that transformers, through their architecture, learn to implement meta-learning algorithms during pretraining. When presented with demonstrations at inference time, they execute these algorithms to infer the task and generate appropriate predictions. The specific algorithm implemented may vary by layer, attention head, and task type, with different components of the network contributing different aspects of the adaptation process.

Improving ICL PerformanceLink Copied

Researchers have developed numerous techniques to enhance in-context learning beyond simple few-shot prompting.

Chain-of-Thought PromptingLink Copied

For reasoning tasks, showing intermediate steps significantly improves performance. Instead of just input-output pairs, demonstrations include the reasoning process:

In[24]:

Code

# Standard few-shot
standard_prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: 11

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:"""

# Chain-of-thought few-shot
cot_prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with 3 balls each, so 2 * 3 = 6 new balls. Total: 5 + 6 = 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:"""

# Standard few-shot
standard_prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: 11

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:"""

# Chain-of-thought few-shot
cot_prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with 3 balls each, so 2 * 3 = 6 new balls. Total: 5 + 6 = 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:"""

Out[25]:

Console

Standard Few-Shot:
----------------------------------------

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: 11

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:

Chain-of-Thought Few-Shot:
----------------------------------------

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with 3 balls each, so 2 * 3 = 6 new balls. Total: 5 + 6 = 11.

Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many do they have?
A:

Chain-of-thought (CoT) prompting guides the model to show its work. The reasoning trace helps the model:

Break complex problems into steps
Catch and correct errors mid-reasoning
Align intermediate computations with the final answer

CoT is particularly effective for arithmetic, logical reasoning, and multi-step problems where the direct answer is hard to compute in a single step.

Self-ConsistencyLink Copied

Rather than generating a single answer, self-consistency samples multiple reasoning paths and aggregates them through majority voting. The procedure works as follows:

Generate $N$ different chain-of-thought responses using temperature $T > 0$ to introduce diversity
Extract the final answer $a_i$ from each response $i$
Return the most common answer (majority vote)

Formally, the self-consistency prediction is:

\hat{a} = \arg\max_{a} \sum_{i=1}^{N} \mathbf{1}[a_i = a]

where:

$\hat{a}$ : the final predicted answer
$N$ : the number of sampled reasoning paths
$a_i$ : the answer extracted from the $i$ -th reasoning path
$\mathbf{1}[a_i = a]$ : an indicator function that equals 1 if answer $a_i$ matches $a$ , and 0 otherwise
$\arg\max_{a}$ : selects the answer $a$ that maximizes the vote count

This approach leverages the intuition that correct reasoning paths, while potentially different in details, converge on the same answer. Incorrect paths are more likely to diverge, producing different wrong answers that split the vote. With enough samples, the correct answer accumulates more votes than any single incorrect answer.

Out[26]:

Visualization

Diagram showing multiple reasoning paths converging through majority vote. — Self-consistency improves reliability by sampling multiple reasoning paths. While individual paths may contain errors, the majority vote typically converges on the correct answer. This technique is especially valuable for complex reasoning tasks.

Out[27]:

Visualization

Bar chart showing vote counts for different answer values from self-consistency sampling. — Self-consistency vote distribution from 10 sampled reasoning paths. The correct answer (42) receives the most votes despite individual reasoning errors in some paths. Wrong answers scatter across different values, diluting their vote share. This demonstrates how aggregation improves reliability.

Calibration and ConfidenceLink Copied

ICL predictions can be overconfident or poorly calibrated. Several techniques address this:

Probability calibration: Adjust output probabilities using held-out validation data
Verbalizing uncertainty: Ask the model to express confidence in its answer
Null prompts: Compare predictions with and without demonstrations to detect when the model is guessing

Retrieval-Augmented ICLLink Copied

Instead of using a fixed set of demonstrations, retrieval-augmented approaches dynamically select examples for each test input:

Embed the test input
Retrieve similar examples from a large pool
Use retrieved examples as demonstrations

This combines the benefits of similarity-based selection with the scalability of large example pools. The model effectively has access to thousands of potential demonstrations, selecting the most relevant ones at inference time.

Out[28]:

Visualization

Grouped bar chart comparing accuracy of random vs retrieval-based example selection. — Performance comparison of random vs. retrieval-based example selection across tasks. Retrieval-augmented ICL (using semantic similarity to select demonstrations) consistently outperforms random selection, with the largest gains on tasks where relevant examples significantly differ from irrelevant ones.

Limitations of In-Context LearningLink Copied

ICL has fundamental limitations that practitioners should understand.

Context Length ConstraintsLink Copied

The number of demonstrations is bounded by the model's context window. With 4,096 tokens, you might fit 10-20 examples depending on their length. This hard limit constrains what ICL can learn from a single prompt.

Long-context models (32K, 100K+ tokens) partially address this, but attention over very long contexts introduces its own challenges: slower inference, potential loss of focus, and quadratic memory scaling.

Out[29]:

Visualization

Dual-axis line plot showing accuracy and inference time vs number of demonstrations. — Context length trade-offs in ICL. More examples improve accuracy (solid blue), but inference time increases with context length (dashed orange). At some point, the marginal accuracy gain from additional examples no longer justifies the added latency, especially for real-time applications.

Sensitivity to Irrelevant DetailsLink Copied

ICL is sensitive to aspects of the prompt that shouldn't matter:

Formatting: Changing delimiters or spacing affects predictions
Order: Shuffling examples changes accuracy
Wording: Synonymous task descriptions yield different results
Few irrelevant tokens: Adding noise degrades performance

This brittleness makes reliable deployment challenging. A prompt optimized on development data may fail unexpectedly on edge cases.

Out[30]:

Visualization

Bar chart showing accuracy drops from various prompt perturbations. — ICL sensitivity to prompt variations. Small changes that shouldn't affect the task definition can cause significant accuracy swings. This brittleness requires careful prompt engineering and validation.

No Permanent LearningLink Copied

ICL provides only temporary task adaptation. Each new inference requires providing demonstrations again. The model doesn't retain any information between queries. For frequently repeated tasks, this overhead can be significant compared to a fine-tuned model that encodes the task permanently.

Limited Task ComplexityLink Copied

Some tasks are too complex to demonstrate in a few examples. Tasks requiring:

Extensive domain knowledge
Multi-step procedures with many steps
Precise output formats (structured data, code)
Consistency across a long document

often exceed what ICL can reliably achieve. Fine-tuning or more sophisticated architectures may be necessary.

SummaryLink Copied

In-context learning has changed how we adapt language models to new tasks. Rather than training specialized models, we demonstrate tasks through examples and let the model infer what to do.

Key takeaways:

ICL is weight-free learning: The model performs new tasks by conditioning on demonstrations, without any gradient updates. This enables rapid task switching and prototyping.
Example selection matters enormously: Choosing relevant, diverse, and balanced examples can swing performance by 20+ percentage points. Strategies like similarity-based selection and MMR help optimize demonstration quality.
ICL scales with model size: Small models show minimal ICL capability. The ability to learn from demonstrations emerges around 1-10 billion parameters and continues improving with scale.
Multiple theories explain ICL: Task location (activating existing capabilities), implicit gradient descent (attention as optimization), and Bayesian inference (updating task posteriors) offer complementary perspectives on why ICL works.
Techniques can enhance ICL: Chain-of-thought prompting, self-consistency, and retrieval augmentation extend what ICL can achieve, particularly for reasoning tasks.
Limitations persist: Context length constraints, prompt sensitivity, lack of permanent learning, and task complexity bounds constrain what ICL can reliably accomplish.

In-context learning has changed how NLP is practiced. Tasks that once required careful dataset curation and model training can now be prototyped in minutes with a few well-chosen examples. Understanding both its capabilities and limitations is important for effective application.

Key ParametersLink Copied

When designing in-context learning prompts, several parameters significantly affect performance:

k (number of demonstrations): The count of input-output examples included in the prompt. Start with 4-8 examples for most tasks. Simple classification may need only 2-4, while complex reasoning benefits from 16-32. More examples generally help but with diminishing returns, and you're constrained by context length.
diversity_weight ( $\lambda$ ): In MMR-based example selection, this parameter balances relevance (similarity to test input) versus diversity (dissimilarity to already-selected examples). Values of 0.3-0.5 typically work well. Higher values (0.5-0.7) prioritize coverage across the input space; lower values (0.1-0.3) prioritize relevance to the specific test case.
temperature: When using self-consistency or sampling multiple outputs, temperature controls randomness. Use 0.0 for deterministic outputs, 0.7-1.0 for diverse reasoning paths in self-consistency. Higher temperatures increase diversity but may reduce coherence.
N (self-consistency samples): The number of reasoning paths to sample before majority voting. Common values are 5-40. More samples improve reliability but increase cost linearly. For critical applications, 20-40 samples provide robust estimates; for rapid prototyping, 5-10 may suffice.
example ordering: Place the most representative or highest-quality examples at the end of the prompt (closest to the test input) due to recency effects. For classification, ensure the final 2-3 examples aren't all from the same class.
prompt format: The delimiter style, spacing, and structure of demonstrations. Consistent formatting across examples helps the model recognize the pattern. Common formats include "Input: X / Output: Y" or "Q: X / A: Y". Match the format to what the model likely saw during pretraining.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about in-context learning.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Previous Chapter

GPT-3

Next Chapter

Autoregressive Generation

Reference

BIBTEXAcademic

@misc{incontextlearninghowllmslearnfromexampleswithouttraining, author = {Michael Brenndoerfer}, title = {In-Context Learning: How LLMs Learn from Examples Without Training}, year = {2025}, url = {https://mbrenndoerfer.com/writing/in-context-learning-llm-examples}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). In-Context Learning: How LLMs Learn from Examples Without Training. Retrieved from https://mbrenndoerfer.com/writing/in-context-learning-llm-examples

MLAAcademic

Michael Brenndoerfer. "In-Context Learning: How LLMs Learn from Examples Without Training." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/in-context-learning-llm-examples>.

CHICAGOAcademic

Michael Brenndoerfer. "In-Context Learning: How LLMs Learn from Examples Without Training." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/in-context-learning-llm-examples.

HARVARDAcademic

Michael Brenndoerfer (2025) 'In-Context Learning: How LLMs Learn from Examples Without Training'. Available at: https://mbrenndoerfer.com/writing/in-context-learning-llm-examples (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). In-Context Learning: How LLMs Learn from Examples Without Training. https://mbrenndoerfer.com/writing/in-context-learning-llm-examples

Direct link:

https://mbrenndoerfer.com/writing/in-context-learning-llm-examples

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

In-Context Learning: How LLMs Learn from Examples Without Training

In-Context LearningLink Copied

What Is In-Context Learning?Link Copied

The Three Regimes of PromptingLink Copied

ICL vs Fine-TuningLink Copied

The Fine-Tuning ParadigmLink Copied

The ICL ParadigmLink Copied

Comparing PerformanceLink Copied

When Representations ConvergeLink Copied

Example Selection StrategiesLink Copied

Factors That Influence Example QualityLink Copied

Semantic Similarity SelectionLink Copied

Diversity Through CoverageLink Copied

Class-Balanced SelectionLink Copied

Ordering and Recency EffectsLink Copied

ICL Scaling BehaviorLink Copied

More Examples Help (Up to a Point)Link Copied

Model Size Enables ICLLink Copied

The Interplay of Scale and ExamplesLink Copied

Theoretical Understanding of ICLLink Copied

Hypothesis 1: Task LocationLink Copied

Hypothesis 2: Implicit Gradient DescentLink Copied

Hypothesis 3: Bayesian InferenceLink Copied

Synthesizing the TheoriesLink Copied

Improving ICL PerformanceLink Copied

Chain-of-Thought PromptingLink Copied

Self-ConsistencyLink Copied

Calibration and ConfidenceLink Copied

Retrieval-Augmented ICLLink Copied

Limitations of In-Context LearningLink Copied

Context Length ConstraintsLink Copied

Sensitivity to Irrelevant DetailsLink Copied

No Permanent LearningLink Copied

Limited Task ComplexityLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Repetition Penalties: Preventing Loops in Language Model Generation

Constrained Decoding: Grammar-Guided Generation for Structured LLM Output

Autoregressive Generation: How GPT Generates Text Token by Token

Stay updated