Nucleus Sampling: Adaptive Top-p Text Generation for Language Models

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Learn how nucleus sampling dynamically selects tokens based on cumulative probability, solving top-k limitations for coherent and creative text generation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Nucleus SamplingLink Copied

When GPT models generate text, they face a fundamental challenge: how do you sample from a probability distribution over thousands of possible next tokens in a way that produces coherent, creative, and diverse text? We've seen how temperature scaling adjusts the sharpness of the distribution and how top-k sampling restricts choices to the k most likely tokens. But top-k has a flaw: the number of reasonable next tokens varies dramatically depending on context. Sometimes only one or two tokens make sense; other times, dozens are equally valid.

Nucleus sampling, introduced by Holtzman et al. in 2020, solves this problem. Instead of fixing the number of candidate tokens, it dynamically selects the smallest set of tokens whose cumulative probability exceeds a threshold $p$ . This adaptive approach captures the "nucleus" of the probability mass, keeping high-probability tokens while excluding the unreliable tail, regardless of how many tokens that requires.

The Problem with Fixed-k SamplingLink Copied

Top-k sampling works by selecting the $k$ tokens with the highest probabilities and redistributing probability mass among them. This works well when the model's confidence is consistent, but language is anything but consistent.

Consider two scenarios:

High certainty: The model predicts "The capital of France is ___" and assigns 95% probability to "Paris". Here, sampling from the top 50 tokens includes 49 tokens that collectively share only 5% of the probability mass, many of which would produce nonsensical completions.
Low certainty: The model predicts "I had a wonderful ___" where "day", "time", "experience", "meal", "trip", and dozens of other tokens are all reasonable. With top-k=10, we might exclude perfectly valid continuations.

The core issue is that $k$ is a hyperparameter that cannot adapt to the context. What we really want is to keep all tokens that represent "reasonable" choices, and the most principled way to define "reasonable" is through probability mass.

The Top-p FormulationLink Copied

The insight behind nucleus sampling is simple: instead of asking "how many tokens should I consider?", we ask "how much probability mass should I capture?" This shift in perspective leads to an adaptive algorithm that naturally handles both high-confidence and uncertain predictions.

From Intuition to DefinitionLink Copied

Think about what we really want when sampling the next token. We want to include all tokens that have a reasonable chance of being correct, and exclude all tokens that are essentially noise. But "reasonable" depends on context. After "The sky is", the token "blue" might have 40% probability, meaning we should consider other options. After "2 + 2 =", the token "4" might have 99% probability, meaning alternatives are probably mistakes.

The key insight is that probability itself tells us what's reasonable. If we capture 90% of the total probability mass, we've included essentially all the tokens the model considers plausible. Everything in the remaining 10% is, by definition, something the model thinks is unlikely.

This leads us to define the nucleus as the smallest set of highest-probability tokens that together account for at least $p$ fraction of the total probability. Formally, given a probability distribution $P$ over vocabulary $V$ , the nucleus $V_p$ is the minimal set such that:

\sum_{x \in V_p} P(x) \geq p

where:

$V_p$ : the nucleus, the minimal set of highest-probability tokens we'll sample from
$P(x)$ : the probability assigned to token $x$ by the model
$p$ : the cumulative probability threshold, typically set between 0.9 and 0.95

Nucleus

The nucleus is the minimal set of highest-probability tokens whose cumulative probability mass meets or exceeds the threshold $p$ . Tokens outside the nucleus are discarded, and the remaining probabilities are renormalized.

The Algorithm Step by StepLink Copied

How do we actually find this minimal set? The algorithm is straightforward once you see the logic:

Sort tokens by probability in descending order. This puts the most likely tokens first.
Walk through the sorted list, accumulating probabilities. Keep adding tokens until your running sum reaches or exceeds $p$ .
Stop as soon as you cross the threshold. The tokens you've accumulated form the nucleus. Discard everything else.
Renormalize so that the remaining probabilities sum to 1, giving you a valid distribution to sample from.

Let's express this mathematically. After sorting, we have tokens ordered so that $P(x_1) \geq P(x_2) \geq \ldots \geq P(x_{|V|})$ , where $|V|$ is the vocabulary size. We find the smallest $k$ such that:

\sum_{i=1}^{k} P(x_i) \geq p

where:

$x_i$ : the token with the $i$ -th highest probability
$k$ : the number of tokens in the nucleus (determined dynamically based on the distribution)
$|V|$ : the total vocabulary size

The nucleus is then $V_p = \{x_1, x_2, \ldots, x_k\}$ , containing exactly the top $k$ tokens needed to reach the probability threshold. The key property of this formulation is that $k$ emerges from the distribution itself. A peaked distribution yields small $k$ ; a flat distribution yields large $k$ .

Why Renormalization MattersLink Copied

After truncating to the nucleus, the probabilities no longer sum to 1. If our nucleus contains 90% of the probability mass, the probabilities inside it sum to 0.90, not 1.0. To sample correctly, we need to renormalize.

The renormalized probability for each token in the nucleus is simply its original probability divided by the total probability mass in the nucleus:

P'(x) = \frac{P(x)}{\sum_{x' \in V_p} P(x')} \quad \text{for } x \in V_p

where:

$P'(x)$ : the renormalized probability used for sampling
$P(x)$ : the original probability of token $x$
$\sum_{x' \in V_p} P(x')$ : the sum of original probabilities over all tokens in the nucleus, which serves as the normalizing constant

This renormalization preserves the relative ordering of tokens. If "nice" was twice as likely as "good" before, it remains twice as likely after. We're simply scaling everything up so the probabilities form a valid distribution that sums to 1.

A Worked ExampleLink Copied

Abstract formulas come alive with concrete numbers. Let's trace through nucleus sampling step by step with a realistic example.

Setting Up the ProblemLink Copied

Suppose a language model predicts the next token after "The weather is" and produces the following probability distribution:

Token probability distribution for predicting the next word after "The weather is".

Token	Probability
nice	0.35
good	0.25
bad	0.15
great	0.10
terrible	0.05
wonderful	0.04
cold	0.03
hot	0.02
(other tokens)	0.01

This distribution is already sorted by probability. The model strongly favors positive weather descriptions, with "nice" and "good" together accounting for 60% of the mass.

Finding the NucleusLink Copied

With $p = 0.9$ , we walk through the tokens from highest to lowest probability, keeping a running sum:

Cumulative probability accumulation with p=0.9 threshold. Tokens 1-5 form the nucleus.

Step	Token	Probability	Cumulative Sum	In Nucleus?
1	nice	0.35	0.35	✓
2	good	0.25	0.60	✓
3	bad	0.15	0.75	✓
4	great	0.10	0.85	✓
5	terrible	0.05	0.90	✓ (threshold reached)
6	wonderful	0.04	—	✗ (excluded)
7	cold	0.03	—	✗ (excluded)

At step 5, our cumulative sum hits exactly 0.90, meeting the threshold. We stop here. The nucleus is {"nice", "good", "bad", "great", "terrible"}, containing 5 tokens.

Notice what happened: we didn't need to decide in advance how many tokens to include. The distribution itself determined that 5 tokens were needed to capture 90% of the probability mass.

Renormalizing for SamplingLink Copied

The five tokens in our nucleus have probabilities summing to 0.90. To sample from a valid probability distribution, we divide each by this sum:

Renormalization divides each probability by the nucleus sum (0.90) to create a valid distribution.

Token	Original $P(x)$	Calculation	Renormalized $P'(x)$
nice	0.35	$0.35 \div 0.90$	0.389
good	0.25	$0.25 \div 0.90$	0.278
bad	0.15	$0.15 \div 0.90$	0.167
great	0.10	$0.10 \div 0.90$	0.111
terrible	0.05	$0.05 \div 0.90$	0.056

You can verify: $0.389 + 0.278 + 0.167 + 0.111 + 0.056 = 1.001$ (the small discrepancy is rounding). We now have a valid probability distribution over just five tokens.

Out[2]:

Visualization

Cumulative probability curve reaching 0.9 threshold at token 5. — Cumulative probability as we add tokens in order of decreasing probability. The horizontal line marks p=0.9, and the vertical line shows where we stop (after 5 tokens).

Bar chart comparing original versus renormalized probabilities. — Original probabilities (light bars) compared with renormalized probabilities (dark bars) for the nucleus tokens.

The left panel shows how cumulative probability grows as we add tokens. The curve rises steeply at first (the top tokens contribute most of the mass) then flattens. We stop when we cross $p = 0.9$ . The right panel shows how renormalization scales up each probability proportionally, preserving relative rankings while creating a valid distribution.

Comparing to Top-kLink Copied

In this particular case, top-k=5 would include the same tokens. But consider what happens if the model were 90% confident in a single token. Say "Paris" has 0.90 probability and everything else shares the remaining 0.10.

With nucleus sampling at $p = 0.9$ , we'd include just "Paris" (0.90 meets the threshold immediately). With top-k=5, we'd include "Paris" plus four tokens that collectively have only 10% probability. Those four tokens are noise that nucleus sampling correctly excludes.

This is the adaptive behavior that makes nucleus sampling effective: it contracts when the model is confident and expands when the model is uncertain.

ImplementationLink Copied

Now that we understand the algorithm conceptually, let's implement it. We'll build nucleus sampling from scratch, verify it works correctly, then see how to use the production-ready version in Hugging Face's transformers library.

Building the Core AlgorithmLink Copied

The implementation follows our algorithm directly. We'll work with logits (the raw model outputs before softmax), apply optional temperature scaling, convert to probabilities, then perform the nucleus truncation.

In[3]:

Code

import torch
import torch.nn.functional as F


def nucleus_sample(
    logits: torch.Tensor, p: float = 0.9, temperature: float = 1.0
) -> int:
    """
    Sample from the nucleus (top-p) of a probability distribution.

    Args:
        logits: Raw logits from the model (before softmax)
        p: Cumulative probability threshold
        temperature: Temperature scaling factor

    Returns:
        Sampled token index
    """
    # Apply temperature scaling
    scaled_logits = logits / temperature

    # Convert to probabilities
    probs = F.softmax(scaled_logits, dim=-1)

    # Sort probabilities in descending order
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Find the cutoff index where cumulative probability exceeds p
    # We shift by 1 to include the token that crosses the threshold
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
    sorted_indices_to_remove[0] = False

    # Zero out probabilities for tokens outside the nucleus
    sorted_probs[sorted_indices_to_remove] = 0.0

    # Renormalize
    sorted_probs = sorted_probs / sorted_probs.sum()

    # Sample from the filtered distribution
    sampled_idx = torch.multinomial(sorted_probs, num_samples=1)

    # Map back to original vocabulary indices
    return sorted_indices[sampled_idx].item()

import torch
import torch.nn.functional as F


def nucleus_sample(
    logits: torch.Tensor, p: float = 0.9, temperature: float = 1.0
) -> int:
    """
    Sample from the nucleus (top-p) of a probability distribution.

    Args:
        logits: Raw logits from the model (before softmax)
        p: Cumulative probability threshold
        temperature: Temperature scaling factor

    Returns:
        Sampled token index
    """
    # Apply temperature scaling
    scaled_logits = logits / temperature

    # Convert to probabilities
    probs = F.softmax(scaled_logits, dim=-1)

    # Sort probabilities in descending order
    sorted_probs, sorted_indices = torch.sort(probs, descending=True)

    # Compute cumulative probabilities
    cumulative_probs = torch.cumsum(sorted_probs, dim=-1)

    # Find the cutoff index where cumulative probability exceeds p
    # We shift by 1 to include the token that crosses the threshold
    sorted_indices_to_remove = cumulative_probs > p
    sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
    sorted_indices_to_remove[0] = False

    # Zero out probabilities for tokens outside the nucleus
    sorted_probs[sorted_indices_to_remove] = 0.0

    # Renormalize
    sorted_probs = sorted_probs / sorted_probs.sum()

    # Sample from the filtered distribution
    sampled_idx = torch.multinomial(sorted_probs, num_samples=1)

    # Map back to original vocabulary indices
    return sorted_indices[sampled_idx].item()

One subtlety deserves attention: the shift operation in the cutoff logic. When we compute cumulative_probs > p, the first position where this is True is the token that crosses the threshold. But we want to include that token in the nucleus, not exclude it. The shift ensures we remove tokens after the one that crosses the threshold, keeping the nucleus at exactly the right size.

Verifying Our ImplementationLink Copied

Let's confirm our implementation produces the expected behavior using the "weather" example:

In[4]:

Code

# Create a probability distribution matching our example
token_names = [
    "nice",
    "good",
    "bad",
    "great",
    "terrible",
    "wonderful",
    "cold",
    "hot",
    "other",
]
probs = torch.tensor([0.35, 0.25, 0.15, 0.10, 0.05, 0.04, 0.03, 0.02, 0.01])

# Convert to logits (inverse of softmax, approximately)
logits = torch.log(probs)

# Sample 1000 times to see the empirical distribution
samples = []
for _ in range(1000):
    idx = nucleus_sample(logits, p=0.9, temperature=1.0)
    samples.append(token_names[idx])

# Create a probability distribution matching our example
token_names = [
    "nice",
    "good",
    "bad",
    "great",
    "terrible",
    "wonderful",
    "cold",
    "hot",
    "other",
]
probs = torch.tensor([0.35, 0.25, 0.15, 0.10, 0.05, 0.04, 0.03, 0.02, 0.01])

# Convert to logits (inverse of softmax, approximately)
logits = torch.log(probs)

# Sample 1000 times to see the empirical distribution
samples = []
for _ in range(1000):
    idx = nucleus_sample(logits, p=0.9, temperature=1.0)
    samples.append(token_names[idx])

Out[5]:

Console

Sample distribution (1000 samples with p=0.9):
  nice: 40.4%
  good: 27.7%
  bad: 16.1%
  great: 12.3%
  terrible: 3.5%
  wonderful: 0.0%

The samples concentrate on the five nucleus tokens, with "wonderful" appearing rarely or never. The empirical frequencies should approximate our renormalized probabilities: "nice" around 39%, "good" around 28%, and so on.

Visualizing Adaptive Nucleus SizeLink Copied

The real insight comes from seeing how nucleus size varies with $p$ . Let's visualize this:

Out[6]:

Visualization

Bar chart showing nucleus size increasing from 1 token at p=0.3 to 6 tokens at p=0.95. — Nucleus size versus probability threshold. Lower p values create smaller nuclei (more focused sampling), while higher p values include more of the tail distribution. The relationship is non-linear because probability mass concentrates in the top tokens.

With our example distribution, $p = 0.3$ requires only 1 token (just "nice" at 35%), while $p = 0.95$ needs 6 tokens. The relationship is non-linear because probability mass concentrates in the top tokens. Moving from $p = 0.3$ to $p = 0.5$ adds just one token, but moving from $p = 0.9$ to $p = 0.95$ also adds one. Each additional token contributes less marginal probability.

The Adaptive Advantage: Peaked vs. Flat DistributionsLink Copied

The advantage of nucleus sampling becomes clear when we compare it to top-k across different distributional shapes. Consider two scenarios:

Out[7]:

Visualization

Bar chart comparing peaked probability distribution under top-k and nucleus sampling, showing nucleus includes fewer tokens. — Peaked distribution: Model is confident. Top-k=5 includes many low-probability tokens (red), while nucleus sampling with p=0.9 focuses on just 2 tokens.

Bar chart comparing flat probability distribution under top-k and nucleus sampling, showing nucleus includes more tokens. — Flat distribution: Model is uncertain. Top-k=5 misses reasonable options, while nucleus sampling with p=0.9 includes 8 tokens to capture the spread probability mass.

These visualizations show why nucleus sampling outperforms top-k. When the model is confident (peaked distribution), nucleus sampling automatically tightens to just 2 tokens, while top-k=5 wastefully includes 3 near-zero tokens. When the model is uncertain (flat distribution), nucleus sampling expands to 8 tokens to capture the spread probability mass, while top-k=5 arbitrarily excludes 3 reasonable options.

Top-k makes the same decision regardless of context. Nucleus sampling reads the distribution and responds appropriately.

Production Use with Hugging Face TransformersLink Copied

In practice, you won't implement nucleus sampling yourself. The transformers library provides it out of the box. Here's how to generate text with GPT-2 using top-p:

In[8]:

Code

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

# Set pad token to avoid warnings
tokenizer.pad_token = tokenizer.eos_token

# Encode prompt
prompt = "The future of artificial intelligence is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

# Set pad token to avoid warnings
tokenizer.pad_token = tokenizer.eos_token

# Encode prompt
prompt = "The future of artificial intelligence is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

The generate() method accepts top_p directly. Set do_sample=True to enable sampling, and top_k=0 to disable top-k filtering so nucleus sampling operates alone:

In[9]:

Code

# Generate with nucleus sampling (top_p)
with torch.no_grad():
    outputs = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,  # Enable sampling
        top_p=0.9,  # Nucleus probability threshold
        top_k=0,  # Disable top-k (set to 0 or very large)
        temperature=1.0,  # No temperature scaling
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=3,  # Generate 3 different completions
    )

generated_texts = [
    tokenizer.decode(output, skip_special_tokens=True) for output in outputs
]

# Generate with nucleus sampling (top_p)
with torch.no_grad():
    outputs = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,  # Enable sampling
        top_p=0.9,  # Nucleus probability threshold
        top_k=0,  # Disable top-k (set to 0 or very large)
        temperature=1.0,  # No temperature scaling
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=3,  # Generate 3 different completions
    )

generated_texts = [
    tokenizer.decode(output, skip_special_tokens=True) for output in outputs
]

Out[10]:

Console

Generated completions with nucleus sampling (p=0.9):

[1] The future of artificial intelligence is the focus of the Technology Review journal, a quarterly journal of the Association for Computing Machinery, a research group with more than 20 academic and business offices. In a piece that will appear in February this year, "In Search of Artificial Intelligence," the

[2] The future of artificial intelligence is clearly looming in AI. Although these technologies may soon be implanted in your home and everyday life, it will never be as if a loved one has won the title of "online greatest champion". Any significant change in the future of AI, and the continued

[3] The future of artificial intelligence is bleak. The future of robotics is bright, and we're near the end of that decade. Artificial intelligence was at its best in computers, robotics, cognitive AI and security. The dawn of automation opens the door for more secure and effective systems. Now

Each completion takes a different path while remaining coherent. At every generation step, nucleus sampling includes only tokens the model considers plausible, allowing for creativity without introducing obvious errors.

Choosing the Right p ValueLink Copied

The probability threshold $p$ controls the trade-off between creativity and coherence. Here are practical guidelines for selecting $p$ :

$p = 0.9$ to $0.95$ : General-purpose creative writing. This is the most common setting and works well for story generation, dialogue, and open-ended tasks. The nucleus is large enough for variety but excludes the unreliable tail.
$p = 0.7$ to $0.9$ : More focused generation. Useful when you want coherent output with moderate diversity, such as paraphrasing or generating multiple options for the same intent.
$p < 0.7$ : Very focused generation. Approaches greedy decoding behavior. Use for tasks requiring high precision, like code completion or factual responses.
$p > 0.95$ : Very permissive. Rarely used alone because it includes many low-probability tokens. Can produce surprising but occasionally incoherent outputs.

Let's visualize how $p$ affects generation diversity:

Out[11]:

Visualization

Line plot showing unique token count increasing with p value, from about 15 tokens at p=0.5 to 35 tokens at p=0.99. — Token diversity versus probability threshold p. Lower p values produce more deterministic outputs with fewer unique tokens, while higher p values allow the model to explore more of its vocabulary.

Combining Nucleus Sampling with TemperatureLink Copied

Nucleus sampling and temperature scaling address different aspects of the probability distribution. Temperature reshapes the distribution, making it more peaked or more uniform. Nucleus sampling then truncates based on the reshaped distribution.

Out[12]:

Visualization

Bar chart showing probability distribution at temperature 0.5 with nucleus of 1-2 tokens. — Temperature T=0.5 concentrates probability in the top token, shrinking the nucleus to just 1-2 tokens.

Bar chart showing probability distribution at temperature 1.0 with nucleus of 3 tokens. — Temperature T=1.0 (no scaling) shows the baseline distribution with a moderate nucleus size.

Bar chart showing probability distribution at temperature 1.5 with nucleus of 5 tokens. — Temperature T=1.5 spreads probability more evenly, expanding the nucleus to capture more tokens.

The visualization above shows how temperature and nucleus size interact. At low temperature (T=0.5), probability concentrates heavily in the top token, so the nucleus contains just 1-2 tokens. At high temperature (T=1.5), probability spreads more evenly, requiring 5+ tokens to capture 90% of the mass. This interaction lets you fine-tune generation behavior: temperature controls the shape of the distribution, while the nucleus threshold determines how much of that reshaped distribution you sample from.

Using both together gives you fine-grained control:

Low temperature + high p: Coherent output with some variety. The temperature concentrates probability mass, while high p still allows sampling from multiple tokens.
High temperature + low p: Unusual but still controlled. Temperature spreads probability, but low p keeps only the tokens that remain relatively high after spreading.
High temperature + high p: Maximum creativity. Use with caution as outputs can become incoherent.

In[13]:

Code

# Generate with combined temperature and nucleus sampling
with torch.no_grad():
    # Lower temperature for more focused but still diverse output
    focused_outputs = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        top_p=0.9,
        temperature=0.7,  # Slightly lower temperature
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=2,
    )

    # Higher temperature for more creative output
    creative_outputs = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        top_p=0.95,
        temperature=1.2,  # Higher temperature
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=2,
    )

# Generate with combined temperature and nucleus sampling
with torch.no_grad():
    # Lower temperature for more focused but still diverse output
    focused_outputs = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        top_p=0.9,
        temperature=0.7,  # Slightly lower temperature
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=2,
    )

    # Higher temperature for more creative output
    creative_outputs = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        top_p=0.95,
        temperature=1.2,  # Higher temperature
        pad_token_id=tokenizer.eos_token_id,
        num_return_sequences=2,
    )

Out[14]:

Console

Focused (temp=0.7, p=0.9):

  The future of artificial intelligence is not clear.

In April, Google CEO Sundar Pichai called artificial intelligence "the future of robotics." Google's AI division is one of the largest in the world, with over 500,000 employees. The company is also working with

  The future of artificial intelligence is not in the hands of the technology companies that make it, but in the hands of the people who use it.

What's the best way to make AI more accessible?

The biggest breakthroughs in artificial intelligence will come from deep learning


Creative (temp=1.2, p=0.95):

  The future of artificial intelligence is in the hands of one team that has developed, at Harvard, a sophisticated model of the human brain, and its algorithms and algorithms are rapidly being perfected." I have used Google in the past, to produce products that do both things well and do it

  The future of artificial intelligence is in sight. "We're all in this together in every sense imaginable," says Hinton. "You see the future when you see your parents. If we can get in the car to meet up at a supermarket, then we can find people.

The focused outputs with lower temperature tend to stay closer to common phrasings and high-probability continuations. The creative outputs with higher temperature explore more unusual word choices and directions, though they may occasionally produce less coherent passages. Finding the right balance depends on your application: conversational AI typically benefits from moderate settings, while creative writing tools can push toward higher values.

Nucleus Sampling vs. Top-k: When to Use WhichLink Copied

Both methods truncate the vocabulary, but they do so with different philosophies:

Comparison of top-k and nucleus sampling approaches.

Aspect	Top-k	Nucleus (Top-p)
Truncation criterion	Fixed count	Probability mass
Adapts to context	No	Yes
Parameter intuition	"Keep this many options"	"Keep this much probability"
Risk of over-truncation	Yes (when few tokens dominate)	No
Risk of under-truncation	Yes (when probability spreads)	No
Compute cost	Slightly lower (no cumsum)	Slightly higher

In practice, nucleus sampling is preferred for creative text generation because of its adaptive behavior. Top-k remains useful when you want explicit control over the number of options or when slight computational savings matter at scale.

Some systems use both: first apply top-k as a coarse filter (e.g., k=100), then apply nucleus sampling within that set. This combines the efficiency of top-k with the adaptivity of top-p.

Limitations and ImpactLink Copied

Nucleus sampling represented a significant advance in text generation quality, but it comes with limitations worth understanding.

The choice of $p$ remains a hyperparameter that requires tuning. While $p = 0.9$ works well in many settings, different tasks and domains may benefit from different thresholds. There's no universal value that works optimally across all contexts, and the "right" $p$ can even vary within a single generation as the model moves through different parts of a sequence. For instance, the beginning of a story might benefit from higher $p$ to establish creative premises, while later passages benefit from lower $p$ to maintain consistency.

Nucleus sampling also doesn't address all failure modes of autoregressive generation. Repetition loops, factual errors, and incoherent long-range structure can still occur. The method operates on local token probabilities and has no mechanism for enforcing global coherence or factual accuracy. Modern systems often combine nucleus sampling with additional techniques like repetition penalties, contrastive search, or post-hoc filtering.

The cumulative probability computation adds modest overhead compared to simpler methods like pure sampling or greedy decoding. For most applications this is negligible, but in latency-critical systems processing millions of requests, it can add up. Optimized implementations pre-sort once and reuse the sorted indices.

Despite these limitations, nucleus sampling's impact on practical text generation has been substantial. It became the default sampling method in many popular language model interfaces and APIs. The insight that truncation should adapt to distributional shape rather than use a fixed count influenced subsequent work on decoding strategies. Nucleus sampling also gave practitioners a principled way to balance coherence and creativity that was previously achieved only through extensive trial-and-error with temperature and top-k.

Key ParametersLink Copied

When using nucleus sampling in Hugging Face's generate() method, the following parameters control the sampling behavior:

top_p (float, 0.0 to 1.0): The cumulative probability threshold. Only tokens whose cumulative probability mass is within this threshold are considered. Higher values include more tokens, increasing diversity. Typical range: 0.9 to 0.95 for creative tasks, 0.7 to 0.9 for more focused generation.
temperature (float, > 0): Scales the logits before applying softmax. Values below 1.0 sharpen the distribution (more deterministic), values above 1.0 flatten it (more random). Applied before nucleus truncation.
top_k (int): When using nucleus sampling alone, set to 0 to disable top-k filtering. Can be combined with top-p for hybrid filtering.
do_sample (bool): Must be set to True to enable any sampling strategy. When False, the model uses greedy decoding.
num_return_sequences (int): Number of independent completions to generate. Useful for generating multiple diverse options.

SummaryLink Copied

Nucleus sampling addresses a fundamental limitation of top-k sampling by adapting the number of candidate tokens to the probability distribution at each step. Rather than asking "how many tokens should I consider?", nucleus sampling asks "how much probability mass should I keep?", and this reframing produces more consistent generation quality across varying contexts.

The key ideas are:

Cumulative probability threshold: The nucleus is the smallest token set whose probabilities sum to at least $p$
Adaptive truncation: High-confidence predictions yield small nuclei; uncertain predictions yield large nuclei
Typical values: $p = 0.9$ to $0.95$ for general creative generation
Combines with temperature: Temperature reshapes the distribution; nucleus sampling truncates it
Practical default: Nucleus sampling is the most common choice for open-ended generation in modern systems

When implementing text generation, nucleus sampling should be your default starting point. Adjust $p$ based on your tolerance for creativity versus coherence, and combine with temperature scaling for finer control over the output distribution.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about nucleus sampling and top-p text generation.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{nucleussamplingadaptivetopptextgenerationforlanguagemodels, author = {Michael Brenndoerfer}, title = {Nucleus Sampling: Adaptive Top-p Text Generation for Language Models}, year = {2025}, url = {https://mbrenndoerfer.com/writing/nucleus-sampling-top-p-text-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Nucleus Sampling: Adaptive Top-p Text Generation for Language Models. Retrieved from https://mbrenndoerfer.com/writing/nucleus-sampling-top-p-text-generation

MLAAcademic

Michael Brenndoerfer. "Nucleus Sampling: Adaptive Top-p Text Generation for Language Models." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/nucleus-sampling-top-p-text-generation>.

CHICAGOAcademic

Michael Brenndoerfer. "Nucleus Sampling: Adaptive Top-p Text Generation for Language Models." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/nucleus-sampling-top-p-text-generation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Nucleus Sampling: Adaptive Top-p Text Generation for Language Models'. Available at: https://mbrenndoerfer.com/writing/nucleus-sampling-top-p-text-generation (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Nucleus Sampling: Adaptive Top-p Text Generation for Language Models. https://mbrenndoerfer.com/writing/nucleus-sampling-top-p-text-generation

Direct link:

https://mbrenndoerfer.com/writing/nucleus-sampling-top-p-text-generation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Nucleus Sampling: Adaptive Top-p Text Generation for Language Models

Nucleus SamplingLink Copied

The Problem with Fixed-k SamplingLink Copied

The Top-p FormulationLink Copied

From Intuition to DefinitionLink Copied

The Algorithm Step by StepLink Copied

Why Renormalization MattersLink Copied

A Worked ExampleLink Copied

Setting Up the ProblemLink Copied

Finding the NucleusLink Copied

Renormalizing for SamplingLink Copied

Comparing to Top-kLink Copied

ImplementationLink Copied

Building the Core AlgorithmLink Copied

Verifying Our ImplementationLink Copied

Visualizing Adaptive Nucleus SizeLink Copied

The Adaptive Advantage: Peaked vs. Flat DistributionsLink Copied

Production Use with Hugging Face TransformersLink Copied

Choosing the Right p ValueLink Copied

Combining Nucleus Sampling with TemperatureLink Copied

Nucleus Sampling vs. Top-k: When to Use WhichLink Copied

Limitations and ImpactLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Repetition Penalties: Preventing Loops in Language Model Generation

Constrained Decoding: Grammar-Guided Generation for Structured LLM Output

Autoregressive Generation: How GPT Generates Text Token by Token

Stay updated