Top-k Sampling: Controlling Language Model Text Generation

Michael Brenndoerfer

Data, Analytics & AI Machine Learning Language AI Handbook

Learn how top-k sampling truncates vocabulary to the k most probable tokens, eliminating incoherent outputs while preserving diversity in language model generation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Top-k SamplingLink Copied

When a language model generates text, it produces a probability distribution over tens of thousands of possible next tokens. Greedy decoding picks the single most likely token at each step, which leads to repetitive and predictable output. Temperature scaling makes the distribution more uniform, but it raises the probability of every token, including nonsensical ones. Top-k sampling offers a different solution: keep only the $k$ most probable tokens and sample from this truncated distribution.

The core idea is straightforward. Rather than considering all 50,000+ tokens in the vocabulary, top-k sampling zeros out the probability of everything outside the top $k$ candidates. This eliminates the long tail of unlikely tokens while preserving meaningful diversity among the plausible choices. The result is text that reads naturally without veering into incoherence.

This chapter explores top-k sampling in detail. We'll examine the mathematics behind truncation, implement the algorithm from scratch, discuss how to select appropriate values of $k$ , and combine top-k with temperature for fine-grained control over generation quality.

The Problem with Full Vocabulary SamplingLink Copied

Before diving into top-k, let's understand why sampling from the full vocabulary distribution causes problems. A trained language model assigns non-zero probability to every token, even ones that make no sense in context.

In[4]:

Code

def get_next_token_distribution(model, tokenizer, prompt):
    """Get the probability distribution over next tokens."""
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model(inputs["input_ids"])
        logits = outputs.logits[0, -1, :]
        probs = torch.softmax(logits, dim=0)

    return probs

def get_next_token_distribution(model, tokenizer, prompt):
    """Get the probability distribution over next tokens."""
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model(inputs["input_ids"])
        logits = outputs.logits[0, -1, :]
        probs = torch.softmax(logits, dim=0)

    return probs

Out[5]:

Console

Prompt: 'The capital of France is'

Vocabulary size: 50,257
Tokens with P > 1%: 14
Tokens with P > 0.1%: 109

Top 5 predictions:
  1. ' the': 0.0846
  2. ' now': 0.0479
  3. ' a': 0.0462
  4. ' France': 0.0324
  5. ' Paris': 0.0322

Most of the vocabulary mass sits in the long tail of tokens with tiny probabilities. While each token individually has near-zero chance of selection, collectively they can represent a substantial fraction of the total probability. Sampling from this full distribution occasionally produces bizarre tokens that derail generation.

Out[6]:

Visualization

Log-scale bar chart showing token probabilities ranked by likelihood, with steep dropoff after top tokens. — Next-token probability distribution for ''The capital of France is''. The distribution is highly skewed: a few tokens dominate while thousands have negligible but non-zero probability. Sampling from the full distribution occasionally selects from this long tail, producing incoherent text.

The figure illustrates why the long tail is problematic. After the first few tokens, probability drops rapidly, but thousands of tokens retain some small probability. Summed together, this tail can be significant. If we sample proportionally, low-probability tokens occasionally win, producing outputs like "The capital of France is ????????" or "The capital of France is asdf".

How Top-k Sampling WorksLink Copied

Now that we understand the problem, let's build up the solution piece by piece. The core insight behind top-k sampling is simple: if most of the probability mass concentrates in a handful of tokens, why not just sample from those and ignore the rest?

The Intuition: Focus on What MattersLink Copied

Think about how you complete sentences. Given "The capital of France is", you don't mentally consider every possible word in English. You immediately focus on a small set of plausible continuations: "Paris", maybe "a", "the", or "known". Your mental probability distribution isn't uniform across all words, and neither is the model's. Top-k sampling formalizes this intuition by explicitly restricting attention to the most likely candidates.

The approach works in five steps:

Compute probabilities: The model outputs a distribution over all vocabulary tokens
Rank by likelihood: Sort tokens from most to least probable
Keep the top k: Select only the $k$ highest-probability tokens
Zero out the rest: Set all other token probabilities to exactly zero
Renormalize and sample: Rescale the remaining probabilities so they sum to 1, then sample

Top-k Sampling

Top-k sampling restricts the sampling space to the $k$ most probable tokens. After zeroing out tokens outside the top- $k$ , the remaining probabilities are renormalized before sampling.

From Intuition to MathematicsLink Copied

Let's formalize this process. At generation step $i$ , the model has seen context $x_{<i}$ (all previous tokens) and outputs a probability $P(x_i | x_{<i})$ for each possible next token $x_i$ . This distribution spans the entire vocabulary, but we only want to sample from the top performers.

First, we need to identify which tokens make the cut. Define $V_k$ as the set containing the $k$ tokens with highest probability. If the vocabulary has 50,000 tokens and $k=50$ , then $V_k$ contains exactly the 50 most likely tokens for this specific context.

Now we need to handle a subtle issue: after removing tokens from consideration, the probabilities of the remaining tokens no longer sum to 1. A probability distribution must sum to 1 for sampling to work correctly. The solution is renormalization. We divide each remaining probability by the sum of all kept probabilities.

This gives us the formal definition of top-k sampling:

P_k(x_i | x_{<i}) = \begin{cases} \frac{P(x_i | x_{<i})}{Z_k} & \text{if } x_i \in V_k \\ 0 & \text{otherwise} \end{cases}

where:

$x_i$ : a candidate token at position $i$
$x_{<i}$ : the context (all tokens before position $i$ )
$P(x_i | x_{<i})$ : the original probability the model assigns to token $x_i$ given the context
$V_k$ : the set of $k$ tokens with highest probability under $P(\cdot | x_{<i})$
$Z_k = \sum_{x \in V_k} P(x | x_{<i})$ : the normalization constant, which equals the sum of probabilities of the top- $k$ tokens
$P_k(x_i | x_{<i})$ : the renormalized probability used for sampling

Why Renormalization Preserves Relative ProbabilitiesLink Copied

A critical property of this construction is that renormalization preserves the relative likelihood of tokens within the top- $k$ . Consider two tokens, "Paris" with original probability 0.30 and "the" with probability 0.15. In the original distribution, "Paris" is exactly twice as likely as "the".

After truncation, suppose the top- $k$ tokens have cumulative probability $Z_k = 0.90$ . The renormalized probabilities become:

"Paris": $0.30 / 0.90 = 0.333$
"the": $0.15 / 0.90 = 0.167$

Notice that $0.333 / 0.167 = 2$ . The ratio is preserved! This happens because we divide both probabilities by the same constant $Z_k$ . Renormalization scales all probabilities equally, maintaining their relative ordering and ratios.

This property matters because it means top-k sampling respects the model's preferences among plausible tokens. We're not arbitrarily reweighting tokens; we're simply restricting which tokens can be sampled while honoring the model's ranking within that restricted set.

Visualizing the TruncationLink Copied

The following visualization shows this process concretely. On the left, we see the original distribution over the top 20 tokens. On the right, we see the truncated distribution after keeping only $k=10$ tokens and renormalizing.

Out[7]:

Visualization

Original distribution (top 20 tokens shown). Blue bars indicate tokens kept in top-10, gray bars are truncated.

Top-10 truncated distribution after renormalization. Probabilities now sum to 1.

Implementing Top-k SamplingLink Copied

With the mathematics established, let's translate the algorithm into code. The implementation is straightforward, but walking through it step by step reveals how each line corresponds to a piece of the formula.

Building the Core SamplerLink Copied

Our implementation needs to accomplish four things:

Apply temperature scaling to the logits
Find the $k$ highest-scoring tokens
Convert those scores to a valid probability distribution
Sample one token from that distribution

The key insight is that we don't need to explicitly zero out the non-top-k tokens. Instead, we can compute softmax over only the top-k logits, which automatically gives us a properly normalized distribution over just those tokens.

In[8]:

Code

def top_k_sample(logits, k, temperature=1.0):
    """
    Sample from top-k truncated distribution.

    Args:
        logits: Raw logits from model, shape (vocab_size,)
        k: Number of top tokens to keep
        temperature: Temperature for scaling before truncation

    Returns:
        Sampled token index
    """
    # Apply temperature scaling
    scaled_logits = logits / temperature

    # Get top-k values and indices
    top_k_values, top_k_indices = torch.topk(scaled_logits, k)

    # Convert to probabilities
    top_k_probs = torch.softmax(top_k_values, dim=0)

    # Sample from truncated distribution
    sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    # Map back to original vocabulary index
    return top_k_indices[sample_idx].item()

def top_k_sample(logits, k, temperature=1.0):
    """
    Sample from top-k truncated distribution.

    Args:
        logits: Raw logits from model, shape (vocab_size,)
        k: Number of top tokens to keep
        temperature: Temperature for scaling before truncation

    Returns:
        Sampled token index
    """
    # Apply temperature scaling
    scaled_logits = logits / temperature

    # Get top-k values and indices
    top_k_values, top_k_indices = torch.topk(scaled_logits, k)

    # Convert to probabilities
    top_k_probs = torch.softmax(top_k_values, dim=0)

    # Sample from truncated distribution
    sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    # Map back to original vocabulary index
    return top_k_indices[sample_idx].item()

Let's trace through each step:

Temperature scaling divides each logit by the temperature, controlling how peaked or flat the distribution becomes
torch.topk efficiently finds the $k$ largest values and their positions in the vocabulary, returning both the values and their indices
Softmax over top-k values computes $e^{z_i} / \sum_j e^{z_j}$ using only the kept tokens, which is equivalent to the renormalization step in our formula
torch.multinomial samples according to the probability distribution, returning an index into our top-k list
Index mapping converts the sampled position (0 to k-1) back to the actual vocabulary token ID

Seeing It in ActionLink Copied

Let's sample multiple times from the same context to see the variety top-k produces:

Out[9]:

Console

Prompt: 'The capital of France is'

Sampling with k=5 (10 samples):
  Sample 1: ' the'
  Sample 2: ' France'
  Sample 3: ' the'
  Sample 4: ' now'
  Sample 5: ' the'
  Sample 6: ' the'
  Sample 7: ' the'
  Sample 8: ' a'
  Sample 9: ' now'
  Sample 10: ' a'

Notice how all samples are reasonable completions. With $k=5$ , we're sampling from only the five most likely tokens, which for this factual prompt are all sensible choices. The variation comes from the probabilistic sampling, but every option is plausible.

From Single Tokens to Full Text GenerationLink Copied

A single sample is interesting, but the real power of top-k sampling emerges when we generate entire sequences. Each token we generate becomes part of the context for the next token, creating a chain of sampling decisions.

The generation loop follows a simple pattern: get logits, sample a token, append it to the sequence, repeat. At each step, top-k ensures we only consider reasonable continuations.

In[10]:

Code

def generate_with_top_k(
    model, tokenizer, prompt, max_tokens=30, k=50, temperature=1.0
):
    """Generate text using top-k sampling."""
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
    generated_ids = input_ids.clone()

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(generated_ids)
            next_token_logits = outputs.logits[0, -1, :]

        # Sample using top-k
        next_token_id = top_k_sample(
            next_token_logits, k=k, temperature=temperature
        )

        # Append and continue
        generated_ids = torch.cat(
            [generated_ids, torch.tensor([[next_token_id]])], dim=1
        )

        # Stop if EOS token
        if next_token_id == tokenizer.eos_token_id:
            break

    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

def generate_with_top_k(
    model, tokenizer, prompt, max_tokens=30, k=50, temperature=1.0
):
    """Generate text using top-k sampling."""
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
    generated_ids = input_ids.clone()

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(generated_ids)
            next_token_logits = outputs.logits[0, -1, :]

        # Sample using top-k
        next_token_id = top_k_sample(
            next_token_logits, k=k, temperature=temperature
        )

        # Append and continue
        generated_ids = torch.cat(
            [generated_ids, torch.tensor([[next_token_id]])], dim=1
        )

        # Stop if EOS token
        if next_token_id == tokenizer.eos_token_id:
            break

    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

The Effect of Different k ValuesLink Copied

Now we can see how $k$ shapes the character of generated text. Smaller $k$ means stricter filtering, keeping only the most confident predictions. Larger $k$ allows more exploration of the probability space.

Out[11]:

Console

Prompt: 'Artificial intelligence will'

Generated continuations:
--------------------------------------------------

k=10: Artificial intelligence will also be more complex, and more diverse, than ever before. For example, artificial intelligence, which is already a key technology for the world's financial

k=50: Artificial intelligence will also create even more exciting challenges for our nation's workers, and many of the benefits will come from improving our ability to build our own factories and reduce

k=200: Artificial intelligence will become so advanced that, like our pets, children, pets will be doing almost any form of AI-related activity. Companies are taking our money.

The outputs reveal distinct personalities. With $k=10$ , the text tends toward common, expected phrasings. With $k=50$ , there's more room for varied word choices while maintaining fluency. With $k=200$ , the model has significant freedom, though even here, we've eliminated the vast majority of the vocabulary's long tail.

Out[12]:

Visualization

Heatmap showing sampling counts across token ranks for different k values. — Empirical sampling frequency across 100 samples for different k values. With k=5, nearly all samples come from the top 2-3 tokens. As k increases, the distribution spreads more evenly across the kept tokens, though higher-probability tokens still dominate.

The heatmap reveals how $k$ shapes sampling behavior empirically. With $k=5$ , the dark cells concentrate in the leftmost columns, meaning the top-ranked tokens receive nearly all samples. As $k$ increases, the distribution spreads rightward, but higher-ranked tokens still dominate because they have higher probability even after renormalization.

Choosing the Right Value of kLink Copied

The choice of $k$ significantly affects generation quality. There's no single optimal value, as the best $k$ depends on both the task and the context.

Out[13]:

Visualization

Line plot showing diversity increasing and coherence decreasing as k increases, with optimal zone highlighted. — Trade-off between diversity and coherence as k varies. Lower k values produce more focused text but risk repetition. Higher k values increase diversity but may introduce occasional off-topic tokens. The shaded region represents typical values used in practice (k=30-100).

Guidelines for Selecting kLink Copied

Consider these factors when choosing $k$ :

Task type: Creative writing benefits from higher $k$ (50-100) for variety. Factual responses work better with lower $k$ (10-30) for precision.
Context confidence: When the model is highly confident (one token dominates), even large $k$ won't introduce much diversity since top tokens have most of the mass anyway.
Generation length: Longer generations may need lower $k$ to prevent drift. Errors accumulate over many steps, and each improbable token can push the generation off course.
User expectations: Interactive applications often use $k$ =40-50 as a reasonable default that balances fluency with variety.

In[14]:

Code

def analyze_top_k_coverage(probs, k_values):
    """Analyze what fraction of probability mass top-k captures."""
    sorted_probs, _ = torch.sort(probs, descending=True)

    coverage = {}
    for k in k_values:
        coverage[k] = sorted_probs[:k].sum().item()

    return coverage

def analyze_top_k_coverage(probs, k_values):
    """Analyze what fraction of probability mass top-k captures."""
    sorted_probs, _ = torch.sort(probs, descending=True)

    coverage = {}
    for k in k_values:
        coverage[k] = sorted_probs[:k].sum().item()

    return coverage

Out[15]:

Console

Top-k Probability Coverage by Context Type:
============================================================

Prompt: 'The capital of France is'
  k=  5: 24.3% of probability mass
  k= 10: 35.9% of probability mass
  k= 20: 45.7% of probability mass
  k= 50: 54.9% of probability mass
  k=100: 62.4% of probability mass

Prompt: 'Once upon a time in a land far'
  k=  5: 68.3% of probability mass
  k= 10: 79.8% of probability mass
  k= 20: 86.2% of probability mass
  k= 50: 92.8% of probability mass
  k=100: 95.4% of probability mass

Prompt: '2 + 2 ='
  k=  5: 40.8% of probability mass
  k= 10: 52.5% of probability mass
  k= 20: 62.0% of probability mass
  k= 50: 70.5% of probability mass
  k=100: 76.0% of probability mass

The coverage analysis reveals an important pattern: when the model is confident, a small $k$ captures most of the probability mass. "2 + 2 =" concentrates probability in very few tokens, so $k$ =10 might capture 99%+ of the mass. Open-ended prompts like "Once upon a time" spread probability more evenly, requiring larger $k$ to maintain diverse sampling.

Out[16]:

Visualization

Line plot showing cumulative probability curves for three prompts, demonstrating different coverage rates. — Cumulative probability coverage as k increases for three different contexts. Confident predictions (blue) reach near-complete coverage with tiny k values. Open-ended contexts (green) require larger k to capture the same probability mass. The dashed lines show the 90% and 99% coverage thresholds.

This visualization makes the intuition concrete. For confident predictions like "2 + 2 =", the curve shoots up almost vertically, reaching 99% coverage with just a handful of tokens. Open-ended prompts show a gentler slope, requiring larger $k$ to achieve the same coverage. This explains why a fixed $k$ works differently across contexts.

Combining Top-k with TemperatureLink Copied

Top-k sampling and temperature scaling complement each other. Temperature adjusts the shape of the distribution before truncation, while top-k determines how many tokens to keep. Used together, they offer fine-grained control.

Out[17]:

Visualization

Low temperature (T=0.5) concentrates probability in top tokens, making k=20 effectively sample from fewer options.

High temperature (T=1.5) flattens the distribution, spreading probability more evenly across all 20 kept tokens.

The figure shows how temperature reshapes the distribution within the top- $k$ candidates. At low temperature, the top token dominates even among the kept tokens. At high temperature, probability spreads more evenly across all $k$ tokens.

Temperature-First vs Top-k-FirstLink Copied

The order of operations matters. Standard practice applies temperature first, then top-k:

Temperature scaling: Divide logits by temperature
Top-k selection: Keep the $k$ highest values
Softmax: Convert to probabilities
Sample: Draw from the truncated distribution

In[18]:

Code

def top_k_sample_with_temperature(logits, k, temperature=1.0):
    """
    Top-k sampling with temperature applied first.

    This is the standard order: temperature reshapes the distribution,
    then top-k truncates, then we sample.
    """
    # Step 1: Temperature scaling
    scaled_logits = logits / temperature

    # Step 2: Get top-k
    top_k_vals, top_k_idx = torch.topk(scaled_logits, k)

    # Step 3: Softmax over top-k only
    top_k_probs = torch.softmax(top_k_vals, dim=0)

    # Step 4: Sample
    sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    return top_k_idx[sample_idx].item()

def top_k_sample_with_temperature(logits, k, temperature=1.0):
    """
    Top-k sampling with temperature applied first.

    This is the standard order: temperature reshapes the distribution,
    then top-k truncates, then we sample.
    """
    # Step 1: Temperature scaling
    scaled_logits = logits / temperature

    # Step 2: Get top-k
    top_k_vals, top_k_idx = torch.topk(scaled_logits, k)

    # Step 3: Softmax over top-k only
    top_k_probs = torch.softmax(top_k_vals, dim=0)

    # Step 4: Sample
    sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    return top_k_idx[sample_idx].item()

Why temperature first? Temperature affects which tokens qualify as "top-k". High temperature compresses differences between token probabilities, potentially changing which tokens make the cut. Low temperature exaggerates differences, making the ranking more pronounced. Applying temperature after top-k selection would mean truncating based on the original distribution, which may not match the sampling distribution you want.

Out[19]:

Console

Prompt: 'The best programming language is'

Top-10 tokens at different temperatures:
--------------------------------------------------
T=0.5: [' Java', ' Python', ' the', ' a', ' not', ' C', ' one', ' JavaScript', ' Haskell', ' probably']
T=1.0: [' Java', ' Python', ' the', ' a', ' not', ' C', ' one', ' JavaScript', ' Haskell', ' probably']
T=2.0: [' Java', ' Python', ' the', ' a', ' not', ' C', ' one', ' JavaScript', ' Haskell', ' probably']

The output shows that the top- $k$ tokens remain mostly stable across temperatures, with the highest-ranked tokens appearing in all lists. However, tokens near the boundary (rank 8-10) may swap in and out as temperature changes, which can subtly affect sampling behavior.

Practical ConsiderationsLink Copied

Moving from theory to production, several practical aspects affect how top-k sampling performs in real applications. This section covers computational efficiency, edge case handling, and batched generation.

Computational EfficiencyLink Copied

Top-k sampling adds minimal overhead to generation. The torch.topk operation is efficient, running in $O(n \log k)$ time, where $n$ is the vocabulary size and $k$ is the number of tokens to select. Since $k \ll n$ (typically $k$ is 50-100 while $n$ is 50,000+), this is essentially linear in vocabulary size.

In[20]:

Code

import time


def benchmark_top_k(
    vocab_size=50257, k_values=[10, 50, 100, 500], n_trials=1000
):
    """Benchmark top-k selection speed."""
    logits = torch.randn(vocab_size)

    results = {}
    for k in k_values:
        start = time.perf_counter()
        for _ in range(n_trials):
            _ = torch.topk(logits, k)
        elapsed = time.perf_counter() - start
        results[k] = elapsed / n_trials * 1000  # Convert to ms

    return results

import time


def benchmark_top_k(
    vocab_size=50257, k_values=[10, 50, 100, 500], n_trials=1000
):
    """Benchmark top-k selection speed."""
    logits = torch.randn(vocab_size)

    results = {}
    for k in k_values:
        start = time.perf_counter()
        for _ in range(n_trials):
            _ = torch.topk(logits, k)
        elapsed = time.perf_counter() - start
        results[k] = elapsed / n_trials * 1000  # Convert to ms

    return results

Out[21]:

Console

Top-k selection time (per call):
  k= 10: 0.0631 ms
  k= 50: 0.0716 ms
  k=100: 0.0769 ms
  k=500: 0.1319 ms

The timings confirm that top-k selection takes only fractions of a millisecond, even for the largest $k$ values. Since a typical GPT-2 forward pass takes 10-50ms (and larger models take 100ms+), the top-k operation adds less than 1% overhead. This makes top-k sampling practical for production use without any performance concerns.

Handling Edge CasesLink Copied

Several edge cases require attention in production implementations:

k larger than vocabulary: If $k \geq$ vocabulary size, top-k degenerates to full sampling
All zero logits: Rare but possible with certain inputs; results in uniform sampling
Very small k: $k$ =1 is equivalent to greedy decoding

In[22]:

Code

def robust_top_k_sample(logits, k, temperature=1.0, min_tokens=1):
    """
    Robust top-k sampling with edge case handling.
    """
    vocab_size = logits.size(0)

    # Clamp k to valid range
    k = max(min_tokens, min(k, vocab_size))

    # Handle temperature
    if temperature <= 0:
        # Greedy selection
        return logits.argmax().item()

    scaled_logits = logits / temperature

    # Check for numerical issues
    if torch.isnan(scaled_logits).any() or torch.isinf(scaled_logits).any():
        # Fall back to argmax
        return logits.argmax().item()

    top_k_vals, top_k_idx = torch.topk(scaled_logits, k)
    top_k_probs = torch.softmax(top_k_vals, dim=0)

    # Handle case where all probabilities become 0 or nan
    if top_k_probs.sum() == 0 or torch.isnan(top_k_probs).any():
        # Uniform sampling over top-k
        sample_idx = torch.randint(0, k, (1,))
    else:
        sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    return top_k_idx[sample_idx].item()

def robust_top_k_sample(logits, k, temperature=1.0, min_tokens=1):
    """
    Robust top-k sampling with edge case handling.
    """
    vocab_size = logits.size(0)

    # Clamp k to valid range
    k = max(min_tokens, min(k, vocab_size))

    # Handle temperature
    if temperature <= 0:
        # Greedy selection
        return logits.argmax().item()

    scaled_logits = logits / temperature

    # Check for numerical issues
    if torch.isnan(scaled_logits).any() or torch.isinf(scaled_logits).any():
        # Fall back to argmax
        return logits.argmax().item()

    top_k_vals, top_k_idx = torch.topk(scaled_logits, k)
    top_k_probs = torch.softmax(top_k_vals, dim=0)

    # Handle case where all probabilities become 0 or nan
    if top_k_probs.sum() == 0 or torch.isnan(top_k_probs).any():
        # Uniform sampling over top-k
        sample_idx = torch.randint(0, k, (1,))
    else:
        sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    return top_k_idx[sample_idx].item()

Batched GenerationLink Copied

For efficiency with multiple sequences, top-k can be applied in parallel across a batch:

In[23]:

Code

def batched_top_k_sample(logits, k, temperature=1.0):
    """
    Top-k sampling for batched logits.

    Args:
        logits: Shape (batch_size, vocab_size)
        k: Number of top tokens
        temperature: Temperature for scaling

    Returns:
        Tensor of sampled token indices, shape (batch_size,)
    """
    batch_size = logits.size(0)

    scaled_logits = logits / temperature
    top_k_vals, top_k_idx = torch.topk(scaled_logits, k, dim=-1)
    top_k_probs = torch.softmax(top_k_vals, dim=-1)

    # Sample one token per sequence
    sample_indices = torch.multinomial(top_k_probs, num_samples=1).squeeze(-1)

    # Gather the actual token IDs
    selected_tokens = top_k_idx[torch.arange(batch_size), sample_indices]

    return selected_tokens

def batched_top_k_sample(logits, k, temperature=1.0):
    """
    Top-k sampling for batched logits.

    Args:
        logits: Shape (batch_size, vocab_size)
        k: Number of top tokens
        temperature: Temperature for scaling

    Returns:
        Tensor of sampled token indices, shape (batch_size,)
    """
    batch_size = logits.size(0)

    scaled_logits = logits / temperature
    top_k_vals, top_k_idx = torch.topk(scaled_logits, k, dim=-1)
    top_k_probs = torch.softmax(top_k_vals, dim=-1)

    # Sample one token per sequence
    sample_indices = torch.multinomial(top_k_probs, num_samples=1).squeeze(-1)

    # Gather the actual token IDs
    selected_tokens = top_k_idx[torch.arange(batch_size), sample_indices]

    return selected_tokens

Out[24]:

Console

Batch size: 4
Sampled token IDs: [29552, 19987, 48649, 40351]

The batched implementation processes all four sequences in a single operation, returning one sampled token ID per sequence. This vectorized approach is essential for efficient inference when generating multiple sequences in parallel, as it avoids the overhead of looping through sequences individually.

Limitations of Top-k SamplingLink Copied

While top-k is widely used, it has notable limitations that motivated the development of alternative approaches like nucleus (top-p) sampling.

Fixed k Ignores ContextLink Copied

The fundamental limitation is that $k$ is fixed regardless of context. Consider two scenarios:

High-confidence context: "The president of the United States in 2020 was Donald", where very few tokens are reasonable next.
Open-ended context: "I think the best way to", where many tokens could reasonably follow.

A fixed $k$ treats both identically. If $k$ =50, the first case includes many implausible tokens (the model might only need $k$ =5). The second case might benefit from even more options.

Out[25]:

Visualization

High confidence context: k=50 includes many unnecessary low-probability tokens when the model is already certain.

Low confidence context: k=50 may exclude viable options when probability is spread more evenly.

Quality-Diversity Trade-offLink Copied

No single $k$ value works optimally across all situations. Lower $k$ improves quality but reduces diversity. Higher $k$ increases diversity but risks including low-quality tokens. This creates tension when the goal is both high quality and interesting variation.

The next chapter on nucleus sampling shows how adapting the truncation threshold to the probability distribution itself addresses these limitations.

Comparison with Other Decoding MethodsLink Copied

Let's compare top-k sampling with other decoding strategies to understand when each is most appropriate.

In[26]:

Code

def generate_comparison(model, tokenizer, prompt, max_tokens=30):
    """Generate using different decoding strategies."""
    results = {}

    # Greedy
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
    with torch.no_grad():
        output = model.generate(
            input_ids, max_new_tokens=max_tokens, do_sample=False
        )
    results["greedy"] = tokenizer.decode(output[0], skip_special_tokens=True)

    # Temperature sampling (T=0.8)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            top_k=0,  # top_k=0 disables top-k
        )
    results["temperature_0.8"] = tokenizer.decode(
        output[0], skip_special_tokens=True
    )

    # Top-k (k=50)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_k=50,
            temperature=1.0,
        )
    results["top_k_50"] = tokenizer.decode(output[0], skip_special_tokens=True)

    # Top-k + Temperature (k=50, T=0.8)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_k=50,
            temperature=0.8,
        )
    results["top_k_50_temp_0.8"] = tokenizer.decode(
        output[0], skip_special_tokens=True
    )

    return results

def generate_comparison(model, tokenizer, prompt, max_tokens=30):
    """Generate using different decoding strategies."""
    results = {}

    # Greedy
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
    with torch.no_grad():
        output = model.generate(
            input_ids, max_new_tokens=max_tokens, do_sample=False
        )
    results["greedy"] = tokenizer.decode(output[0], skip_special_tokens=True)

    # Temperature sampling (T=0.8)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            top_k=0,  # top_k=0 disables top-k
        )
    results["temperature_0.8"] = tokenizer.decode(
        output[0], skip_special_tokens=True
    )

    # Top-k (k=50)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_k=50,
            temperature=1.0,
        )
    results["top_k_50"] = tokenizer.decode(output[0], skip_special_tokens=True)

    # Top-k + Temperature (k=50, T=0.8)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_k=50,
            temperature=0.8,
        )
    results["top_k_50_temp_0.8"] = tokenizer.decode(
        output[0], skip_special_tokens=True
    )

    return results

Out[27]:

Console

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.

Prompt: 'The future of renewable energy'
============================================================

greedy:
  The future of renewable energy is in the hands of the people.

"We need to be able to do it in a way that is sustainable and that is sustainable for

temperature_0.8:
  The future of renewable energy (starting in 2020) is very uncertain. It is not going to be created at a quick pace. We will need to wait until the global climate

top_k_50:
  The future of renewable energy is a good one," he stressed. "It's the kind of situation where we have to look at renewables in general, because they're going to

top_k_50_temp_0.8:
  The future of renewable energy in the US is very uncertain," said Bruce Leinwell, CEO of the Energy Information Administration.

"We want to have a sustainable energy

The comparison reveals characteristic differences. Greedy decoding produces focused but potentially repetitive text. Pure temperature sampling adds variety but can drift. Top-k restricts the sampling space while preserving diversity. Combining top-k with temperature offers the most control.

Out[28]:

Visualization

Four panel diagram showing how each decoding strategy modifies the probability distribution. — Conceptual comparison of decoding strategies. Greedy takes only the most likely token. Temperature adjusts distribution shape. Top-k truncates to k best tokens. Nucleus (top-p) truncates to cumulative probability threshold. Each offers different trade-offs between quality and diversity.

Key ParametersLink Copied

When using top-k sampling for text generation, the following parameters have the greatest impact on output quality:

k (top_k): The number of highest-probability tokens to keep. Values of 40-100 work well for most applications. Lower values (10-30) produce more focused, deterministic output suited for factual content. Higher values (100-200) allow more creative diversity but risk occasional incoherent tokens.
temperature: Scales the logits before computing probabilities. Values below 1.0 sharpen the distribution, making the top tokens more dominant. Values above 1.0 flatten the distribution, spreading probability more evenly. Common values range from 0.7 to 1.2. Temperature is typically applied before top-k selection.
do_sample: Boolean flag in Hugging Face's generate() method. Must be True to enable any sampling strategy including top-k. When False, the model uses greedy decoding regardless of other parameters.
max_new_tokens: Limits the number of tokens to generate. Longer generations may accumulate sampling noise, so lower $k$ values can help maintain coherence over extended outputs.

SummaryLink Copied

Top-k sampling provides a practical solution to the long-tail problem in language model decoding. By truncating the vocabulary to the $k$ most likely tokens and renormalizing, it eliminates improbable tokens while preserving meaningful diversity among plausible choices.

Key takeaways from this chapter:

Truncation principle: Top-k zeros out all tokens outside the $k$ highest-probability candidates, then renormalizes the remaining probabilities. This prevents sampling from the incoherent long tail while allowing variety among reasonable options.
Temperature interaction: Temperature scaling reshapes the distribution before truncation. Low temperature concentrates probability in fewer tokens; high temperature spreads it more evenly. Applying temperature first affects which tokens make the top- $k$ cut.
Parameter selection: Common values range from $k$ =40 to $k$ =100. Lower $k$ produces more focused text; higher $k$ allows more diversity. The optimal choice depends on task, context confidence, and user preferences.
Fixed-k limitation: A constant $k$ applies regardless of context, including too many tokens when the model is confident and potentially too few when it's uncertain. This motivates adaptive approaches like nucleus sampling.
Computational efficiency: Top-k selection adds negligible overhead, with $O(n \log k)$ time complexity where $n$ is vocabulary size and $k$ is the number of tokens kept. This is fast compared to the model forward pass, making it practical for production use.

Top-k sampling strikes a useful balance between greedy decoding (deterministic but repetitive) and pure sampling (diverse but sometimes incoherent). While nucleus sampling offers more adaptive truncation, top-k remains widely used due to its simplicity and interpretability. The next chapter explores nucleus sampling, which addresses the fixed-k limitation by adapting the threshold to each distribution's shape.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about top-k sampling and how it controls language model text generation.

Loading component...

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{topksamplingcontrollinglanguagemodeltextgeneration, author = {Michael Brenndoerfer}, title = {Top-k Sampling: Controlling Language Model Text Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }

APAAcademic

Michael Brenndoerfer (2025). Top-k Sampling: Controlling Language Model Text Generation. Retrieved from https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation

MLAAcademic

Michael Brenndoerfer. "Top-k Sampling: Controlling Language Model Text Generation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation>.

CHICAGOAcademic

Michael Brenndoerfer. "Top-k Sampling: Controlling Language Model Text Generation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Top-k Sampling: Controlling Language Model Text Generation'. Available at: https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation (Accessed: 12/19/2025).

SimpleBasic

Michael Brenndoerfer (2025). Top-k Sampling: Controlling Language Model Text Generation. https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation

Direct link:

https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Top-k Sampling: Controlling Language Model Text Generation

Top-k SamplingLink Copied

The Problem with Full Vocabulary SamplingLink Copied

How Top-k Sampling WorksLink Copied

The Intuition: Focus on What MattersLink Copied

From Intuition to MathematicsLink Copied

Why Renormalization Preserves Relative ProbabilitiesLink Copied

Visualizing the TruncationLink Copied

Implementing Top-k SamplingLink Copied

Building the Core SamplerLink Copied

Seeing It in ActionLink Copied

From Single Tokens to Full Text GenerationLink Copied

The Effect of Different k ValuesLink Copied

Choosing the Right Value of kLink Copied

Guidelines for Selecting kLink Copied

Combining Top-k with TemperatureLink Copied

Temperature-First vs Top-k-FirstLink Copied

Practical ConsiderationsLink Copied

Computational EfficiencyLink Copied

Handling Edge CasesLink Copied

Batched GenerationLink Copied

Limitations of Top-k SamplingLink Copied

Fixed k Ignores ContextLink Copied

Quality-Diversity Trade-offLink Copied

Comparison with Other Decoding MethodsLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Repetition Penalties: Preventing Loops in Language Model Generation

Constrained Decoding: Grammar-Guided Generation for Structured LLM Output

Autoregressive Generation: How GPT Generates Text Token by Token

Stay updated