Search

Search articles

Top-k Sampling: Controlling Language Model Text Generation

Michael BrenndoerferUpdated July 29, 202530 min read

Learn how top-k sampling truncates vocabulary to the k most probable tokens, eliminating incoherent outputs while preserving diversity in language model generation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Top-k Sampling

When a language model generates text, it produces a probability distribution over tens of thousands of possible next tokens. Greedy decoding picks the single most likely token at each step, which leads to repetitive and predictable output. Temperature scaling makes the distribution more uniform, but it raises the probability of every token, including nonsensical ones. Top-k sampling offers a different solution: keep only the kk most probable tokens and sample from this truncated distribution.

The core idea is straightforward. Rather than considering all 50,000+ tokens in the vocabulary, top-k sampling zeros out the probability of everything outside the top kk candidates. This eliminates the long tail of unlikely tokens while preserving meaningful diversity among the plausible choices. The result is text that reads naturally without veering into incoherence.

This chapter explores top-k sampling in detail. We'll examine the mathematics behind truncation, implement the algorithm from scratch, discuss how to select appropriate values of kk, and combine top-k with temperature for fine-grained control over generation quality.

The Problem with Full Vocabulary Sampling

Before diving into top-k, let's understand why sampling from the full vocabulary distribution causes problems. A trained language model assigns non-zero probability to every token, even ones that make no sense in context.

In[4]:
Code
def get_next_token_distribution(model, tokenizer, prompt):
    """Get the probability distribution over next tokens."""
    inputs = tokenizer(prompt, return_tensors="pt")

    with torch.no_grad():
        outputs = model(inputs["input_ids"])
        logits = outputs.logits[0, -1, :]
        probs = torch.softmax(logits, dim=0)

    return probs
Out[5]:
Console
Prompt: 'The capital of France is'

Vocabulary size: 50,257
Tokens with P > 1%: 14
Tokens with P > 0.1%: 109

Top 5 predictions:
  1. ' the': 0.0846
  2. ' now': 0.0479
  3. ' a': 0.0462
  4. ' France': 0.0324
  5. ' Paris': 0.0322

Most of the vocabulary mass sits in the long tail of tokens with tiny probabilities. While each token individually has near-zero chance of selection, collectively they can represent a substantial fraction of the total probability. Sampling from this full distribution occasionally produces bizarre tokens that derail generation.

Out[6]:
Visualization
Log-scale bar chart showing token probabilities ranked by likelihood, with steep dropoff after top tokens.
Next-token probability distribution for ''The capital of France is''. The distribution is highly skewed: a few tokens dominate while thousands have negligible but non-zero probability. Sampling from the full distribution occasionally selects from this long tail, producing incoherent text.

The figure illustrates why the long tail is problematic. After the first few tokens, probability drops rapidly, but thousands of tokens retain some small probability. Summed together, this tail can be significant. If we sample proportionally, low-probability tokens occasionally win, producing outputs like "The capital of France is ????????" or "The capital of France is asdf".

How Top-k Sampling Works

Now that we understand the problem, let's build up the solution piece by piece. The core insight behind top-k sampling is simple: if most of the probability mass concentrates in a handful of tokens, why not just sample from those and ignore the rest?

The Intuition: Focus on What Matters

Think about how you complete sentences. Given "The capital of France is", you don't mentally consider every possible word in English. You immediately focus on a small set of plausible continuations: "Paris", maybe "a", "the", or "known". Your mental probability distribution isn't uniform across all words, and neither is the model's. Top-k sampling formalizes this intuition by explicitly restricting attention to the most likely candidates.

The approach works in five steps:

  1. Compute probabilities: The model outputs a distribution over all vocabulary tokens
  2. Rank by likelihood: Sort tokens from most to least probable
  3. Keep the top k: Select only the kk highest-probability tokens
  4. Zero out the rest: Set all other token probabilities to exactly zero
  5. Renormalize and sample: Rescale the remaining probabilities so they sum to 1, then sample
Top-k Sampling

Top-k sampling restricts the sampling space to the kk most probable tokens. After zeroing out tokens outside the top-kk, the remaining probabilities are renormalized before sampling.

From Intuition to Mathematics

Let's formalize this process. At generation step ii, the model has seen context x<ix_{<i} (all previous tokens) and outputs a probability P(xix<i)P(x_i | x_{<i}) for each possible next token xix_i. This distribution spans the entire vocabulary, but we only want to sample from the top performers.

First, we need to identify which tokens make the cut. Define VkV_k as the set containing the kk tokens with highest probability. If the vocabulary has 50,000 tokens and k=50k=50, then VkV_k contains exactly the 50 most likely tokens for this specific context.

Now we need to handle a subtle issue: after removing tokens from consideration, the probabilities of the remaining tokens no longer sum to 1. A probability distribution must sum to 1 for sampling to work correctly. The solution is renormalization. We divide each remaining probability by the sum of all kept probabilities.

This gives us the formal definition of top-k sampling:

Pk(xix<i)={P(xix<i)Zkif xiVk0otherwiseP_k(x_i | x_{<i}) = \begin{cases} \frac{P(x_i | x_{<i})}{Z_k} & \text{if } x_i \in V_k \\ 0 & \text{otherwise} \end{cases}

where:

  • xix_i: a candidate token at position ii
  • x<ix_{<i}: the context (all tokens before position ii)
  • P(xix<i)P(x_i | x_{<i}): the original probability the model assigns to token xix_i given the context
  • VkV_k: the set of kk tokens with highest probability under P(x<i)P(\cdot | x_{<i})
  • Zk=xVkP(xx<i)Z_k = \sum_{x \in V_k} P(x | x_{<i}): the normalization constant, which equals the sum of probabilities of the top-kk tokens
  • Pk(xix<i)P_k(x_i | x_{<i}): the renormalized probability used for sampling

Why Renormalization Preserves Relative Probabilities

A critical property of this construction is that renormalization preserves the relative likelihood of tokens within the top-kk. Consider two tokens, "Paris" with original probability 0.30 and "the" with probability 0.15. In the original distribution, "Paris" is exactly twice as likely as "the".

After truncation, suppose the top-kk tokens have cumulative probability Zk=0.90Z_k = 0.90. The renormalized probabilities become:

  • "Paris": 0.30/0.90=0.3330.30 / 0.90 = 0.333
  • "the": 0.15/0.90=0.1670.15 / 0.90 = 0.167

Notice that 0.333/0.167=20.333 / 0.167 = 2. The ratio is preserved! This happens because we divide both probabilities by the same constant ZkZ_k. Renormalization scales all probabilities equally, maintaining their relative ordering and ratios.

This property matters because it means top-k sampling respects the model's preferences among plausible tokens. We're not arbitrarily reweighting tokens; we're simply restricting which tokens can be sampled while honoring the model's ranking within that restricted set.

Visualizing the Truncation

The following visualization shows this process concretely. On the left, we see the original distribution over the top 20 tokens. On the right, we see the truncated distribution after keeping only k=10k=10 tokens and renormalizing.

Out[7]:
Visualization
Original distribution (top 20 tokens shown). Blue bars indicate tokens kept in top-10, gray bars are truncated.
Original distribution (top 20 tokens shown). Blue bars indicate tokens kept in top-10, gray bars are truncated.
Top-10 truncated distribution after renormalization. Probabilities now sum to 1.
Top-10 truncated distribution after renormalization. Probabilities now sum to 1.

Implementing Top-k Sampling

With the mathematics established, let's translate the algorithm into code. The implementation is straightforward, but walking through it step by step reveals how each line corresponds to a piece of the formula.

Building the Core Sampler

Our implementation needs to accomplish four things:

  1. Apply temperature scaling to the logits
  2. Find the kk highest-scoring tokens
  3. Convert those scores to a valid probability distribution
  4. Sample one token from that distribution

The key insight is that we don't need to explicitly zero out the non-top-k tokens. Instead, we can compute softmax over only the top-k logits, which automatically gives us a properly normalized distribution over just those tokens.

In[8]:
Code
def top_k_sample(logits, k, temperature=1.0):
    """
    Sample from top-k truncated distribution.

    Args:
        logits: Raw logits from model, shape (vocab_size,)
        k: Number of top tokens to keep
        temperature: Temperature for scaling before truncation

    Returns:
        Sampled token index
    """
    # Apply temperature scaling
    scaled_logits = logits / temperature

    # Get top-k values and indices
    top_k_values, top_k_indices = torch.topk(scaled_logits, k)

    # Convert to probabilities
    top_k_probs = torch.softmax(top_k_values, dim=0)

    # Sample from truncated distribution
    sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    # Map back to original vocabulary index
    return top_k_indices[sample_idx].item()

Let's trace through each step:

  • Temperature scaling divides each logit by the temperature, controlling how peaked or flat the distribution becomes
  • torch.topk efficiently finds the kk largest values and their positions in the vocabulary, returning both the values and their indices
  • Softmax over top-k values computes ezi/jezje^{z_i} / \sum_j e^{z_j} using only the kept tokens, which is equivalent to the renormalization step in our formula
  • torch.multinomial samples according to the probability distribution, returning an index into our top-k list
  • Index mapping converts the sampled position (0 to k-1) back to the actual vocabulary token ID

Seeing It in Action

Let's sample multiple times from the same context to see the variety top-k produces:

Out[9]:
Console
Prompt: 'The capital of France is'

Sampling with k=5 (10 samples):
  Sample 1: ' the'
  Sample 2: ' France'
  Sample 3: ' the'
  Sample 4: ' now'
  Sample 5: ' the'
  Sample 6: ' the'
  Sample 7: ' the'
  Sample 8: ' a'
  Sample 9: ' now'
  Sample 10: ' a'

Notice how all samples are reasonable completions. With k=5k=5, we're sampling from only the five most likely tokens, which for this factual prompt are all sensible choices. The variation comes from the probabilistic sampling, but every option is plausible.

From Single Tokens to Full Text Generation

A single sample is interesting, but the real power of top-k sampling emerges when we generate entire sequences. Each token we generate becomes part of the context for the next token, creating a chain of sampling decisions.

The generation loop follows a simple pattern: get logits, sample a token, append it to the sequence, repeat. At each step, top-k ensures we only consider reasonable continuations.

In[10]:
Code
def generate_with_top_k(
    model, tokenizer, prompt, max_tokens=30, k=50, temperature=1.0
):
    """Generate text using top-k sampling."""
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
    generated_ids = input_ids.clone()

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(generated_ids)
            next_token_logits = outputs.logits[0, -1, :]

        # Sample using top-k
        next_token_id = top_k_sample(
            next_token_logits, k=k, temperature=temperature
        )

        # Append and continue
        generated_ids = torch.cat(
            [generated_ids, torch.tensor([[next_token_id]])], dim=1
        )

        # Stop if EOS token
        if next_token_id == tokenizer.eos_token_id:
            break

    return tokenizer.decode(generated_ids[0], skip_special_tokens=True)

The Effect of Different k Values

Now we can see how kk shapes the character of generated text. Smaller kk means stricter filtering, keeping only the most confident predictions. Larger kk allows more exploration of the probability space.

Out[11]:
Console
Prompt: 'Artificial intelligence will'

Generated continuations:
--------------------------------------------------
k=10: Artificial intelligence will also be more complex, and more diverse, than ever before. For example, artificial intelligence, which is already a key technology for the world's financial

k=50: Artificial intelligence will also create even more exciting challenges for our nation's workers, and many of the benefits will come from improving our ability to build our own factories and reduce

k=200: Artificial intelligence will become so advanced that, like our pets, children, pets will be doing almost any form of AI-related activity. Companies are taking our money.

The outputs reveal distinct personalities. With k=10k=10, the text tends toward common, expected phrasings. With k=50k=50, there's more room for varied word choices while maintaining fluency. With k=200k=200, the model has significant freedom, though even here, we've eliminated the vast majority of the vocabulary's long tail.

Out[12]:
Visualization
Heatmap showing sampling counts across token ranks for different k values.
Empirical sampling frequency across 100 samples for different k values. With k=5, nearly all samples come from the top 2-3 tokens. As k increases, the distribution spreads more evenly across the kept tokens, though higher-probability tokens still dominate.

The heatmap reveals how kk shapes sampling behavior empirically. With k=5k=5, the dark cells concentrate in the leftmost columns, meaning the top-ranked tokens receive nearly all samples. As kk increases, the distribution spreads rightward, but higher-ranked tokens still dominate because they have higher probability even after renormalization.

Choosing the Right Value of k

The choice of kk significantly affects generation quality. There's no single optimal value, as the best kk depends on both the task and the context.

Out[13]:
Visualization
Line plot showing diversity increasing and coherence decreasing as k increases, with optimal zone highlighted.
Trade-off between diversity and coherence as k varies. Lower k values produce more focused text but risk repetition. Higher k values increase diversity but may introduce occasional off-topic tokens. The shaded region represents typical values used in practice (k=30-100).

Guidelines for Selecting k

Consider these factors when choosing kk:

  • Task type: Creative writing benefits from higher kk (50-100) for variety. Factual responses work better with lower kk (10-30) for precision.

  • Context confidence: When the model is highly confident (one token dominates), even large kk won't introduce much diversity since top tokens have most of the mass anyway.

  • Generation length: Longer generations may need lower kk to prevent drift. Errors accumulate over many steps, and each improbable token can push the generation off course.

  • User expectations: Interactive applications often use kk=40-50 as a reasonable default that balances fluency with variety.

In[14]:
Code
def analyze_top_k_coverage(probs, k_values):
    """Analyze what fraction of probability mass top-k captures."""
    sorted_probs, _ = torch.sort(probs, descending=True)

    coverage = {}
    for k in k_values:
        coverage[k] = sorted_probs[:k].sum().item()

    return coverage
Out[15]:
Console
Top-k Probability Coverage by Context Type:
============================================================

Prompt: 'The capital of France is'
  k=  5: 24.3% of probability mass
  k= 10: 35.9% of probability mass
  k= 20: 45.7% of probability mass
  k= 50: 54.9% of probability mass
  k=100: 62.4% of probability mass

Prompt: 'Once upon a time in a land far'
  k=  5: 68.3% of probability mass
  k= 10: 79.8% of probability mass
  k= 20: 86.2% of probability mass
  k= 50: 92.8% of probability mass
  k=100: 95.4% of probability mass

Prompt: '2 + 2 ='
  k=  5: 40.8% of probability mass
  k= 10: 52.5% of probability mass
  k= 20: 62.0% of probability mass
  k= 50: 70.5% of probability mass
  k=100: 76.0% of probability mass

The coverage analysis reveals an important pattern: when the model is confident, a small kk captures most of the probability mass. "2 + 2 =" concentrates probability in very few tokens, so kk=10 might capture 99%+ of the mass. Open-ended prompts like "Once upon a time" spread probability more evenly, requiring larger kk to maintain diverse sampling.

Out[16]:
Visualization
Line plot showing cumulative probability curves for three prompts, demonstrating different coverage rates.
Cumulative probability coverage as k increases for three different contexts. Confident predictions (blue) reach near-complete coverage with tiny k values. Open-ended contexts (green) require larger k to capture the same probability mass. The dashed lines show the 90% and 99% coverage thresholds.

This visualization makes the intuition concrete. For confident predictions like "2 + 2 =", the curve shoots up almost vertically, reaching 99% coverage with just a handful of tokens. Open-ended prompts show a gentler slope, requiring larger kk to achieve the same coverage. This explains why a fixed kk works differently across contexts.

Combining Top-k with Temperature

Top-k sampling and temperature scaling complement each other. Temperature adjusts the shape of the distribution before truncation, while top-k determines how many tokens to keep. Used together, they offer fine-grained control.

Out[17]:
Visualization
Low temperature (T=0.5) concentrates probability in top tokens, making k=20 effectively sample from fewer options.
Low temperature (T=0.5) concentrates probability in top tokens, making k=20 effectively sample from fewer options.
High temperature (T=1.5) flattens the distribution, spreading probability more evenly across all 20 kept tokens.
High temperature (T=1.5) flattens the distribution, spreading probability more evenly across all 20 kept tokens.

The figure shows how temperature reshapes the distribution within the top-kk candidates. At low temperature, the top token dominates even among the kept tokens. At high temperature, probability spreads more evenly across all kk tokens.

Temperature-First vs Top-k-First

The order of operations matters. Standard practice applies temperature first, then top-k:

  1. Temperature scaling: Divide logits by temperature
  2. Top-k selection: Keep the kk highest values
  3. Softmax: Convert to probabilities
  4. Sample: Draw from the truncated distribution
In[18]:
Code
def top_k_sample_with_temperature(logits, k, temperature=1.0):
    """
    Top-k sampling with temperature applied first.

    This is the standard order: temperature reshapes the distribution,
    then top-k truncates, then we sample.
    """
    # Step 1: Temperature scaling
    scaled_logits = logits / temperature

    # Step 2: Get top-k
    top_k_vals, top_k_idx = torch.topk(scaled_logits, k)

    # Step 3: Softmax over top-k only
    top_k_probs = torch.softmax(top_k_vals, dim=0)

    # Step 4: Sample
    sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    return top_k_idx[sample_idx].item()

Why temperature first? Temperature affects which tokens qualify as "top-k". High temperature compresses differences between token probabilities, potentially changing which tokens make the cut. Low temperature exaggerates differences, making the ranking more pronounced. Applying temperature after top-k selection would mean truncating based on the original distribution, which may not match the sampling distribution you want.

Out[19]:
Console
Prompt: 'The best programming language is'

Top-10 tokens at different temperatures:
--------------------------------------------------
T=0.5: [' Java', ' Python', ' the', ' a', ' not', ' C', ' one', ' JavaScript', ' Haskell', ' probably']
T=1.0: [' Java', ' Python', ' the', ' a', ' not', ' C', ' one', ' JavaScript', ' Haskell', ' probably']
T=2.0: [' Java', ' Python', ' the', ' a', ' not', ' C', ' one', ' JavaScript', ' Haskell', ' probably']

The output shows that the top-kk tokens remain mostly stable across temperatures, with the highest-ranked tokens appearing in all lists. However, tokens near the boundary (rank 8-10) may swap in and out as temperature changes, which can subtly affect sampling behavior.

Practical Considerations

Moving from theory to production, several practical aspects affect how top-k sampling performs in real applications. This section covers computational efficiency, edge case handling, and batched generation.

Computational Efficiency

Top-k sampling adds minimal overhead to generation. The torch.topk operation is efficient, running in O(nlogk)O(n \log k) time, where nn is the vocabulary size and kk is the number of tokens to select. Since knk \ll n (typically kk is 50-100 while nn is 50,000+), this is essentially linear in vocabulary size.

In[20]:
Code
import time


def benchmark_top_k(
    vocab_size=50257, k_values=[10, 50, 100, 500], n_trials=1000
):
    """Benchmark top-k selection speed."""
    logits = torch.randn(vocab_size)

    results = {}
    for k in k_values:
        start = time.perf_counter()
        for _ in range(n_trials):
            _ = torch.topk(logits, k)
        elapsed = time.perf_counter() - start
        results[k] = elapsed / n_trials * 1000  # Convert to ms

    return results
Out[21]:
Console
Top-k selection time (per call):
  k= 10: 0.0631 ms
  k= 50: 0.0716 ms
  k=100: 0.0769 ms
  k=500: 0.1319 ms

The timings confirm that top-k selection takes only fractions of a millisecond, even for the largest kk values. Since a typical GPT-2 forward pass takes 10-50ms (and larger models take 100ms+), the top-k operation adds less than 1% overhead. This makes top-k sampling practical for production use without any performance concerns.

Handling Edge Cases

Several edge cases require attention in production implementations:

  • k larger than vocabulary: If kk \geq vocabulary size, top-k degenerates to full sampling
  • All zero logits: Rare but possible with certain inputs; results in uniform sampling
  • Very small k: kk=1 is equivalent to greedy decoding
In[22]:
Code
def robust_top_k_sample(logits, k, temperature=1.0, min_tokens=1):
    """
    Robust top-k sampling with edge case handling.
    """
    vocab_size = logits.size(0)

    # Clamp k to valid range
    k = max(min_tokens, min(k, vocab_size))

    # Handle temperature
    if temperature <= 0:
        # Greedy selection
        return logits.argmax().item()

    scaled_logits = logits / temperature

    # Check for numerical issues
    if torch.isnan(scaled_logits).any() or torch.isinf(scaled_logits).any():
        # Fall back to argmax
        return logits.argmax().item()

    top_k_vals, top_k_idx = torch.topk(scaled_logits, k)
    top_k_probs = torch.softmax(top_k_vals, dim=0)

    # Handle case where all probabilities become 0 or nan
    if top_k_probs.sum() == 0 or torch.isnan(top_k_probs).any():
        # Uniform sampling over top-k
        sample_idx = torch.randint(0, k, (1,))
    else:
        sample_idx = torch.multinomial(top_k_probs, num_samples=1)

    return top_k_idx[sample_idx].item()

Batched Generation

For efficiency with multiple sequences, top-k can be applied in parallel across a batch:

In[23]:
Code
def batched_top_k_sample(logits, k, temperature=1.0):
    """
    Top-k sampling for batched logits.

    Args:
        logits: Shape (batch_size, vocab_size)
        k: Number of top tokens
        temperature: Temperature for scaling

    Returns:
        Tensor of sampled token indices, shape (batch_size,)
    """
    batch_size = logits.size(0)

    scaled_logits = logits / temperature
    top_k_vals, top_k_idx = torch.topk(scaled_logits, k, dim=-1)
    top_k_probs = torch.softmax(top_k_vals, dim=-1)

    # Sample one token per sequence
    sample_indices = torch.multinomial(top_k_probs, num_samples=1).squeeze(-1)

    # Gather the actual token IDs
    selected_tokens = top_k_idx[torch.arange(batch_size), sample_indices]

    return selected_tokens
Out[24]:
Console
Batch size: 4
Sampled token IDs: [29552, 19987, 48649, 40351]

The batched implementation processes all four sequences in a single operation, returning one sampled token ID per sequence. This vectorized approach is essential for efficient inference when generating multiple sequences in parallel, as it avoids the overhead of looping through sequences individually.

Limitations of Top-k Sampling

While top-k is widely used, it has notable limitations that motivated the development of alternative approaches like nucleus (top-p) sampling.

Fixed k Ignores Context

The fundamental limitation is that kk is fixed regardless of context. Consider two scenarios:

  • High-confidence context: "The president of the United States in 2020 was Donald", where very few tokens are reasonable next.
  • Open-ended context: "I think the best way to", where many tokens could reasonably follow.

A fixed kk treats both identically. If kk=50, the first case includes many implausible tokens (the model might only need kk=5). The second case might benefit from even more options.

Out[25]:
Visualization
High confidence context: k=50 includes many unnecessary low-probability tokens when the model is already certain.
High confidence context: k=50 includes many unnecessary low-probability tokens when the model is already certain.
Low confidence context: k=50 may exclude viable options when probability is spread more evenly.
Low confidence context: k=50 may exclude viable options when probability is spread more evenly.

Quality-Diversity Trade-off

No single kk value works optimally across all situations. Lower kk improves quality but reduces diversity. Higher kk increases diversity but risks including low-quality tokens. This creates tension when the goal is both high quality and interesting variation.

The next chapter on nucleus sampling shows how adapting the truncation threshold to the probability distribution itself addresses these limitations.

Comparison with Other Decoding Methods

Let's compare top-k sampling with other decoding strategies to understand when each is most appropriate.

In[26]:
Code
def generate_comparison(model, tokenizer, prompt, max_tokens=30):
    """Generate using different decoding strategies."""
    results = {}

    # Greedy
    input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"]
    with torch.no_grad():
        output = model.generate(
            input_ids, max_new_tokens=max_tokens, do_sample=False
        )
    results["greedy"] = tokenizer.decode(output[0], skip_special_tokens=True)

    # Temperature sampling (T=0.8)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            top_k=0,  # top_k=0 disables top-k
        )
    results["temperature_0.8"] = tokenizer.decode(
        output[0], skip_special_tokens=True
    )

    # Top-k (k=50)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_k=50,
            temperature=1.0,
        )
    results["top_k_50"] = tokenizer.decode(output[0], skip_special_tokens=True)

    # Top-k + Temperature (k=50, T=0.8)
    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            top_k=50,
            temperature=0.8,
        )
    results["top_k_50_temp_0.8"] = tokenizer.decode(
        output[0], skip_special_tokens=True
    )

    return results
Out[27]:
Console
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Prompt: 'The future of renewable energy'
============================================================

greedy:
  The future of renewable energy is in the hands of the people.

"We need to be able to do it in a way that is sustainable and that is sustainable for

temperature_0.8:
  The future of renewable energy (starting in 2020) is very uncertain. It is not going to be created at a quick pace. We will need to wait until the global climate

top_k_50:
  The future of renewable energy is a good one," he stressed. "It's the kind of situation where we have to look at renewables in general, because they're going to

top_k_50_temp_0.8:
  The future of renewable energy in the US is very uncertain," said Bruce Leinwell, CEO of the Energy Information Administration.

"We want to have a sustainable energy

The comparison reveals characteristic differences. Greedy decoding produces focused but potentially repetitive text. Pure temperature sampling adds variety but can drift. Top-k restricts the sampling space while preserving diversity. Combining top-k with temperature offers the most control.

Out[28]:
Visualization
Four panel diagram showing how each decoding strategy modifies the probability distribution.
Conceptual comparison of decoding strategies. Greedy takes only the most likely token. Temperature adjusts distribution shape. Top-k truncates to k best tokens. Nucleus (top-p) truncates to cumulative probability threshold. Each offers different trade-offs between quality and diversity.

Key Parameters

When using top-k sampling for text generation, the following parameters have the greatest impact on output quality:

  • k (top_k): The number of highest-probability tokens to keep. Values of 40-100 work well for most applications. Lower values (10-30) produce more focused, deterministic output suited for factual content. Higher values (100-200) allow more creative diversity but risk occasional incoherent tokens.

  • temperature: Scales the logits before computing probabilities. Values below 1.0 sharpen the distribution, making the top tokens more dominant. Values above 1.0 flatten the distribution, spreading probability more evenly. Common values range from 0.7 to 1.2. Temperature is typically applied before top-k selection.

  • do_sample: Boolean flag in Hugging Face's generate() method. Must be True to enable any sampling strategy including top-k. When False, the model uses greedy decoding regardless of other parameters.

  • max_new_tokens: Limits the number of tokens to generate. Longer generations may accumulate sampling noise, so lower kk values can help maintain coherence over extended outputs.

Summary

Top-k sampling provides a practical solution to the long-tail problem in language model decoding. By truncating the vocabulary to the kk most likely tokens and renormalizing, it eliminates improbable tokens while preserving meaningful diversity among plausible choices.

Key takeaways from this chapter:

  • Truncation principle: Top-k zeros out all tokens outside the kk highest-probability candidates, then renormalizes the remaining probabilities. This prevents sampling from the incoherent long tail while allowing variety among reasonable options.

  • Temperature interaction: Temperature scaling reshapes the distribution before truncation. Low temperature concentrates probability in fewer tokens; high temperature spreads it more evenly. Applying temperature first affects which tokens make the top-kk cut.

  • Parameter selection: Common values range from kk=40 to kk=100. Lower kk produces more focused text; higher kk allows more diversity. The optimal choice depends on task, context confidence, and user preferences.

  • Fixed-k limitation: A constant kk applies regardless of context, including too many tokens when the model is confident and potentially too few when it's uncertain. This motivates adaptive approaches like nucleus sampling.

  • Computational efficiency: Top-k selection adds negligible overhead, with O(nlogk)O(n \log k) time complexity where nn is vocabulary size and kk is the number of tokens kept. This is fast compared to the model forward pass, making it practical for production use.

Top-k sampling strikes a useful balance between greedy decoding (deterministic but repetitive) and pure sampling (diverse but sometimes incoherent). While nucleus sampling offers more adaptive truncation, top-k remains widely used due to its simplicity and interpretability. The next chapter explores nucleus sampling, which addresses the fixed-k limitation by adapting the threshold to each distribution's shape.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about top-k sampling and how it controls language model text generation.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{topksamplingcontrollinglanguagemodeltextgeneration, author = {Michael Brenndoerfer}, title = {Top-k Sampling: Controlling Language Model Text Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Top-k Sampling: Controlling Language Model Text Generation. Retrieved from https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation
MLAAcademic
Michael Brenndoerfer. "Top-k Sampling: Controlling Language Model Text Generation." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation>.
CHICAGOAcademic
Michael Brenndoerfer. "Top-k Sampling: Controlling Language Model Text Generation." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Top-k Sampling: Controlling Language Model Text Generation'. Available at: https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Top-k Sampling: Controlling Language Model Text Generation. https://mbrenndoerfer.com/writing/top-k-sampling-language-model-text-generation
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free