Repetition Penalties: Preventing Loops in Language Model Generation

Michael Brenndoerfer

Data, Analytics & AI Language AI Handbook Machine Learning

Learn how repetition penalty, frequency penalty, presence penalty, and n-gram blocking prevent language models from getting stuck in repetitive loops during text generation.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Repetition PenaltiesLink Copied

Language models have a peculiar tendency to get stuck in loops. Ask GPT to write a story, and without intervention, you might get output like "The cat sat on the mat. The cat sat on the mat. The cat sat on the mat..." This repetitive behavior emerges from a fundamental property of autoregressive generation: at each step, the model selects tokens that maximize likelihood given the context, and if a pattern worked once, the same pattern often has high probability again.

Repetition is not always undesirable. Certain phrases naturally repeat in language: "again and again", "more and more", or the rhythmic repetition in poetry and song lyrics. But uncontrolled repetition signals a failure mode where the model has collapsed into a degenerate loop, producing text that no human would write. This chapter explores techniques to prevent such loops while preserving natural language patterns.

We'll examine three approaches that modify the probability distribution during generation. Repetition penalty scales down the logits of previously used tokens. Frequency penalty applies increasingly strong penalties based on how many times each token has appeared. Presence penalty applies a flat penalty to any token that has appeared at all. We'll also explore n-gram blocking, a deterministic approach that prevents exact phrase repetition. Each technique offers different trade-offs between preventing loops and preserving natural repetition.

Why Models Repeat ThemselvesLink Copied

Before diving into solutions, let's understand why repetition happens. The answer lies in how autoregressive models generate text and how they're trained.

During training, language models learn to predict the next token given all previous tokens. They're optimized to assign high probability to tokens that frequently follow specific contexts in the training data. When a particular phrase or structure appears often in training text, the model learns to give it high probability in similar contexts.

The problem manifests during generation. Suppose the model generates "The results show that" and then produces "the model performs well." Now "The results show that the model performs well" becomes part of the context. If the training data contained similar patterns of presenting multiple results, the phrase "The results show that" might again have high probability, leading to another similar sentence. Each repetition reinforces the pattern in the context, making further repetition even more likely.

In[4]:

Code

def generate_without_penalty(model, tokenizer, prompt, max_tokens=100):
    """Generate text without any repetition penalty."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            top_p=0.9,
            repetition_penalty=1.0,  # No penalty
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

def generate_without_penalty(model, tokenizer, prompt, max_tokens=100):
    """Generate text without any repetition penalty."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            top_p=0.9,
            repetition_penalty=1.0,  # No penalty
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

Out[5]:

Console

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

Prompt: 'In today's news, we report that'

Generated (no penalty):
In today's news, we report that three of the four officers involved in the fatal shooting of a black teen in Baton Rouge, Louisiana, were Black, and two were white. One of the four officers was not charged.

In other news, we report that a woman in New York City was shot and killed by a man who claimed to be the gunman in a shooting rampage that left at least four people dead. Police in New

The output may contain repetitive phrases or sentence structures. While sampling strategies like temperature and nucleus sampling add randomness, they don't specifically target repetition. A token that appeared recently still has the same probability as before, so the model can easily select it again.

The Feedback LoopLink Copied

Repetition often starts subtly and escalates. The model might repeat a word, then a phrase, then entire sentences. This acceleration happens because each repetition adds more of the same pattern to the context, which the model uses to predict the next token. The context becomes increasingly dominated by the repeated content, making continuation of that content more and more likely.

Consider a model generating a list. After producing "First, we need to consider..." it might generate "Second, we need to consider..." and then "Third, we need to consider...". So far, this is reasonable parallel structure. But without intervention, the model might continue with "Fourth, we need to consider..." long after the list should have ended, trapped in a pattern that keeps reinforcing itself.

The Repetition PenaltyLink Copied

The most direct approach to preventing repetition modifies the logits for tokens that have already appeared in the generated sequence. The repetition penalty, introduced by Keskar et al. (2019) in the CTRL paper, divides the logits of previously seen tokens by a penalty factor, making them less likely to be selected.

Repetition Penalty

The repetition penalty modifies logits for tokens that appear in the existing context. For each token $t$ that has already been generated, its logit is divided by the penalty factor $\theta$ if positive, or multiplied by $\theta$ if negative. This reduces the probability of repeating tokens without eliminating them entirely.

Mathematical FormulationLink Copied

To understand how repetition penalty works, we need to trace the path from model output to token selection. When a language model predicts the next token, it doesn't output probabilities directly. Instead, it produces logits: raw, unnormalized scores for each token in the vocabulary. These logits then pass through the softmax function to become probabilities, and finally, we sample from that probability distribution.

This pipeline gives us a natural intervention point. If we want to reduce the probability of certain tokens, we can modify their logits before softmax. The question is: how should we modify them?

The Core InsightLink Copied

Our goal is straightforward: make previously used tokens less likely to appear again. Since softmax converts logits to probabilities, reducing a logit reduces its corresponding probability. But there's a subtlety. Logits can be positive or negative, and the operation that reduces a positive number (division) actually increases a negative number (makes it closer to zero). We need an operation that pushes logits in the "less likely" direction regardless of their sign.

The solution is a piecewise function that applies different operations based on the sign of the logit. Let $z_i$ be the logit for token $i$ and let $G$ be the set of tokens that have already appeared in the generated sequence. The repetition penalty modifies logits as follows:

z'_i = \begin{cases} z_i / \theta & \text{if } i \in G \text{ and } z_i > 0 \\ z_i \cdot \theta & \text{if } i \in G \text{ and } z_i \leq 0 \\ z_i & \text{if } i \notin G \end{cases}

where:

$z_i$ : the original logit for token $i$ , the raw score from the model before softmax
$z'_i$ : the modified logit after applying the penalty
$\theta$ : the repetition penalty factor, typically in the range $[1.0, 2.0]$
$G$ : the set of token indices that have appeared in the context
$i \in G$ : notation meaning "token $i$ is a member of set $G$ " (i.e., token $i$ has appeared before)

The formula reads: for each token, check if it has appeared before. If it has and its logit is positive, divide by the penalty factor. If it has and its logit is zero or negative, multiply by the penalty factor. If it hasn't appeared, leave it unchanged.

Why the Asymmetric Treatment?Link Copied

This asymmetry might seem arbitrary, but it follows directly from how softmax works. Recall that softmax converts a logit $z_i$ to a probability:

P(i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

The probability depends on $e^{z_i}$ , the exponential of the logit. The exponential function is strictly increasing: larger inputs produce larger outputs. So to reduce $P(i)$ , we must reduce $z_i$ .

For a positive logit like $z_i = 2.0$ , dividing by $\theta = 1.5$ gives $z'_i = 1.33$ . Since $1.33 < 2.0$ , we've reduced the logit and thus reduced the probability. Division works as intended.

For a negative logit like $z_i = -2.0$ , the same division gives $z'_i = -2.0 / 1.5 = -1.33$ . But $-1.33 > -2.0$ (it's closer to zero), so we've actually increased the logit and the probability. Division fails here.

The fix is multiplication. Multiplying $-2.0$ by $\theta = 1.5$ gives $z'_i = -3.0$ . Since $-3.0 < -2.0$ , we've pushed the logit further negative, reducing the probability. Both operations, division for positive logits and multiplication for negative ones, consistently reduce probability.

The Neutral CaseLink Copied

When $\theta = 1.0$ , the penalty has no effect. Dividing by 1 or multiplying by 1 leaves any number unchanged. This gives us a natural "off switch" for the penalty and a baseline for comparison. As $\theta$ increases above 1.0, the penalty strengthens, pushing repeated tokens further toward improbability.

Out[6]:

Visualization

Line plot showing logit transformations for positive values under different penalty strengths. — Positive logits are divided by the penalty factor, pulling values toward zero and reducing probability.

Line plot showing logit transformations for negative values under different penalty strengths. — Negative logits are multiplied by the penalty factor, pushing values further negative and reducing probability.

The visualization shows the key insight: for positive logits, division pulls values toward zero (reducing probability), while for negative logits, multiplication pushes values further negative (also reducing probability). Both transformations achieve the same goal through different arithmetic operations.

ImplementationLink Copied

Let's implement the repetition penalty from scratch to see exactly how it works:

In[7]:

Code

def apply_repetition_penalty(logits, generated_ids, penalty):
    """
    Apply repetition penalty to logits for tokens that have been generated.

    Args:
        logits: Tensor of shape (vocab_size,) containing raw logits
        generated_ids: Set of token IDs that have been generated
        penalty: Penalty factor (1.0 = no penalty, >1.0 = reduce repetition)

    Returns:
        Modified logits with penalty applied to generated tokens
    """
    modified_logits = logits.clone()

    for token_id in generated_ids:
        if modified_logits[token_id] > 0:
            modified_logits[token_id] = modified_logits[token_id] / penalty
        else:
            modified_logits[token_id] = modified_logits[token_id] * penalty

    return modified_logits

def apply_repetition_penalty(logits, generated_ids, penalty):
    """
    Apply repetition penalty to logits for tokens that have been generated.

    Args:
        logits: Tensor of shape (vocab_size,) containing raw logits
        generated_ids: Set of token IDs that have been generated
        penalty: Penalty factor (1.0 = no penalty, >1.0 = reduce repetition)

    Returns:
        Modified logits with penalty applied to generated tokens
    """
    modified_logits = logits.clone()

    for token_id in generated_ids:
        if modified_logits[token_id] > 0:
            modified_logits[token_id] = modified_logits[token_id] / penalty
        else:
            modified_logits[token_id] = modified_logits[token_id] * penalty

    return modified_logits

The function iterates through each token that has appeared in the context and applies the appropriate modification based on the sign of its logit. This is conceptually simple but reveals an important property: the penalty treats all repeated tokens equally, whether they appeared once or a hundred times.

In[8]:

Code

# Create example logits and demonstrate the effect
vocab_size = 10
example_logits = torch.tensor(
    [2.5, -1.0, 3.2, 0.5, -0.3, 1.8, -2.0, 0.0, 1.2, -0.5]
)
token_names = [
    "the",
    "cat",
    "sat",
    "on",
    "mat",
    "dog",
    "ran",
    "and",
    "big",
    "small",
]

# Suppose tokens 0, 2, 3 have been generated (the, sat, on)
generated_tokens = {0, 2, 3}
penalty = 1.5

modified_logits = apply_repetition_penalty(
    example_logits, generated_tokens, penalty
)

# Create example logits and demonstrate the effect
vocab_size = 10
example_logits = torch.tensor(
    [2.5, -1.0, 3.2, 0.5, -0.3, 1.8, -2.0, 0.0, 1.2, -0.5]
)
token_names = [
    "the",
    "cat",
    "sat",
    "on",
    "mat",
    "dog",
    "ran",
    "and",
    "big",
    "small",
]

# Suppose tokens 0, 2, 3 have been generated (the, sat, on)
generated_tokens = {0, 2, 3}
penalty = 1.5

modified_logits = apply_repetition_penalty(
    example_logits, generated_tokens, penalty
)

Out[9]:

Console

Effect of repetition penalty (θ = 1.5):
------------------------------------------------------------
Token         Logit   Modified    P(orig)     P(mod)
------------------------------------------------------------
the            2.50       1.67      0.241      0.194 *
cat           -1.00      -1.00      0.007      0.013  
sat            3.20       2.13      0.485      0.309 *
on             0.50       0.33      0.033      0.051 *
mat           -0.30      -0.30      0.015      0.027  
dog            1.80       1.80      0.120      0.221  
ran           -2.00      -2.00      0.003      0.005  
and            0.00       0.00      0.020      0.037  
big            1.20       1.20      0.066      0.121  
small         -0.50      -0.50      0.012      0.022  
------------------------------------------------------------
* = token appeared in context (penalty applied)

The table shows how the repetition penalty redistributes probability mass. Tokens that appeared in the context (marked with *) have their probabilities reduced, while tokens that haven't appeared receive proportionally higher probabilities. Notice that "sat" (token 2), which had the highest original logit, is no longer the most likely token after applying the penalty.

Visualizing the EffectLink Copied

Out[10]:

Visualization

Bar chart showing original token probabilities. — Original probability distribution before applying any penalty.

Bar chart showing penalized token probabilities with decreased probability for repeated tokens. — Distribution after repetition penalty (θ = 1.5). Red bars show penalized tokens with reduced probability.

Choosing the Penalty ValueLink Copied

The repetition penalty parameter requires careful tuning. Too low, and repetition persists. Too high, and the model avoids necessary repetition, producing awkward text that never uses "the" or "a" more than once.

Common values and their effects:

$\theta = 1.0$ : No penalty applied. Baseline behavior with full repetition potential.
$\theta = 1.1$ to $1.2$ : Light penalty. Gently discourages repetition while allowing natural patterns. Good starting point for most applications.
$\theta = 1.3$ to $1.5$ : Moderate penalty. Noticeably reduces repetition. May affect fluency for texts requiring repeated terms (technical writing, legal documents).
$\theta > 1.5$ : Strong penalty. Significantly suppresses any repeated token. Can produce stilted or unnatural text, especially for longer outputs.

In[11]:

Code

def generate_with_penalty(model, tokenizer, prompt, penalty, max_tokens=80):
    """Generate text with specified repetition penalty."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            top_p=0.9,
            repetition_penalty=penalty,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

def generate_with_penalty(model, tokenizer, prompt, penalty, max_tokens=80):
    """Generate text with specified repetition penalty."""
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    with torch.no_grad():
        output = model.generate(
            input_ids,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=0.8,
            top_p=0.9,
            repetition_penalty=penalty,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output[0], skip_special_tokens=True)

Out[12]:

Console

Prompt: 'The key to success in machine learning is'

======================================================================


Repetition Penalty = 1.0:
The key to success in machine learning is finding the right combination of skills and tools for each task, and using them to build a successful system.

The "Machine Learning Methodology for Machine Learning" is a step in the right direction. We believe this approach is a great way to understand how machine learning works and how to use it
----------------------------------------------------------------------


Repetition Penalty = 1.2:
The key to success in machine learning is finding the right combination of skills and tools for each task, which means that once you've been able (or should have) identified where your current project needs help with a particular topic or problem, it's easy enough to create an intelligent AI solution.
…and then come up with something else entirely
----------------------------------------------------------------------


Repetition Penalty = 1.5:
The key to success in machine learning is finding the right combination of skills and tools for each task, which means that once you've been able (or should have) identified where your current project needs help with a particular topic or problem then it's time again to create new projects.
There are many different ways out there but this guide will
----------------------------------------------------------------------


Repetition Penalty = 2.0:
The key to success in machine learning is finding the right combination of skills and tools for each task, which means that once you've been able (or should have) identified where your current project needs help with a particular topic or problem then it's time again to create new projects.
There are many different ways out there but this guide will
----------------------------------------------------------------------

As the penalty increases, you'll notice the text becomes more varied in vocabulary but may also become less coherent. The model is forced to find alternative words even when repetition would be natural.

Out[13]:

Visualization

Heatmap showing probability reduction percentages for different combinations of original logit values and penalty strengths. — Probability reduction as a function of original logit and penalty strength. Higher original logits experience larger absolute probability reductions, but the relative reduction is more uniform.

The heatmap reveals an important pattern: the probability reduction depends on both the penalty strength and the token's original standing. Tokens with moderate positive logits (around 1-3) experience the largest percentage reductions because they start with meaningful probability that can be substantially reduced. Very high logits still dominate even after penalty, while very low logits have little probability to lose.

Frequency PenaltyLink Copied

While repetition penalty treats all repeated tokens identically, frequency penalty scales with occurrence count. A token used once receives a small penalty; a token used ten times receives ten times that penalty. This graduated approach allows some repetition while strongly discouraging excessive use of any single token.

Frequency Penalty

Frequency penalty subtracts a value from each token's logit proportional to how many times that token has appeared in the generated text. The formula is $z'_i = z_i - \alpha \cdot c_i$ , where $\alpha$ is the penalty strength and $c_i$ is the number of times token $i$ has been generated.

Mathematical FormulationLink Copied

Repetition penalty has a limitation: it treats a token that appeared once the same as one that appeared fifty times. Both receive the same penalty. But intuitively, we might want to allow occasional repetition (the word "the" naturally appears multiple times in most texts) while strongly discouraging tokens that have become overused. This calls for a penalty that accumulates with each occurrence.

From Binary to Graduated PenaltiesLink Copied

Frequency penalty introduces a simple but powerful shift: instead of asking "has this token appeared?", we ask "how many times has this token appeared?" The penalty then scales proportionally with the answer.

The mechanism is additive rather than multiplicative. For each occurrence of a token, we subtract a fixed amount from its logit. Let $c_i$ denote the number of times token $i$ has appeared in the generated sequence. The frequency penalty modifies logits as:

z'_i = z_i - \alpha \cdot c_i

where:

$z_i$ : the original logit for token $i$
$z'_i$ : the modified logit after applying the penalty
$\alpha$ : the frequency penalty coefficient, typically in the range $[0, 2]$ , controlling how strongly each occurrence is penalized
$c_i$ : the count of token $i$ in the generated sequence (0 if the token has not appeared)

The formula subtracts $\alpha$ once for each time the token has appeared. If a token appeared three times and $\alpha = 0.5$ , we subtract $0.5 \times 3 = 1.5$ from its logit.

Why Additive WorksLink Copied

Unlike repetition penalty, frequency penalty doesn't need to handle positive and negative logits differently. Subtraction always reduces a number, regardless of its sign. Subtracting from a positive logit makes it smaller (or even negative). Subtracting from a negative logit makes it more negative. Both operations reduce the logit and thus reduce the probability after softmax.

This simplicity is a feature. The formula is easy to implement, easy to understand, and produces predictable behavior: each occurrence adds the same "cost" to using that token again.

Linear Scaling in ActionLink Copied

The linearity of frequency penalty creates a distinctive pattern. Early occurrences of a token receive mild penalties, allowing natural repetition of common words. But as a token accumulates, the penalty grows relentlessly. Consider $\alpha = 0.5$ :

Frequency penalty scaling with α = 0.5. The penalty grows linearly with occurrence count.

Occurrences	Penalty Applied
0	0.0
1	0.5
2	1.0
5	2.5
10	5.0

By the tenth occurrence, the token's logit has been reduced by 5.0, a substantial penalty that makes it far less competitive against tokens that haven't been overused. This graduated pressure naturally prevents any single token from dominating the output without harsh early constraints.

ImplementationLink Copied

In[14]:

Code

def apply_frequency_penalty(logits, token_counts, alpha):
    """
    Apply frequency penalty based on token occurrence counts.

    Args:
        logits: Tensor of shape (vocab_size,) containing raw logits
        token_counts: Dict mapping token IDs to their occurrence counts
        alpha: Frequency penalty coefficient

    Returns:
        Modified logits with frequency-based penalties applied
    """
    modified_logits = logits.clone()

    for token_id, count in token_counts.items():
        modified_logits[token_id] = modified_logits[token_id] - alpha * count

    return modified_logits

def apply_frequency_penalty(logits, token_counts, alpha):
    """
    Apply frequency penalty based on token occurrence counts.

    Args:
        logits: Tensor of shape (vocab_size,) containing raw logits
        token_counts: Dict mapping token IDs to their occurrence counts
        alpha: Frequency penalty coefficient

    Returns:
        Modified logits with frequency-based penalties applied
    """
    modified_logits = logits.clone()

    for token_id, count in token_counts.items():
        modified_logits[token_id] = modified_logits[token_id] - alpha * count

    return modified_logits

Let's see how frequency penalty differs from repetition penalty:

In[15]:

Code

# Create a scenario where tokens have different occurrence counts
example_logits = torch.tensor([2.0, 2.0, 2.0, 2.0, 2.0])
token_names = ["data", "model", "the", "learning", "result"]

# Token counts: "the" appeared 5 times, "model" appeared 3 times, "data" once
token_counts = {0: 1, 1: 3, 2: 5}  # data: 1, model: 3, the: 5

alpha = 0.5  # Frequency penalty coefficient
modified_logits = apply_frequency_penalty(example_logits, token_counts, alpha)

# Create a scenario where tokens have different occurrence counts
example_logits = torch.tensor([2.0, 2.0, 2.0, 2.0, 2.0])
token_names = ["data", "model", "the", "learning", "result"]

# Token counts: "the" appeared 5 times, "model" appeared 3 times, "data" once
token_counts = {0: 1, 1: 3, 2: 5}  # data: 1, model: 3, the: 5

alpha = 0.5  # Frequency penalty coefficient
modified_logits = apply_frequency_penalty(example_logits, token_counts, alpha)

Out[16]:

Console

Frequency penalty effect (α = 0.5):
-----------------------------------------------------------------
Token         Count    Logit   Modified    P(orig)     P(mod)
-----------------------------------------------------------------
data              1     2.00       1.50      0.200      0.208
model             3     2.00       0.50      0.200      0.077
the               5     2.00      -0.50      0.200      0.028
learning          0     2.00       2.00      0.200      0.343
result            0     2.00       2.00      0.200      0.343
-----------------------------------------------------------------

The key difference is apparent: tokens with higher counts receive proportionally stronger penalties. "The", which appeared 5 times, has its logit reduced by 2.5 (5 × 0.5), while "data", which appeared once, only loses 0.5 from its logit. This graduated penalty is particularly effective for suppressing overused words while allowing reasonable repetition of less common terms.

Visualizing Frequency PenaltyLink Copied

Out[17]:

Visualization

Line plot showing linear growth of frequency penalty with token count. — Frequency penalty grows linearly with token occurrence count for different α values.

Line plot showing probability decay as occurrence count increases. — Probability decay as a token appears more frequently (α = 0.5).

Presence PenaltyLink Copied

Presence penalty takes an even simpler approach: penalize any token that has appeared, regardless of how many times. This binary penalty treats "appeared once" and "appeared twenty times" identically.

Presence Penalty

Presence penalty applies a fixed penalty to the logit of any token that has appeared in the generated text, regardless of frequency. The formula is $z'_i = z_i - \beta \cdot \mathbb{1}[i \in G]$ , where $\beta$ is the penalty strength and $\mathbb{1}[i \in G]$ is an indicator that equals 1 if token $i$ has appeared and 0 otherwise.

Mathematical FormulationLink Copied

We've now seen two approaches: repetition penalty treats all repeated tokens equally (ignoring count), while frequency penalty scales with count. Presence penalty takes the opposite extreme from frequency penalty: it also ignores count, but uses the simpler additive mechanism.

The Binary QuestionLink Copied

Presence penalty asks the simplest possible question about each token: "Have you appeared before, yes or no?" If yes, apply a fixed penalty. If no, leave the logit unchanged. The number of previous occurrences is irrelevant.

To express this mathematically, we use an indicator function, a standard notation for encoding yes/no conditions as 1/0 values:

z'_i = z_i - \beta \cdot \mathbb{1}[i \in G]

where:

$z_i$ : the original logit for token $i$
$z'_i$ : the modified logit after applying the penalty
$\beta$ : the presence penalty coefficient, controlling how strongly any prior appearance is penalized
$\mathbb{1}[i \in G]$ : the indicator function, which equals 1 if token $i$ has appeared (is in set $G$ ), and 0 otherwise
$G$ : the set of tokens that have appeared in the generated sequence

Reading the Indicator FunctionLink Copied

The notation $\mathbb{1}[i \in G]$ might look intimidating, but it simply acts like a switch. The condition inside the brackets, " $i \in G$ " (token $i$ is in set $G$ ), is evaluated as true or false. The indicator function converts this to a number:

If the condition is true (token has appeared): $\mathbb{1}[\text{true}] = 1$
If the condition is false (token has not appeared): $\mathbb{1}[\text{false}] = 0$

Substituting back into the formula:

For a token that has appeared: $z'_i = z_i - \beta \cdot 1 = z_i - \beta$
For a token that hasn't appeared: $z'_i = z_i - \beta \cdot 0 = z_i$

The indicator function elegantly handles both cases in a single equation.

Presence vs. Frequency: A ComparisonLink Copied

The contrast with frequency penalty is instructive. Frequency penalty applies $\alpha$ for each occurrence, so a token appearing 10 times receives 10 times the penalty of one appearing once. Presence penalty applies $\beta$ exactly once regardless of occurrence count. Whether a token appeared 1 time or 100 times, it receives the same penalty $\beta$ .

This makes presence penalty particularly effective for encouraging vocabulary diversity. Once a word has been used, it faces a fixed "tax" on appearing again. The model is pushed to explore alternatives, to find synonyms, to vary its phrasing. It's a blunt instrument compared to frequency penalty's graduated pressure, but for tasks like brainstorming or generating diverse lists, that bluntness can be exactly what's needed.

Out[18]:

Visualization

Line plot comparing flat presence penalty curve against linearly increasing frequency penalty curves. — Presence penalty vs frequency penalty as occurrence count increases. Presence penalty (dashed) applies a constant penalty regardless of count, while frequency penalty (solid) grows linearly.

The contrast is stark. Presence penalty (dashed lines) jumps to its full value after the first occurrence and stays flat. Frequency penalty (solid lines) starts at zero and grows steadily. For tokens that appear many times, frequency penalty eventually dominates; for tokens that appear just once or twice, presence penalty applies stronger immediate pressure.

ImplementationLink Copied

In[19]:

Code

def apply_presence_penalty(logits, generated_tokens, beta):
    """
    Apply presence penalty to tokens that have appeared at least once.

    Args:
        logits: Tensor of shape (vocab_size,) containing raw logits
        generated_tokens: Set of token IDs that have appeared
        beta: Presence penalty coefficient

    Returns:
        Modified logits with presence-based penalties applied
    """
    modified_logits = logits.clone()

    for token_id in generated_tokens:
        modified_logits[token_id] = modified_logits[token_id] - beta

    return modified_logits

def apply_presence_penalty(logits, generated_tokens, beta):
    """
    Apply presence penalty to tokens that have appeared at least once.

    Args:
        logits: Tensor of shape (vocab_size,) containing raw logits
        generated_tokens: Set of token IDs that have appeared
        beta: Presence penalty coefficient

    Returns:
        Modified logits with presence-based penalties applied
    """
    modified_logits = logits.clone()

    for token_id in generated_tokens:
        modified_logits[token_id] = modified_logits[token_id] - beta

    return modified_logits

Comparing All Three PenaltiesLink Copied

Let's see how the three penalties differ when applied to the same scenario:

In[20]:

Code

# Setup: tokens with varying occurrence counts
logits = torch.tensor([2.0, 2.0, 2.0, 2.0, 2.0, 2.0])
names = ["novel", "data", "the", "model", "is", "good"]

# Counts: novel=0, data=1, the=5, model=3, is=2, good=0
counts = {1: 1, 2: 5, 3: 3, 4: 2}  # data, the, model, is
appeared = {1, 2, 3, 4}  # Set of tokens that appeared

# Apply each penalty type
rep_penalty = 1.5
freq_alpha = 0.3
pres_beta = 1.0

logits_rep = apply_repetition_penalty(logits, appeared, rep_penalty)
logits_freq = apply_frequency_penalty(logits, counts, freq_alpha)
logits_pres = apply_presence_penalty(logits, appeared, pres_beta)

# Setup: tokens with varying occurrence counts
logits = torch.tensor([2.0, 2.0, 2.0, 2.0, 2.0, 2.0])
names = ["novel", "data", "the", "model", "is", "good"]

# Counts: novel=0, data=1, the=5, model=3, is=2, good=0
counts = {1: 1, 2: 5, 3: 3, 4: 2}  # data, the, model, is
appeared = {1, 2, 3, 4}  # Set of tokens that appeared

# Apply each penalty type
rep_penalty = 1.5
freq_alpha = 0.3
pres_beta = 1.0

logits_rep = apply_repetition_penalty(logits, appeared, rep_penalty)
logits_freq = apply_frequency_penalty(logits, counts, freq_alpha)
logits_pres = apply_presence_penalty(logits, appeared, pres_beta)

Out[21]:

Console

Comparison of penalty types:
================================================================================
Token     Count   Original   Rep(θ=1.5)  Freq(α=0.3)    Pres(β=1)
================================================================================
novel         0      0.167        0.247        0.255        0.288
data          1      0.167        0.127        0.189        0.106
the           5      0.167        0.127        0.057        0.106
model         3      0.167        0.127        0.104        0.106
is            2      0.167        0.127        0.140        0.106
good          0      0.167        0.247        0.255        0.288
================================================================================

Out[22]:

Visualization

Grouped bar chart comparing original, repetition, frequency, and presence penalty probabilities across six tokens. — Comparison of three penalty types on the same token distribution. Frequency penalty scales with occurrence count, heavily penalizing 'the' (5 occurrences) while lightly penalizing 'data' (1 occurrence).

The visualization reveals each penalty's character:

Repetition penalty (blue) uniformly reduces all appeared tokens, regardless of count
Frequency penalty (red) creates graduated reductions: "the" (×5) drops dramatically while "data" (×1) drops minimally
Presence penalty (green) applies equal reduction to all appeared tokens, similar to repetition penalty but using additive rather than multiplicative adjustment

N-gram BlockingLink Copied

The penalties we've explored so far are probabilistic: they reduce the likelihood of repetition without preventing it entirely. N-gram blocking takes a deterministic approach by absolutely forbidding the repetition of specific token sequences.

N-gram Blocking

N-gram blocking prevents the model from generating any n-gram (sequence of n consecutive tokens) that has already appeared in the generated text. When the model would complete a forbidden n-gram, that token's probability is set to zero, forcing selection of a different continuation.

How N-gram Blocking WorksLink Copied

The penalties we've explored, repetition, frequency, and presence, all work by adjusting probabilities. They make repetition less likely but don't prevent it entirely. Sometimes a token's original probability is so high that even after penalization, it remains the most likely choice. For applications where exact repetition would be clearly wrong (legal documents, safety-critical outputs), we need a stronger guarantee.

N-gram blocking provides that guarantee through a fundamentally different mechanism: instead of adjusting probabilities, it eliminates certain tokens from consideration entirely by setting their probability to zero.

What Is an N-gram?Link Copied

An n-gram is simply a contiguous sequence of n tokens. The terminology comes from computational linguistics:

A bigram (2-gram) is a sequence of 2 tokens, like ["the", "cat"]
A trigram (3-gram) is a sequence of 3 tokens, like ["sat", "on", "the"]
A 4-gram is a sequence of 4 tokens, like ["the", "quick", "brown", "fox"]

N-gram blocking maintains a record of all n-grams that have appeared in the generated text. Before each token is sampled, the algorithm checks: "If I generate this token, will it complete an n-gram that already exists?" If so, that token is forbidden.

The Blocking MechanismLink Copied

Consider 3-gram blocking with the sequence "machine learning is powerful. Machine learning is". The current context ends with "Machine learning". If the model generates "is", it would create the trigram ["Machine", "learning", "is"], which already appeared earlier. N-gram blocking detects this and sets the probability of "is" to zero, forcing the model to choose a different continuation.

This is absolute prevention, not probabilistic discouragement. No matter how strongly the model wants to generate "is", it cannot. The token is masked out before sampling occurs.

In[23]:

Code

def get_ngrams(token_ids, n):
    """
    Extract all n-grams from a sequence of token IDs.

    Args:
        token_ids: List of token IDs
        n: Size of n-grams to extract

    Returns:
        Set of n-gram tuples
    """
    ngrams = set()
    for i in range(len(token_ids) - n + 1):
        ngram = tuple(token_ids[i : i + n])
        ngrams.add(ngram)
    return ngrams


def get_banned_tokens(token_ids, n):
    """
    Get tokens that would create a repeated n-gram if generated next.

    Args:
        token_ids: List of previously generated token IDs
        n: Size of n-grams to block

    Returns:
        Set of token IDs that should not be generated
    """
    if len(token_ids) < n - 1:
        return set()

    # Get existing n-grams
    existing_ngrams = get_ngrams(token_ids, n)

    # Get the (n-1)-gram that would be completed by the next token
    context = tuple(token_ids[-(n - 1) :])

    # Find tokens that would complete a repeated n-gram
    banned = set()
    for ngram in existing_ngrams:
        if ngram[:-1] == context:
            banned.add(ngram[-1])

    return banned

def get_ngrams(token_ids, n):
    """
    Extract all n-grams from a sequence of token IDs.

    Args:
        token_ids: List of token IDs
        n: Size of n-grams to extract

    Returns:
        Set of n-gram tuples
    """
    ngrams = set()
    for i in range(len(token_ids) - n + 1):
        ngram = tuple(token_ids[i : i + n])
        ngrams.add(ngram)
    return ngrams


def get_banned_tokens(token_ids, n):
    """
    Get tokens that would create a repeated n-gram if generated next.

    Args:
        token_ids: List of previously generated token IDs
        n: Size of n-grams to block

    Returns:
        Set of token IDs that should not be generated
    """
    if len(token_ids) < n - 1:
        return set()

    # Get existing n-grams
    existing_ngrams = get_ngrams(token_ids, n)

    # Get the (n-1)-gram that would be completed by the next token
    context = tuple(token_ids[-(n - 1) :])

    # Find tokens that would complete a repeated n-gram
    banned = set()
    for ngram in existing_ngrams:
        if ngram[:-1] == context:
            banned.add(ngram[-1])

    return banned

Let's trace through an example:

In[24]:

Code

# Simulate a generation scenario
sentence = "The quick brown fox jumps over the lazy dog and the quick brown"
tokens = tokenizer.encode(sentence)
token_strings = [tokenizer.decode([t]) for t in tokens]

# Simulate a generation scenario
sentence = "The quick brown fox jumps over the lazy dog and the quick brown"
tokens = tokenizer.encode(sentence)
token_strings = [tokenizer.decode([t]) for t in tokens]

Out[25]:

Console

Sentence: 'The quick brown fox jumps over the lazy dog and the quick brown'

Tokens: ['The', ' quick', ' brown', ' fox', ' jumps', ' over', ' the', ' lazy', ' dog', ' and', ' the', ' quick', ' brown']

Existing 3-grams:
  ['the', 'quick', 'brown']
  ['the', 'lazy', 'dog']
  ['and', 'the', 'quick']
  ['The', 'quick', 'brown']
  ['over', 'the', 'lazy']
  ['quick', 'brown', 'fox']
  ['dog', 'and', 'the']
  ['brown', 'fox', 'jumps']
  ['lazy', 'dog', 'and']
  ['jumps', 'over', 'the']
  ['fox', 'jumps', 'over']

Context ends with: [' quick', ' brown']
Banned next tokens: [' fox']

The output shows all trigrams extracted from the sentence. Since the context ends with "quick brown" and the trigram ["quick", "brown", "fox"] already exists, generating "fox" next would create an exact repetition. N-gram blocking identifies this and adds "fox" to the banned token list, forcing the model to choose a different continuation.

Implementation in GenerationLink Copied

In[26]:

Code

def generate_with_ngram_blocking(model, tokenizer, prompt, n, max_tokens=50):
    """
    Generate text with n-gram blocking to prevent exact repetition.

    Args:
        model: Language model
        tokenizer: Tokenizer
        prompt: Input prompt string
        n: Size of n-grams to block (e.g., 3 for trigram blocking)
        max_tokens: Maximum tokens to generate

    Returns:
        Generated text string
    """
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated_ids = input_ids[0].tolist()

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(torch.tensor([generated_ids]))
            logits = outputs.logits[0, -1, :]

        # Get banned tokens
        banned_tokens = get_banned_tokens(generated_ids, n)

        # Mask out banned tokens
        for token_id in banned_tokens:
            logits[token_id] = float("-inf")

        # Apply temperature and sample
        probs = F.softmax(logits / 0.8, dim=0)
        next_token = torch.multinomial(probs, num_samples=1).item()

        generated_ids.append(next_token)

        if next_token == tokenizer.eos_token_id:
            break

    return tokenizer.decode(generated_ids, skip_special_tokens=True)

def generate_with_ngram_blocking(model, tokenizer, prompt, n, max_tokens=50):
    """
    Generate text with n-gram blocking to prevent exact repetition.

    Args:
        model: Language model
        tokenizer: Tokenizer
        prompt: Input prompt string
        n: Size of n-grams to block (e.g., 3 for trigram blocking)
        max_tokens: Maximum tokens to generate

    Returns:
        Generated text string
    """
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated_ids = input_ids[0].tolist()

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(torch.tensor([generated_ids]))
            logits = outputs.logits[0, -1, :]

        # Get banned tokens
        banned_tokens = get_banned_tokens(generated_ids, n)

        # Mask out banned tokens
        for token_id in banned_tokens:
            logits[token_id] = float("-inf")

        # Apply temperature and sample
        probs = F.softmax(logits / 0.8, dim=0)
        next_token = torch.multinomial(probs, num_samples=1).item()

        generated_ids.append(next_token)

        if next_token == tokenizer.eos_token_id:
            break

    return tokenizer.decode(generated_ids, skip_special_tokens=True)

Out[27]:

Console

Prompt: 'The future of technology is'

======================================================================


2-gram blocking:
The future of technology is uncertain and it was not unusual for developers to look to Google to figure out how to bring these technologies into their products.

But what if the company were to use several different technologies to go with their services? What if Google has simply selected a single technology to allow each one to be used (
----------------------------------------------------------------------


3-gram blocking:
The future of technology is uncertain and it was not unusual for developers to look to Google to figure out how to bring these technologies into their products.

But what if the company were to use several different technologies to go with their products? What if Google has simply selected their offerings and can only choose one?

It
----------------------------------------------------------------------


4-gram blocking:
The future of technology is uncertain and it was not unusual for developers to look to Google to figure out how to bring these technologies into their products.

But what if the company were to use several different technologies to go with their products? What if Google has simply selected their offerings and can only choose one?

It
----------------------------------------------------------------------

With 2-gram blocking, the text avoids repeating any two consecutive tokens, which can make the output feel choppy since common phrases like "of the" can only appear once. With 3-gram blocking, the constraint relaxes slightly, allowing more natural flow while still preventing obvious repetition. At 4-gram blocking, only longer repeated phrases are blocked, preserving most natural language patterns while catching more egregious loops.

Trade-offs of N-gram BlockingLink Copied

N-gram blocking guarantees no exact phrase repetition but has notable limitations:

Rigidity: It cannot distinguish between undesirable repetition and natural language patterns. Phrases like "on the other hand" might be blocked after first use, even when appropriate.
Local focus: It only prevents exact matches. "The cat sat" and "A cat sat" are different trigrams, so both could appear despite semantic similarity.
Brittleness: The blocking can force awkward workarounds when the model genuinely needs to repeat a phrase.

Smaller n values (2 or 3) are more restrictive but may hurt fluency. Larger values (4 or 5) allow more natural repetition while still preventing obvious loops.

Combining Penalties with Sampling StrategiesLink Copied

In practice, repetition penalties work alongside temperature, top-k, and nucleus sampling. The typical processing order is:

Get raw logits from the model
Apply repetition/frequency/presence penalties to adjust logit values
Apply temperature scaling to control distribution sharpness
Apply top-k or nucleus truncation to limit candidates
Sample from the resulting distribution

In[28]:

Code

def generate_with_combined_strategies(
    model,
    tokenizer,
    prompt,
    repetition_penalty=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    temperature=1.0,
    top_p=0.9,
    max_tokens=50,
):
    """
    Generate text combining multiple penalty types with sampling strategies.
    """
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated_ids = input_ids[0].tolist()
    prompt_length = len(generated_ids)
    token_counts = {}

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(torch.tensor([generated_ids]))
            logits = outputs.logits[0, -1, :].clone()

        # Track tokens generated so far (excluding prompt)
        generated_so_far = generated_ids[prompt_length:]
        generated_set = set(generated_so_far)

        # Count occurrences
        for token_id in generated_so_far:
            token_counts[token_id] = token_counts.get(token_id, 0) + 1

        # Apply repetition penalty
        if repetition_penalty != 1.0:
            for token_id in generated_set:
                if logits[token_id] > 0:
                    logits[token_id] = logits[token_id] / repetition_penalty
                else:
                    logits[token_id] = logits[token_id] * repetition_penalty

        # Apply frequency penalty
        if frequency_penalty > 0:
            for token_id, count in token_counts.items():
                logits[token_id] = logits[token_id] - frequency_penalty * count

        # Apply presence penalty
        if presence_penalty > 0:
            for token_id in generated_set:
                logits[token_id] = logits[token_id] - presence_penalty

        # Apply temperature
        logits = logits / temperature

        # Apply nucleus sampling
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=0), dim=0)
        sorted_indices_to_remove = cumulative_probs > top_p
        sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
        sorted_indices_to_remove[0] = False
        indices_to_remove = sorted_indices_to_remove.scatter(
            0, sorted_indices, sorted_indices_to_remove
        )
        logits[indices_to_remove] = float("-inf")

        # Sample
        probs = F.softmax(logits, dim=0)
        next_token = torch.multinomial(probs, num_samples=1).item()
        generated_ids.append(next_token)

        if next_token == tokenizer.eos_token_id:
            break

    return tokenizer.decode(generated_ids, skip_special_tokens=True)

def generate_with_combined_strategies(
    model,
    tokenizer,
    prompt,
    repetition_penalty=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    temperature=1.0,
    top_p=0.9,
    max_tokens=50,
):
    """
    Generate text combining multiple penalty types with sampling strategies.
    """
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    generated_ids = input_ids[0].tolist()
    prompt_length = len(generated_ids)
    token_counts = {}

    for _ in range(max_tokens):
        with torch.no_grad():
            outputs = model(torch.tensor([generated_ids]))
            logits = outputs.logits[0, -1, :].clone()

        # Track tokens generated so far (excluding prompt)
        generated_so_far = generated_ids[prompt_length:]
        generated_set = set(generated_so_far)

        # Count occurrences
        for token_id in generated_so_far:
            token_counts[token_id] = token_counts.get(token_id, 0) + 1

        # Apply repetition penalty
        if repetition_penalty != 1.0:
            for token_id in generated_set:
                if logits[token_id] > 0:
                    logits[token_id] = logits[token_id] / repetition_penalty
                else:
                    logits[token_id] = logits[token_id] * repetition_penalty

        # Apply frequency penalty
        if frequency_penalty > 0:
            for token_id, count in token_counts.items():
                logits[token_id] = logits[token_id] - frequency_penalty * count

        # Apply presence penalty
        if presence_penalty > 0:
            for token_id in generated_set:
                logits[token_id] = logits[token_id] - presence_penalty

        # Apply temperature
        logits = logits / temperature

        # Apply nucleus sampling
        sorted_logits, sorted_indices = torch.sort(logits, descending=True)
        cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=0), dim=0)
        sorted_indices_to_remove = cumulative_probs > top_p
        sorted_indices_to_remove[1:] = sorted_indices_to_remove[:-1].clone()
        sorted_indices_to_remove[0] = False
        indices_to_remove = sorted_indices_to_remove.scatter(
            0, sorted_indices, sorted_indices_to_remove
        )
        logits[indices_to_remove] = float("-inf")

        # Sample
        probs = F.softmax(logits, dim=0)
        next_token = torch.multinomial(probs, num_samples=1).item()
        generated_ids.append(next_token)

        if next_token == tokenizer.eos_token_id:
            break

    return tokenizer.decode(generated_ids, skip_special_tokens=True)

Out[29]:

Console

Prompt: 'Here are some ideas for improving productivity:'

======================================================================


No penalties:
Here are some ideas for improving productivity:

Show Me The Details

First off, let's talk about how much time I spend on Twitter. I know you're not the only one who's tired of being told that you shouldn't spend more time on Twitter than you should. I know that I want to see my tweets read more and more frequently.

This is something
----------------------------------------------------------------------


Repetition only (θ=1.2):
Here are some ideas for improving productivity:
"Don't make a bunch of extra widgets just to get stuff done. Put them on top and do it well." - Gary Dewey, YouTube Engineer (not long ago). Stop making some arbitrary split-screen widget with no taskbar or multitasking effects that makes your life feel like shit for hours at the least! New features: V
----------------------------------------------------------------------


Frequency only (α=0.5):
Here are some ideas for improving productivity:

Show Me The Details: Show me what you need to get done at work. Make it very clear where the project is being held, and how productive that piece of information will be for a change in your life or career! Focus on one thing per task group; focus more often than not (with less distraction) On each separate activity specific
----------------------------------------------------------------------


Presence only (β=0.8):
Here are some ideas for improving productivity:

Show Me The Details: Show me what you need to get done at work. Make it very clear where the project is being done, where you're being done and where you're going. Show people the value that you have created. Show people the benefits of your work. Show them how to spend their money.

This all comes
----------------------------------------------------------------------


Combined (θ=1.1, α=0.3):
Here are some ideas for improving productivity:
"Don't make a bunch of extra widgets just to get stuff done. Put them on top and do it well." - Gary Dewey, YouTube Engineer (not long ago). Stop making some arbitrary split-screen widget with no taskbar or multitasking effects that makes your life feel like shit for hours at the least! New features: V
----------------------------------------------------------------------

Comparing the outputs reveals how each penalty type shapes generation. Without penalties, the model may fall into repetitive patterns. Repetition penalty alone provides uniform discouragement of repeated tokens. Frequency penalty creates graduated pressure that builds as tokens accumulate, while presence penalty encourages the model to continuously introduce new vocabulary. The combined approach balances these effects, often producing the most natural-sounding output by gently discouraging repetition without forcing unnatural word choices.

When to Use Each PenaltyLink Copied

The choice between penalty types depends on your generation task:

Repetition Penalty works well as a general-purpose solution. It's the most widely implemented (available in Hugging Face's generate() method) and handles most cases adequately. Use it when you want a simple, effective baseline for reducing repetition.

Frequency Penalty excels when you want to allow some repetition while preventing excessive use. It's particularly useful for:

Long-form content where occasional word repetition is acceptable but word overuse is not
Creative writing where you want natural variation without harsh constraints
Technical writing where certain terms must repeat but shouldn't dominate

Presence Penalty promotes vocabulary diversity and topic breadth. Consider it for:

Brainstorming and idea generation where you want many distinct concepts
Summarization where you want to cover multiple points without redundancy
Conversational agents that should avoid fixating on specific words

N-gram Blocking provides guaranteed protection against exact repetition. Use it when:

Exact phrase repetition would be clearly wrong (e.g., safety-critical applications)
You need deterministic behavior rather than probabilistic reduction
Other penalties aren't sufficiently preventing loops

For many applications, combining moderate repetition penalty ( $\theta \approx 1.1$ - $1.2$ ) with nucleus sampling provides a good balance between preventing loops and maintaining fluent output.

Key ParametersLink Copied

When implementing repetition penalties in generation, these parameters control behavior:

repetition_penalty (float, typically 1.0-2.0): Multiplicative penalty applied to logits of repeated tokens. A value of 1.0 applies no penalty. Values around 1.1-1.3 provide gentle repetition reduction. Values above 1.5 strongly discourage any repetition.
frequency_penalty (float, typically 0-2.0): Additive penalty proportional to token occurrence count. Higher values increasingly penalize frequently used tokens. OpenAI's API uses this parameter with typical values of 0-1.0.
presence_penalty (float, typically 0-2.0): Flat additive penalty for tokens that have appeared at all. Encourages vocabulary diversity. Also used in OpenAI's API with typical values of 0-1.0.
no_repeat_ngram_size (int): Size of n-grams to block from repeating. In Hugging Face's generate(), setting this to 2 prevents any bigram repetition, 3 prevents trigram repetition, etc. Set to 0 to disable.
encoder_no_repeat_ngram_size (int): For encoder-decoder models, prevents generating n-grams that appear in the encoder input. Useful for abstractive summarization to avoid copying source phrases.

SummaryLink Copied

Repetition is a common failure mode in language model generation, emerging from the autoregressive process where patterns reinforce themselves through context accumulation. Several techniques address this problem by modifying token probabilities based on prior usage.

Key takeaways:

Repetition penalty divides logits of previously generated tokens by a penalty factor, uniformly reducing their probability regardless of occurrence count. Simple and effective for general use.
Frequency penalty subtracts from logits proportionally to occurrence count, creating graduated penalties that scale with overuse. Allows natural repetition while strongly penalizing excessive use.
Presence penalty applies a flat reduction to any token that has appeared, promoting vocabulary diversity. Effective for brainstorming and ensuring broad coverage.
N-gram blocking deterministically prevents exact phrase repetition by masking tokens that would complete a previously seen n-gram. Provides guaranteed protection but can reduce fluency.
Order matters: Apply penalties before temperature and sampling truncation. The modified logits then flow through the standard sampling pipeline.
Context window: Consider whether penalties apply to the full context (including prompt) or only generated tokens. Most implementations penalize tokens anywhere in the sequence, which can affect prompt-relevant words.

The optimal configuration depends heavily on your use case. Start with a moderate repetition penalty ( $\theta = 1.1$ - $1.2$ ), observe the output quality, and adjust based on whether you see too much repetition or unnaturally varied vocabulary. For production systems, A/B testing different configurations against human evaluations provides the most reliable guidance.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about repetition penalties in language model generation.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{repetitionpenaltiespreventingloopsinlanguagemodelgeneration, author = {Michael Brenndoerfer}, title = {Repetition Penalties: Preventing Loops in Language Model Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Repetition Penalties: Preventing Loops in Language Model Generation. Retrieved from https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation

MLAAcademic

Michael Brenndoerfer. "Repetition Penalties: Preventing Loops in Language Model Generation." 2026. Web. today. <https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation>.

CHICAGOAcademic

Michael Brenndoerfer. "Repetition Penalties: Preventing Loops in Language Model Generation." Accessed today. https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Repetition Penalties: Preventing Loops in Language Model Generation'. Available at: https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Repetition Penalties: Preventing Loops in Language Model Generation. https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation

Direct link:

https://mbrenndoerfer.com/writing/repetition-penalties-language-model-generation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Repetition Penalties: Preventing Loops in Language Model Generation

Repetition PenaltiesLink Copied

Why Models Repeat ThemselvesLink Copied

The Feedback LoopLink Copied

The Repetition PenaltyLink Copied

Mathematical FormulationLink Copied

The Core InsightLink Copied

Why the Asymmetric Treatment?Link Copied

The Neutral CaseLink Copied

ImplementationLink Copied

Visualizing the EffectLink Copied

Choosing the Penalty ValueLink Copied

Frequency PenaltyLink Copied

Mathematical FormulationLink Copied

From Binary to Graduated PenaltiesLink Copied

Why Additive WorksLink Copied

Linear Scaling in ActionLink Copied

ImplementationLink Copied

Visualizing Frequency PenaltyLink Copied

Presence PenaltyLink Copied

Mathematical FormulationLink Copied

The Binary QuestionLink Copied

Reading the Indicator FunctionLink Copied

Presence vs. Frequency: A ComparisonLink Copied

ImplementationLink Copied

Comparing All Three PenaltiesLink Copied

N-gram BlockingLink Copied

How N-gram Blocking WorksLink Copied

What Is an N-gram?Link Copied

The Blocking MechanismLink Copied

Implementation in GenerationLink Copied

Trade-offs of N-gram BlockingLink Copied

Combining Penalties with Sampling StrategiesLink Copied

When to Use Each PenaltyLink Copied

Key ParametersLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Constrained Decoding: Grammar-Guided Generation for Structured LLM Output

Autoregressive Generation: How GPT Generates Text Token by Token

Nucleus Sampling: Adaptive Top-p Text Generation for Language Models

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Constrained Decoding: Grammar-Guided Generation for Structured LLM Output

Autoregressive Generation: How GPT Generates Text Token by Token

Nucleus Sampling: Adaptive Top-p Text Generation for Language Models

Stay updated