Search

Search articles

Beam Search: Finding Optimal Sequences in Neural Text Generation

Michael BrenndoerferDecember 16, 202554 min read

Master beam search decoding for sequence-to-sequence models. Learn log probability scoring, length normalization, diverse beam search, and when to use sampling.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

When a sequence-to-sequence model generates output, it faces a fundamental challenge: at each timestep, it must choose from thousands of possible tokens. The simplest approach, greedy decoding, picks the single most probable token at each step. But this myopic strategy often leads to suboptimal sequences because a locally good choice can paint the model into a corner, forcing poor choices later. Beam search offers a principled middle ground between exhaustive search (computationally impossible) and greedy decoding (often suboptimal), maintaining multiple candidate sequences simultaneously to find better overall translations.

The Problem with Greedy Decoding

Greedy decoding selects the highest-probability token at each timestep, building the output sequence one token at a time. While computationally efficient, this approach has a critical flaw: it cannot recover from early mistakes.

Consider translating "The cat sat on the mat" into French. At the first timestep, suppose the model assigns these probabilities:

  • "Le" (The, masculine): 0.45
  • "La" (The, feminine): 0.35
  • "Un" (A, masculine): 0.15
  • "Une" (A, feminine): 0.05

Greedy decoding picks "Le" with probability 0.45. But what if "La chatte" (the female cat) would lead to a more fluent overall translation? The greedy decoder cannot explore this path once it commits to "Le."

Out[3]:
Visualization
Tree diagram showing greedy path selecting highest probability at each node versus optimal path through lower initial probability.
Greedy decoding follows the locally optimal path (solid blue), selecting the highest-probability token at each step. However, this can miss the globally optimal sequence (dashed green), which might start with a lower-probability token but lead to a better overall translation. The greedy path achieves score 0.45 × 0.40 × 0.45 = 0.081, while the optimal path achieves 0.35 × 0.55 × 0.60 = 0.116.

The greedy path achieves a cumulative probability of 0.45×0.40×0.45=0.0810.45 \times 0.40 \times 0.45 = 0.081, while the alternative path through "La chatte était" achieves 0.35×0.55×0.60=0.1160.35 \times 0.55 \times 0.60 = 0.116. Despite starting with a lower-probability token, the second path is 43% more likely overall. Greedy decoding cannot discover this because it never explores paths that start with suboptimal first choices.

This limitation becomes more severe as sequences grow longer. Each greedy decision constrains future options, and small early mistakes compound into significantly worse final translations. The model might generate grammatically correct but awkward or semantically imprecise text simply because it committed too early to a particular phrasing.

The Beam Search Algorithm

Having seen how greedy decoding can miss globally optimal sequences, we need an algorithm that explores multiple possibilities without the exponential cost of exhaustive search. Beam search provides exactly this: a principled way to maintain several candidate sequences simultaneously, hedging our bets until we have enough evidence to commit.

The Core Idea: Hedging Your Bets

Imagine you're navigating a maze where each intersection offers several paths forward. Greedy decoding is like always taking the path that looks most promising at each intersection. This is efficient, but you might walk into a dead end. Exhaustive search explores every possible route, guaranteed to find the best path, but impossibly slow for large mazes. Beam search takes a middle approach: at each intersection, you send scouts down the kk most promising paths, where kk is called the beam width.

Beam Width

The beam width kk controls how many candidate sequences beam search maintains at each step. A beam width of 1 reduces to greedy decoding. Typical values range from 4 to 10 for most applications, balancing exploration against computational cost.

Why does this work? Because natural language has structure. If "Le chat" (the male cat) leads to awkward phrasing, we want to discover this before committing. Keeping "La chatte" (the female cat) alive as an alternative lets us recover. The beam width determines how many "insurance policies" we maintain against early mistakes.

The Algorithm: Expand, Score, Prune, Repeat

Beam search follows a simple four-phase cycle at each timestep:

  1. Expand: For each hypothesis currently in the beam, consider all possible next tokens. If we have kk hypotheses and a vocabulary of VV tokens, this generates k×Vk \times V candidate extensions.

  2. Score: Assign each candidate a score reflecting how probable the entire sequence is so far. Higher scores indicate more promising sequences.

  3. Prune: Keep only the top kk candidates by score. This is the key step that prevents exponential growth, ruthlessly discarding all but the best options.

  4. Terminate: If a candidate ends with the <END> token, move it to a "completed" list. Continue until all hypotheses are complete or we reach a maximum length.

The algorithm begins with a single hypothesis containing only <START> (score 0 in log space), and ends by returning the highest-scoring completed sequence.

Out[4]:
Visualization
Step-by-step diagram showing beam search expanding two hypotheses, generating candidates, and pruning back to two.
Beam search with width $k=2$ maintains two hypotheses at each step. After expanding all possibilities, it prunes to keep only the two highest-scoring candidates. Scores are log probabilities (more negative = less likely), and beam search keeps the k highest-scoring (least negative) hypotheses.

The Scoring Function: Why Log Probabilities?

To compare hypotheses during beam search, we need a principled way to quantify "how good" a partial sequence is. This section develops the scoring function from first principles, showing why the mathematical choices we make are not arbitrary but rather necessary solutions to fundamental problems.

The Goal: Measuring Sequence Quality

When beam search maintains multiple candidate sequences, it must decide which ones to keep and which to discard. The natural measure of sequence quality is probability: given an input xx, how likely is the model to generate this particular output sequence?

For a sequence y1,y2,,yty_1, y_2, \ldots, y_t, we want to compute P(y1,y2,,ytx)P(y_1, y_2, \ldots, y_t \,|\, x), the probability of generating exactly these tokens in this order, conditioned on the input. But how do we compute this joint probability when our model only predicts one token at a time?

From Single Tokens to Full Sequences: The Chain Rule

A sequence-to-sequence model generates output autoregressively: at each step, it predicts the next token based on the input and all previously generated tokens. This means we can decompose the joint probability using the chain rule:

P(y1,y2,,ytx)=P(y1x)×P(y2y1,x)×P(y3y1,y2,x)××P(yty1:t1,x)P(y_1, y_2, \ldots, y_t \,|\, x) = P(y_1 \,|\, x) \times P(y_2 \,|\, y_1, x) \times P(y_3 \,|\, y_1, y_2, x) \times \cdots \times P(y_t \,|\, y_{1:t-1}, x)

Each factor P(yiy1:i1,x)P(y_i \,|\, y_{1:i-1}, x) is exactly what our model computes at timestep ii: the probability distribution over the next token, given the input and all tokens generated so far. The sequence probability is simply the product of all these per-token probabilities.

This decomposition is powerful because it lets us compute sequence probabilities incrementally. When we extend a hypothesis by one token, the new sequence probability is just the old probability multiplied by the new token's probability.

The Numerical Underflow Problem

While the chain rule gives us a mathematically correct formula, it creates a serious computational problem. Consider what happens as sequences grow longer.

Suppose each token has an average probability of 0.3 (a reasonable estimate for natural language). For a 20-token sequence:

P(sequence)=0.3203.5×1011P(\text{sequence}) = 0.3^{20} \approx 3.5 \times 10^{-11}

This number is extraordinarily small, so small that standard 32-bit floating-point arithmetic may round it to exactly zero. Real sequences are often 50-100 tokens long, with many tokens having probabilities well below 0.3. Multiplying dozens of small numbers together quickly produces values that computers cannot represent.

Out[5]:
Visualization
Line plot showing exponential decay of sequence probability as sequence length increases, with a horizontal line marking the float32 minimum value.
Sequence probabilities shrink exponentially with length, quickly reaching values too small for computers to represent. With average token probability 0.3, a 50-token sequence has probability around 10⁻²⁶, well below the minimum representable float32 value of approximately 10⁻³⁸. Typical sentences (20-30 tokens) already approach problematic ranges for uncertain predictions.

This isn't a theoretical concern. It's a practical bug that causes beam search to fail silently. When all sequence probabilities underflow to zero, the algorithm cannot distinguish between good and bad hypotheses.

The Solution: Working in Log Space

The solution is elegant: instead of tracking probabilities directly, we track their logarithms. This simple transformation solves the underflow problem while preserving everything we need for comparison.

Recall that for any positive numbers aa and bb:

log(a×b)=log(a)+log(b)\log(a \times b) = \log(a) + \log(b)

This means that multiplying probabilities in regular space is equivalent to adding log probabilities in log space. Our problematic product becomes a well-behaved sum:

logP(y1,,ytx)=logP(y1x)+logP(y2y1,x)++logP(yty1:t1,x)\log P(y_1, \ldots, y_t \,|\, x) = \log P(y_1 \,|\, x) + \log P(y_2 \,|\, y_1, x) + \cdots + \log P(y_t \,|\, y_{1:t-1}, x)

The benefits of this transformation are substantial:

  1. Numerical stability: Log probabilities are moderate negative numbers. For P=0.3P = 0.3, we have log(0.3)1.2\log(0.3) \approx -1.2. Adding 20 such numbers gives 24-24, a perfectly manageable value that any computer can represent exactly.

  2. Computational efficiency: Addition is faster than multiplication on most hardware, and the difference compounds over millions of sequence evaluations during decoding.

  3. Preserved ordering: Since log\log is a monotonically increasing function, if P(A)>P(B)P(A) > P(B), then logP(A)>logP(B)\log P(A) > \log P(B). Maximizing log probability is equivalent to maximizing probability, so we lose nothing by working in log space.

The Beam Search Scoring Formula

We can now state the scoring formula precisely. Given a partial output sequence y1:t=(y1,y2,,yt)y_{1:t} = (y_1, y_2, \ldots, y_t) generated from input xx, the score is the cumulative log probability:

score(y1:t)=i=1tlogP(yiy1:i1,x)\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x)

where:

  • y1:ty_{1:t}: the output sequence from position 1 to tt, representing all tokens generated so far
  • yiy_i: the token at position ii in the output sequence
  • y1:i1y_{1:i-1}: the prefix of all previously generated tokens (before position ii); for i=1i=1, this is empty
  • xx: the input sequence being translated or transformed
  • P(yiy1:i1,x)P(y_i \,|\, y_{1:i-1}, x): the model's predicted probability of token yiy_i, conditioned on the input and all previous outputs
  • log\log: the natural logarithm, mapping probabilities from (0,1](0, 1] to (,0](-\infty, 0]

This formula has an elegant incremental property: when we extend a hypothesis by adding token yt+1y_{t+1}, the new score is simply:

score(y1:t+1)=score(y1:t)+logP(yt+1y1:t,x)\text{score}(y_{1:t+1}) = \text{score}(y_{1:t}) + \log P(y_{t+1} \,|\, y_{1:t}, x)

We don't need to recompute the entire sum; we just add one term. This makes beam search efficient even for long sequences.

Interpreting Scores

Understanding how to read scores is essential for debugging and intuition. Since probabilities lie in (0,1](0, 1], their logarithms are non-positive:

  • Scores are always ≤ 0: A score of 0 means probability 1 (impossible for real sequences). Typical scores range from 5-5 to 50-50 depending on sequence length.
  • Higher (less negative) scores are better: A score of 3.5-3.5 beats a score of 8.2-8.2.
  • Scores are additive costs: Each token "costs" some negative amount. Better token choices cost less (smaller negative numbers).

To convert a score back to probability, exponentiate: P=escoreP = e^{\text{score}}. For example:

Converting log probability scores to probabilities. Exponentiate the score to recover the original probability.
ScoreProbabilityInterpretation
2.3-2.3e2.30.10e^{-2.3} \approx 0.1010%, quite likely
4.6-4.6e4.60.01e^{-4.6} \approx 0.011%, plausible
9.2-9.2e9.20.0001e^{-9.2} \approx 0.00010.01%, unlikely

The ratio of probabilities corresponds to the difference in scores. If hypothesis A has score 3.5-3.5 and hypothesis B has score 8.2-8.2, then A is e(3.5)(8.2)=e4.7110e^{(-3.5) - (-8.2)} = e^{4.7} \approx 110 times more likely than B.

With the scoring function established, we can now implement beam search. Building the algorithm from scratch will solidify your understanding of how the expand-score-prune cycle operates and how log probabilities flow through the computation.

Step 1: Representing Hypotheses

A hypothesis is a candidate sequence under consideration. During beam search, we maintain multiple hypotheses simultaneously, each representing a different path through the token space. Every hypothesis must track two pieces of information:

  1. The tokens generated so far: The actual sequence of token IDs
  2. The cumulative score: The sum of log probabilities accumulated along this path

We represent this as a simple dataclass:

In[6]:
Code
from dataclasses import dataclass
from typing import List


@dataclass
class Hypothesis:
    """A single beam search hypothesis."""

    tokens: List[int]  # Sequence of token IDs
    score: float  # Cumulative log probability

    def extend(self, token: int, log_prob: float) -> "Hypothesis":
        """Create a new hypothesis by extending this one."""
        return Hypothesis(
            tokens=self.tokens + [token], score=self.score + log_prob
        )

The extend method implements the incremental scoring property we derived earlier. When we add a new token with log probability log_prob, we create a new hypothesis with:

  • The original tokens plus the new token appended
  • The original score plus the new log probability (implementing score(y1:t+1)=score(y1:t)+logP(yt+1)\text{score}(y_{1:t+1}) = \text{score}(y_{1:t}) + \log P(y_{t+1} \,|\, \ldots))

This immutable design, creating new hypotheses rather than modifying existing ones, is crucial. During expansion, we extend each hypothesis with every vocabulary token, generating many new candidates from a single parent. If we modified hypotheses in place, these extensions would interfere with each other.

Step 2: The Main Algorithm

The beam search algorithm orchestrates the expand-score-prune cycle. It takes a language model (any function that returns token probabilities) and searches for high-scoring sequences:

In[7]:
Code
def beam_search(
    get_next_probs,  # Function: tokens -> probability distribution
    start_token: int,
    end_token: int,
    vocab_size: int,
    beam_width: int = 4,
    max_length: int = 50,
) -> List[Hypothesis]:
    """
    Perform beam search decoding.

    Args:
        get_next_probs: Function that takes current tokens and returns
                       probability distribution over next token
        start_token: ID of the start token
        end_token: ID of the end token
        vocab_size: Size of the vocabulary
        beam_width: Number of hypotheses to maintain
        max_length: Maximum sequence length

    Returns:
        List of completed hypotheses, sorted by score (highest first)
    """
    # Initialize with start token
    initial = Hypothesis(tokens=[start_token], score=0.0)
    beam = [initial]
    completed = []

    for step in range(max_length):
        if not beam:
            break

        # Collect all possible extensions
        all_candidates = []

        for hyp in beam:
            # Get probability distribution for next token
            probs = get_next_probs(hyp.tokens)
            log_probs = np.log(probs + 1e-10)  # Add small epsilon for stability

            # Generate all possible extensions
            for token in range(vocab_size):
                new_hyp = hyp.extend(token, log_probs[token])
                all_candidates.append(new_hyp)

        # Sort by score (highest first) and keep top k
        all_candidates.sort(key=lambda h: h.score, reverse=True)

        # Separate completed and continuing hypotheses
        beam = []
        for hyp in all_candidates:
            if hyp.tokens[-1] == end_token:
                completed.append(hyp)
            elif len(beam) < beam_width:
                beam.append(hyp)

            # Stop if we have enough hypotheses
            if len(beam) >= beam_width and len(completed) >= beam_width:
                break

    # Add any remaining hypotheses that didn't complete
    completed.extend(beam)

    # Sort by score and return
    completed.sort(key=lambda h: h.score, reverse=True)
    return completed

Let's trace through each phase of this implementation to understand how the algorithm operates:

Initialization. We begin with a single hypothesis containing only the <START> token. Its score is 0.0, which corresponds to log(1)=0\log(1) = 0, a probability of 1. This makes sense: we haven't made any probabilistic choices yet, so there's no "cost" accumulated.

Expansion phase. For each hypothesis currently in the beam, we query the language model for the probability distribution over the next token. We then create VV new hypotheses (one per vocabulary token), each extending the original by one token. If we have kk hypotheses and vocabulary size VV, this generates k×Vk \times V candidates.

Notice the + 1e-10 when computing log probabilities. This small epsilon prevents numerical errors when the model assigns exactly zero probability to a token (which would make log(0)=\log(0) = -\infty).

Scoring phase. The scoring happens automatically during extension. The extend method adds the new token's log probability to the cumulative score. This is the incremental property in action: we don't recompute the entire sum, just add one term.

Pruning phase. We sort all k×Vk \times V candidates by score (highest first) and iterate through them. Hypotheses ending with <END> are complete and go to the completed list. Others fill the beam until we have kk active hypotheses. The break condition implements early stopping when we have enough of both types.

Termination. The loop continues until either the beam is empty (all hypotheses completed) or we reach max_length. Hypotheses that didn't complete naturally are added to the results, allowing the caller to see partial sequences if needed.

Step 3: Testing with a Worked Example

To see beam search in action, we need a language model. Rather than loading a full neural network, we'll create a mock model that captures the essential behavior: given the tokens generated so far, it returns a probability distribution over what comes next.

Our mock model is designed to prefer the sentence "the cat sat on the mat" while assigning non-zero probability to alternatives like "the dog ran on the floor." This lets us observe how beam search explores multiple paths.

In[8]:
Code
# Define a simple vocabulary
vocab = [
    "<START>",
    "<END>",
    "the",
    "cat",
    "dog",
    "sat",
    "ran",
    "on",
    "mat",
    "floor",
]
token_to_id = {t: i for i, t in enumerate(vocab)}
id_to_token = {i: t for t, i in token_to_id.items()}


def mock_language_model(tokens: List[int]) -> np.ndarray:
    """
    A mock language model that returns probability distributions.
    Designed to prefer "the cat sat on the mat" but allow alternatives.
    """
    probs = np.ones(len(vocab)) * 0.01  # Small base probability

    last_token = id_to_token[tokens[-1]]

    if last_token == "<START>":
        probs[token_to_id["the"]] = 0.85
        probs[token_to_id["cat"]] = 0.05
        probs[token_to_id["dog"]] = 0.05
    elif last_token == "the":
        probs[token_to_id["cat"]] = 0.50
        probs[token_to_id["dog"]] = 0.35
        probs[token_to_id["mat"]] = 0.05
        probs[token_to_id["floor"]] = 0.05
    elif last_token in ["cat", "dog"]:
        probs[token_to_id["sat"]] = 0.55
        probs[token_to_id["ran"]] = 0.35
    elif last_token in ["sat", "ran"]:
        probs[token_to_id["on"]] = 0.80
        probs[token_to_id["<END>"]] = 0.10
    elif last_token == "on":
        probs[token_to_id["the"]] = 0.85
        probs[token_to_id["<END>"]] = 0.05
    elif last_token in ["mat", "floor"]:
        probs[token_to_id["<END>"]] = 0.90

    # Normalize
    probs = probs / probs.sum()
    return probs


# Run beam search
results = beam_search(
    get_next_probs=mock_language_model,
    start_token=token_to_id["<START>"],
    end_token=token_to_id["<END>"],
    vocab_size=len(vocab),
    beam_width=4,
    max_length=10,
)
Out[9]:
Console
Top 4 hypotheses from beam search:
------------------------------------------------------------
1. Score: -4.137 (prob: 0.0160)
   Sequence: <START> the cat sat on the cat sat on the cat

2. Score: -4.494 (prob: 0.0112)
   Sequence: <START> the cat sat on the dog sat on the cat

3. Score: -4.494 (prob: 0.0112)
   Sequence: <START> the cat sat on the cat sat on the dog

4. Score: -4.494 (prob: 0.0112)
   Sequence: <START> the dog sat on the cat sat on the cat

The output reveals several important properties of beam search:

Scores reflect cumulative log probability. Each score is the sum of log probabilities along the path. The top hypothesis has the highest (least negative) score because each transition ("" → "the" → "cat" → "sat" → "on") follows the model's preferred path, accumulating smaller penalties at each step.

Multiple alternatives survive. Unlike greedy decoding, beam search maintains hypotheses with "dog" instead of "cat" and "ran" instead of "sat." These alternatives have lower scores but remain in the beam because they're among the top kk candidates at each step.

Probability interpretation. The scores can be converted back to probabilities by exponentiating. A score around 4-4 corresponds to a probability around e40.02e^{-4} \approx 0.02, or about 2%. These probabilities are small in absolute terms but represent the most likely sequences among an exponentially large space of possibilities.

The key insight from this example: beam search doesn't just find one good sequence. It finds several good sequences ranked by probability. This is valuable when we want alternatives (for user selection) or need to rerank candidates using additional criteria (like length normalization, which we'll explore next).

Visualizing Score Evolution

To build intuition for how beam search explores the hypothesis space, let's trace the scores of all candidates across decoding steps. This visualization reveals the pruning dynamics: at each step, many candidates are generated, but only the top kk survive.

Out[10]:
Visualization
Scatter plot showing hypothesis scores at each decoding step, with kept hypotheses highlighted and pruned hypotheses faded.
Score evolution during beam search decoding. Each dot represents a candidate hypothesis, with kept hypotheses shown as filled circles and pruned ones as faded X marks. The beam maintains the top k=4 candidates at each step, with scores becoming more negative as sequences grow longer.

The visualization shows several important dynamics:

  • Score decrease: As sequences grow, scores become more negative because each token adds another (negative) log probability term.
  • Candidate explosion: At each step, the number of candidates is k×Vk \times V (beam width × vocabulary size), but only kk survive.
  • Score clustering: Kept hypotheses tend to cluster near the top of the score range, while pruned candidates spread across lower scores.
  • Convergence: As decoding progresses, the score gap between kept and pruned candidates often widens as the beam "commits" to promising paths.

Beam Width Selection

The beam width kk is the primary hyperparameter in beam search. It controls the trade-off between exploration and computational cost.

Out[11]:
Visualization
Line plot showing BLEU score increasing then plateauing with beam width.
Translation quality (BLEU score) improves rapidly with beam width up to k=4-6, then plateaus. The sweet spot for most applications is k=4-10, balancing quality gains against computational cost.
Line plot showing computation time increasing linearly with beam width.
Computational cost grows roughly linearly with beam width. Each additional beam hypothesis requires proportionally more memory and compute time.

The key observations about beam width are:

  • k=1k = 1: Equivalent to greedy decoding. Fast but often produces suboptimal sequences.
  • k=4k = 4 to 1010: The sweet spot for most applications. Provides significant quality improvements over greedy decoding with manageable computational overhead.
  • k>10k > 10: Diminishing returns. Quality improvements become marginal while computation continues to grow linearly.

The optimal beam width depends on the task and model. Machine translation typically uses k=4k = 4 to 66. Summarization and dialogue systems might use smaller beams (k=2k = 2 to 44) since diversity matters more than finding the single best sequence. Speech recognition often uses larger beams (k=10k = 10 to 2020) because the search space is more constrained by acoustic evidence.

In[12]:
Code
# Compare different beam widths
def compare_beam_widths(widths: List[int]):
    """Compare beam search results across different beam widths."""
    results = {}
    for k in widths:
        hypotheses = beam_search(
            get_next_probs=mock_language_model,
            start_token=token_to_id["<START>"],
            end_token=token_to_id["<END>"],
            vocab_size=len(vocab),
            beam_width=k,
            max_length=10,
        )
        if hypotheses:
            best = hypotheses[0]
            results[k] = {
                "sequence": " ".join(id_to_token[t] for t in best.tokens),
                "score": best.score,
                "num_hypotheses": len(hypotheses),
            }
    return results


beam_comparison = compare_beam_widths([1, 2, 4, 8])
Out[13]:
Console
Effect of Beam Width on Search Results:
----------------------------------------------------------------------
Beam Width   Best Score   # Hypotheses    Best Sequence
----------------------------------------------------------------------
1            -4.137       2               <START> the cat sat on the cat sat on the cat
2            -4.137       4               <START> the cat sat on the cat sat on the cat
4            -4.137       8               <START> the cat sat on the cat sat on the cat
8            -3.283       22              <START> the mat <END>

With beam width 1 (greedy decoding), we find only one sequence. As beam width increases, we explore more alternatives and may find better-scoring sequences. The best sequence often stabilizes at moderate beam widths, confirming that very large beams provide diminishing returns.

Length Normalization

Our basic beam search implementation has a subtle but significant flaw: it systematically prefers shorter sequences over longer ones, even when the longer sequence is a better output. Understanding why this happens and how to fix it reveals important insights about working with log probabilities.

The Length Bias Problem

Recall that each token contributes a negative log probability to the cumulative score (since logp<0\log p < 0 for any p<1p < 1). This creates an arithmetic bias: longer sequences accumulate more negative terms, driving their scores lower regardless of quality.

Consider a concrete example:

  • Hypothesis A: "The cat sat" with score 2.5-2.5 (3 content tokens)
  • Hypothesis B: "The cat sat on the mat" with score 4.2-4.2 (6 content tokens)

Raw beam search prefers Hypothesis A because 2.5>4.2-2.5 > -4.2. But Hypothesis B is clearly a more complete sentence! The problem is that we're comparing apples to oranges: Hypothesis B has made more probabilistic choices, so naturally it has accumulated more "cost."

This isn't a flaw in the math. It's a flaw in what we're optimizing. Raw log probability asks "what's the probability of this exact sequence?" But what we really want is "how good is each token choice, on average?"

The Solution: Normalize by Length

To fairly compare sequences of different lengths, we divide the cumulative score by a function of the sequence length. The normalized score effectively measures average log probability per token rather than total log probability.

Length Normalization Formula

The length-normalized score divides the cumulative log probability by tαt^\alpha, where tt is the sequence length:

scorenorm(y1:t)=1tαi=1tlogP(yiy1:i1,x)\text{score}_{\text{norm}}(y_{1:t}) = \frac{1}{t^\alpha} \sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x)

where:

  • scorenorm(y1:t)\text{score}_{\text{norm}}(y_{1:t}): the length-normalized score for sequence y1:ty_{1:t}
  • tt: the length of the generated sequence (number of tokens)
  • α\alpha: the length penalty exponent, typically in the range [0,1][0, 1]
  • tαt^\alpha: the normalization factor, which grows with sequence length
  • i=1tlogP(yiy1:i1,x)\sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x): the raw cumulative log probability (same as the unnormalized score)

Why the exponent α\alpha? The parameter α\alpha controls how much we compensate for length:

  • α=0\alpha = 0: No normalization at all. t0=1t^0 = 1, so we use raw scores. Short sequences are strongly favored.
  • α=0.5\alpha = 0.5: Moderate normalization. We divide by t\sqrt{t}, partially compensating for length.
  • α=1.0\alpha = 1.0: Full normalization. We divide by tt, computing the average log probability per token. This can over-compensate, sometimes favoring unnecessarily long sequences.

In practice, α=0.6\alpha = 0.6 to 0.70.7 works well for most tasks. This provides enough compensation to prevent truncated outputs without encouraging verbose padding.

Out[14]:
Visualization
Line plot showing length penalty curves for different alpha values from 0 to 1, demonstrating how penalty grows with sequence length.
The length penalty factor $t^\alpha$ grows with sequence length, with the rate controlled by $\alpha$. Higher $\alpha$ values apply stronger penalties to longer sequences. The commonly used range of 0.6-0.7 provides moderate compensation without over-penalizing long outputs.

Visualizing the Effect

The following table demonstrates how length normalization changes which hypothesis "wins." Without normalization, the shortest sequence has the highest (least negative) score. With normalization (α=0.7\alpha = 0.7), sequences are compared fairly based on their average token quality.

Effect of length normalization on hypothesis ranking. Without normalization, the "Short" sequence wins with the highest (least negative) raw score of −1.5. With normalization using, the "Medium" sequence achieves the best normalized score of −0.63, demonstrating how length normalization can favor appropriately-lengthed sequences over truncated ones. The normalized score is computed as .
SequenceLengthRaw ScoreNormalized ScoreWinner?
Short3−1.5−0.69Raw ✓
Medium5−2.0−0.63Normalized ✓
Long8−3.5−0.82
Very Long12−6.0−1.04

Task-Specific Tuning

The optimal α\alpha depends on your task's desired output length:

Recommended length penalty values by task. Lower values favor shorter outputs, while higher values allow longer sequences.
TaskRecommended α\alphaReasoning
Summarization0.5 - 0.6Shorter outputs are often appropriate
Translation0.6 - 0.7Balance between brevity and completeness
Dialogue0.7 - 0.8Responses should be appropriately detailed
Story generation0.8 - 1.0Longer outputs may be desirable

Implementation

Adding length normalization to beam search requires only a small modification: instead of sorting candidates by raw score, we sort by normalized score. The key insight is that normalization affects comparison but we still track the raw score for final reporting.

In[15]:
Code
def beam_search_with_length_norm(
    get_next_probs,
    start_token: int,
    end_token: int,
    vocab_size: int,
    beam_width: int = 4,
    max_length: int = 50,
    length_penalty_alpha: float = 0.7,
) -> List[Hypothesis]:
    """
    Beam search with length normalization.

    The length penalty is applied when comparing hypotheses:
    normalized_score = score / (length ^ alpha)
    """
    initial = Hypothesis(tokens=[start_token], score=0.0)
    beam = [initial]
    completed = []

    def get_normalized_score(hyp: Hypothesis) -> float:
        """Compute length-normalized score."""
        length = len(hyp.tokens)
        return hyp.score / (length**length_penalty_alpha)

    for step in range(max_length):
        if not beam:
            break

        all_candidates = []

        for hyp in beam:
            probs = get_next_probs(hyp.tokens)
            log_probs = np.log(probs + 1e-10)

            for token in range(vocab_size):
                new_hyp = hyp.extend(token, log_probs[token])
                all_candidates.append(new_hyp)

        # Sort by NORMALIZED score
        all_candidates.sort(key=get_normalized_score, reverse=True)

        beam = []
        for hyp in all_candidates:
            if hyp.tokens[-1] == end_token:
                completed.append(hyp)
            elif len(beam) < beam_width:
                beam.append(hyp)

            if len(beam) >= beam_width and len(completed) >= beam_width:
                break

    completed.extend(beam)
    completed.sort(key=get_normalized_score, reverse=True)
    return completed

The only change from our basic implementation is the get_normalized_score function and using it as the sort key. This small modification can dramatically improve output quality for tasks where appropriate length matters.

Standard beam search has an unexpected limitation: the hypotheses it returns are often nearly identical. If the top candidate is "The cat sat on the mat," the second and third candidates might be "The cat sat on the floor" and "The cat sat on the rug," differing only in the final word. This happens because beam search greedily pursues high-probability paths, and in natural language, similar beginnings tend to have similar continuations.

For many applications, this lack of diversity is a serious problem:

  • Dialogue systems need varied responses to avoid sounding robotic
  • Creative writing assistants should offer genuinely different alternatives
  • Reranking pipelines benefit from diverse candidates to choose from

Diverse beam search addresses this by explicitly encouraging hypotheses to differ from each other.

The Group-Based Approach

The key idea is to partition the beam into GG groups, where each group is encouraged to explore different regions of the output space. Group 1 runs standard beam search. Group 2 penalizes hypotheses that are similar to Group 1's selections, pushing it toward different word choices. Group 3 penalizes similarity to both Groups 1 and 2, and so on.

Out[16]:
Visualization
Diagram showing two beam groups with diversity penalty arrows between them encouraging different word choices.
Diverse beam search partitions the beam into groups and applies a diversity penalty to encourage different hypotheses. Group 1 contains the top hypotheses from standard beam search. Group 2 penalizes similarity to Group 1, encouraging exploration of different sentence structures, word choices, and semantic content.

The Algorithm

Diverse beam search modifies the standard algorithm by processing groups sequentially and penalizing similarity:

  1. Partition the beam into GG groups, each maintaining k/Gk/G hypotheses
  2. Process Group 1 using standard beam search (no penalty)
  3. For each subsequent group gg, when scoring candidates:
    • Compute the base score (log probability) as usual
    • Subtract a penalty for similarity to hypotheses in groups 1,2,,g11, 2, \ldots, g-1
    • Select the top k/Gk/G candidates by this penalized score
  4. Combine all groups' hypotheses for the final output

The Diversity Penalty

The penalty for hypothesis yy in group gg accumulates similarity to all hypotheses selected by earlier groups. The penalized score for a candidate hypothesis is computed by subtracting a diversity term from the base log probability score:

penalized_score(y,g)=score(y)λg<gyGgsim(y,y)\text{penalized\_score}(y, g) = \text{score}(y) - \lambda \sum_{g' < g} \sum_{y' \in G_{g'}} \text{sim}(y, y')

where:

  • penalized_score(y,g)\text{penalized\_score}(y, g): the adjusted score for hypothesis yy when being considered for group gg
  • score(y)\text{score}(y): the base log probability score of hypothesis yy (computed as usual)
  • λ\lambda: a hyperparameter controlling the strength of diversification (typical values: 0.5 to 1.0; higher values enforce more diversity)
  • g<gg' < g: iteration over all groups with indices less than gg (i.e., groups processed earlier)
  • GgG_{g'}: the set of hypotheses selected for group gg'
  • yGgy' \in G_{g'}: iteration over all hypotheses in group gg'
  • sim(y,y)\text{sim}(y, y'): a function measuring overlap between hypotheses yy and yy' (returns higher values for more similar sequences)

Measuring similarity. Common choices for the similarity function include:

  • Hamming distance: Count the number of positions where the two sequences have the same token. Simple and fast.
  • N-gram overlap: Count shared bigrams or trigrams. Captures phrase-level similarity.
  • Token-level penalty: Penalize selecting the same token at the same position as any previous group's hypothesis.

The effect. By subtracting this penalty from the score, hypotheses that share many tokens with earlier groups become less attractive. Group 2 is pushed away from Group 1's choices; Group 3 is pushed away from both Groups 1 and 2. The result is a diverse set of outputs that collectively cover more of the plausible output space.

Beam Search vs Sampling

Beam search finds high-probability sequences, but high probability doesn't always mean high quality. For creative tasks like story generation or dialogue, the most probable response is often generic and boring. Sampling-based methods introduce randomness to produce more diverse and interesting outputs.

Out[17]:
Visualization
Side-by-side comparison showing beam search producing consistent outputs versus sampling producing varied outputs.
Comparison of beam search and sampling approaches. Beam search (left) deterministically finds high-probability sequences, producing consistent but potentially generic outputs with high probability and consistent quality but low diversity. Sampling (right) introduces randomness, enabling diverse and creative outputs with high diversity but variable quality.

The key sampling methods are:

Temperature sampling scales the logits before applying softmax, controlling how "peaked" or "flat" the resulting probability distribution is. The temperature-scaled probability for token yiy_i is:

P(yi)=exp(zi/T)j=1Vexp(zj/T)P(y_i) = \frac{\exp(z_i / T)}{\sum_{j=1}^{V} \exp(z_j / T)}

where:

  • P(yi)P(y_i): the probability of selecting token yiy_i after temperature scaling
  • ziz_i: the logit (unnormalized score) for token yiy_i from the model's output layer
  • TT: the temperature parameter, a positive real number that controls randomness
  • VV: the vocabulary size (total number of possible tokens)
  • exp(zi/T)\exp(z_i / T): the exponentiated scaled logit, which determines the token's relative weight
  • j=1Vexp(zj/T)\sum_{j=1}^{V} \exp(z_j / T): the normalization constant summing over all vocabulary tokens

Why does temperature work? Dividing by TT before exponentiation changes the relative differences between logits:

  • T=1T = 1: Standard softmax. The distribution reflects the model's learned preferences.
  • T<1T < 1 (e.g., 0.5): Dividing by a small number amplifies differences between logits. If z1=2z_1 = 2 and z2=1z_2 = 1, then z1/0.5=4z_1/0.5 = 4 and z2/0.5=2z_2/0.5 = 2, so the gap widens from 1 to 2. This makes the distribution more peaked, favoring high-probability tokens.
  • T>1T > 1 (e.g., 2.0): Dividing by a large number compresses differences. The same logits become z1/2=1z_1/2 = 1 and z2/2=0.5z_2/2 = 0.5, so the gap shrinks from 1 to 0.5. This flattens the distribution, giving lower-probability tokens more chance.

In the limit: as T0T \to 0, the distribution approaches a one-hot vector (greedy selection of the maximum logit). As TT \to \infty, the distribution approaches uniform (random selection).

Out[18]:
Visualization
Bar chart showing sharply peaked probability distribution at low temperature.
T = 0.3 (Sharp): Low temperature amplifies differences between logits, concentrating probability on the most likely token (''the'' has 97% probability).
Bar chart showing moderate probability distribution at standard temperature.
T = 1.0 (Standard): Standard softmax preserves the model''s learned distribution with moderate spread across tokens.
Bar chart showing flattened probability distribution at high temperature.
T = 2.0 (Flat): High temperature compresses differences, flattening the distribution and giving unlikely tokens more chance.

Top-k sampling restricts sampling to the kk most probable tokens, setting all other probabilities to zero. This prevents sampling very unlikely tokens while maintaining diversity among likely options.

Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds pp. This adapts to the distribution's shape: when the model is confident, fewer tokens are considered; when uncertain, more options remain.

Out[19]:
Visualization
Bar chart showing top-k sampling keeping fixed 5 tokens.
Top-k sampling (k=5) keeps a fixed number of tokens regardless of their probability mass. Here, the top 5 tokens capture 91% of the probability mass. The cutoff is fixed regardless of how confident the model is.
Bar chart showing top-p sampling keeping tokens until 90% cumulative probability.
Top-p (nucleus) sampling (p=0.9) adapts to the distribution shape, keeping tokens until 90% cumulative probability is reached. This requires 4 tokens here, but would use more tokens if the distribution were flatter.

The choice between beam search and sampling depends on the application:

Recommended decoding strategies for different NLP applications. The choice depends on whether accuracy or diversity is more important for the task.
ApplicationRecommended MethodReasoning
Machine TranslationBeam search (k=4-6)Accuracy matters most; one correct translation needed
SummarizationBeam search (k=4)Faithfulness to source is critical
Dialogue SystemsSampling (top-p=0.9)Diversity prevents repetitive responses
Creative WritingSampling (T=0.8-1.2)Creativity and surprise are valued
Code GenerationBeam search or low-T samplingCorrectness is essential

A Complete Beam Search Implementation

We've now covered all the key concepts: the basic expand-score-prune algorithm, log-space computation, length normalization, and diversity. Let's consolidate these into a production-ready implementation that you could use in real applications.

Design Decisions

A practical beam search decoder needs several features beyond the basic algorithm:

  1. Length normalization to fairly compare sequences of different lengths
  2. Minimum length constraints to prevent trivially short outputs
  3. Early stopping to save computation when enough good hypotheses are found
  4. N-best output to return multiple candidates for downstream reranking

The implementation below uses the length penalty formula from Google's Neural Machine Translation paper. Rather than dividing by tαt^\alpha directly, it uses a smoothed version that prevents excessive penalty for very short sequences:

lp(t)=(5+t6)α\text{lp}(t) = \left( \frac{5 + t}{6} \right)^\alpha

where:

  • lp(t)\text{lp}(t): the length penalty factor for a sequence of length tt
  • tt: the number of tokens in the sequence
  • α\alpha: the length penalty exponent (same as before, typically 0.6 to 0.7)
  • The constants 5 and 6 are chosen so that lp(1)=1\text{lp}(1) = 1 when α=1\alpha = 1, providing a smooth baseline

This formula behaves similarly to tαt^\alpha for longer sequences but is gentler on short sequences. For example, with α=0.7\alpha = 0.7: a 2-token sequence gets penalty factor lp(2)=(7/6)0.71.11\text{lp}(2) = (7/6)^{0.7} \approx 1.11 instead of 20.71.622^{0.7} \approx 1.62.

Out[20]:
Visualization
Line plot comparing two length penalty formulas, showing GNMT formula is less aggressive for short sequences.
Comparison of the Google NMT length penalty formula $((5+t)/6)^\alpha$ versus the simple formula $t^\alpha$ with α = 0.7. The GNMT formula is gentler on short sequences: at t=2, simple gives 1.62 while GNMT gives 1.11. The formulas converge for longer sequences.
In[21]:
Code
from typing import Callable, Tuple


class BeamSearchDecoder:
    """
    A complete beam search decoder with length normalization
    and configurable stopping criteria.
    """

    def __init__(
        self,
        vocab_size: int,
        start_token: int,
        end_token: int,
        beam_width: int = 4,
        max_length: int = 50,
        length_penalty: float = 0.7,
        min_length: int = 1,
    ):
        self.vocab_size = vocab_size
        self.start_token = start_token
        self.end_token = end_token
        self.beam_width = beam_width
        self.max_length = max_length
        self.length_penalty = length_penalty
        self.min_length = min_length

    def _get_length_penalty(self, length: int) -> float:
        """Compute length penalty factor."""
        return ((5 + length) / 6) ** self.length_penalty

    def _normalize_score(self, score: float, length: int) -> float:
        """Apply length normalization to score."""
        return score / self._get_length_penalty(length)

    def decode(
        self,
        get_next_log_probs: Callable[[List[int]], np.ndarray],
        n_best: int = 1,
    ) -> List[Tuple[List[int], float]]:
        """
        Perform beam search decoding.

        Args:
            get_next_log_probs: Function that takes token sequence and
                               returns log probabilities for next token
            n_best: Number of best hypotheses to return

        Returns:
            List of (token_sequence, score) tuples
        """
        # Initialize beam with start token
        # Each hypothesis: (normalized_score, raw_score, tokens)
        beam = [(0.0, 0.0, [self.start_token])]
        completed = []

        for step in range(self.max_length):
            if not beam:
                break

            candidates = []

            for norm_score, raw_score, tokens in beam:
                # Get log probabilities for next token
                log_probs = get_next_log_probs(tokens)

                # Expand with all possible next tokens
                for token in range(self.vocab_size):
                    new_tokens = tokens + [token]
                    new_raw_score = raw_score + log_probs[token]
                    new_length = len(new_tokens)
                    new_norm_score = self._normalize_score(
                        new_raw_score, new_length
                    )

                    candidates.append(
                        (new_norm_score, new_raw_score, new_tokens)
                    )

            # Sort by normalized score and keep top k
            candidates.sort(key=lambda x: x[0], reverse=True)

            beam = []
            for norm_score, raw_score, tokens in candidates:
                if tokens[-1] == self.end_token:
                    # Only accept if minimum length reached
                    if len(tokens) >= self.min_length:
                        completed.append((norm_score, raw_score, tokens))
                elif len(beam) < self.beam_width:
                    beam.append((norm_score, raw_score, tokens))

                # Early stopping: enough completed hypotheses
                if len(completed) >= self.beam_width * 2:
                    break

        # Add remaining beam hypotheses to completed
        completed.extend(beam)

        # Sort by normalized score and return top n_best
        completed.sort(key=lambda x: x[0], reverse=True)

        return [
            (tokens, raw_score)
            for norm_score, raw_score, tokens in completed[:n_best]
        ]

Testing the Complete Decoder

Let's verify that our implementation works correctly with the mock language model:

In[22]:
Code
# Create decoder
decoder = BeamSearchDecoder(
    vocab_size=len(vocab),
    start_token=token_to_id["<START>"],
    end_token=token_to_id["<END>"],
    beam_width=4,
    max_length=10,
    length_penalty=0.7,
)


# Wrapper to convert probabilities to log probabilities
def get_log_probs(tokens):
    probs = mock_language_model(tokens)
    return np.log(probs + 1e-10)


# Decode
results = decoder.decode(get_log_probs, n_best=4)
Out[23]:
Console
Beam Search Results (with length normalization):
----------------------------------------------------------------------
1. Score: -4.137
   Sequence: <START> the cat sat on the cat sat on the cat

2. Score: -3.283
   Sequence: <START> the mat <END>

3. Score: -3.283
   Sequence: <START> the floor <END>

4. Score: -4.797
   Sequence: <START> the <END>

Understanding the output. The decoder returns tuples of (tokens, raw_score). The raw score is the unnormalized cumulative log probability, useful for comparing to other systems or for further processing. The ranking, however, is based on normalized scores internally.

Practical features demonstrated:

  • Length normalization: The Google NMT formula ((5+t)/6)α((5 + t) / 6)^\alpha provides smoother normalization than simple tαt^\alpha, especially for short sequences where tαt^\alpha can be overly aggressive.
  • Minimum length enforcement: Hypotheses ending with <END> before reaching min_length are rejected, preventing degenerate single-word outputs.
  • Early stopping: Once we have 2k2k completed hypotheses, we stop expanding, since further search is unlikely to find better candidates.
  • N-best output: Returning multiple hypotheses enables downstream reranking with additional models or features.

Limitations and Impact

Beam search has been the dominant decoding algorithm for neural machine translation and other sequence generation tasks since the mid-2010s. Its success stems from finding a practical balance between the exhaustive search (which guarantees finding the optimal sequence but is computationally infeasible) and greedy decoding (which is fast but often suboptimal).

However, beam search has notable limitations. The algorithm is inherently sequential, processing one timestep at a time, which limits parallelization on modern hardware. Each step requires computing probabilities for all vocabulary tokens across all beam hypotheses, making it slower than greedy decoding by a factor roughly equal to the beam width.

More fundamentally, beam search optimizes for probability, not quality. Research has shown that the highest-probability sequence according to a language model isn't always the best output by human judgment. This "probability-quality mismatch" manifests in several ways: beam search can produce generic, repetitive, or overly safe outputs. For open-ended generation tasks like dialogue or creative writing, sampling methods often produce more engaging text despite lower probabilities.

The length bias issue, while addressable through normalization, reveals a deeper problem: sequence-level probability isn't the right objective for many tasks. A sequence might be highly probable because it's short and generic, not because it's a good translation or summary. Length normalization is a heuristic fix rather than a principled solution.

Despite these limitations, beam search remains essential for tasks where accuracy matters more than diversity. Machine translation systems, summarization models, and speech recognition systems all rely heavily on beam search. The algorithm's deterministic nature is actually an advantage in these contexts: given the same input, you get the same output, making systems predictable and debuggable.

The impact of beam search extends beyond its direct applications. The algorithm established the paradigm of maintaining multiple hypotheses during decoding, which influenced later developments like diverse beam search, constrained decoding, and various reranking methods. Understanding beam search is foundational for working with any sequence generation system.

Summary

This chapter explored beam search as a practical algorithm for sequence generation that balances quality against computational cost.

The key concepts covered:

  • Greedy decoding limitations: Selecting the highest-probability token at each step can lead to globally suboptimal sequences because early decisions constrain later options.

  • Beam search algorithm: Maintains kk candidate hypotheses at each step, expanding all possibilities and pruning to keep the top kk by cumulative score. This explores multiple paths without the exponential cost of exhaustive search.

  • Log-space computation: Working with log probabilities converts multiplication to addition, preventing numerical underflow and improving computational efficiency.

  • Beam width selection: The beam width kk controls the exploration-computation trade-off. Values of 4-10 typically provide good results; larger beams offer diminishing returns.

  • Length normalization: Dividing scores by tαt^\alpha (typically α=0.6\alpha = 0.6 to 0.70.7) corrects the inherent bias toward shorter sequences.

  • Diverse beam search: Partitioning the beam into groups with diversity penalties encourages exploration of structurally different hypotheses.

  • Beam search vs sampling: Beam search finds high-probability sequences (good for translation, summarization), while sampling methods introduce randomness for more diverse outputs (better for dialogue, creative writing).

The complete beam search procedure, given a model that computes P(yty1:t1,x)P(y_t \,|\, y_{1:t-1}, x):

  1. Initialize beam with <START> token
  2. For each timestep:
    • Expand each hypothesis with all vocabulary tokens

    • Score extensions using cumulative log probability:

      score(y1:t)=i=1tlogP(yiy1:i1,x)\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x)
    • Optionally apply length normalization:

      score_norm(y1:t)=score(y1:t)tα\text{score\_norm}(y_{1:t}) = \frac{\text{score}(y_{1:t})}{t^\alpha}
    • Keep top kk hypotheses by score

    • Move completed hypotheses (ending with <END>) to finished list

  3. Return highest-scoring completed hypothesis

In the next chapter, we'll explore attention mechanisms, which address the information bottleneck in encoder-decoder models by allowing the decoder to directly access all encoder states rather than relying on a single context vector.

Key Parameters

When implementing beam search, several parameters significantly impact the quality and efficiency of generated sequences:

  • beam_width (k): The number of hypotheses maintained at each decoding step. Values of 1 reduce to greedy decoding. Values of 4-6 work well for most translation tasks, providing a good balance between quality and speed. Larger values (10-20) may help for speech recognition where acoustic constraints narrow the search space. Increasing beam width improves quality with diminishing returns while linearly increasing computation.

  • max_length: The maximum number of tokens to generate before forcing termination. Set this based on your task's expected output length, typically 1.5-2× the average target length in your training data. Too short truncates valid outputs; too long wastes computation on degenerate sequences.

  • length_penalty (α\alpha): Controls the strength of length normalization. Values of 0.6-0.7 work well for translation. Lower values (0.5) favor shorter outputs, useful for summarization. Higher values (0.8-1.0) favor longer outputs. Set to 0 to disable length normalization entirely.

  • min_length: The minimum number of tokens required before accepting a completed hypothesis. Prevents the decoder from generating trivially short outputs like single-word responses. Set based on task requirements, typically 5-10 for translation.

  • end_token: The token ID that signals sequence completion. Hypotheses ending with this token are moved to the completed list. Ensure this matches your tokenizer's end-of-sequence token.

  • n_best: The number of top hypotheses to return. Set to 1 for single-best output. Higher values (4-10) enable reranking with additional models or returning multiple alternatives to users.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about beam search and sequence decoding.

Loading component...

Comments

Reference

BIBTEXAcademic
@misc{beamsearchfindingoptimalsequencesinneuraltextgeneration, author = {Michael Brenndoerfer}, title = {Beam Search: Finding Optimal Sequences in Neural Text Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }
APAAcademic
Michael Brenndoerfer (2025). Beam Search: Finding Optimal Sequences in Neural Text Generation. Retrieved from https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation
MLAAcademic
Michael Brenndoerfer. "Beam Search: Finding Optimal Sequences in Neural Text Generation." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation>.
CHICAGOAcademic
Michael Brenndoerfer. "Beam Search: Finding Optimal Sequences in Neural Text Generation." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Beam Search: Finding Optimal Sequences in Neural Text Generation'. Available at: https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation (Accessed: 12/16/2025).
SimpleBasic
Michael Brenndoerfer (2025). Beam Search: Finding Optimal Sequences in Neural Text Generation. https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free