Beam Search: Finding Optimal Sequences in Neural Text Generation

Michael Brenndoerfer

Master beam search decoding for sequence-to-sequence models. Learn log probability scoring, length normalization, diverse beam search, and when to use sampling.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Beam SearchLink Copied

When a sequence-to-sequence model generates output, it faces a fundamental challenge: at each timestep, it must choose from thousands of possible tokens. The simplest approach, greedy decoding, picks the single most probable token at each step. But this myopic strategy often leads to suboptimal sequences because a locally good choice can paint the model into a corner, forcing poor choices later. Beam search offers a principled middle ground between exhaustive search (computationally impossible) and greedy decoding (often suboptimal), maintaining multiple candidate sequences simultaneously to find better overall translations.

The Problem with Greedy DecodingLink Copied

Greedy decoding selects the highest-probability token at each timestep, building the output sequence one token at a time. While computationally efficient, this approach has a critical flaw: it cannot recover from early mistakes.

Consider translating "The cat sat on the mat" into French. At the first timestep, suppose the model assigns these probabilities:

"Le" (The, masculine): 0.45
"La" (The, feminine): 0.35
"Un" (A, masculine): 0.15
"Une" (A, feminine): 0.05

Greedy decoding picks "Le" with probability 0.45. But what if "La chatte" (the female cat) would lead to a more fluent overall translation? The greedy decoder cannot explore this path once it commits to "Le."

Out[3]:

Visualization

Tree diagram showing greedy path selecting highest probability at each node versus optimal path through lower initial probability. — Greedy decoding follows the locally optimal path (solid blue), selecting the highest-probability token at each step. However, this can miss the globally optimal sequence (dashed green), which might start with a lower-probability token but lead to a better overall translation. The greedy path achieves score 0.45 × 0.40 × 0.45 = 0.081, while the optimal path achieves 0.35 × 0.55 × 0.60 = 0.116.

The greedy path achieves a cumulative probability of $0.45 \times 0.40 \times 0.45 = 0.081$ , while the alternative path through "La chatte était" achieves $0.35 \times 0.55 \times 0.60 = 0.116$ . Despite starting with a lower-probability token, the second path is 43% more likely overall. Greedy decoding cannot discover this because it never explores paths that start with suboptimal first choices.

This limitation becomes more severe as sequences grow longer. Each greedy decision constrains future options, and small early mistakes compound into significantly worse final translations. The model might generate grammatically correct but awkward or semantically imprecise text simply because it committed too early to a particular phrasing.

The Beam Search AlgorithmLink Copied

Having seen how greedy decoding can miss globally optimal sequences, we need an algorithm that explores multiple possibilities without the exponential cost of exhaustive search. Beam search provides exactly this: a principled way to maintain several candidate sequences simultaneously, hedging our bets until we have enough evidence to commit.

The Core Idea: Hedging Your BetsLink Copied

Imagine you're navigating a maze where each intersection offers several paths forward. Greedy decoding is like always taking the path that looks most promising at each intersection. This is efficient, but you might walk into a dead end. Exhaustive search explores every possible route, guaranteed to find the best path, but impossibly slow for large mazes. Beam search takes a middle approach: at each intersection, you send scouts down the $k$ most promising paths, where $k$ is called the beam width.

Beam Width

The beam width $k$ controls how many candidate sequences beam search maintains at each step. A beam width of 1 reduces to greedy decoding. Typical values range from 4 to 10 for most applications, balancing exploration against computational cost.

Why does this work? Because natural language has structure. If "Le chat" (the male cat) leads to awkward phrasing, we want to discover this before committing. Keeping "La chatte" (the female cat) alive as an alternative lets us recover. The beam width determines how many "insurance policies" we maintain against early mistakes.

The Algorithm: Expand, Score, Prune, RepeatLink Copied

Beam search follows a simple four-phase cycle at each timestep:

Expand: For each hypothesis currently in the beam, consider all possible next tokens. If we have $k$ hypotheses and a vocabulary of $V$ tokens, this generates $k \times V$ candidate extensions.
Score: Assign each candidate a score reflecting how probable the entire sequence is so far. Higher scores indicate more promising sequences.
Prune: Keep only the top $k$ candidates by score. This is the key step that prevents exponential growth, ruthlessly discarding all but the best options.
Terminate: If a candidate ends with the <END> token, move it to a "completed" list. Continue until all hypotheses are complete or we reach a maximum length.

The algorithm begins with a single hypothesis containing only <START> (score 0 in log space), and ends by returning the highest-scoring completed sequence.

Out[4]:

Visualization

Step-by-step diagram showing beam search expanding two hypotheses, generating candidates, and pruning back to two. — Beam search with width $k=2$ maintains two hypotheses at each step. After expanding all possibilities, it prunes to keep only the two highest-scoring candidates. Scores are log probabilities (more negative = less likely), and beam search keeps the k highest-scoring (least negative) hypotheses.

The Scoring Function: Why Log Probabilities?Link Copied

To compare hypotheses during beam search, we need a principled way to quantify "how good" a partial sequence is. This section develops the scoring function from first principles, showing why the mathematical choices we make are not arbitrary but rather necessary solutions to fundamental problems.

The Goal: Measuring Sequence QualityLink Copied

When beam search maintains multiple candidate sequences, it must decide which ones to keep and which to discard. The natural measure of sequence quality is probability: given an input $x$ , how likely is the model to generate this particular output sequence?

For a sequence $y_1, y_2, \ldots, y_t$ , we want to compute $P(y_1, y_2, \ldots, y_t \,|\, x)$ , the probability of generating exactly these tokens in this order, conditioned on the input. But how do we compute this joint probability when our model only predicts one token at a time?

From Single Tokens to Full Sequences: The Chain RuleLink Copied

A sequence-to-sequence model generates output autoregressively: at each step, it predicts the next token based on the input and all previously generated tokens. This means we can decompose the joint probability using the chain rule:

P(y_1, y_2, \ldots, y_t \,|\, x) = P(y_1 \,|\, x) \times P(y_2 \,|\, y_1, x) \times P(y_3 \,|\, y_1, y_2, x) \times \cdots \times P(y_t \,|\, y_{1:t-1}, x)

Each factor $P(y_i \,|\, y_{1:i-1}, x)$ is exactly what our model computes at timestep $i$ : the probability distribution over the next token, given the input and all tokens generated so far. The sequence probability is simply the product of all these per-token probabilities.

This decomposition is powerful because it lets us compute sequence probabilities incrementally. When we extend a hypothesis by one token, the new sequence probability is just the old probability multiplied by the new token's probability.

The Numerical Underflow ProblemLink Copied

While the chain rule gives us a mathematically correct formula, it creates a serious computational problem. Consider what happens as sequences grow longer.

Suppose each token has an average probability of 0.3 (a reasonable estimate for natural language). For a 20-token sequence:

P(\text{sequence}) = 0.3^{20} \approx 3.5 \times 10^{-11}

This number is extraordinarily small, so small that standard 32-bit floating-point arithmetic may round it to exactly zero. Real sequences are often 50-100 tokens long, with many tokens having probabilities well below 0.3. Multiplying dozens of small numbers together quickly produces values that computers cannot represent.

Out[5]:

Visualization

Line plot showing exponential decay of sequence probability as sequence length increases, with a horizontal line marking the float32 minimum value. — Sequence probabilities shrink exponentially with length, quickly reaching values too small for computers to represent. With average token probability 0.3, a 50-token sequence has probability around 10⁻²⁶, well below the minimum representable float32 value of approximately 10⁻³⁸. Typical sentences (20-30 tokens) already approach problematic ranges for uncertain predictions.

This isn't a theoretical concern. It's a practical bug that causes beam search to fail silently. When all sequence probabilities underflow to zero, the algorithm cannot distinguish between good and bad hypotheses.

The Solution: Working in Log SpaceLink Copied

The solution is elegant: instead of tracking probabilities directly, we track their logarithms. This simple transformation solves the underflow problem while preserving everything we need for comparison.

Recall that for any positive numbers $a$ and $b$ :

\log(a \times b) = \log(a) + \log(b)

This means that multiplying probabilities in regular space is equivalent to adding log probabilities in log space. Our problematic product becomes a well-behaved sum:

\log P(y_1, \ldots, y_t \,|\, x) = \log P(y_1 \,|\, x) + \log P(y_2 \,|\, y_1, x) + \cdots + \log P(y_t \,|\, y_{1:t-1}, x)

The benefits of this transformation are substantial:

Numerical stability: Log probabilities are moderate negative numbers. For $P = 0.3$ , we have $\log(0.3) \approx -1.2$ . Adding 20 such numbers gives $-24$ , a perfectly manageable value that any computer can represent exactly.
Computational efficiency: Addition is faster than multiplication on most hardware, and the difference compounds over millions of sequence evaluations during decoding.
Preserved ordering: Since $\log$ is a monotonically increasing function, if $P(A) > P(B)$ , then $\log P(A) > \log P(B)$ . Maximizing log probability is equivalent to maximizing probability, so we lose nothing by working in log space.

The Beam Search Scoring FormulaLink Copied

We can now state the scoring formula precisely. Given a partial output sequence $y_{1:t} = (y_1, y_2, \ldots, y_t)$ generated from input $x$ , the score is the cumulative log probability:

\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x)

where:

$y_{1:t}$ : the output sequence from position 1 to $t$ , representing all tokens generated so far
$y_i$ : the token at position $i$ in the output sequence
$y_{1:i-1}$ : the prefix of all previously generated tokens (before position $i$ ); for $i=1$ , this is empty
$x$ : the input sequence being translated or transformed
$P(y_i \,|\, y_{1:i-1}, x)$ : the model's predicted probability of token $y_i$ , conditioned on the input and all previous outputs
$\log$ : the natural logarithm, mapping probabilities from $(0, 1]$ to $(-\infty, 0]$

This formula has an elegant incremental property: when we extend a hypothesis by adding token $y_{t+1}$ , the new score is simply:

\text{score}(y_{1:t+1}) = \text{score}(y_{1:t}) + \log P(y_{t+1} \,|\, y_{1:t}, x)

We don't need to recompute the entire sum; we just add one term. This makes beam search efficient even for long sequences.

Interpreting ScoresLink Copied

Understanding how to read scores is essential for debugging and intuition. Since probabilities lie in $(0, 1]$ , their logarithms are non-positive:

Scores are always ≤ 0: A score of 0 means probability 1 (impossible for real sequences). Typical scores range from $-5$ to $-50$ depending on sequence length.
Higher (less negative) scores are better: A score of $-3.5$ beats a score of $-8.2$ .
Scores are additive costs: Each token "costs" some negative amount. Better token choices cost less (smaller negative numbers).

To convert a score back to probability, exponentiate: $P = e^{\text{score}}$ . For example:

Converting log probability scores to probabilities. Exponentiate the score to recover the original probability.

Score	Probability	Interpretation
$-2.3$	$e^{-2.3} \approx 0.10$	10%, quite likely
$-4.6$	$e^{-4.6} \approx 0.01$	1%, plausible
$-9.2$	$e^{-9.2} \approx 0.0001$	0.01%, unlikely

The ratio of probabilities corresponds to the difference in scores. If hypothesis A has score $-3.5$ and hypothesis B has score $-8.2$ , then A is $e^{(-3.5) - (-8.2)} = e^{4.7} \approx 110$ times more likely than B.

Implementing Beam SearchLink Copied

With the scoring function established, we can now implement beam search. Building the algorithm from scratch will solidify your understanding of how the expand-score-prune cycle operates and how log probabilities flow through the computation.

Step 1: Representing HypothesesLink Copied

A hypothesis is a candidate sequence under consideration. During beam search, we maintain multiple hypotheses simultaneously, each representing a different path through the token space. Every hypothesis must track two pieces of information:

The tokens generated so far: The actual sequence of token IDs
The cumulative score: The sum of log probabilities accumulated along this path

We represent this as a simple dataclass:

In[6]:

Code

from dataclasses import dataclass
from typing import List


@dataclass
class Hypothesis:
    """A single beam search hypothesis."""

    tokens: List[int]  # Sequence of token IDs
    score: float  # Cumulative log probability

    def extend(self, token: int, log_prob: float) -> "Hypothesis":
        """Create a new hypothesis by extending this one."""
        return Hypothesis(
            tokens=self.tokens + [token], score=self.score + log_prob
        )

from dataclasses import dataclass
from typing import List


@dataclass
class Hypothesis:
    """A single beam search hypothesis."""

    tokens: List[int]  # Sequence of token IDs
    score: float  # Cumulative log probability

    def extend(self, token: int, log_prob: float) -> "Hypothesis":
        """Create a new hypothesis by extending this one."""
        return Hypothesis(
            tokens=self.tokens + [token], score=self.score + log_prob
        )

The extend method implements the incremental scoring property we derived earlier. When we add a new token with log probability log_prob, we create a new hypothesis with:

The original tokens plus the new token appended
The original score plus the new log probability (implementing $\text{score}(y_{1:t+1}) = \text{score}(y_{1:t}) + \log P(y_{t+1} \,|\, \ldots)$ )

This immutable design, creating new hypotheses rather than modifying existing ones, is crucial. During expansion, we extend each hypothesis with every vocabulary token, generating many new candidates from a single parent. If we modified hypotheses in place, these extensions would interfere with each other.

Step 2: The Main AlgorithmLink Copied

The beam search algorithm orchestrates the expand-score-prune cycle. It takes a language model (any function that returns token probabilities) and searches for high-scoring sequences:

In[7]:

Code

def beam_search(
    get_next_probs,  # Function: tokens -> probability distribution
    start_token: int,
    end_token: int,
    vocab_size: int,
    beam_width: int = 4,
    max_length: int = 50,
) -> List[Hypothesis]:
    """
    Perform beam search decoding.

    Args:
        get_next_probs: Function that takes current tokens and returns
                       probability distribution over next token
        start_token: ID of the start token
        end_token: ID of the end token
        vocab_size: Size of the vocabulary
        beam_width: Number of hypotheses to maintain
        max_length: Maximum sequence length

    Returns:
        List of completed hypotheses, sorted by score (highest first)
    """
    # Initialize with start token
    initial = Hypothesis(tokens=[start_token], score=0.0)
    beam = [initial]
    completed = []

    for step in range(max_length):
        if not beam:
            break

        # Collect all possible extensions
        all_candidates = []

        for hyp in beam:
            # Get probability distribution for next token
            probs = get_next_probs(hyp.tokens)
            log_probs = np.log(probs + 1e-10)  # Add small epsilon for stability

            # Generate all possible extensions
            for token in range(vocab_size):
                new_hyp = hyp.extend(token, log_probs[token])
                all_candidates.append(new_hyp)

        # Sort by score (highest first) and keep top k
        all_candidates.sort(key=lambda h: h.score, reverse=True)

        # Separate completed and continuing hypotheses
        beam = []
        for hyp in all_candidates:
            if hyp.tokens[-1] == end_token:
                completed.append(hyp)
            elif len(beam) < beam_width:
                beam.append(hyp)

            # Stop if we have enough hypotheses
            if len(beam) >= beam_width and len(completed) >= beam_width:
                break

    # Add any remaining hypotheses that didn't complete
    completed.extend(beam)

    # Sort by score and return
    completed.sort(key=lambda h: h.score, reverse=True)
    return completed

def beam_search(
    get_next_probs,  # Function: tokens -> probability distribution
    start_token: int,
    end_token: int,
    vocab_size: int,
    beam_width: int = 4,
    max_length: int = 50,
) -> List[Hypothesis]:
    """
    Perform beam search decoding.

    Args:
        get_next_probs: Function that takes current tokens and returns
                       probability distribution over next token
        start_token: ID of the start token
        end_token: ID of the end token
        vocab_size: Size of the vocabulary
        beam_width: Number of hypotheses to maintain
        max_length: Maximum sequence length

    Returns:
        List of completed hypotheses, sorted by score (highest first)
    """
    # Initialize with start token
    initial = Hypothesis(tokens=[start_token], score=0.0)
    beam = [initial]
    completed = []

    for step in range(max_length):
        if not beam:
            break

        # Collect all possible extensions
        all_candidates = []

        for hyp in beam:
            # Get probability distribution for next token
            probs = get_next_probs(hyp.tokens)
            log_probs = np.log(probs + 1e-10)  # Add small epsilon for stability

            # Generate all possible extensions
            for token in range(vocab_size):
                new_hyp = hyp.extend(token, log_probs[token])
                all_candidates.append(new_hyp)

        # Sort by score (highest first) and keep top k
        all_candidates.sort(key=lambda h: h.score, reverse=True)

        # Separate completed and continuing hypotheses
        beam = []
        for hyp in all_candidates:
            if hyp.tokens[-1] == end_token:
                completed.append(hyp)
            elif len(beam) < beam_width:
                beam.append(hyp)

            # Stop if we have enough hypotheses
            if len(beam) >= beam_width and len(completed) >= beam_width:
                break

    # Add any remaining hypotheses that didn't complete
    completed.extend(beam)

    # Sort by score and return
    completed.sort(key=lambda h: h.score, reverse=True)
    return completed

Let's trace through each phase of this implementation to understand how the algorithm operates:

Initialization. We begin with a single hypothesis containing only the <START> token. Its score is 0.0, which corresponds to $\log(1) = 0$ , a probability of 1. This makes sense: we haven't made any probabilistic choices yet, so there's no "cost" accumulated.

Expansion phase. For each hypothesis currently in the beam, we query the language model for the probability distribution over the next token. We then create $V$ new hypotheses (one per vocabulary token), each extending the original by one token. If we have $k$ hypotheses and vocabulary size $V$ , this generates $k \times V$ candidates.

Notice the + 1e-10 when computing log probabilities. This small epsilon prevents numerical errors when the model assigns exactly zero probability to a token (which would make $\log(0) = -\infty$ ).

Scoring phase. The scoring happens automatically during extension. The extend method adds the new token's log probability to the cumulative score. This is the incremental property in action: we don't recompute the entire sum, just add one term.

Pruning phase. We sort all $k \times V$ candidates by score (highest first) and iterate through them. Hypotheses ending with <END> are complete and go to the completed list. Others fill the beam until we have $k$ active hypotheses. The break condition implements early stopping when we have enough of both types.

Termination. The loop continues until either the beam is empty (all hypotheses completed) or we reach max_length. Hypotheses that didn't complete naturally are added to the results, allowing the caller to see partial sequences if needed.

Step 3: Testing with a Worked ExampleLink Copied

To see beam search in action, we need a language model. Rather than loading a full neural network, we'll create a mock model that captures the essential behavior: given the tokens generated so far, it returns a probability distribution over what comes next.

Our mock model is designed to prefer the sentence "the cat sat on the mat" while assigning non-zero probability to alternatives like "the dog ran on the floor." This lets us observe how beam search explores multiple paths.

In[8]:

Code

# Define a simple vocabulary
vocab = [
    "<START>",
    "<END>",
    "the",
    "cat",
    "dog",
    "sat",
    "ran",
    "on",
    "mat",
    "floor",
]
token_to_id = {t: i for i, t in enumerate(vocab)}
id_to_token = {i: t for t, i in token_to_id.items()}


def mock_language_model(tokens: List[int]) -> np.ndarray:
    """
    A mock language model that returns probability distributions.
    Designed to prefer "the cat sat on the mat" but allow alternatives.
    """
    probs = np.ones(len(vocab)) * 0.01  # Small base probability

    last_token = id_to_token[tokens[-1]]

    if last_token == "<START>":
        probs[token_to_id["the"]] = 0.85
        probs[token_to_id["cat"]] = 0.05
        probs[token_to_id["dog"]] = 0.05
    elif last_token == "the":
        probs[token_to_id["cat"]] = 0.50
        probs[token_to_id["dog"]] = 0.35
        probs[token_to_id["mat"]] = 0.05
        probs[token_to_id["floor"]] = 0.05
    elif last_token in ["cat", "dog"]:
        probs[token_to_id["sat"]] = 0.55
        probs[token_to_id["ran"]] = 0.35
    elif last_token in ["sat", "ran"]:
        probs[token_to_id["on"]] = 0.80
        probs[token_to_id["<END>"]] = 0.10
    elif last_token == "on":
        probs[token_to_id["the"]] = 0.85
        probs[token_to_id["<END>"]] = 0.05
    elif last_token in ["mat", "floor"]:
        probs[token_to_id["<END>"]] = 0.90

    # Normalize
    probs = probs / probs.sum()
    return probs


# Run beam search
results = beam_search(
    get_next_probs=mock_language_model,
    start_token=token_to_id["<START>"],
    end_token=token_to_id["<END>"],
    vocab_size=len(vocab),
    beam_width=4,
    max_length=10,
)

# Define a simple vocabulary
vocab = [
    "<START>",
    "<END>",
    "the",
    "cat",
    "dog",
    "sat",
    "ran",
    "on",
    "mat",
    "floor",
]
token_to_id = {t: i for i, t in enumerate(vocab)}
id_to_token = {i: t for t, i in token_to_id.items()}


def mock_language_model(tokens: List[int]) -> np.ndarray:
    """
    A mock language model that returns probability distributions.
    Designed to prefer "the cat sat on the mat" but allow alternatives.
    """
    probs = np.ones(len(vocab)) * 0.01  # Small base probability

    last_token = id_to_token[tokens[-1]]

    if last_token == "<START>":
        probs[token_to_id["the"]] = 0.85
        probs[token_to_id["cat"]] = 0.05
        probs[token_to_id["dog"]] = 0.05
    elif last_token == "the":
        probs[token_to_id["cat"]] = 0.50
        probs[token_to_id["dog"]] = 0.35
        probs[token_to_id["mat"]] = 0.05
        probs[token_to_id["floor"]] = 0.05
    elif last_token in ["cat", "dog"]:
        probs[token_to_id["sat"]] = 0.55
        probs[token_to_id["ran"]] = 0.35
    elif last_token in ["sat", "ran"]:
        probs[token_to_id["on"]] = 0.80
        probs[token_to_id["<END>"]] = 0.10
    elif last_token == "on":
        probs[token_to_id["the"]] = 0.85
        probs[token_to_id["<END>"]] = 0.05
    elif last_token in ["mat", "floor"]:
        probs[token_to_id["<END>"]] = 0.90

    # Normalize
    probs = probs / probs.sum()
    return probs


# Run beam search
results = beam_search(
    get_next_probs=mock_language_model,
    start_token=token_to_id["<START>"],
    end_token=token_to_id["<END>"],
    vocab_size=len(vocab),
    beam_width=4,
    max_length=10,
)

Out[9]:

Console

Top 4 hypotheses from beam search:
------------------------------------------------------------
1. Score: -4.137 (prob: 0.0160)
   Sequence: <START> the cat sat on the cat sat on the cat

2. Score: -4.494 (prob: 0.0112)
   Sequence: <START> the cat sat on the dog sat on the cat

3. Score: -4.494 (prob: 0.0112)
   Sequence: <START> the cat sat on the cat sat on the dog

4. Score: -4.494 (prob: 0.0112)
   Sequence: <START> the dog sat on the cat sat on the cat

The output reveals several important properties of beam search:

Scores reflect cumulative log probability. Each score is the sum of log probabilities along the path. The top hypothesis has the highest (least negative) score because each transition ("" → "the" → "cat" → "sat" → "on") follows the model's preferred path, accumulating smaller penalties at each step.

Multiple alternatives survive. Unlike greedy decoding, beam search maintains hypotheses with "dog" instead of "cat" and "ran" instead of "sat." These alternatives have lower scores but remain in the beam because they're among the top $k$ candidates at each step.

Probability interpretation. The scores can be converted back to probabilities by exponentiating. A score around $-4$ corresponds to a probability around $e^{-4} \approx 0.02$ , or about 2%. These probabilities are small in absolute terms but represent the most likely sequences among an exponentially large space of possibilities.

The key insight from this example: beam search doesn't just find one good sequence. It finds several good sequences ranked by probability. This is valuable when we want alternatives (for user selection) or need to rerank candidates using additional criteria (like length normalization, which we'll explore next).

Visualizing Score EvolutionLink Copied

To build intuition for how beam search explores the hypothesis space, let's trace the scores of all candidates across decoding steps. This visualization reveals the pruning dynamics: at each step, many candidates are generated, but only the top $k$ survive.

Out[10]:

Visualization

Scatter plot showing hypothesis scores at each decoding step, with kept hypotheses highlighted and pruned hypotheses faded. — Score evolution during beam search decoding. Each dot represents a candidate hypothesis, with kept hypotheses shown as filled circles and pruned ones as faded X marks. The beam maintains the top k=4 candidates at each step, with scores becoming more negative as sequences grow longer.

The visualization shows several important dynamics:

Score decrease: As sequences grow, scores become more negative because each token adds another (negative) log probability term.
Candidate explosion: At each step, the number of candidates is $k \times V$ (beam width × vocabulary size), but only $k$ survive.
Score clustering: Kept hypotheses tend to cluster near the top of the score range, while pruned candidates spread across lower scores.
Convergence: As decoding progresses, the score gap between kept and pruned candidates often widens as the beam "commits" to promising paths.

Beam Width SelectionLink Copied

The beam width $k$ is the primary hyperparameter in beam search. It controls the trade-off between exploration and computational cost.

Out[11]:

Visualization

Line plot showing BLEU score increasing then plateauing with beam width. — Translation quality (BLEU score) improves rapidly with beam width up to k=4-6, then plateaus. The sweet spot for most applications is k=4-10, balancing quality gains against computational cost.

Line plot showing computation time increasing linearly with beam width. — Computational cost grows roughly linearly with beam width. Each additional beam hypothesis requires proportionally more memory and compute time.

The key observations about beam width are:

$k = 1$ : Equivalent to greedy decoding. Fast but often produces suboptimal sequences.
$k = 4$ to $10$ : The sweet spot for most applications. Provides significant quality improvements over greedy decoding with manageable computational overhead.
$k > 10$ : Diminishing returns. Quality improvements become marginal while computation continues to grow linearly.

The optimal beam width depends on the task and model. Machine translation typically uses $k = 4$ to $6$ . Summarization and dialogue systems might use smaller beams ( $k = 2$ to $4$ ) since diversity matters more than finding the single best sequence. Speech recognition often uses larger beams ( $k = 10$ to $20$ ) because the search space is more constrained by acoustic evidence.

In[12]:

Code

# Compare different beam widths
def compare_beam_widths(widths: List[int]):
    """Compare beam search results across different beam widths."""
    results = {}
    for k in widths:
        hypotheses = beam_search(
            get_next_probs=mock_language_model,
            start_token=token_to_id["<START>"],
            end_token=token_to_id["<END>"],
            vocab_size=len(vocab),
            beam_width=k,
            max_length=10,
        )
        if hypotheses:
            best = hypotheses[0]
            results[k] = {
                "sequence": " ".join(id_to_token[t] for t in best.tokens),
                "score": best.score,
                "num_hypotheses": len(hypotheses),
            }
    return results


beam_comparison = compare_beam_widths([1, 2, 4, 8])

# Compare different beam widths
def compare_beam_widths(widths: List[int]):
    """Compare beam search results across different beam widths."""
    results = {}
    for k in widths:
        hypotheses = beam_search(
            get_next_probs=mock_language_model,
            start_token=token_to_id["<START>"],
            end_token=token_to_id["<END>"],
            vocab_size=len(vocab),
            beam_width=k,
            max_length=10,
        )
        if hypotheses:
            best = hypotheses[0]
            results[k] = {
                "sequence": " ".join(id_to_token[t] for t in best.tokens),
                "score": best.score,
                "num_hypotheses": len(hypotheses),
            }
    return results


beam_comparison = compare_beam_widths([1, 2, 4, 8])

Out[13]:

Console

Effect of Beam Width on Search Results:
----------------------------------------------------------------------
Beam Width   Best Score   # Hypotheses    Best Sequence
----------------------------------------------------------------------
1            -4.137       2               <START> the cat sat on the cat sat on the cat
2            -4.137       4               <START> the cat sat on the cat sat on the cat
4            -4.137       8               <START> the cat sat on the cat sat on the cat
8            -3.283       22              <START> the mat <END>

With beam width 1 (greedy decoding), we find only one sequence. As beam width increases, we explore more alternatives and may find better-scoring sequences. The best sequence often stabilizes at moderate beam widths, confirming that very large beams provide diminishing returns.

Length NormalizationLink Copied

Our basic beam search implementation has a subtle but significant flaw: it systematically prefers shorter sequences over longer ones, even when the longer sequence is a better output. Understanding why this happens and how to fix it reveals important insights about working with log probabilities.

The Length Bias ProblemLink Copied

Recall that each token contributes a negative log probability to the cumulative score (since $\log p < 0$ for any $p < 1$ ). This creates an arithmetic bias: longer sequences accumulate more negative terms, driving their scores lower regardless of quality.

Consider a concrete example:

Hypothesis A: "The cat sat" with score $-2.5$ (3 content tokens)
Hypothesis B: "The cat sat on the mat" with score $-4.2$ (6 content tokens)

Raw beam search prefers Hypothesis A because $-2.5 > -4.2$ . But Hypothesis B is clearly a more complete sentence! The problem is that we're comparing apples to oranges: Hypothesis B has made more probabilistic choices, so naturally it has accumulated more "cost."

This isn't a flaw in the math. It's a flaw in what we're optimizing. Raw log probability asks "what's the probability of this exact sequence?" But what we really want is "how good is each token choice, on average?"

The Solution: Normalize by LengthLink Copied

To fairly compare sequences of different lengths, we divide the cumulative score by a function of the sequence length. The normalized score effectively measures average log probability per token rather than total log probability.

Length Normalization Formula

The length-normalized score divides the cumulative log probability by $t^\alpha$ , where $t$ is the sequence length:

\text{score}_{\text{norm}}(y_{1:t}) = \frac{1}{t^\alpha} \sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x)

where:

$\text{score}_{\text{norm}}(y_{1:t})$ : the length-normalized score for sequence $y_{1:t}$
$t$ : the length of the generated sequence (number of tokens)
$\alpha$ : the length penalty exponent, typically in the range $[0, 1]$
$t^\alpha$ : the normalization factor, which grows with sequence length
$\sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x)$ : the raw cumulative log probability (same as the unnormalized score)

Why the exponent $\alpha$ ? The parameter $\alpha$ controls how much we compensate for length:

$\alpha = 0$ : No normalization at all. $t^0 = 1$ , so we use raw scores. Short sequences are strongly favored.
$\alpha = 0.5$ : Moderate normalization. We divide by $\sqrt{t}$ , partially compensating for length.
$\alpha = 1.0$ : Full normalization. We divide by $t$ , computing the average log probability per token. This can over-compensate, sometimes favoring unnecessarily long sequences.

In practice, $\alpha = 0.6$ to $0.7$ works well for most tasks. This provides enough compensation to prevent truncated outputs without encouraging verbose padding.

Out[14]:

Visualization

Line plot showing length penalty curves for different alpha values from 0 to 1, demonstrating how penalty grows with sequence length. — The length penalty factor $t^\alpha$ grows with sequence length, with the rate controlled by $\alpha$. Higher $\alpha$ values apply stronger penalties to longer sequences. The commonly used range of 0.6-0.7 provides moderate compensation without over-penalizing long outputs.

Visualizing the EffectLink Copied

The following table demonstrates how length normalization changes which hypothesis "wins." Without normalization, the shortest sequence has the highest (least negative) score. With normalization ( $\alpha = 0.7$ ), sequences are compared fairly based on their average token quality.

Effect of length normalization on hypothesis ranking. Without normalization, the "Short" sequence wins with the highest (least negative) raw score of −1.5. With normalization using, the "Medium" sequence achieves the best normalized score of −0.63, demonstrating how length normalization can favor appropriately-lengthed sequences over truncated ones. The normalized score is computed as .

Sequence	Length	Raw Score	Normalized Score	Winner?
Short	3	−1.5	−0.69	Raw ✓
Medium	5	−2.0	−0.63	Normalized ✓
Long	8	−3.5	−0.82
Very Long	12	−6.0	−1.04

Task-Specific TuningLink Copied

The optimal $\alpha$ depends on your task's desired output length:

Recommended length penalty values by task. Lower values favor shorter outputs, while higher values allow longer sequences.

Task	Recommended $\alpha$	Reasoning
Summarization	0.5 - 0.6	Shorter outputs are often appropriate
Translation	0.6 - 0.7	Balance between brevity and completeness
Dialogue	0.7 - 0.8	Responses should be appropriately detailed
Story generation	0.8 - 1.0	Longer outputs may be desirable

ImplementationLink Copied

Adding length normalization to beam search requires only a small modification: instead of sorting candidates by raw score, we sort by normalized score. The key insight is that normalization affects comparison but we still track the raw score for final reporting.

In[15]:

Code

def beam_search_with_length_norm(
    get_next_probs,
    start_token: int,
    end_token: int,
    vocab_size: int,
    beam_width: int = 4,
    max_length: int = 50,
    length_penalty_alpha: float = 0.7,
) -> List[Hypothesis]:
    """
    Beam search with length normalization.

    The length penalty is applied when comparing hypotheses:
    normalized_score = score / (length ^ alpha)
    """
    initial = Hypothesis(tokens=[start_token], score=0.0)
    beam = [initial]
    completed = []

    def get_normalized_score(hyp: Hypothesis) -> float:
        """Compute length-normalized score."""
        length = len(hyp.tokens)
        return hyp.score / (length**length_penalty_alpha)

    for step in range(max_length):
        if not beam:
            break

        all_candidates = []

        for hyp in beam:
            probs = get_next_probs(hyp.tokens)
            log_probs = np.log(probs + 1e-10)

            for token in range(vocab_size):
                new_hyp = hyp.extend(token, log_probs[token])
                all_candidates.append(new_hyp)

        # Sort by NORMALIZED score
        all_candidates.sort(key=get_normalized_score, reverse=True)

        beam = []
        for hyp in all_candidates:
            if hyp.tokens[-1] == end_token:
                completed.append(hyp)
            elif len(beam) < beam_width:
                beam.append(hyp)

            if len(beam) >= beam_width and len(completed) >= beam_width:
                break

    completed.extend(beam)
    completed.sort(key=get_normalized_score, reverse=True)
    return completed

def beam_search_with_length_norm(
    get_next_probs,
    start_token: int,
    end_token: int,
    vocab_size: int,
    beam_width: int = 4,
    max_length: int = 50,
    length_penalty_alpha: float = 0.7,
) -> List[Hypothesis]:
    """
    Beam search with length normalization.

    The length penalty is applied when comparing hypotheses:
    normalized_score = score / (length ^ alpha)
    """
    initial = Hypothesis(tokens=[start_token], score=0.0)
    beam = [initial]
    completed = []

    def get_normalized_score(hyp: Hypothesis) -> float:
        """Compute length-normalized score."""
        length = len(hyp.tokens)
        return hyp.score / (length**length_penalty_alpha)

    for step in range(max_length):
        if not beam:
            break

        all_candidates = []

        for hyp in beam:
            probs = get_next_probs(hyp.tokens)
            log_probs = np.log(probs + 1e-10)

            for token in range(vocab_size):
                new_hyp = hyp.extend(token, log_probs[token])
                all_candidates.append(new_hyp)

        # Sort by NORMALIZED score
        all_candidates.sort(key=get_normalized_score, reverse=True)

        beam = []
        for hyp in all_candidates:
            if hyp.tokens[-1] == end_token:
                completed.append(hyp)
            elif len(beam) < beam_width:
                beam.append(hyp)

            if len(beam) >= beam_width and len(completed) >= beam_width:
                break

    completed.extend(beam)
    completed.sort(key=get_normalized_score, reverse=True)
    return completed

The only change from our basic implementation is the get_normalized_score function and using it as the sort key. This small modification can dramatically improve output quality for tasks where appropriate length matters.

Diverse Beam SearchLink Copied

Standard beam search has an unexpected limitation: the hypotheses it returns are often nearly identical. If the top candidate is "The cat sat on the mat," the second and third candidates might be "The cat sat on the floor" and "The cat sat on the rug," differing only in the final word. This happens because beam search greedily pursues high-probability paths, and in natural language, similar beginnings tend to have similar continuations.

For many applications, this lack of diversity is a serious problem:

Dialogue systems need varied responses to avoid sounding robotic
Creative writing assistants should offer genuinely different alternatives
Reranking pipelines benefit from diverse candidates to choose from

Diverse beam search addresses this by explicitly encouraging hypotheses to differ from each other.

The Group-Based ApproachLink Copied

The key idea is to partition the beam into $G$ groups, where each group is encouraged to explore different regions of the output space. Group 1 runs standard beam search. Group 2 penalizes hypotheses that are similar to Group 1's selections, pushing it toward different word choices. Group 3 penalizes similarity to both Groups 1 and 2, and so on.

Out[16]:

Visualization

Diagram showing two beam groups with diversity penalty arrows between them encouraging different word choices. — Diverse beam search partitions the beam into groups and applies a diversity penalty to encourage different hypotheses. Group 1 contains the top hypotheses from standard beam search. Group 2 penalizes similarity to Group 1, encouraging exploration of different sentence structures, word choices, and semantic content.

The AlgorithmLink Copied

Diverse beam search modifies the standard algorithm by processing groups sequentially and penalizing similarity:

Partition the beam into $G$ groups, each maintaining $k/G$ hypotheses
Process Group 1 using standard beam search (no penalty)
For each subsequent group $g$ , when scoring candidates:
- Compute the base score (log probability) as usual
- Subtract a penalty for similarity to hypotheses in groups $1, 2, \ldots, g-1$
- Select the top $k/G$ candidates by this penalized score
Combine all groups' hypotheses for the final output

The Diversity PenaltyLink Copied

The penalty for hypothesis $y$ in group $g$ accumulates similarity to all hypotheses selected by earlier groups. The penalized score for a candidate hypothesis is computed by subtracting a diversity term from the base log probability score:

\text{penalized\_score}(y, g) = \text{score}(y) - \lambda \sum_{g' < g} \sum_{y' \in G_{g'}} \text{sim}(y, y')

where:

$\text{penalized\_score}(y, g)$ : the adjusted score for hypothesis $y$ when being considered for group $g$
$\text{score}(y)$ : the base log probability score of hypothesis $y$ (computed as usual)
$\lambda$ : a hyperparameter controlling the strength of diversification (typical values: 0.5 to 1.0; higher values enforce more diversity)
$g' < g$ : iteration over all groups with indices less than $g$ (i.e., groups processed earlier)
$G_{g'}$ : the set of hypotheses selected for group $g'$
$y' \in G_{g'}$ : iteration over all hypotheses in group $g'$
$\text{sim}(y, y')$ : a function measuring overlap between hypotheses $y$ and $y'$ (returns higher values for more similar sequences)

Measuring similarity. Common choices for the similarity function include:

Hamming distance: Count the number of positions where the two sequences have the same token. Simple and fast.
N-gram overlap: Count shared bigrams or trigrams. Captures phrase-level similarity.
Token-level penalty: Penalize selecting the same token at the same position as any previous group's hypothesis.

The effect. By subtracting this penalty from the score, hypotheses that share many tokens with earlier groups become less attractive. Group 2 is pushed away from Group 1's choices; Group 3 is pushed away from both Groups 1 and 2. The result is a diverse set of outputs that collectively cover more of the plausible output space.

Beam Search vs SamplingLink Copied

Beam search finds high-probability sequences, but high probability doesn't always mean high quality. For creative tasks like story generation or dialogue, the most probable response is often generic and boring. Sampling-based methods introduce randomness to produce more diverse and interesting outputs.

Out[17]:

Visualization

Side-by-side comparison showing beam search producing consistent outputs versus sampling producing varied outputs. — Comparison of beam search and sampling approaches. Beam search (left) deterministically finds high-probability sequences, producing consistent but potentially generic outputs with high probability and consistent quality but low diversity. Sampling (right) introduces randomness, enabling diverse and creative outputs with high diversity but variable quality.

The key sampling methods are:

Temperature sampling scales the logits before applying softmax, controlling how "peaked" or "flat" the resulting probability distribution is. The temperature-scaled probability for token $y_i$ is:

P(y_i) = \frac{\exp(z_i / T)}{\sum_{j=1}^{V} \exp(z_j / T)}

where:

$P(y_i)$ : the probability of selecting token $y_i$ after temperature scaling
$z_i$ : the logit (unnormalized score) for token $y_i$ from the model's output layer
$T$ : the temperature parameter, a positive real number that controls randomness
$V$ : the vocabulary size (total number of possible tokens)
$\exp(z_i / T)$ : the exponentiated scaled logit, which determines the token's relative weight
$\sum_{j=1}^{V} \exp(z_j / T)$ : the normalization constant summing over all vocabulary tokens

Why does temperature work? Dividing by $T$ before exponentiation changes the relative differences between logits:

$T = 1$ : Standard softmax. The distribution reflects the model's learned preferences.
$T < 1$ (e.g., 0.5): Dividing by a small number amplifies differences between logits. If $z_1 = 2$ and $z_2 = 1$ , then $z_1/0.5 = 4$ and $z_2/0.5 = 2$ , so the gap widens from 1 to 2. This makes the distribution more peaked, favoring high-probability tokens.
$T > 1$ (e.g., 2.0): Dividing by a large number compresses differences. The same logits become $z_1/2 = 1$ and $z_2/2 = 0.5$ , so the gap shrinks from 1 to 0.5. This flattens the distribution, giving lower-probability tokens more chance.

In the limit: as $T \to 0$ , the distribution approaches a one-hot vector (greedy selection of the maximum logit). As $T \to \infty$ , the distribution approaches uniform (random selection).

Out[18]:

Visualization

Bar chart showing sharply peaked probability distribution at low temperature. — T = 0.3 (Sharp): Low temperature amplifies differences between logits, concentrating probability on the most likely token (''the'' has 97% probability).

Bar chart showing moderate probability distribution at standard temperature. — T = 1.0 (Standard): Standard softmax preserves the model''s learned distribution with moderate spread across tokens.

Bar chart showing flattened probability distribution at high temperature. — T = 2.0 (Flat): High temperature compresses differences, flattening the distribution and giving unlikely tokens more chance.

Top-k sampling restricts sampling to the $k$ most probable tokens, setting all other probabilities to zero. This prevents sampling very unlikely tokens while maintaining diversity among likely options.

Top-p (nucleus) sampling dynamically selects the smallest set of tokens whose cumulative probability exceeds $p$ . This adapts to the distribution's shape: when the model is confident, fewer tokens are considered; when uncertain, more options remain.

Out[19]:

Visualization

Bar chart showing top-k sampling keeping fixed 5 tokens. — Top-k sampling (k=5) keeps a fixed number of tokens regardless of their probability mass. Here, the top 5 tokens capture 91% of the probability mass. The cutoff is fixed regardless of how confident the model is.

Bar chart showing top-p sampling keeping tokens until 90% cumulative probability. — Top-p (nucleus) sampling (p=0.9) adapts to the distribution shape, keeping tokens until 90% cumulative probability is reached. This requires 4 tokens here, but would use more tokens if the distribution were flatter.

The choice between beam search and sampling depends on the application:

Recommended decoding strategies for different NLP applications. The choice depends on whether accuracy or diversity is more important for the task.

Application	Recommended Method	Reasoning
Machine Translation	Beam search (k=4-6)	Accuracy matters most; one correct translation needed
Summarization	Beam search (k=4)	Faithfulness to source is critical
Dialogue Systems	Sampling (top-p=0.9)	Diversity prevents repetitive responses
Creative Writing	Sampling (T=0.8-1.2)	Creativity and surprise are valued
Code Generation	Beam search or low-T sampling	Correctness is essential

A Complete Beam Search ImplementationLink Copied

We've now covered all the key concepts: the basic expand-score-prune algorithm, log-space computation, length normalization, and diversity. Let's consolidate these into a production-ready implementation that you could use in real applications.

Design DecisionsLink Copied

A practical beam search decoder needs several features beyond the basic algorithm:

Length normalization to fairly compare sequences of different lengths
Minimum length constraints to prevent trivially short outputs
Early stopping to save computation when enough good hypotheses are found
N-best output to return multiple candidates for downstream reranking

The implementation below uses the length penalty formula from Google's Neural Machine Translation paper. Rather than dividing by $t^\alpha$ directly, it uses a smoothed version that prevents excessive penalty for very short sequences:

\text{lp}(t) = \left( \frac{5 + t}{6} \right)^\alpha

where:

$\text{lp}(t)$ : the length penalty factor for a sequence of length $t$
$t$ : the number of tokens in the sequence
$\alpha$ : the length penalty exponent (same as before, typically 0.6 to 0.7)
The constants 5 and 6 are chosen so that $\text{lp}(1) = 1$ when $\alpha = 1$ , providing a smooth baseline

This formula behaves similarly to $t^\alpha$ for longer sequences but is gentler on short sequences. For example, with $\alpha = 0.7$ : a 2-token sequence gets penalty factor $\text{lp}(2) = (7/6)^{0.7} \approx 1.11$ instead of $2^{0.7} \approx 1.62$ .

Out[20]:

Visualization

Line plot comparing two length penalty formulas, showing GNMT formula is less aggressive for short sequences. — Comparison of the Google NMT length penalty formula $((5+t)/6)^\alpha$ versus the simple formula $t^\alpha$ with α = 0.7. The GNMT formula is gentler on short sequences: at t=2, simple gives 1.62 while GNMT gives 1.11. The formulas converge for longer sequences.

In[21]:

Code

from typing import Callable, Tuple


class BeamSearchDecoder:
    """
    A complete beam search decoder with length normalization
    and configurable stopping criteria.
    """

    def __init__(
        self,
        vocab_size: int,
        start_token: int,
        end_token: int,
        beam_width: int = 4,
        max_length: int = 50,
        length_penalty: float = 0.7,
        min_length: int = 1,
    ):
        self.vocab_size = vocab_size
        self.start_token = start_token
        self.end_token = end_token
        self.beam_width = beam_width
        self.max_length = max_length
        self.length_penalty = length_penalty
        self.min_length = min_length

    def _get_length_penalty(self, length: int) -> float:
        """Compute length penalty factor."""
        return ((5 + length) / 6) ** self.length_penalty

    def _normalize_score(self, score: float, length: int) -> float:
        """Apply length normalization to score."""
        return score / self._get_length_penalty(length)

    def decode(
        self,
        get_next_log_probs: Callable[[List[int]], np.ndarray],
        n_best: int = 1,
    ) -> List[Tuple[List[int], float]]:
        """
        Perform beam search decoding.

        Args:
            get_next_log_probs: Function that takes token sequence and
                               returns log probabilities for next token
            n_best: Number of best hypotheses to return

        Returns:
            List of (token_sequence, score) tuples
        """
        # Initialize beam with start token
        # Each hypothesis: (normalized_score, raw_score, tokens)
        beam = [(0.0, 0.0, [self.start_token])]
        completed = []

        for step in range(self.max_length):
            if not beam:
                break

            candidates = []

            for norm_score, raw_score, tokens in beam:
                # Get log probabilities for next token
                log_probs = get_next_log_probs(tokens)

                # Expand with all possible next tokens
                for token in range(self.vocab_size):
                    new_tokens = tokens + [token]
                    new_raw_score = raw_score + log_probs[token]
                    new_length = len(new_tokens)
                    new_norm_score = self._normalize_score(
                        new_raw_score, new_length
                    )

                    candidates.append(
                        (new_norm_score, new_raw_score, new_tokens)
                    )

            # Sort by normalized score and keep top k
            candidates.sort(key=lambda x: x[0], reverse=True)

            beam = []
            for norm_score, raw_score, tokens in candidates:
                if tokens[-1] == self.end_token:
                    # Only accept if minimum length reached
                    if len(tokens) >= self.min_length:
                        completed.append((norm_score, raw_score, tokens))
                elif len(beam) < self.beam_width:
                    beam.append((norm_score, raw_score, tokens))

                # Early stopping: enough completed hypotheses
                if len(completed) >= self.beam_width * 2:
                    break

        # Add remaining beam hypotheses to completed
        completed.extend(beam)

        # Sort by normalized score and return top n_best
        completed.sort(key=lambda x: x[0], reverse=True)

        return [
            (tokens, raw_score)
            for norm_score, raw_score, tokens in completed[:n_best]
        ]

from typing import Callable, Tuple


class BeamSearchDecoder:
    """
    A complete beam search decoder with length normalization
    and configurable stopping criteria.
    """

    def __init__(
        self,
        vocab_size: int,
        start_token: int,
        end_token: int,
        beam_width: int = 4,
        max_length: int = 50,
        length_penalty: float = 0.7,
        min_length: int = 1,
    ):
        self.vocab_size = vocab_size
        self.start_token = start_token
        self.end_token = end_token
        self.beam_width = beam_width
        self.max_length = max_length
        self.length_penalty = length_penalty
        self.min_length = min_length

    def _get_length_penalty(self, length: int) -> float:
        """Compute length penalty factor."""
        return ((5 + length) / 6) ** self.length_penalty

    def _normalize_score(self, score: float, length: int) -> float:
        """Apply length normalization to score."""
        return score / self._get_length_penalty(length)

    def decode(
        self,
        get_next_log_probs: Callable[[List[int]], np.ndarray],
        n_best: int = 1,
    ) -> List[Tuple[List[int], float]]:
        """
        Perform beam search decoding.

        Args:
            get_next_log_probs: Function that takes token sequence and
                               returns log probabilities for next token
            n_best: Number of best hypotheses to return

        Returns:
            List of (token_sequence, score) tuples
        """
        # Initialize beam with start token
        # Each hypothesis: (normalized_score, raw_score, tokens)
        beam = [(0.0, 0.0, [self.start_token])]
        completed = []

        for step in range(self.max_length):
            if not beam:
                break

            candidates = []

            for norm_score, raw_score, tokens in beam:
                # Get log probabilities for next token
                log_probs = get_next_log_probs(tokens)

                # Expand with all possible next tokens
                for token in range(self.vocab_size):
                    new_tokens = tokens + [token]
                    new_raw_score = raw_score + log_probs[token]
                    new_length = len(new_tokens)
                    new_norm_score = self._normalize_score(
                        new_raw_score, new_length
                    )

                    candidates.append(
                        (new_norm_score, new_raw_score, new_tokens)
                    )

            # Sort by normalized score and keep top k
            candidates.sort(key=lambda x: x[0], reverse=True)

            beam = []
            for norm_score, raw_score, tokens in candidates:
                if tokens[-1] == self.end_token:
                    # Only accept if minimum length reached
                    if len(tokens) >= self.min_length:
                        completed.append((norm_score, raw_score, tokens))
                elif len(beam) < self.beam_width:
                    beam.append((norm_score, raw_score, tokens))

                # Early stopping: enough completed hypotheses
                if len(completed) >= self.beam_width * 2:
                    break

        # Add remaining beam hypotheses to completed
        completed.extend(beam)

        # Sort by normalized score and return top n_best
        completed.sort(key=lambda x: x[0], reverse=True)

        return [
            (tokens, raw_score)
            for norm_score, raw_score, tokens in completed[:n_best]
        ]

Testing the Complete DecoderLink Copied

Let's verify that our implementation works correctly with the mock language model:

In[22]:

Code

# Create decoder
decoder = BeamSearchDecoder(
    vocab_size=len(vocab),
    start_token=token_to_id["<START>"],
    end_token=token_to_id["<END>"],
    beam_width=4,
    max_length=10,
    length_penalty=0.7,
)


# Wrapper to convert probabilities to log probabilities
def get_log_probs(tokens):
    probs = mock_language_model(tokens)
    return np.log(probs + 1e-10)


# Decode
results = decoder.decode(get_log_probs, n_best=4)

# Create decoder
decoder = BeamSearchDecoder(
    vocab_size=len(vocab),
    start_token=token_to_id["<START>"],
    end_token=token_to_id["<END>"],
    beam_width=4,
    max_length=10,
    length_penalty=0.7,
)


# Wrapper to convert probabilities to log probabilities
def get_log_probs(tokens):
    probs = mock_language_model(tokens)
    return np.log(probs + 1e-10)


# Decode
results = decoder.decode(get_log_probs, n_best=4)

Out[23]:

Console

Beam Search Results (with length normalization):
----------------------------------------------------------------------
1. Score: -4.137
   Sequence: <START> the cat sat on the cat sat on the cat

2. Score: -3.283
   Sequence: <START> the mat <END>

3. Score: -3.283
   Sequence: <START> the floor <END>

4. Score: -4.797
   Sequence: <START> the <END>

Understanding the output. The decoder returns tuples of (tokens, raw_score). The raw score is the unnormalized cumulative log probability, useful for comparing to other systems or for further processing. The ranking, however, is based on normalized scores internally.

Practical features demonstrated:

Length normalization: The Google NMT formula $((5 + t) / 6)^\alpha$ provides smoother normalization than simple $t^\alpha$ , especially for short sequences where $t^\alpha$ can be overly aggressive.
Minimum length enforcement: Hypotheses ending with <END> before reaching min_length are rejected, preventing degenerate single-word outputs.
Early stopping: Once we have $2k$ completed hypotheses, we stop expanding, since further search is unlikely to find better candidates.
N-best output: Returning multiple hypotheses enables downstream reranking with additional models or features.

Limitations and ImpactLink Copied

Beam search has been the dominant decoding algorithm for neural machine translation and other sequence generation tasks since the mid-2010s. Its success stems from finding a practical balance between the exhaustive search (which guarantees finding the optimal sequence but is computationally infeasible) and greedy decoding (which is fast but often suboptimal).

However, beam search has notable limitations. The algorithm is inherently sequential, processing one timestep at a time, which limits parallelization on modern hardware. Each step requires computing probabilities for all vocabulary tokens across all beam hypotheses, making it slower than greedy decoding by a factor roughly equal to the beam width.

More fundamentally, beam search optimizes for probability, not quality. Research has shown that the highest-probability sequence according to a language model isn't always the best output by human judgment. This "probability-quality mismatch" manifests in several ways: beam search can produce generic, repetitive, or overly safe outputs. For open-ended generation tasks like dialogue or creative writing, sampling methods often produce more engaging text despite lower probabilities.

The length bias issue, while addressable through normalization, reveals a deeper problem: sequence-level probability isn't the right objective for many tasks. A sequence might be highly probable because it's short and generic, not because it's a good translation or summary. Length normalization is a heuristic fix rather than a principled solution.

Despite these limitations, beam search remains essential for tasks where accuracy matters more than diversity. Machine translation systems, summarization models, and speech recognition systems all rely heavily on beam search. The algorithm's deterministic nature is actually an advantage in these contexts: given the same input, you get the same output, making systems predictable and debuggable.

The impact of beam search extends beyond its direct applications. The algorithm established the paradigm of maintaining multiple hypotheses during decoding, which influenced later developments like diverse beam search, constrained decoding, and various reranking methods. Understanding beam search is foundational for working with any sequence generation system.

SummaryLink Copied

This chapter explored beam search as a practical algorithm for sequence generation that balances quality against computational cost.

The key concepts covered:

Greedy decoding limitations: Selecting the highest-probability token at each step can lead to globally suboptimal sequences because early decisions constrain later options.
Beam search algorithm: Maintains $k$ candidate hypotheses at each step, expanding all possibilities and pruning to keep the top $k$ by cumulative score. This explores multiple paths without the exponential cost of exhaustive search.
Log-space computation: Working with log probabilities converts multiplication to addition, preventing numerical underflow and improving computational efficiency.
Beam width selection: The beam width $k$ controls the exploration-computation trade-off. Values of 4-10 typically provide good results; larger beams offer diminishing returns.
Length normalization: Dividing scores by $t^\alpha$ (typically $\alpha = 0.6$ to $0.7$ ) corrects the inherent bias toward shorter sequences.
Diverse beam search: Partitioning the beam into groups with diversity penalties encourages exploration of structurally different hypotheses.
Beam search vs sampling: Beam search finds high-probability sequences (good for translation, summarization), while sampling methods introduce randomness for more diverse outputs (better for dialogue, creative writing).

The complete beam search procedure, given a model that computes $P(y_t \,|\, y_{1:t-1}, x)$ :

Initialize beam with <START> token
For each timestep:
- Expand each hypothesis with all vocabulary tokens
- Score extensions using cumulative log probability:
  $\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log P(y_i \,|\, y_{1:i-1}, x)$
- Optionally apply length normalization:
  $\text{score\_norm}(y_{1:t}) = \frac{\text{score}(y_{1:t})}{t^\alpha}$
- Keep top $k$ hypotheses by score
- Move completed hypotheses (ending with <END>) to finished list
Return highest-scoring completed hypothesis

In the next chapter, we'll explore attention mechanisms, which address the information bottleneck in encoder-decoder models by allowing the decoder to directly access all encoder states rather than relying on a single context vector.

Key ParametersLink Copied

When implementing beam search, several parameters significantly impact the quality and efficiency of generated sequences:

beam_width (k): The number of hypotheses maintained at each decoding step. Values of 1 reduce to greedy decoding. Values of 4-6 work well for most translation tasks, providing a good balance between quality and speed. Larger values (10-20) may help for speech recognition where acoustic constraints narrow the search space. Increasing beam width improves quality with diminishing returns while linearly increasing computation.
max_length: The maximum number of tokens to generate before forcing termination. Set this based on your task's expected output length, typically 1.5-2× the average target length in your training data. Too short truncates valid outputs; too long wastes computation on degenerate sequences.
length_penalty ( $\alpha$ ): Controls the strength of length normalization. Values of 0.6-0.7 work well for translation. Lower values (0.5) favor shorter outputs, useful for summarization. Higher values (0.8-1.0) favor longer outputs. Set to 0 to disable length normalization entirely.
min_length: The minimum number of tokens required before accepting a completed hypothesis. Prevents the decoder from generating trivially short outputs like single-word responses. Set based on task requirements, typically 5-10 for translation.
end_token: The token ID that signals sequence completion. Hypotheses ending with this token are moved to the completed list. Ensure this matches your tokenizer's end-of-sequence token.
n_best: The number of top hypotheses to return. Set to 1 for single-best output. Higher values (4-10) enable reranking with additional models or returning multiple alternatives to users.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about beam search and sequence decoding.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{beamsearchfindingoptimalsequencesinneuraltextgeneration, author = {Michael Brenndoerfer}, title = {Beam Search: Finding Optimal Sequences in Neural Text Generation}, year = {2025}, url = {https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-16} }

APAAcademic

Michael Brenndoerfer (2025). Beam Search: Finding Optimal Sequences in Neural Text Generation. Retrieved from https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation

MLAAcademic

Michael Brenndoerfer. "Beam Search: Finding Optimal Sequences in Neural Text Generation." 2025. Web. 12/16/2025. <https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation>.

CHICAGOAcademic

Michael Brenndoerfer. "Beam Search: Finding Optimal Sequences in Neural Text Generation." Accessed 12/16/2025. https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Beam Search: Finding Optimal Sequences in Neural Text Generation'. Available at: https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation (Accessed: 12/16/2025).

SimpleBasic

Michael Brenndoerfer (2025). Beam Search: Finding Optimal Sequences in Neural Text Generation. https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation

Direct link:

https://mbrenndoerfer.com/writing/beam-search-decoding-sequence-generation

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Beam Search: Finding Optimal Sequences in Neural Text Generation

Beam SearchLink Copied

The Problem with Greedy DecodingLink Copied

The Beam Search AlgorithmLink Copied

The Core Idea: Hedging Your BetsLink Copied

The Algorithm: Expand, Score, Prune, RepeatLink Copied

The Scoring Function: Why Log Probabilities?Link Copied

The Goal: Measuring Sequence QualityLink Copied

From Single Tokens to Full Sequences: The Chain RuleLink Copied

The Numerical Underflow ProblemLink Copied

The Solution: Working in Log SpaceLink Copied

The Beam Search Scoring FormulaLink Copied

Interpreting ScoresLink Copied

Implementing Beam SearchLink Copied

Step 1: Representing HypothesesLink Copied

Step 2: The Main AlgorithmLink Copied

Step 3: Testing with a Worked ExampleLink Copied

Visualizing Score EvolutionLink Copied

Beam Width SelectionLink Copied

Length NormalizationLink Copied

The Length Bias ProblemLink Copied

The Solution: Normalize by LengthLink Copied

Visualizing the EffectLink Copied

Task-Specific TuningLink Copied

ImplementationLink Copied

Diverse Beam SearchLink Copied

The Group-Based ApproachLink Copied

The AlgorithmLink Copied

The Diversity PenaltyLink Copied

Beam Search vs SamplingLink Copied

A Complete Beam Search ImplementationLink Copied

Design DecisionsLink Copied

Testing the Complete DecoderLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

Key ParametersLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Teacher Forcing: Training Seq2Seq Models with Ground Truth Context

Bahdanau Attention: Dynamic Context for Neural Machine Translation

Luong Attention: Dot Product, General & Local Attention Mechanisms

Stay updated