Search

Search articles

Constrained Decoding: Grammar-Guided Generation for Structured LLM Output

Michael BrenndoerferUpdated July 31, 202542 min read

Learn how constrained decoding forces language models to generate valid JSON, SQL, and regex-matching text through token masking and grammar-guided generation.

Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Constrained Decoding

Language models generate text by sampling from a probability distribution over the vocabulary at each step. This works well for creative writing, conversational responses, and open-ended generation. But what happens when you need the model to produce structured output, like valid JSON, SQL queries, or text matching a specific pattern? Standard sampling offers no guarantees. The model might produce syntactically invalid JSON, hallucinate field names, or ignore your carefully specified schema.

Constrained decoding solves this problem by modifying the generation process itself. Instead of allowing the model to choose freely from all tokens at each step, we restrict it to tokens that keep the output on a valid path. The model still leverages its learned distributions, but we guide it toward outputs that satisfy our structural requirements. This technique has become essential for building reliable AI systems that integrate with existing software infrastructure.

Why Unconstrained Generation Fails

Before diving into solutions, let's understand why asking a model nicely to follow a format often doesn't work. Language models learn patterns from training data, including the patterns of JSON, XML, and other structured formats. They can often produce valid structured output. But "often" isn't good enough for production systems.

Consider a simple scenario: you want the model to output a JSON object with specific fields. Even with careful prompting, several failure modes emerge:

  • Syntax errors: Missing commas, unmatched brackets, trailing commas that violate strict JSON
  • Schema violations: Extra fields the model invented, missing required fields, wrong data types
  • Encoding issues: Unescaped special characters, invalid Unicode sequences
  • Format drift: The model starts with JSON but drifts into natural language explanation
In[3]:
Code
import json

# Examples of common generation failures when asking for structured output
failed_outputs = [
    # Missing closing brace
    '{"name": "Alice", "age": 30',
    # Extra field not in schema
    '{"name": "Bob", "age": 25, "favorite_color": "blue"}',
    # Wrong type (string instead of number)
    '{"name": "Carol", "age": "twenty-five"}',
    # Trailing comma (invalid JSON)
    '{"name": "Dave", "age": 35,}',
    # Mixed format - starts JSON, ends with explanation
    '{"name": "Eve"} Note: I only included the name field since...',
]


def validate_json(text: str) -> tuple[bool, str]:
    """Check if text is valid JSON and return status with message."""
    try:
        json.loads(text)
        return True, "Valid JSON"
    except json.JSONDecodeError as e:
        return False, f"Invalid: {e.msg}"
Out[4]:
Console
Validation results for common model outputs:

✗ Invalid: Expecting ',' delimiter
  Output: {"name": "Alice", "age": 30

✓ Valid JSON
  Output: {"name": "Bob", "age": 25, "favorite_color": "blue...

✓ Valid JSON
  Output: {"name": "Carol", "age": "twenty-five"}

✗ Invalid: Expecting property name enclosed in double quotes
  Output: {"name": "Dave", "age": 35,}

✗ Invalid: Extra data
  Output: {"name": "Eve"} Note: I only included the name fie...

These failures aren't bugs in the model. They reflect the fundamental nature of autoregressive generation: each token is sampled independently given the context, with no mechanism to enforce global constraints. The model doesn't "know" it needs to close a brace three tokens from now. Constrained decoding adds that knowledge.

Out[5]:
Visualization
Common JSON generation failures. Most attempts produce syntactically invalid output that cannot be parsed.
Common JSON generation failures. Most attempts produce syntactically invalid output that cannot be parsed.

The Core Idea: Token Masking

To understand constrained decoding, we need to think carefully about what happens at each generation step. A language model produces a probability distribution over its entire vocabulary, perhaps 50,000 tokens or more. From this distribution, we sample a single token to continue the sequence. The challenge is that the model has no built-in knowledge of our structural requirements. It might assign high probability to tokens that would break our desired format.

The solution is simple: before sampling, we modify the distribution to exclude impossible choices. This is called token masking, and it forms the foundation of all constrained decoding techniques.

Building Intuition

Consider a concrete scenario. You're generating JSON, and so far you've produced {"name": "Alice". What tokens could come next in valid JSON?

  • A comma , to add another field
  • A closing brace } to end the object

What tokens would break the JSON?

  • A letter like a or b (no string delimiter)
  • A number without a preceding colon
  • Another opening brace { (not a valid position)

The model might assign probability to any of these tokens based on patterns it learned during training. Our job is to intercept the sampling process and remove the invalid options before the model can choose them.

Token Masking

Token masking modifies the probability distribution at each generation step by setting the probability of invalid tokens to zero (or negative infinity in log-space). The remaining valid tokens are renormalized to form a proper probability distribution for sampling.

The Constrained Generation Process

At each step tt of generation, given the tokens produced so far (x1,,xt1)(x_1, \ldots, x_{t-1}), we follow a four-step process:

  1. Query the model: Compute the probability distribution P(xtx1,,xt1)P(x_t | x_1, \ldots, x_{t-1}) over the full vocabulary
  2. Evaluate constraints: Determine which tokens are valid given our structural requirements and the current partial output
  3. Mask invalid tokens: Set the probability of every invalid token to zero
  4. Renormalize and sample: Adjust the remaining probabilities to sum to 1, then sample

The critical question is: how do we "adjust" the probabilities in step 4? We can't simply zero out invalid tokens and sample, because the remaining probabilities no longer form a valid distribution (they don't sum to 1). We need a principled approach.

The Mathematical Formulation

Let VvalidV_{\text{valid}} denote the set of tokens that satisfy our constraint given the current context. We want to construct a new probability distribution, PconstrainedP_{\text{constrained}}, that:

  • Assigns zero probability to invalid tokens
  • Preserves the relative ranking among valid tokens
  • Forms a proper probability distribution (sums to 1)

The solution is to divide each valid token's probability by the total probability mass of all valid tokens:

Pconstrained(xt)={P(xt)xVvalidP(x)if xtVvalid0otherwiseP_{\text{constrained}}(x_t) = \begin{cases} \frac{P(x_t)}{\sum_{x \in V_{\text{valid}}} P(x)} & \text{if } x_t \in V_{\text{valid}} \\ 0 & \text{otherwise} \end{cases}

where:

  • xtx_t: the candidate token at generation step tt
  • P(xt)P(x_t): the original probability assigned by the model to token xtx_t
  • VvalidV_{\text{valid}}: the set of tokens that satisfy the constraint given the current context
  • xVvalidP(x)\sum_{x \in V_{\text{valid}}} P(x): the sum of probabilities over all valid tokens, serving as the normalization constant

Why This Formula Works

The formula accomplishes exactly what we need through a technique called renormalization. Let's trace through why each component matters.

The numerator P(xt)P(x_t) preserves the model's original preference for token xtx_t. If the model strongly preferred this token before masking, it will still be favored afterward.

The denominator xVvalidP(x)\sum_{x \in V_{\text{valid}}} P(x) is the total probability mass that was originally assigned to all valid tokens. By dividing by this sum, we scale up the valid probabilities so they sum to 1.

The conditional structure ensures invalid tokens get exactly zero probability. They cannot be sampled, period.

Consider a worked example. Suppose the model produces these probabilities:

Token probabilities before constraint masking. Valid tokens account for 40% of probability mass.
TokenOriginal P(x)P(x)Valid?
,0.30
}0.10
a0.25
{0.15
other0.20

The valid tokens (, and }) together have probability mass 0.30+0.10=0.400.30 + 0.10 = 0.40. The invalid tokens have mass 0.600.60.

After applying the constrained formula:

  • Pconstrained(comma)=0.300.40=0.75P_{\text{constrained}}(\text{comma}) = \frac{0.30}{0.40} = 0.75
  • Pconstrained(brace)=0.100.40=0.25P_{\text{constrained}}(\text{brace}) = \frac{0.10}{0.40} = 0.25
  • Pconstrained(a)=0P_{\text{constrained}}(\text{a}) = 0, and so on for all invalid tokens

Notice that the 3:1 ratio between , and } is preserved (0.30:0.10 becomes 0.75:0.25). The model originally preferred comma over closing brace by this ratio, and constrained decoding respects that preference. We've filtered out the impossible options without overriding the model's learned intuitions about likely continuations.

Out[6]:
Visualization
Original model distribution. Invalid tokens receive significant probability mass.
Original model distribution. Invalid tokens receive significant probability mass.
Constrained distribution. Only valid tokens remain, with preserved relative ratios.
Constrained distribution. Only valid tokens remain, with preserved relative ratios.

Preserving Model Intelligence

This is the key insight that makes constrained decoding powerful. We're not overriding the model's preferences. We're filtering them. The model brings its understanding of language, context, and likely continuations. The constraint brings the structural requirements. Together, they produce output that is both fluent (because the model guides token selection among valid options) and valid (because invalid options are impossible to select).

If the model assigns probability 0.3 to token A and 0.1 to token B, and both are valid, the constrained distribution preserves this 3:1 ratio. Token A remains three times more likely than token B. We only intervene when the model would choose something structurally invalid.

Grammar-Guided Generation

The most powerful form of constrained decoding uses formal grammars to specify valid outputs. A grammar defines the structure of valid strings, and we use it to compute valid tokens at each step.

Context-Free Grammars

Context-free grammars (CFGs) can express the syntax of most structured formats: JSON, XML, SQL, programming languages, and more. A grammar consists of production rules that define how to expand non-terminal symbols into terminals (actual tokens) and other non-terminals.

In[7]:
Code
# A simple grammar for arithmetic expressions
# E -> E + T | E - T | T
# T -> T * F | T / F | F
# F -> ( E ) | number

# For constrained decoding, we need to track parser state
class SimpleArithmeticParser:
    """
    Track valid next tokens for simple arithmetic expressions.

    This is a simplified example - real implementations use
    more sophisticated parsing techniques.
    """

    def __init__(self):
        self.stack = ["E"]  # Start with expression
        self.depth = 0  # Parenthesis depth

    def valid_next_tokens(self, current_text: str) -> set[str]:
        """Determine valid next tokens based on current parse state."""
        # Simplified logic for demonstration
        if not current_text:
            return {"(", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}

        last_char = current_text[-1]

        if last_char.isdigit():
            # After a number: operators, closing paren, or more digits
            valid = {
                "+",
                "-",
                "*",
                "/",
                "0",
                "1",
                "2",
                "3",
                "4",
                "5",
                "6",
                "7",
                "8",
                "9",
            }
            if self.depth > 0:
                valid.add(")")
            return valid

        elif last_char in "+-*/":
            # After operator: number or opening paren
            return {"(", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}

        elif last_char == "(":
            self.depth += 1
            return {"(", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"}

        elif last_char == ")":
            self.depth -= 1
            valid = {"+", "-", "*", "/"}
            if self.depth > 0:
                valid.add(")")
            return valid

        return set()
Out[8]:
Console
Valid next tokens at each step of expression building:

After (empty)      → valid: ['(', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
After "3"          → valid: ['*', '+', '-', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
After "3+"         → valid: ['(', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
After "3+4"        → valid: ['*', '+', '-', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
After "("          → valid: ['(', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
After "(2"         → valid: [')', '*', '+', '-', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
After "(2+3"       → valid: [')', '*', '+', '-', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

The parser maintains state as tokens are generated, computing valid continuations at each step. Real grammar-guided generation uses more sophisticated parsing algorithms, but the core principle remains: track the grammar state and filter tokens accordingly.

Out[9]:
Visualization
Valid token count at each position during arithmetic expression parsing. The number of valid choices varies based on grammar state and context.
Valid token count at each position during arithmetic expression parsing. The number of valid choices varies based on grammar state and context.

Implementing Grammar Constraints

For practical grammar-guided generation, we need to map between the language model's tokens (often subwords) and the grammar's terminals. This mapping is non-trivial because a single grammar terminal might span multiple model tokens, and vice versa.

In[10]:
Code
import numpy as np


class GrammarConstrainedSampler:
    """
    Apply grammar constraints during text generation.

    This simplified implementation assumes character-level tokenization.
    Real implementations must handle subword tokenization carefully.
    """

    def __init__(self, grammar_valid_fn):
        """
        Initialize with a function that returns valid next tokens.

        Args:
            grammar_valid_fn: Function(current_text) -> set of valid tokens
        """
        self.grammar_valid_fn = grammar_valid_fn

    def constrained_sample(
        self,
        logits: np.ndarray,
        current_text: str,
        token_to_char: dict[int, str],
    ) -> int:
        """
        Sample from logits with grammar constraints.

        Args:
            logits: Raw model logits over vocabulary
            current_text: Text generated so far
            token_to_char: Mapping from token IDs to characters

        Returns:
            Selected token ID
        """
        # Get valid tokens from grammar
        valid_chars = self.grammar_valid_fn(current_text)

        # Create mask for valid token IDs
        valid_token_ids = {
            tid for tid, char in token_to_char.items() if char in valid_chars
        }

        # Apply mask: set invalid logits to -infinity
        masked_logits = logits.copy()
        for tid in range(len(logits)):
            if tid not in valid_token_ids:
                masked_logits[tid] = float("-inf")

        # Convert to probabilities and sample
        probs = self._softmax(masked_logits)
        return np.random.choice(len(probs), p=probs)

    def _softmax(self, x: np.ndarray) -> np.ndarray:
        """Compute softmax probabilities."""
        exp_x = np.exp(x - np.max(x))
        return exp_x / exp_x.sum()
Out[11]:
Console
Constrained sampling in action:

Scenario: Model logits favor ')' but grammar requires number/operator

Current text: '1+'
Valid next tokens: {'9', '1', '4', '5', '6', '2', '8', '3', '(', '7', '0'}
Model logits: {'0': np.float64(1.0), '1': np.float64(1.5), '2': np.float64(1.2), '+': np.float64(0.5), '-': np.float64(0.3), '*': np.float64(0.2), '(': np.float64(0.8), ')': np.float64(3.0)}

Sampled distribution over 100 trials:
  '1': 31%
  '0': 29%
  '2': 25%
  '(': 15%

Notice how the sampler never outputs ')' even though the model assigned it the highest logit. The constraint prevents invalid tokens from being selected, while the relative preferences among valid tokens are preserved.

JSON Schema Constraints

JSON is the most common structured format for LLM outputs. Schema-constrained generation ensures the model produces valid JSON that conforms to a specified schema.

From Schema to Grammar

A JSON schema can be converted into a grammar that describes all valid JSON documents matching that schema. This grammar then guides generation token by token.

In[12]:
Code
from dataclasses import dataclass
from typing import Any


@dataclass
class JSONSchema:
    """Simplified JSON schema representation."""

    type: str
    properties: dict[str, "JSONSchema"] = None
    required: list[str] = None
    items: "JSONSchema" = None
    enum: list[Any] = None


def schema_to_valid_prefixes(schema: JSONSchema, current: str) -> set[str]:
    """
    Determine valid next characters given schema and current output.

    This is a simplified demonstration. Production implementations
    use finite state machines compiled from the schema.
    """
    current = current.strip()

    if schema.type == "object":
        if not current:
            return {"{"}
        if current == "{":
            if schema.required:
                return {'"'}  # Must start a key
            return {'"', "}"}  # Key or empty object
        # ... more cases for object parsing

    elif schema.type == "string":
        if not current:
            return {'"'}
        if current.startswith('"') and not current.endswith('"'):
            # Inside string - allow any character (simplified)
            return set(
                'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789 "'
            )
        if (
            current.startswith('"')
            and current.endswith('"')
            and len(current) > 1
        ):
            return set()  # Complete

    elif schema.type == "number":
        if not current:
            return set("-0123456789")
        if all(c in "-0123456789." for c in current):
            return set("0123456789.")

    elif schema.type == "boolean":
        for val in ["true", "false"]:
            if val.startswith(current):
                if len(current) < len(val):
                    return {val[len(current)]}
                return set()

    return set()
Out[13]:
Console
Valid continuations for a User schema (name: string, age: number):

(empty)              → next chars: ['{']
"{"                  → next chars: ['"']
"{""                 → next chars: ['a', 'b', 'c', 'd', 'e']...
"{"name"             → next chars: ['"', 'a', 'b', 'c']
"{"name""            → next chars: [':']
"{"name":"           → next chars: [' ', '"']

The Outlines Library

For production use, the Outlines library provides efficient JSON schema constraints. It compiles schemas into finite state machines that run in parallel with generation.

In[14]:
Code
# Example of how Outlines-style generation works conceptually
# (Actual Outlines usage requires installing the library)


def compile_schema_to_fsm(schema: dict) -> dict:
    """
    Compile a JSON schema to a finite state machine.

    Returns a mapping from (state, token) -> next_state.
    This is a simplified conceptual demonstration.
    """
    # In practice, this would build a complete FSM
    # tracking position in the JSON structure
    states = {
        "start": {"valid_tokens": ["{"], "next": "object_open"},
        "object_open": {"valid_tokens": ['"'], "next": "key_start"},
        "key_start": {
            "valid_tokens": list("abcdefghijklmnopqrstuvwxyz"),
            "next": "in_key",
        },
        "in_key": {
            "valid_tokens": list('abcdefghijklmnopqrstuvwxyz"'),
            "next": "key_end_or_continue",
        },
        # ... many more states
    }
    return states


# The FSM approach is much faster than re-parsing at each step
# because state transitions are precomputed
Out[15]:
Console
FSM-based schema validation (conceptual):

State transitions for JSON object generation:
  start         --{-->  object_open
  object_open   --"-->  key_start
  key_start     --[a-z]-->  in_key
  in_key        --"-->  after_key
  after_key     --:-->  before_value
  before_value  --{-->  nested_object (recursive)
  before_value  --"-->  string_value
  before_value  --[0-9]-->  number_value

Each state knows exactly which tokens are valid next.

The FSM approach is crucial for performance. Instead of parsing the entire output at each step, we simply look up the current state and get the valid tokens in constant time.

Regex Constraints

Regular expressions provide a flexible way to constrain generation when you need pattern matching without full grammar complexity. Common use cases include:

  • Phone numbers: \d{3}-\d{3}-\d{4}
  • Email addresses: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
  • Dates: \d{4}-\d{2}-\d{2}
  • Custom identifiers: [A-Z]{2}\d{6}

Regex to DFA Conversion

Regular expressions can be converted to deterministic finite automata (DFAs), which then guide generation exactly like grammar-based FSMs.

In[16]:
Code
import re


class RegexConstraint:
    """
    Constrain generation to match a regular expression.

    Uses a simplified simulation approach for demonstration.
    Real implementations compile to DFA for efficiency.
    """

    def __init__(self, pattern: str):
        self.pattern = pattern
        self.regex = re.compile(f"^{pattern}$")
        self.partial_regex = re.compile(f"^{pattern[:]}")

    def get_valid_tokens(
        self, current: str, vocabulary: list[str]
    ) -> list[str]:
        """
        Find tokens that could extend current string toward a match.

        Args:
            current: String generated so far
            vocabulary: List of possible next tokens

        Returns:
            List of valid next tokens
        """
        valid = []
        for token in vocabulary:
            candidate = current + token
            # Check if candidate could be prefix of a valid match
            if self._could_match(candidate):
                valid.append(token)
        return valid

    def _could_match(self, text: str) -> bool:
        """Check if text could be extended to match the pattern."""
        # Simplified check: see if any extension could work
        # Real implementations use DFA state tracking
        for extension in ["", "a", "0", "@", ".", "-"]:
            test = text + extension * 20  # Pad with potential matches
            if self.regex.match(test):
                return True
        # Also check if text itself is a valid prefix
        # by trying the pattern with .* appended conceptually
        try:
            partial = re.compile(f"^{re.escape(text)}")
            # If we can find any completion that matches, text is valid
            return True  # Simplified - assume prefixes are valid
        except re.error:
            return False
Out[17]:
Console
Regex constraint: \d{3}-\d{3}-\d{4}

Valid next characters at each step:

(empty)              → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']
"5"                  → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']
"55"                 → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']
"555"                → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']
"555-"               → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']
"555-12"             → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']
"555-123-"           → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']
"555-123-456"        → valid: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-']

The regex constraint ensures that partial outputs always remain on a path toward completing the pattern. Once you've generated "555-", only digits are valid for the next three positions, followed by a mandatory hyphen, then four more digits.

Out[18]:
Visualization
Regex constraint for phone number generation. At each position, the constraint specifies exactly which characters are valid based on the pattern.
Regex constraint for phone number generation. At each position, the constraint specifies exactly which characters are valid based on the pattern.

Combining Regex with Semantic Guidance

A powerful technique combines regex constraints with the model's semantic knowledge. The regex ensures format correctness while the model contributes meaningful content.

In[19]:
Code
import numpy as np


def generate_with_regex_constraint(
    model_probs: dict[str, float],
    current: str,
    regex_pattern: str,
    vocabulary: list[str],
) -> str:
    """
    Sample next token using both model preferences and regex constraints.

    Args:
        model_probs: Model's probability distribution over tokens
        current: Generated string so far
        regex_pattern: Pattern the final output must match
        vocabulary: List of possible tokens

    Returns:
        Selected token
    """
    constraint = RegexConstraint(regex_pattern)
    valid_tokens = constraint.get_valid_tokens(current, vocabulary)

    if not valid_tokens:
        raise ValueError(f"No valid continuation from '{current}'")

    # Filter model probs to valid tokens only
    filtered_probs = {t: model_probs.get(t, 0) for t in valid_tokens}
    total = sum(filtered_probs.values())

    if total == 0:
        # Uniform over valid tokens if model gives no probability
        return np.random.choice(valid_tokens)

    # Normalize and sample
    normalized = {t: p / total for t, p in filtered_probs.items()}
    tokens, probs = zip(*normalized.items())
    return np.random.choice(tokens, p=probs)
Out[20]:
Console
Model digit preferences (from training data):
  '5': 15%
  '1': 12%
  '7': 11%
  '2': 10%
  '6': 10%

Generating phone number with constraint + model semantics:

Generated: 397511085609

Note: Digits reflect model preferences, hyphens are structurally required.

Beam search maintains multiple candidate sequences to find higher-quality outputs. Adding constraints to beam search requires filtering candidates at each step, keeping only beams that remain on valid paths.

In standard beam search, we expand each beam with the top-k tokens by probability. In constrained beam search, we first filter to valid tokens, then take the top-k among those.

Beam search tracks cumulative log-probabilities rather than raw probabilities. For a sequence of tokens (x1,x2,,xn)(x_1, x_2, \ldots, x_n), the log-probability is:

logP(x1,,xn)=i=1nlogP(xix1,,xi1)\log P(x_1, \ldots, x_n) = \sum_{i=1}^{n} \log P(x_i | x_1, \ldots, x_{i-1})

where:

  • logP(x1,,xn)\log P(x_1, \ldots, x_n): the log-probability of the entire sequence
  • P(xix1,,xi1)P(x_i | x_1, \ldots, x_{i-1}): the conditional probability of token xix_i given all preceding tokens
  • i=1n\sum_{i=1}^{n}: summation over all nn tokens in the sequence

We use log-probabilities because multiplying many small probabilities quickly leads to numerical underflow. Adding log-probabilities is numerically stable and mathematically equivalent to multiplying the original probabilities.

In[21]:
Code
import numpy as np


@dataclass
class Beam:
    """A single beam in constrained beam search."""

    tokens: list[str]
    log_prob: float
    state: Any  # Grammar/FSM state for constraint tracking

    @property
    def text(self) -> str:
        return "".join(self.tokens)


def constrained_beam_search(
    get_token_probs,  # Function returning token probabilities
    constraint_fn,  # Function(text) -> set of valid tokens
    beam_width: int = 3,
    max_length: int = 20,
    vocabulary: list[str] = None,
) -> list[Beam]:
    """
    Beam search with constraint filtering at each step.

    Args:
        get_token_probs: Returns probability distribution for next token
        constraint_fn: Returns valid tokens given current text
        beam_width: Number of beams to maintain
        max_length: Maximum generation length
        vocabulary: Token vocabulary

    Returns:
        Top beams after generation
    """
    if vocabulary is None:
        vocabulary = list("abcdefghijklmnopqrstuvwxyz0123456789-_. ")

    # Initialize with empty beam
    beams = [Beam(tokens=[], log_prob=0.0, state=None)]

    for step in range(max_length):
        all_candidates = []

        for beam in beams:
            # Get valid next tokens
            valid_tokens = constraint_fn(beam.text)
            if not valid_tokens:
                continue  # Dead end

            # Get model probabilities
            probs = get_token_probs(beam.text)

            # Expand beam with valid tokens only
            for token in valid_tokens:
                if token in probs:
                    new_log_prob = beam.log_prob + np.log(probs[token] + 1e-10)
                    new_beam = Beam(
                        tokens=beam.tokens + [token],
                        log_prob=new_log_prob,
                        state=None,
                    )
                    all_candidates.append(new_beam)

        if not all_candidates:
            break

        # Keep top beams by log probability
        all_candidates.sort(key=lambda b: b.log_prob, reverse=True)
        beams = all_candidates[:beam_width]

    return beams
Out[22]:
Console
Constrained beam search for email prefix generation:

Constraint: Only [a-z0-9.] allowed, max 8 characters

Top beams found:
  1. 'oooooooo' (log-prob: -21.73)
  2. 'oooooooi' (log-prob: -21.73)
  3. 'oooooooe' (log-prob: -21.73)

All three beams contain only characters from the valid set [a-z0-9.], demonstrating that the constraint successfully filtered out any invalid tokens. The top beam "oooooooo" has the highest log-probability because our mock model assigns high weight to vowels. The log-probabilities are negative because we're summing log-probabilities of tokens with probability less than 1.

Out[23]:
Visualization
Beam search log-probability progression. Each beam accumulates log-probability at each step, with lower (more negative) scores being worse. Beams are pruned based on cumulative score.
Beam search log-probability progression. Each beam accumulates log-probability at each step, with lower (more negative) scores being worse. Beams are pruned based on cumulative score.

The constrained beam search explores multiple valid paths simultaneously. Even if the model's top choice at some step is invalid, the beam search can recover by following the next-best valid option. This makes it more robust than greedy constrained decoding.

Handling Constraint Conflicts

Sometimes constraints conflict with what the model has learned, leading to low-probability valid tokens. The model might strongly prefer tokens that violate the constraint, leaving only tokens it considers unlikely. This can produce disfluent or unnatural text.

In[24]:
Code
def analyze_constraint_difficulty(
    model_probs: dict[str, float], valid_tokens: set[str]
) -> dict:
    """
    Analyze how much the constraint fights the model's preferences.

    Returns metrics about the constraint's impact on generation.
    """
    valid_prob_mass = sum(model_probs.get(t, 0) for t in valid_tokens)
    invalid_prob_mass = 1.0 - valid_prob_mass

    # Top token analysis
    sorted_tokens = sorted(model_probs.items(), key=lambda x: -x[1])
    top_token, top_prob = sorted_tokens[0]
    top_is_valid = top_token in valid_tokens

    # Find best valid token
    valid_probs = [(t, model_probs.get(t, 0)) for t in valid_tokens]
    valid_probs.sort(key=lambda x: -x[1])
    best_valid = valid_probs[0] if valid_probs else (None, 0)

    return {
        "valid_mass": valid_prob_mass,
        "invalid_mass": invalid_prob_mass,
        "top_token": top_token,
        "top_prob": top_prob,
        "top_is_valid": top_is_valid,
        "best_valid_token": best_valid[0],
        "best_valid_prob": best_valid[1],
        "constraint_difficulty": invalid_prob_mass / (valid_prob_mass + 1e-10),
    }
Out[25]:
Console
Constraint difficulty analysis:

Easy constraint (model agrees):
  Valid tokens: {'c', 'a', 'b'}
  Valid probability mass: 75%
  Top choice 'a' is valid: True
  Difficulty score: 0.33

Hard constraint (model disagrees):
  Valid tokens: {'c', 'a', 'b'}
  Valid probability mass: 15%
  Top choice 'x' is valid: False
  Best valid token: 'a' (8%)
  Difficulty score: 5.67

High difficulty scores indicate the constraint forces low-probability tokens.

The easy constraint has a difficulty score near 0.33, indicating the model naturally prefers valid tokens. Only 25% of the probability mass goes to invalid options. In contrast, the hard constraint has a difficulty score of 5.67, meaning the model strongly prefers invalid tokens. When 85% of probability mass goes to tokens we must reject, the constrained output may feel forced or unnatural.

Out[26]:
Visualization
Constraint difficulty comparison. Easy constraints allow the model to use its preferred tokens, while hard constraints force selection from low-probability options.
Constraint difficulty comparison. Easy constraints allow the model to use its preferred tokens, while hard constraints force selection from low-probability options.

When constraint difficulty is high, consider whether the constraint is too restrictive or whether the task requires a model fine-tuned on the target format.

Constrained Sampling Strategies

Beyond masking invalid tokens, we can combine constraints with sampling strategies like temperature, top-k, and nucleus sampling.

Temperature and Constraints

Temperature scaling adjusts the "sharpness" of the distribution. With constraints, we apply temperature before masking to maintain proper probability ratios among valid tokens.

In[27]:
Code
import numpy as np


def constrained_sample_with_temperature(
    logits: np.ndarray, valid_token_ids: set[int], temperature: float = 1.0
) -> int:
    """
    Sample with both temperature and constraints.

    Args:
        logits: Raw model logits
        valid_token_ids: Set of valid token indices
        temperature: Sampling temperature (>1 = more random, <1 = more deterministic)

    Returns:
        Sampled token ID
    """
    # Apply temperature first
    scaled_logits = logits / temperature

    # Mask invalid tokens
    masked_logits = np.full_like(scaled_logits, float("-inf"))
    for tid in valid_token_ids:
        masked_logits[tid] = scaled_logits[tid]

    # Convert to probabilities
    max_logit = np.max(masked_logits[masked_logits > float("-inf")])
    exp_logits = np.exp(masked_logits - max_logit)
    probs = exp_logits / exp_logits.sum()

    return np.random.choice(len(probs), p=probs)
Out[28]:
Console
Effect of temperature on constrained sampling:

Valid tokens: a, b, c, d, e (X is excluded)
Original logits: a=2.0, b=1.5, c=1.0, d=0.5, X=3.5(invalid), e=0.2

Temperature 0.5:
  a: ████████████████████████████████ 65.0%
  b: ██████████ 21.0%
  c: ████ 9.9%
  d: █ 2.5%
  e:  1.6%

Temperature 1.0:
  a: ████████████████████ 40.5%
  b: ████████████ 25.8%
  c: ████████ 17.0%
  d: ████ 8.7%
  e: ████ 8.0%

Temperature 2.0:
  a: ███████████████ 30.8%
  b: ████████████ 24.2%
  c: ████████ 17.6%
  d: ███████ 14.8%
  e: ██████ 12.6%

Lower temperatures make the sampler more deterministic within the valid set, while higher temperatures spread probability more evenly across valid options. The constraint always prevents invalid tokens regardless of temperature.

Out[29]:
Visualization
Effect of temperature on constrained sampling. Lower temperatures concentrate probability on the highest-probability valid token, while higher temperatures spread probability more evenly.
Effect of temperature on constrained sampling. Lower temperatures concentrate probability on the highest-probability valid token, while higher temperatures spread probability more evenly.

Combining Top-k with Constraints

Top-k sampling and constraints interact in an important way: we should apply constraints first, then take top-k among valid tokens.

In[30]:
Code
import numpy as np


def constrained_top_k_sample(
    logits: np.ndarray,
    valid_token_ids: set[int],
    k: int = 5,
    temperature: float = 1.0,
) -> int:
    """
    Top-k sampling within constraint-valid tokens.

    Args:
        logits: Raw model logits
        valid_token_ids: Set of valid token indices
        k: Number of top tokens to consider
        temperature: Sampling temperature

    Returns:
        Sampled token ID
    """
    # First, mask to valid tokens only
    valid_logits = []
    valid_ids_list = []
    for tid in valid_token_ids:
        valid_logits.append(logits[tid])
        valid_ids_list.append(tid)

    valid_logits = np.array(valid_logits)

    # Apply temperature
    valid_logits = valid_logits / temperature

    # Get top-k among valid tokens
    if len(valid_logits) > k:
        top_k_indices = np.argsort(valid_logits)[-k:]
        top_k_logits = valid_logits[top_k_indices]
        top_k_ids = [valid_ids_list[i] for i in top_k_indices]
    else:
        top_k_logits = valid_logits
        top_k_ids = valid_ids_list

    # Convert to probabilities and sample
    exp_logits = np.exp(top_k_logits - np.max(top_k_logits))
    probs = exp_logits / exp_logits.sum()

    selected_idx = np.random.choice(len(top_k_ids), p=probs)
    return top_k_ids[selected_idx]
Out[31]:
Console
Top-k within constrained tokens:

Logits: run=2.5, walk=2.0, jump=1.8, skip=1.5, hop=1.0, crawl=0.5, roll=0.2
Invalid: FLY=5.0 (highest logit but excluded)

With k=3, sampling only from top-3 valid tokens:

  run: 46.6%
  walk: 28.4%
  jump: 25.0%

Only 'run', 'walk', 'jump' are sampled (top-3 among valid).

A Worked Example: Generating Valid SQL

Let's work through a complete example of constrained decoding for SQL generation. SQL has strict syntax rules that make it an ideal candidate for grammar-guided generation.

Defining the Constraint

We'll generate simple SELECT statements with a constraint that ensures valid syntax.

In[32]:
Code
from typing import Optional


class SimpleSQLConstraint:
    """
    Constraint for simple SQL SELECT statements.

    Supports: SELECT <columns> FROM <table> WHERE <condition>
    """

    KEYWORDS = {"SELECT", "FROM", "WHERE", "AND", "OR"}
    OPERATORS = {"=", ">", "<", ">=", "<=", "!="}

    def __init__(
        self, valid_tables: list[str], valid_columns: dict[str, list[str]]
    ):
        """
        Initialize with schema information.

        Args:
            valid_tables: List of valid table names
            valid_columns: Mapping from table name to list of column names
        """
        self.valid_tables = set(valid_tables)
        self.valid_columns = valid_columns

    def get_valid_tokens(self, current_sql: str) -> set[str]:
        """
        Determine valid next tokens based on SQL parse state.
        """
        tokens = current_sql.upper().split()

        if not tokens:
            return {"SELECT"}

        last_token = tokens[-1]

        if last_token == "SELECT":
            # After SELECT: column names or *
            all_columns = set()
            for cols in self.valid_columns.values():
                all_columns.update(cols)
            return all_columns | {"*"}

        elif (
            last_token in self.valid_columns.get(self._find_table(tokens), [])
            or last_token == "*"
        ):
            # After column: comma for more columns, or FROM
            return {",", "FROM"}

        elif last_token == ",":
            # After comma: another column
            table = self._find_table(tokens)
            if table:
                return set(self.valid_columns.get(table, []))
            return set()

        elif last_token == "FROM":
            # After FROM: table name
            return self.valid_tables

        elif last_token in self.valid_tables:
            # After table: WHERE or end
            return {"WHERE", "<END>"}

        elif last_token == "WHERE":
            # After WHERE: column for condition
            table = self._find_table(tokens)
            return set(self.valid_columns.get(table, []))

        elif last_token in self.OPERATORS:
            # After operator: value (simplified to numbers/quoted strings)
            return {"<VALUE>"}

        return set()

    def _find_table(self, tokens: list[str]) -> Optional[str]:
        """Find the table name in the token list."""
        try:
            from_idx = tokens.index("FROM")
            if from_idx + 1 < len(tokens):
                return tokens[from_idx + 1]
        except ValueError:
            pass
        return None
Out[33]:
Console
SQL constraint walkthrough:

Schema: users(id, name, email, age), orders(id, user_id, product, amount)

(empty)
  → valid next: ['SELECT']

"SELECT"
  → valid next: ['*', 'age', 'amount', 'id', 'user_id', '...']

"SELECT name"
  → valid next: []

"SELECT name,"
  → valid next: []

"SELECT name, email"
  → valid next: []

"SELECT name, email FROM"
  → valid next: ['orders', 'users']

"SELECT name, email FROM users"
  → valid next: []

"SELECT name, email FROM users WHERE"
  → valid next: []

The SQL constraint leverages schema knowledge to ensure only valid table and column names appear, and it enforces proper keyword ordering.

Full Generation Pipeline

Putting it all together, here's how constrained generation produces a valid SQL query:

In[34]:
Code
import numpy as np


def generate_sql_with_constraints(
    constraint: SimpleSQLConstraint,
    model_preferences: dict[str, float],
    max_tokens: int = 20,
) -> str:
    """
    Generate a SQL query using constrained sampling.

    Args:
        constraint: SQL syntax constraint
        model_preferences: Model's token preferences (simulated)
        max_tokens: Maximum query length

    Returns:
        Generated SQL query
    """
    tokens = []

    for _ in range(max_tokens):
        current_sql = " ".join(tokens)
        valid = constraint.get_valid_tokens(current_sql)

        if not valid or "<END>" in valid:
            break

        # Remove <END> and <VALUE> placeholders for actual sampling
        valid = {t for t in valid if not t.startswith("<")}
        if not valid:
            break

        # Sample based on model preferences
        probs = {t: model_preferences.get(t.lower(), 0.1) for t in valid}
        total = sum(probs.values())
        probs = {t: p / total for t, p in probs.items()}

        tokens_list, prob_list = zip(*probs.items())
        selected = np.random.choice(tokens_list, p=prob_list)
        tokens.append(selected)

    return " ".join(tokens)
Out[35]:
Console
Generating SQL with constraints:

  1. SELECT name
  2. SELECT email
  3. SELECT id

All queries are syntactically valid because constraints prevent
invalid tokens from being selected at each step.

Limitations and Practical Considerations

Constrained decoding is powerful but comes with tradeoffs that affect both quality and performance.

Computational Overhead

Computing valid tokens at each generation step adds overhead. For simple regex constraints, this is minimal. For complex grammars or JSON schemas, the constraint checking can become a bottleneck. Production systems address this by precompiling constraints into efficient finite state machines, caching state transitions, and leveraging parallelism. The Outlines library, for instance, compiles JSON schemas into index structures that enable O(1) valid token lookup per step. Without such optimization, naive implementations can increase generation time by 2-5x.

Quality Degradation

When constraints force the model away from its preferred tokens, output quality can suffer. If the model's top-10 tokens are all invalid, it must choose from lower-probability alternatives. This can produce text that is grammatically correct but semantically awkward, like a translation that follows syntax rules but reads unnaturally. Fine-tuning the model on constrained output formats mitigates this problem by aligning the model's preferences with the constraint requirements.

Constraint Expressiveness

Not all requirements can be expressed as grammar or regex constraints. Semantic constraints like "output must be factually accurate" or "response must be polite" cannot be enforced through token masking. Hybrid approaches combine hard syntactic constraints with soft semantic guidance through prompting or reward models. For instance, you might use constrained decoding to ensure valid JSON structure while relying on prompts to guide the content within that structure.

Tokenization Mismatches

Modern language models use subword tokenization, where a single model token might represent multiple characters or partial words. Mapping grammar terminals to model tokens requires careful handling. The token "json" might be a single token in some vocabularies but "js" + "on" in others. Robust implementations must handle these cases, potentially allowing a token if any valid extension of it could eventually satisfy the constraint.

Key Parameters

When implementing constrained decoding, several parameters control the behavior of the system:

  • temperature (float, typically 0.1-2.0): Controls randomness within valid tokens. Lower values make generation more deterministic, always choosing the highest-probability valid token. Higher values spread probability more evenly across valid options. Temperature of 1.0 preserves the model's original distribution.

  • top_k (int, typically 5-50): After filtering to valid tokens, further restricts sampling to the k highest-probability options. Reduces variance in generation by excluding low-probability valid tokens.

  • beam_width (int, typically 3-10): For constrained beam search, determines how many candidate sequences to maintain. Larger widths explore more paths but increase computation linearly.

  • max_length (int): Maximum number of tokens to generate. Prevents infinite loops when constraints allow arbitrarily long valid sequences.

  • constraint_type: The type of constraint to apply. Options include grammar-based (context-free grammars), schema-based (JSON schemas), regex-based (regular expressions), or custom constraint functions.

For production use with JSON output, libraries like Outlines handle these parameters automatically. For custom constraints, start with temperature=1.0 and no top_k filtering to preserve the model's preferences, then tune based on output quality.

Summary

Constrained decoding transforms language model generation from probabilistic text production into structured output generation. The key concepts covered in this chapter include:

  • Token masking forms the foundation: at each step, we zero out probabilities for tokens that would violate constraints, then sample from the remaining valid tokens
  • Grammar-guided generation uses formal grammars (typically context-free grammars compiled to finite state machines) to specify valid output structures and compute valid tokens efficiently
  • JSON schema constraints convert schemas into grammars that enforce both valid JSON syntax and adherence to the specified schema structure
  • Regex constraints provide a lighter-weight alternative for pattern matching, converted to DFAs for efficient constraint checking
  • Constrained beam search extends beam search to maintain multiple valid hypotheses, enabling recovery from locally suboptimal choices
  • Combining with sampling strategies allows constraints to work alongside temperature, top-k, and nucleus sampling for controlled randomness within valid outputs

The technique has become essential for building reliable AI systems that must produce machine-readable output. When you need a language model to return valid JSON, generate syntactically correct code, or produce output matching any formal specification, constrained decoding provides the guarantee that pure prompting cannot.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about constrained decoding and structured output generation.

Loading component...
Track your reading progress

Sign in to mark chapters as read and track your learning journey

Sign in →

Comments

Reference

BIBTEXAcademic
@misc{constraineddecodinggrammarguidedgenerationforstructuredllmoutput, author = {Michael Brenndoerfer}, title = {Constrained Decoding: Grammar-Guided Generation for Structured LLM Output}, year = {2025}, url = {https://mbrenndoerfer.com/writing/constrained-decoding-structured-llm-output}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-12-19} }
APAAcademic
Michael Brenndoerfer (2025). Constrained Decoding: Grammar-Guided Generation for Structured LLM Output. Retrieved from https://mbrenndoerfer.com/writing/constrained-decoding-structured-llm-output
MLAAcademic
Michael Brenndoerfer. "Constrained Decoding: Grammar-Guided Generation for Structured LLM Output." 2025. Web. 12/19/2025. <https://mbrenndoerfer.com/writing/constrained-decoding-structured-llm-output>.
CHICAGOAcademic
Michael Brenndoerfer. "Constrained Decoding: Grammar-Guided Generation for Structured LLM Output." Accessed 12/19/2025. https://mbrenndoerfer.com/writing/constrained-decoding-structured-llm-output.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Constrained Decoding: Grammar-Guided Generation for Structured LLM Output'. Available at: https://mbrenndoerfer.com/writing/constrained-decoding-structured-llm-output (Accessed: 12/19/2025).
SimpleBasic
Michael Brenndoerfer (2025). Constrained Decoding: Grammar-Guided Generation for Structured LLM Output. https://mbrenndoerfer.com/writing/constrained-decoding-structured-llm-output
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.

No spam, unsubscribe anytime.

or

Create a free account to unlock exclusive features, track your progress, and join the conversation.

No popupsUnobstructed readingCommenting100% Free