Policy Gradient Methods: REINFORCE Algorithm & Theory

Michael BrenndoerferDecember 26, 202542 min read

Learn policy gradient theory for language model alignment. Master the REINFORCE algorithm, variance reduction with baselines, and foundations for PPO.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Policy Gradient Methods

With a trained reward model that captures human preferences, we face a fundamental challenge: how do we use this reward signal to actually improve our language model? The reward model tells us whether a complete generated response is good or bad, but our language model makes decisions token by token. We need a way to propagate that final reward signal back to influence all the individual token choices that led to it.

This is where reinforcement learning enters the picture. Unlike supervised learning, where we have explicit targets for each prediction, reinforcement learning handles the credit assignment problem: determining how much each action contributed to the eventual outcome. Policy gradient methods provide a mathematically principled framework for optimizing a model's behavior based on sparse, delayed reward signals.

In this chapter, we'll develop the theory behind policy gradients from first principles, derive the key mathematical results that make training possible, and implement the REINFORCE algorithm. This foundation is essential for understanding PPO and the complete RLHF pipeline that follows in subsequent chapters.

Language Models as Policies

In reinforcement learning terminology, a policy is a function that maps states to actions. For language models, this mapping has a natural interpretation that becomes clear once we examine how text generation actually works. When you ask a language model to complete a sentence, it doesn't produce the entire response instantaneously. Instead, it generates one token at a time, each choice depending on everything that came before. This sequential decision-making process is precisely what reinforcement learning was designed to handle.

Policy

A policy πθ(as)\pi_\theta(a|s) is a probability distribution over actions aa given state ss, parameterized by θ\theta. In language models, the policy is the model itself: given a context (state), it produces a probability distribution over the next token (action).

To understand why this framing is so powerful, consider what happens during text generation. The model receives a prompt, which establishes the initial context. Based on this context, the model must decide which token to produce first. This decision changes the context, as the newly generated token becomes part of the history. The model then faces a new decision: given the original prompt plus the first generated token, what should the second token be? This process continues, with each token choice altering the state and presenting a fresh decision problem.

When a language model generates text, it's acting as a policy:

  • State sts_t: The current context, consisting of the prompt plus all tokens generated so far
  • Action ata_t: The next token to generate
  • Policy πθ(atst)\pi_\theta(a_t|s_t): The probability the model assigns to token ata_t given context sts_t

This framing reveals why language model alignment is fundamentally a reinforcement learning problem. At each timestep, the model takes an action (selects a token), the state changes (the context grows), and eventually the complete sequence receives a reward from our reward model. The challenge is learning which tokens led to that reward. Unlike a game where individual moves might receive immediate feedback, language generation only reveals whether the response was good or bad after the entire sequence is complete. This delayed reward structure is exactly the scenario where reinforcement learning excels.

In[2]:
Code
import torch.nn as nn
import torch.nn.functional as F


# A simple language model acting as a policy
class LanguageModelPolicy(nn.Module):
    def __init__(self, vocab_size, hidden_size=256, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.lstm = nn.LSTM(
            hidden_size, hidden_size, num_layers, batch_first=True
        )
        self.output = nn.Linear(hidden_size, vocab_size)

    def forward(self, input_ids):
        # State: all tokens seen so far
        embeds = self.embedding(input_ids)
        lstm_out, _ = self.lstm(embeds)
        # Action distribution: probability over next token
        logits = self.output(lstm_out)
        return logits

    def get_action_probs(self, state):
        """Get probability distribution over actions (next tokens)"""
        logits = self.forward(state)
        # Policy: π_θ(a|s) - probability of each action given state
        return F.softmax(logits[:, -1, :], dim=-1)
Out[3]:
Visualization
Diagram showing states expanding as tokens are added, with actions connecting consecutive states and reward at the end.
Sequential token generation modeled as a Markov Decision Process. Each state $s_t$ encompasses the full context (prompt and generated history), while actions $a_t$ correspond to next-token selections. The visualization highlights the temporal gap between individual actions and the final sparse reward $R$, illustrating the credit assignment challenge.

The model's parameters θ\theta define the policy. Our goal is to adjust these parameters so that the model generates responses that receive high rewards. This means we need to find parameter values that make high-reward sequences more probable and low-reward sequences less probable. The question is: how do we know which direction to adjust the parameters?

The Objective Function

To optimize our policy, we need a clear objective. In RLHF, we want to maximize the expected reward over all possible responses the model might generate. This expectation is crucial because language generation is inherently stochastic. The same prompt can lead to many different responses depending on which tokens get sampled at each step. Some responses will be excellent, others will be mediocre, and still others will be poor. We want to tune the model so that, on average, it produces responses that receive high rewards.

Given a prompt xx, the model generates a response y=(y1,y2,,yT)y = (y_1, y_2, \ldots, y_T) by sampling tokens according to its policy. The reward model then scores the complete response, giving us R(x,y)R(x, y). Our objective:

J(θ)=Eyπθ(x)[R(x,y)]J(\theta) = \mathbb{E}_{y \sim \pi_\theta(\cdot|x)}[R(x, y)]

where:

  • J(θ)J(\theta): the objective function we want to maximize
  • θ\theta: the parameters of the language model
  • E\mathbb{E}: the expectation over sequences sampled from the policy
  • πθ(x)\pi_\theta(\cdot|x): the policy (language model) distribution conditioned on prompt xx
  • R(x,y)R(x, y): the reward scalar for the generated response yy

This expectation is taken over all possible responses the model could generate. Some responses will be excellent (high reward), others mediocre, others poor. The expectation weights each by its probability under the current policy. This weighting is important: if the model rarely generates a particular response, that response contributes little to the expected value, even if it happens to have high reward.

We can expand this using the probability of generating the complete sequence:

J(θ)=yπθ(yx)R(x,y)J(\theta) = \sum_{y} \pi_\theta(y|x) R(x, y)
  • y\sum_{y}: summation over all possible output sequences in the vocabulary
  • πθ(yx)\pi_\theta(y|x): probability of generating sequence yy given prompt xx
  • R(x,y)R(x, y): reward for the specific sequence yy

The probability πθ(yx)\pi_\theta(y|x) represents the likelihood of generating the entire response yy. It is computed as the product of individual token probabilities:

πθ(yx)=t=1Tπθ(ytx,y<t)\pi_\theta(y|x) = \prod_{t=1}^{T} \pi_\theta(y_t | x, y_{<t})

where:

  • TT: the length of the generated sequence
  • yty_t: the token generated at timestep tt
  • y<ty_{<t}: the history of tokens generated before step tt
  • πθ(ytx,y<t)\pi_\theta(y_t | x, y_{<t}): probability of the next token given context

This autoregressive factorization, which we've seen throughout our discussion of language models, connects the policy formulation directly to how transformers and LSTMs actually generate text. Each factor in the product corresponds to one step of the generation process, one call to the model's forward function, one decision about which token to emit next. The full sequence probability is simply the product of all these individual decisions.

The Policy Gradient Theorem

Here's the core challenge: how do we compute θJ(θ)\nabla_\theta J(\theta)? We need this gradient to update our parameters via gradient ascent, but the expectation involves sampling from the policy itself, which depends on θ\theta. This creates a circular dependency that seems difficult to resolve. The gradient depends on how changing θ\theta affects which sequences get sampled, but sampling is a discrete, non-differentiable operation. We cannot simply backpropagate through the sampling process the way we would through a continuous neural network layer.

The policy gradient theorem provides an elegant solution to this problem. Rather than trying to differentiate through sampling directly, it rewrites the gradient in a form that we can estimate using samples from the current policy. The key insight is that we can express the gradient of an expectation in terms of an expectation of gradients, which is something we can actually compute.

Let's derive the policy gradient step by step. Starting with our objective:

J(θ)=yπθ(yx)R(x,y)J(\theta) = \sum_{y} \pi_\theta(y|x) R(x, y)

The gradient with respect to θ\theta:

θJ(θ)=yθπθ(yx)R(x,y)\nabla_\theta J(\theta) = \sum_{y} \nabla_\theta \pi_\theta(y|x) R(x, y)

Reward R(x,y)R(x, y) doesn't depend on θ\theta (it comes from a fixed reward model), so only the policy probability is differentiated. This is an important observation: the reward model is frozen during policy optimization, so its output is just a scalar that multiplies the gradient.

Now we apply a crucial mathematical trick called the log-derivative trick. This technique appears throughout machine learning and statistics whenever we need to compute gradients of expectations. The core idea is to introduce a logarithm and then remove it in a way that reveals a useful structure. Starting from the identity:

θπθ(yx)=πθ(yx)θπθ(yx)πθ(yx)\nabla_\theta \pi_\theta(y|x) = \pi_\theta(y|x) \cdot \frac{\nabla_\theta \pi_\theta(y|x)}{\pi_\theta(y|x)}

We recognize that:

θπθ(yx)πθ(yx)=θlogπθ(yx)\frac{\nabla_\theta \pi_\theta(y|x)}{\pi_\theta(y|x)} = \nabla_\theta \log \pi_\theta(y|x)

This follows from the chain rule: ddθlogf(θ)=1f(θ)dfdθ\frac{d}{d\theta} \log f(\theta) = \frac{1}{f(\theta)} \frac{df}{d\theta}. The logarithm converts a multiplicative relationship into an additive one, which will prove extremely useful when we deal with sequence probabilities that are products of many terms.

Substituting back:

θJ(θ)=yπθ(yx)θlogπθ(yx)R(x,y)\nabla_\theta J(\theta) = \sum_{y} \pi_\theta(y|x) \nabla_\theta \log \pi_\theta(y|x) R(x, y)

A sum over all possible sequences weighted by their probabilities is exactly an expectation:

θJ(θ)=Eyπθ[θlogπθ(yx)R(x,y)]\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(y|x) \cdot R(x, y)]
  • θJ(θ)\nabla_\theta J(\theta): gradient of the objective function w.r.t. parameters
  • θlogπθ(yx)\nabla_\theta \log \pi_\theta(y|x): gradient of the log-probability of the sequence (the "score function")
  • R(x,y)R(x, y): the reward acting as a scalar weight

This is the policy gradient theorem for our setting. It tells us something remarkable: to compute the gradient of expected reward, we can sample sequences from our policy, compute the gradient of their log-probabilities, and multiply by their rewards. The score function θlogπθ(yx)\nabla_\theta \log \pi_\theta(y|x) points in the direction that would increase the probability of sequence yy. Multiplying by the reward scales this direction: high rewards give large positive contributions, low rewards give small or negative contributions.

Policy Gradient Theorem

The gradient of expected reward:

θJ(θ)=Eyπθ[θlogπθ(yx)R(x,y)]\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(y|x) \cdot R(x, y)]

where:

  • θJ(θ)\nabla_\theta J(\theta): gradient of the objective function
  • Eyπθ\mathbb{E}_{y \sim \pi_\theta}: expectation over sampled sequences
  • θlogπθ(yx)\nabla_\theta \log \pi_\theta(y|x): gradient of the log-probability (score function)
  • R(x,y)R(x, y): reward of the sampled sequence

This transforms an intractable sum over all possible sequences into a tractable Monte Carlo estimate using samples from the policy.

The transformation from an intractable sum to a Monte Carlo estimate is the key practical insight. We cannot enumerate all possible sequences; for even modest vocabulary sizes and sequence lengths, this would involve more sequences than atoms in the universe. But we can sample sequences from our policy, and the policy gradient theorem tells us that averaging over these samples gives us an unbiased estimate of the true gradient.

Out[4]:
Visualization
Scatter plot showing sampled sequences with arrows indicating gradient direction based on reward sign.
Policy gradient update directions in the probability-reward plane. Positive rewards (green) generate gradients that increase the log-probability of the sampled sequence, while negative rewards (orange) generate gradients that decrease it. The magnitude of each update, represented by arrow length, scales proportionally with the absolute value of the reward.

From Sequences to Tokens

For language models, we need to express the log-probability of an entire sequence in terms of individual tokens. This connection is essential because our models compute token-level probabilities, not sequence-level probabilities directly. Fortunately, the autoregressive structure of language models makes this straightforward. Using the autoregressive factorization:

logπθ(yx)=t=1Tlogπθ(ytx,y<t)\log \pi_\theta(y|x) = \sum_{t=1}^{T} \log \pi_\theta(y_t | x, y_{<t})

where:

  • t=1T\sum_{t=1}^{T}: sum over all timesteps in the sequence
  • logπθ(ytx,y<t)\log \pi_\theta(y_t | x, y_{<t}): log-probability of the specific token yty_t given context

The logarithm converts the product of probabilities into a sum of log-probabilities. This is where the log-derivative trick pays off: instead of dealing with products of many small numbers (which can underflow numerically), we work with sums of log-probabilities, which are much more stable.

Each term logπθ(ytst)\log \pi_\theta(y_t | s_t) is simply the log-softmax output of our model at position tt, evaluated at the token that was actually selected. This is exactly what we compute during the forward pass. When the model processes a sequence, it produces logits for every position, and applying log-softmax gives us log-probabilities. To compute the log-probability of a specific token choice, we simply index into this log-probability vector at the chosen token's index.

The gradient of the full sequence log-probability is then:

θlogπθ(yx)=t=1Tθlogπθ(ytst)\nabla_\theta \log \pi_\theta(y|x) = \sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(y_t | s_t)

where:

  • sts_t: the state (context) at timestep tt, equivalent to (x,y<t)(x, y_{<t})
  • θlogπθ(ytst)\nabla_\theta \log \pi_\theta(y_t | s_t): gradient of the log-probability for a single token

This sum of gradients follows from the linearity of differentiation. The gradient of a sum is the sum of gradients. Each term in the sum corresponds to one token's contribution to the overall sequence probability.

Putting this together, the policy gradient becomes:

θJ(θ)=Eyπθ[t=1Tθlogπθ(ytst)R(x,y)]\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}\left[\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(y_t | s_t) \cdot R(x, y)\right]

where:

  • t=1Tθlogπθ(ytst)\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(y_t | s_t): the accumulated gradient from all tokens in the sequence
  • R(x,y)R(x, y): the final reward, which scales the gradient of every token choice

The same reward R(x,y)R(x, y) multiplies every token's gradient. This is the key insight: even though we only receive reward at the end, it propagates back to influence all token choices. Every token that contributed to the response receives the same credit or blame for the final outcome. This is both a strength and a weakness of the basic approach. It allows learning from sparse rewards, but it doesn't distinguish between tokens that were crucial for the reward and tokens that were irrelevant.

Out[5]:
Visualization
Line plot showing cumulative log probability decreasing as more tokens are added to a sequence.
Per-token and cumulative log-probabilities for a single generated sequence. The left panel displays individual token log-probabilities log π(y_t|s_t), while the right panel tracks the accumulating sum log π(y|x). The final cumulative value determines the magnitude of the gradient update for the sequence in the REINFORCE algorithm.
Notebook output

The REINFORCE Algorithm

The policy gradient theorem gives us a theoretical result. REINFORCE (also called Monte Carlo policy gradient) is the algorithm that turns this theory into practice through Monte Carlo estimation. The name comes from the idea that we "reinforce" behaviors that lead to good outcomes: when a sequence receives high reward, we reinforce the probability of generating that exact sequence.

The algorithm is elegantly simple:

  1. Sample a complete sequence from the current policy
  2. Compute the reward for that sequence
  3. For each token, compute θlogπθ(ytst)R\nabla_\theta \log \pi_\theta(y_t|s_t) \cdot R
  4. Sum these gradients and update parameters
In[6]:
Code
def reinforce_loss(log_probs, reward):
    """
    Compute REINFORCE loss for a single sequence.

    Args:
        log_probs: Log probabilities of each selected token, shape (seq_len,)
        reward: Scalar reward for the complete sequence

    Returns:
        Loss to minimize (negative of policy gradient objective)
    """
    # Sum log probs to get log π(y|x)
    log_prob_sequence = log_probs.sum()

    # REINFORCE: maximize E[log π(y|x) · R]
    # For gradient descent, we minimize the negative
    loss = -log_prob_sequence * reward

    return loss
In[7]:
Code
import torch
import torch.nn.functional as F


def generate_and_get_log_probs(
    model, prompt_ids, max_length=20, vocab_size=1000
):
    """Generate a sequence and collect log probabilities."""
    model.eval()

    current_ids = prompt_ids.clone()
    log_probs = []

    with torch.no_grad():
        for _ in range(max_length):
            # Get action probabilities
            logits = model(current_ids)
            probs = F.softmax(logits[:, -1, :], dim=-1)

            # Sample action (next token)
            action = torch.multinomial(probs, 1)

            # Store log probability of selected action
            log_prob = torch.log(probs.gather(1, action))
            log_probs.append(log_prob)

            # Update state
            current_ids = torch.cat([current_ids, action], dim=1)

    return current_ids, torch.cat(log_probs, dim=1).squeeze()

Let's trace through a concrete example to build intuition. Suppose our model generates a three-token response with probabilities 0.3, 0.5, and 0.2, receiving a reward of +1. The log-probability of the sequence is:

logπθ(yx)=log(0.3)+log(0.5)+log(0.2)3.91\log \pi_\theta(y|x) = \log(0.3) + \log(0.5) + \log(0.2) \approx -3.91

where:

  • log(0.3)\log(0.3), log(0.5)\log(0.5), etc.: log-probabilities of individual tokens
  • 3.91-3.91: the cumulative log-probability of the complete sequence

Notice that the log-probability is negative, which is always the case since probabilities are between 0 and 1. A log-probability of -3.91 corresponds to a probability of about 0.02, which makes sense: it's the product 0.3×0.5×0.2=0.030.3 \times 0.5 \times 0.2 = 0.03.

The REINFORCE gradient update uses the term:

g=θlogπθ(yx)Rg = \nabla_\theta \log \pi_\theta(y|x) \cdot R

where:

  • gg: the gradient contribution from this single sample
  • RR: the reward (11)
  • θlogπθ(yx)\nabla_\theta \log \pi_\theta(y|x): the direction in parameter space that increases the log-probability (which is 3.91-3.91 in this example)

This gradient pushes the model to increase the probability of generating this exact sequence. If the reward had been 1-1 instead, the gradient would push in the opposite direction, making this sequence less likely. The reward acts as a scaling factor that determines both the magnitude and direction of the update.

In[8]:
Code
import torch

# Demonstration: REINFORCE gradient direction
log_probs = torch.log(torch.tensor([0.3, 0.5, 0.2]))
log_prob_total = log_probs.sum()

# Positive reward: gradient increases sequence probability
reward_positive = 1.0
gradient_direction_positive = -log_prob_total * reward_positive

# Negative reward: gradient decreases sequence probability
reward_negative = -1.0
gradient_direction_negative = -log_prob_total * reward_negative
Out[9]:
Console
Log probability of sequence: -3.507
With positive reward (+1): loss = 3.507
  → Minimizing this loss increases sequence probability
With negative reward (-1): loss = -3.507
  → Minimizing this loss decreases sequence probability

The intuition is clear: REINFORCE increases the probability of sequences that receive high rewards and decreases the probability of sequences that receive low rewards. Over many updates, the policy gradually shifts to favor the kinds of responses that the reward model prefers. This is the essence of learning from rewards: try things, see what works, do more of what works.

The Variance Problem

REINFORCE has a critical weakness: extremely high variance in its gradient estimates. This variance comes from several sources:

  • Sampling variance: We estimate an expectation using a single sample (or small batch). Different samples from the same policy can have wildly different rewards. Consider a prompt where some responses are brilliant (reward +10) and others are terrible (reward -10), but most are mediocre (reward near 0). If we happen to sample one of the rare brilliant responses, we get a huge positive gradient. If we sample a terrible one, we get a huge negative gradient. The expected gradient might be small and stable, but individual estimates can be orders of magnitude larger.
  • Reward magnitude: Large rewards amplify gradient noise. A reward of 100 versus 0.1 changes gradient magnitudes by a factor of 1000. If the reward scale varies across prompts or across training, the gradient scale varies too. This makes it hard to choose a stable learning rate.
  • Long sequences: With more tokens, there's more opportunity for randomness in the sampling process. Each token choice is stochastic, and these random choices compound. The same prompt can lead to vastly different sequences purely due to sampling randomness early in generation.

To see why this matters, consider the policy gradient:

θJ(θ)1Ni=1Nθlogπθ(y(i)x)R(x,y(i))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^{N} \nabla_\theta \log \pi_\theta(y^{(i)}|x) \cdot R(x, y^{(i)})
  • NN: the batch size (number of sampled sequences)
  • y(i)y^{(i)}: the ii-th sampled sequence in the batch
  • R(x,y(i))R(x, y^{(i)}): the reward for the ii-th sequence

If rewards vary from -10 to +10, these gradient estimates will swing wildly between updates. The model may get pushed in completely different directions on consecutive batches, even with the same prompt, simply due to sampling randomness. This creates a noisy optimization landscape where progress is slow and inconsistent.

In[10]:
Code
import numpy as np

# Simulate variance in REINFORCE estimates
np.random.seed(42)


def simulate_reinforce_variance(
    n_samples_per_estimate, true_gradient=1.0, reward_std=5.0
):
    """Simulate variance in gradient estimates."""
    # Simulate rewards with high variance
    rewards = np.random.normal(
        loc=true_gradient, scale=reward_std, size=n_samples_per_estimate
    )
    # Each estimate is the mean of samples
    estimate = np.mean(rewards)
    return estimate


# Compare variance with different batch sizes
batch_sizes = [1, 4, 16, 64, 256]
n_estimates = 200

variance_by_batch = {}
for batch_size in batch_sizes:
    estimates = [
        simulate_reinforce_variance(batch_size) for _ in range(n_estimates)
    ]
    variance_by_batch[batch_size] = np.var(estimates)
Out[11]:
Visualization
Bar chart showing gradient estimate variance decreasing from about 25 to near 0.1 as batch size increases.
REINFORCE gradient estimate variance as a function of batch size. The variance decreases following a $1/N$ scaling law, requiring exponentially more samples to achieve linear improvements in stability. Even with large batch sizes (e.g., 256), the variance remains significant compared to supervised learning, highlighting the algorithm's sample inefficiency.

The variance decreases at rate O(1/N)O(1/N) with batch size NN, but this means we need exponentially more samples to achieve linear improvements in stability. This is fundamentally inefficient. To reduce variance by a factor of 10, we need 10 times more samples. To reduce it by a factor of 100, we need 100 times more samples. For large language models where each sample is expensive to generate, this sample inefficiency becomes a serious practical barrier.

Variance Reduction with Baselines

The key insight for reducing variance comes from a useful property of the policy gradient: subtracting any constant from the reward doesn't change the expected gradient, but it can dramatically reduce variance. This property allows us to center the rewards around zero without introducing any bias into our gradient estimates.

Consider adding a baseline bb to our gradient estimator:

θJ(θ)=Eyπθ[θlogπθ(yx)(R(x,y)b)]\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(y|x) \cdot (R(x, y) - b)]
  • bb: a baseline value (constant w.r.t. the action yy)
  • (R(x,y)b)(R(x, y) - b): the "centered" reward, reducing variance

Why does centering help? Consider the extreme case where all rewards are positive, say ranging from +5 to +15. Without a baseline, every gradient update pushes to increase the probability of all sampled sequences. The model learns slowly because it's receiving mixed signals: "increase this sequence, increase that sequence too, increase everything." With a baseline of 10, rewards become -5 to +5. Now the gradient clearly distinguishes: "increase sequences above the baseline, decrease sequences below it." This clearer signal leads to faster learning.

To verify this doesn't change the expected gradient, we check that the baseline term contributes zero in expectation:

Eyπθ[θlogπθ(yx)b]=bEyπθ[θlogπθ(yx)]\mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(y|x) \cdot b] = b \cdot \mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(y|x)]

The expectation of the score function θlogπθ(yx)\nabla_\theta \log \pi_\theta(y|x) is zero:

Eyπθ[θlogπθ(yx)]=yπθ(yx)θπθ(yx)πθ(yx)(expand expectation and log derivative)=yθπθ(yx)(simplify)=θyπθ(yx)(linearity of gradient)=θ1(probabilities sum to 1)=0(derivative of constant)\begin{aligned} \mathbb{E}_{y \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(y|x)] &= \sum_y \pi_\theta(y|x) \frac{\nabla_\theta \pi_\theta(y|x)}{\pi_\theta(y|x)} && \text{(expand expectation and log derivative)} \\ &= \sum_y \nabla_\theta \pi_\theta(y|x) && \text{(simplify)} \\ &= \nabla_\theta \sum_y \pi_\theta(y|x) && \text{(linearity of gradient)} \\ &= \nabla_\theta 1 && \text{(probabilities sum to 1)} \\ &= 0 && \text{(derivative of constant)} \end{aligned}

This result shows that the score function always has zero mean. This is because the score function measures how changing parameters affects the log-probability of different outcomes, and these effects must balance out since probabilities always sum to one. Increasing the probability of some outcomes necessarily decreases the probability of others.

This result implies that we can subtract any baseline without introducing bias. The question becomes: what baseline minimizes variance?

The optimal baseline turns out to be the expected reward:

b=Eyπθ[R(x,y)]b^* = \mathbb{E}_{y \sim \pi_\theta}[R(x, y)]

where:

  • b:b^*: the variance-minimizing baseline value
  • E[R(x,y)]\mathbb{E}[R(x, y)]: the expected reward under the current policy (often approximated by a value function V(x)V(x))

This is called a value baseline, and in practice we often estimate it with a learned value function. With this baseline, the quantity (R(x,y)b)(R(x, y) - b^*) becomes the advantage: how much better (or worse) this specific sequence is compared to what we expect on average. The advantage is positive when a response exceeds expectations and negative when it falls short.

Advantage

The advantage measures how much better a response yy is compared to the expected reward from the current policy:

A(x,y)=R(x,y)V(x)A(x, y) = R(x, y) - V(x)

where:

  • A(x,y)A(x, y): the advantage score
  • R(x,y)R(x, y): the reward for sequence yy
  • V(x)V(x): the value function estimating expected reward

Positive advantage means better than average; negative means worse.

The advantage has a natural interpretation: it measures how "surprising" a reward is relative to what we expected. A response with advantage +5 did much better than typical; one with advantage -3 did worse. By training on advantages rather than raw rewards, we give the model clearer information about which responses are actually good.

In[12]:
Code
def reinforce_with_baseline(log_probs, reward, baseline):
    """
    REINFORCE with baseline for variance reduction.

    Args:
        log_probs: Log probabilities of selected tokens
        reward: Scalar reward for sequence
        baseline: Expected reward estimate (e.g., from value function)

    Returns:
        Policy loss and value loss
    """
    log_prob_sequence = log_probs.sum()

    # Advantage: how much better/worse than expected
    advantage = reward - baseline

    # Policy gradient uses advantage instead of raw reward
    policy_loss = (
        -log_prob_sequence * advantage.detach()
    )  # Don't backprop through advantage

    # Value loss: train baseline to predict expected reward
    value_loss = (baseline - reward) ** 2

    return policy_loss, value_loss
In[13]:
Code
# Demonstrate variance reduction
import numpy as np

np.random.seed(42)


def simulate_with_baseline(n_samples, rewards, baseline):
    """Simulate gradient estimates with baseline."""
    advantages = rewards - baseline
    # Gradient estimate variance depends on (R - b)^2
    return np.var(advantages)


# Simulate batch of rewards
n_trials = 1000
rewards = np.random.normal(loc=5.0, scale=3.0, size=n_trials)

# Compare different baselines
baselines = [0, 3, 5, 7, 10]
variances = [simulate_with_baseline(n_trials, rewards, b) for b in baselines]
optimal_baseline = np.mean(rewards)
Out[14]:
Visualization
Line plot showing variance minimized at baseline value of 5, with U-shaped curve around it.
Gradient estimate variance as a function of the baseline value. The variance is minimized when the baseline equals the expected reward (approx. 5.0, marked by the dashed line). Deviating from this optimal value increases the variance of the gradient estimates, demonstrating the importance of accurate value estimation.
Out[15]:
Visualization
Two histograms comparing raw reward distribution (all positive) versus advantage distribution (centered at zero).
Distributions of raw rewards versus centered advantages. The left panel shows raw positive rewards, which cause all policy gradients to push in the same direction. The right panel shows advantages centered at zero; positive values indicate better-than-average sequences (reinforce), while negative values indicate worse-than-average sequences (suppress).
Notebook output

Reward-to-Go

Another variance reduction technique acknowledges that actions only affect future rewards, not past ones. This is a causal observation: a decision made at time tt cannot retroactively change what happened at time t1t-1. Yet in basic REINFORCE, we multiply every token's gradient by the full sequence reward, even though early tokens cannot have influenced rewards that were "already determined" by the time they were generated.

Instead of multiplying each token's gradient by the total reward, we can use only the rewards that could have been influenced by that action. This is called the reward-to-go formulation. If we only receive reward at the end of the sequence, the reward-to-go at each timestep is the same as the final reward. However, if we were to introduce intermediate rewards (for example, per-token penalties for certain words), the reward-to-go at step tt would be:

Rt=t=tTrtR_t = \sum_{t'=t}^{T} r_{t'}

where:

  • RtR_t: reward-to-go from timestep tt
  • rtr_{t'}: immediate reward received at timestep tt'

The sum starts at the current timestep tt and runs to the end of the sequence TT. This captures only the future rewards, ignoring any rewards that occurred before the action was taken.

The policy gradient becomes:

θJ(θ)=Eyπθ[t=1Tθlogπθ(ytst)Rt]\nabla_\theta J(\theta) = \mathbb{E}_{y \sim \pi_\theta}\left[\sum_{t=1}^{T} \nabla_\theta \log \pi_\theta(y_t|s_t) \cdot R_t\right]

where:

  • Eyπθ\mathbb{E}_{y \sim \pi_\theta}: expectation over sequences sampled from the policy
  • yty_t: the token generated at timestep tt (previously denoted as action ata_t)
  • RtR_t: the cumulative reward from time tt onwards (reward-to-go), ignoring past rewards

This doesn't change the expected gradient (the proof is similar to the baseline case) but reduces variance by removing terms that contribute only noise. The intuition is that past rewards add randomness without providing useful signal about the current action. By excluding them, we get cleaner gradient estimates.

In[16]:
Code
import torch


def compute_rewards_to_go(rewards, gamma=1.0):
    """
    Compute reward-to-go for each timestep.

    Args:
        rewards: List of rewards at each timestep
        gamma: Discount factor (1.0 = no discounting)

    Returns:
        Tensor of rewards-to-go
    """
    T = len(rewards)
    rewards_to_go = torch.zeros(T)

    # Work backwards from end
    running_sum = 0
    for t in reversed(range(T)):
        running_sum = rewards[t] + gamma * running_sum
        rewards_to_go[t] = running_sum

    return rewards_to_go
In[17]:
Code
# Example: sequence with intermediate rewards
intermediate_rewards = [0, 0, 0, 0, 1.0]  # Only final reward
rewards_to_go = compute_rewards_to_go(intermediate_rewards)
Out[18]:
Console
Timestep | Immediate Reward | Reward-to-Go
---------------------------------------------
    0    |       0.0        |     1.0
    1    |       0.0        |     1.0
    2    |       0.0        |     1.0
    3    |       0.0        |     1.0
    4    |       1.0        |     1.0
Out[19]:
Visualization
Bar chart comparing immediate rewards at each timestep with the corresponding reward-to-go values.
Comparison of immediate rewards and cumulative reward-to-go with intermediate penalties. The left panel displays per-token rewards (e.g., penalties plus a final outcome), while the right panel shows the reward-to-go $R_t$ from each timestep. Earlier tokens are associated with larger cumulative future rewards, providing a richer signal for gradient estimation.
Notebook output

In the typical RLHF setup where reward is only given at the end, reward-to-go equals the final reward at all timesteps. However, this structure becomes important when we add auxiliary rewards or penalties, such as the KL divergence penalty we'll introduce in subsequent chapters. That penalty applies at each token, creating a richer reward structure where reward-to-go provides meaningful variance reduction.

Complete Implementation

Let's put everything together into a complete REINFORCE implementation with baseline and proper training loop.

In[20]:
Code
import torch.nn as nn


class ValueNetwork(nn.Module):
    """Baseline network that estimates expected reward."""

    def __init__(self, vocab_size, hidden_size=256):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, batch_first=True)
        self.output = nn.Linear(hidden_size, 1)

    def forward(self, input_ids):
        embeds = self.embedding(input_ids)
        lstm_out, (h_n, _) = self.lstm(embeds)
        # Use final hidden state to predict value
        value = self.output(h_n[-1])
        return value.squeeze(-1)
In[21]:
Code
import torch
import torch.nn.functional as F


def reinforce_training_step(
    policy,
    value_net,
    prompt_ids,
    reward_fn,
    policy_optimizer,
    value_optimizer,
    max_length=20,
    vocab_size=1000,
):
    """
    Single REINFORCE training step with baseline.

    Args:
        policy: Language model policy
        value_net: Baseline value network
        prompt_ids: Input prompt token IDs
        reward_fn: Function that scores complete sequences
        policy_optimizer: Optimizer for policy
        value_optimizer: Optimizer for value network
        max_length: Maximum generation length
        vocab_size: Vocabulary size

    Returns:
        Dictionary with loss values and reward
    """
    policy.train()
    value_net.train()

    # Generate sequence and collect log probs
    current_ids = prompt_ids.clone()
    log_probs = []

    for _ in range(max_length):
        logits = policy(current_ids)
        probs = F.softmax(logits[:, -1, :], dim=-1)

        # Sample action
        action = torch.multinomial(probs, 1)

        # Log probability of selected action
        log_prob = torch.log(probs.gather(1, action) + 1e-10)
        log_probs.append(log_prob)

        current_ids = torch.cat([current_ids, action], dim=1)

    log_probs = torch.cat(log_probs, dim=1).sum(dim=1)  # Sum over sequence

    # Get reward for complete sequence
    reward = reward_fn(current_ids)

    # Baseline prediction
    baseline = value_net(prompt_ids)

    # Advantage
    advantage = reward - baseline.detach()

    # Policy loss (REINFORCE with baseline)
    policy_loss = -(log_probs * advantage).mean()

    # Value loss (MSE)
    value_loss = F.mse_loss(baseline, reward)

    # Update policy
    policy_optimizer.zero_grad()
    policy_loss.backward()
    policy_optimizer.step()

    # Update value network
    value_optimizer.zero_grad()
    value_loss.backward()
    value_optimizer.step()

    return {
        "policy_loss": policy_loss.item(),
        "value_loss": value_loss.item(),
        "reward": reward.mean().item(),
        "advantage": advantage.mean().item(),
    }

Now let's test this with a simple reward function that encourages shorter sequences:

In[22]:
Code
# Setup
import torch

torch.manual_seed(42)
vocab_size = 100
hidden_size = 64

policy = LanguageModelPolicy(vocab_size, hidden_size, num_layers=1)
value_net = ValueNetwork(vocab_size, hidden_size)

policy_optimizer = torch.optim.Adam(policy.parameters(), lr=1e-3)
value_optimizer = torch.optim.Adam(value_net.parameters(), lr=1e-3)


# Simple reward: prefer shorter sequences (measured by unique tokens)
def length_penalty_reward(sequence_ids):
    """Reward function that penalizes sequence length."""
    # Negative reward proportional to length beyond prompt
    batch_size = sequence_ids.shape[0]
    seq_length = sequence_ids.shape[1]
    return torch.tensor([-seq_length / 50.0] * batch_size, dtype=torch.float32)


# Training loop
prompt = torch.randint(0, vocab_size, (1, 5))  # Batch of 1, prompt length 5
training_history = []

for step in range(100):
    metrics = reinforce_training_step(
        policy,
        value_net,
        prompt,
        length_penalty_reward,
        policy_optimizer,
        value_optimizer,
        max_length=15,
        vocab_size=vocab_size,
    )
    training_history.append(metrics)
In[23]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (2.0, 1.8),  # Adjust for layout
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

steps = range(len(training_history))
policy_losses = [h["policy_loss"] for h in training_history]
value_losses = [h["value_loss"] for h in training_history]
rewards = [h["reward"] for h in training_history]

# First plot
plt.figure()
plt.plot(steps, policy_losses, alpha=0.7, color="steelblue")
plt.xlabel("Training Step")
plt.ylabel("Policy Loss")
plt.title("Policy Loss (High Variance)")
plt.show()

# Second plot
plt.figure()
plt.plot(steps, value_losses, alpha=0.7, color="coral")
plt.xlabel("Training Step")
plt.ylabel("Value Loss")
plt.title("Value Loss (Baseline Learning)")
plt.show()

# Third plot
plt.figure()
plt.plot(steps, rewards, alpha=0.7, color="forestgreen")
plt.xlabel("Training Step")
plt.ylabel("Reward")
plt.title("Average Reward")
plt.show()
Out[23]:
Visualization
Three-panel plot showing noisy policy loss, decreasing value loss, and relatively stable reward over training.
Training metrics over 100 steps of REINFORCE optimization. The left panel illustrates the high variance in policy loss characteristic of Monte Carlo estimates. The center panel tracks the decreasing value loss as the baseline improves, while the right panel shows the average reward trajectory.
Notebook output
Notebook output

The plots reveal the characteristic noisiness of REINFORCE training. While the value network learns to predict expected rewards (value loss decreases), the policy loss remains highly variable. This variance is the primary motivation for more sophisticated algorithms.

Additional Variance Reduction Techniques

Beyond baselines and reward-to-go, several additional techniques help control variance in practice.

Reward normalization standardizes rewards across a batch to have zero mean and unit variance:

In[24]:
Code
def normalize_rewards(rewards, eps=1e-8):
    """Normalize rewards to zero mean, unit variance."""
    mean = rewards.mean()
    std = rewards.std()
    return (rewards - mean) / (std + eps)

This prevents large reward magnitudes from causing gradient explosions while preserving the relative ordering of rewards.

Entropy regularization adds a bonus for maintaining policy entropy, preventing premature convergence to deterministic behavior:

In[25]:
Code
import torch


def policy_entropy(probs):
    """Compute entropy of action distribution."""
    return -(probs * torch.log(probs + 1e-10)).sum(dim=-1)


def entropy_regularized_loss(policy_loss, probs, entropy_coef=0.01):
    """Add entropy bonus to encourage exploration."""
    entropy = policy_entropy(probs).mean()
    return policy_loss - entropy_coef * entropy

High entropy means the policy is exploring diverse actions; low entropy means it's becoming deterministic. The entropy bonus encourages continued exploration, which can help escape local optima.

Out[26]:
Visualization
Two bar charts comparing action probability distributions with and without entropy regularization.
Impact of entropy regularization on action probability distributions. The left panel shows a collapsed, near-deterministic policy (low entropy) typical of unregularized training. The right panel demonstrates how an entropy bonus preserves a more diverse distribution (high entropy), ensuring continued exploration of the action space.
Notebook output

Gradient clipping limits the magnitude of gradient updates:

In[27]:
Code
import torch


def clip_gradients(model, max_norm=1.0):
    """Clip gradients to prevent large updates."""
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

As we discussed in Part VII, gradient clipping prevents individual updates from being too large, which is especially important given REINFORCE's high variance.

Limitations and Practical Challenges

REINFORCE, while theoretically elegant, has significant practical limitations that make it challenging for training large language models:

  • Sample efficiency: REINFORCE requires many samples to get reliable gradient estimates. For large language models generating hundreds of tokens, each sample is expensive: it requires a full forward pass and potentially a reward model evaluation. Training can require millions of samples, which becomes prohibitively expensive with models that have billions of parameters.
  • Credit assignment: The credit assignment problem is only partially solved. While the policy gradient attributes reward to all tokens equally, this isn't always appropriate. A response might be mostly good with one problematic sentence, yet every token receives the same reward signal. More sophisticated algorithms like actor-critic methods (which we'll see with PPO) attempt to provide finer-grained credit assignment.
  • Update stability: REINFORCE provides no guarantees about the size of policy updates. A lucky high-reward sample can push the policy dramatically in one direction, only for the next batch to push it back. This oscillation wastes computation and can destabilize training. The PPO algorithm, covered in the next chapter, directly addresses this by constraining how much the policy can change per update.

Despite these limitations, REINFORCE remains important as the foundation for understanding more sophisticated algorithms. PPO, TRPO, and other policy gradient methods all build on the same core gradient estimator. Understanding REINFORCE's mechanics and failure modes helps explain why those algorithms make the design choices they do.

Summary

Policy gradient methods provide the mathematical foundation for optimizing language models using reward signals. The key insights from this chapter are:

  • Language models are policies. The autoregressive generation process naturally maps to the RL framework: states are contexts, actions are tokens, and the policy πθ(atst)\pi_\theta(a_t|s_t) is the model's token probability distribution.
  • The policy gradient theorem enables optimization. By rewriting the gradient as E[θlogπθ(yx)R(y)]\mathbb{E}[\nabla_\theta \log \pi_\theta(y|x) \cdot R(y)], we transform an intractable sum into a Monte Carlo estimate that we can compute from samples.
  • REINFORCE is simple but high-variance. The basic algorithm just samples sequences, computes their rewards, and updates parameters in proportion to logπ(y)R\log \pi(y) \cdot R. This works but requires many samples for stable estimates.
  • Baselines reduce variance without introducing bias. Subtracting a baseline bb from rewards yields the same expected gradient but lower variance. The optimal baseline is the expected reward, leading to the advantage formulation A=RVA = R - V.
  • Additional techniques help in practice. Reward normalization, entropy regularization, and gradient clipping all contribute to more stable training.

The variance problem with REINFORCE motivates the development of more sophisticated algorithms. In the next chapter, we'll see how Proximal Policy Optimization (PPO) addresses these issues by constraining policy updates and using more careful gradient estimation, making RL-based alignment practical for production language models.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about policy gradient methods and the REINFORCE algorithm.

Loading component...

Reference

BIBTEXAcademic
@misc{policygradientmethodsreinforcealgorithmtheory, author = {Michael Brenndoerfer}, title = {Policy Gradient Methods: REINFORCE Algorithm & Theory}, year = {2025}, url = {https://mbrenndoerfer.com/writing/policy-gradient-methods-reinforce-algorithm}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Policy Gradient Methods: REINFORCE Algorithm & Theory. Retrieved from https://mbrenndoerfer.com/writing/policy-gradient-methods-reinforce-algorithm
MLAAcademic
Michael Brenndoerfer. "Policy Gradient Methods: REINFORCE Algorithm & Theory." 2026. Web. today. <https://mbrenndoerfer.com/writing/policy-gradient-methods-reinforce-algorithm>.
CHICAGOAcademic
Michael Brenndoerfer. "Policy Gradient Methods: REINFORCE Algorithm & Theory." Accessed today. https://mbrenndoerfer.com/writing/policy-gradient-methods-reinforce-algorithm.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Policy Gradient Methods: REINFORCE Algorithm & Theory'. Available at: https://mbrenndoerfer.com/writing/policy-gradient-methods-reinforce-algorithm (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Policy Gradient Methods: REINFORCE Algorithm & Theory. https://mbrenndoerfer.com/writing/policy-gradient-methods-reinforce-algorithm