RLHF Pipeline: Complete Three-Stage Training Guide

Michael BrenndoerferDecember 29, 202537 min read

Master the complete RLHF pipeline with three stages: Supervised Fine-Tuning, Reward Model training, and PPO optimization. Learn debugging techniques.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

RLHF Pipeline

Reinforcement Learning from Human Feedback transforms a base language model into an assistant that follows instructions helpfully, honestly, and safely. The previous chapters introduced the individual components: the Bradley-Terry model for preference modeling, reward model architecture and training, and the PPO algorithm adapted for language models. Now we assemble these pieces into a complete training pipeline.

The RLHF pipeline consists of three sequential stages: Supervised Fine-Tuning (SFT), Reward Model (RM) training, and PPO optimization. Each stage builds on the previous one, progressively shaping the model's behavior. This chapter walks through the complete pipeline, examining the design decisions at each stage, the hyperparameters that govern training stability, and the debugging techniques essential for successful alignment.

The Three-Stage Pipeline

The RLHF pipeline follows a specific sequence that transforms a pretrained language model into an aligned assistant. Understanding why each stage exists and how they connect is crucial for successful implementation.

Out[2]:
Visualization
Flow diagram showing three sequential RLHF stages with data inputs and model outputs.
The three-stage RLHF pipeline illustrating the transformation of a pretrained model into an aligned assistant. This process sequentially utilizes supervised demonstrations, reward modeling for preference learning, and PPO optimization to maximize helpfulness while maintaining linguistic coherence.

Stage 1: Supervised Fine-Tuning (SFT) takes a pretrained language model and trains it on high-quality demonstrations of desired behavior. This stage teaches the model the format and style of helpful responses, creating a starting point that can already follow instructions reasonably well.

Stage 2: Reward Model Training creates a model that predicts human preferences. Using comparison data where humans ranked alternative responses, the reward model learns to assign scalar scores reflecting response quality. As we covered in the Reward Modeling chapter, this model provides the optimization signal for the final stage.

Stage 3: PPO Fine-Tuning optimizes the SFT model to maximize rewards from the reward model while staying close to its original behavior. Building on the PPO for Language Models chapter, this stage performs the actual alignment through reinforcement learning.

Stage 1: Supervised Fine-Tuning

Supervised Fine-Tuning transforms a pretrained language model into one capable of following instructions and engaging in dialogue. While the pretrained model has acquired extensive knowledge and language understanding, it lacks the ability to respond helpfully to your queries in a conversational format.

Why SFT Comes First

You might wonder why we don't skip directly to reinforcement learning. The reason is practical: PPO optimization requires a model that already produces reasonable responses. Trying to optimize a raw pretrained model with RL is like trying to teach someone chess strategy before they know how the pieces move.

The pretrained model generates plausible continuations of text, but it doesn't understand the assistant paradigm. Given a question, it might generate more questions, continue with a different topic, or produce text that reads like training data rather than a helpful response. SFT provides the foundation by:

  • Teaching the instruction-following format (recognizing prompts, generating responses)
  • Establishing a baseline quality level that RL can refine
  • Reducing the search space for PPO (responses are already in a useful format)

SFT Data Requirements

SFT data consists of (prompt, response) pairs demonstrating ideal assistant behavior. Quality matters far more than quantity. A few thousand high-quality demonstrations often outperform millions of lower-quality examples.

In[3]:
Code
sft_examples = [
    {
        "prompt": "What causes rainbows to form?",
        "response": """Rainbows form through a combination of refraction, reflection, and dispersion of sunlight in water droplets. Here's how it works:

1. **Refraction**: Sunlight enters a water droplet and bends (refracts) because light travels slower in water than in air.

2. **Dispersion**: Different wavelengths of light bend by different amounts, separating white light into its component colors (red bends least, violet bends most).

3. **Internal Reflection**: The light reflects off the back inner surface of the droplet.

4. **Second Refraction**: The light bends again as it exits the droplet, further separating the colors.

You see a rainbow when the sun is behind you and rain is in front of you, with the light reflecting back at approximately 42 degrees from the original light direction.""",
    },
    {
        "prompt": "Write a short poem about coding",
        "response": """Lines of Logic

In the glow of screens at night,
We craft our worlds in black and white,
Variables dance, functions call,
Building dreams that never fall.

Bugs may lurk in shadowed code,
Each one a puzzle to decode,
But when at last the program runs,
We've made magic, byte by byte, we've won.""",
    },
]
Out[4]:
Console
SFT Training Example:
Prompt: What causes rainbows to form?

Response preview: Rainbows form through a combination of refraction, reflection, and dispersion of sunlight in water droplets. Here's how it works:

1. **Refraction**: Sunlight enters a water droplet and bends (refract...

The demonstrations should exhibit properties you want the final model to have: helpfulness, appropriate tone, factual accuracy, and safety awareness. As discussed in the Instruction Data Creation chapter, these examples can come from human writers, filtered model outputs, or synthetic generation with quality controls.

SFT Training Process

SFT uses standard causal language modeling loss but only on the response tokens. This distinction is crucial for understanding how the model learns during this stage. The prompt provides context, allowing the model to understand what kind of response is expected, but we don't penalize the model for not predicting prompt tokens. The reasoning is straightforward: we only want the model to learn how to respond given a prompt, not to memorize and reproduce the prompts themselves.

The training objective focuses exclusively on maximizing the likelihood of generating the correct response tokens, conditioned on the full context of the prompt. This targeted learning ensures the model develops the skill of producing appropriate responses rather than simply learning to continue arbitrary text.

Out[5]:
Visualization
Visualization showing token sequence with prompt tokens masked and response tokens contributing to loss.
Response-only loss masking in SFT. Prompt tokens (gray) provide context but are masked in the loss calculation (-100). Only response tokens (green) contribute to gradients, ensuring the model learns to generate the completion rather than reproducing the prompt.
In[6]:
Code
import torch
from torch.utils.data import Dataset


class SFTDataset(Dataset):
    """Dataset for supervised fine-tuning with response-only loss."""

    def __init__(self, examples, tokenizer, max_length=512):
        self.examples = examples
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, idx):
        example = self.examples[idx]

        prompt_tokens = self.tokenizer.encode(example["prompt"])
        response_tokens = self.tokenizer.encode(example["response"])

        full_tokens = prompt_tokens + response_tokens

        if len(full_tokens) > self.max_length:
            full_tokens = full_tokens[: self.max_length]
            response_start = len(prompt_tokens)
        else:
            response_start = len(prompt_tokens)

        labels = [-100] * response_start + full_tokens[response_start:]

        padding_length = self.max_length - len(full_tokens)
        input_ids = full_tokens + [self.tokenizer.pad_token_id] * padding_length
        labels = labels + [-100] * padding_length

        return {
            "input_ids": torch.tensor(input_ids),
            "labels": torch.tensor(labels),
            "attention_mask": torch.tensor(
                [1] * len(full_tokens) + [0] * padding_length
            ),
        }

The key detail is setting labels to -100 for prompt tokens. This value serves as a special sentinel in PyTorch's CrossEntropyLoss function: any token position with a label of -100 is completely ignored during loss computation. By masking the prompt tokens this way, we ensure we only compute loss on the response portion, teaching the model to generate good responses without wasting gradient updates on predicting prompt content it doesn't need to reproduce.

In[7]:
Code
def sft_training_step(model, batch, optimizer):
    """Single SFT training step with response-only loss."""
    model.train()

    input_ids = batch["input_ids"]
    labels = batch["labels"]
    attention_mask = batch["attention_mask"]

    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        labels=labels,  # Model computes loss internally, ignoring -100 labels
    )

    loss = outputs.loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return loss.item()

Key Parameters

SFT training is relatively straightforward compared to the later stages. The key parameters are:

  • Learning rate: Typically 1e-5 to 5e-5 for full fine-tuning, or 1e-4 to 3e-4 for LoRA (as covered in the LoRA chapters)
  • Batch size: 32 to 128 examples, depending on available memory
  • Epochs: 1-3 passes over the data; more risks overfitting
  • Warmup: 3-10% of total steps with linear warmup
In[8]:
Code
sft_config = {
    "learning_rate": 2e-5,
    "batch_size": 64,
    "num_epochs": 2,
    "warmup_ratio": 0.03,
    "weight_decay": 0.01,
    "max_grad_norm": 1.0,
    "lr_scheduler": "cosine",
}
Out[9]:
Console
Typical SFT Configuration:
  learning_rate: 2e-05
  batch_size: 64
  num_epochs: 2
  warmup_ratio: 0.03
  weight_decay: 0.01
  max_grad_norm: 1.0
  lr_scheduler: cosine

Overfitting is a real concern with small SFT datasets. Monitor validation loss and stop training when it begins to increase. You might prefer training for slightly fewer steps than optimal to preserve model generalization.

Stage 2: Reward Model Training

The reward model learns to predict which responses humans prefer. Building on the Bradley-Terry model from earlier chapters, it converts pairwise comparisons into a scalar reward signal that guides PPO optimization.

Reward Model Architecture

As discussed in the Reward Modeling chapter, the reward model typically shares architecture with the language model but replaces the language modeling head with a scalar output head. This architectural choice is deliberate: by starting from a language model that understands text, the reward model inherits the ability to comprehend nuanced language, context, and meaning. The only modification is the final layer, which now outputs a single number representing quality rather than a distribution over vocabulary tokens.

In[10]:
Code
import torch
import torch.nn as nn


class RewardModel(nn.Module):
    """Reward model that outputs a scalar score for text quality."""

    def __init__(self, base_model, hidden_size):
        super().__init__()
        self.base_model = base_model

        self.reward_head = nn.Sequential(
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1),
        )

    def forward(self, input_ids, attention_mask):
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True,
        )

        last_hidden_state = outputs.hidden_states[-1]

        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_indices = torch.arange(input_ids.size(0), device=input_ids.device)
        final_hidden = last_hidden_state[batch_indices, sequence_lengths]

        reward = self.reward_head(final_hidden).squeeze(-1)
        return reward

The reward model extracts the hidden state at the final token position and passes it through a small neural network to produce a scalar reward. This design leverages an important property of causal transformers: the final token's hidden state has attended to all previous tokens in the sequence, meaning it encapsulates information about the entire prompt-response pair. By using this aggregated representation, the model can make a holistic quality judgment that considers both what was asked and how well the response addresses it.

Preference Data Format

Training data consists of prompts with two or more responses ranked by human annotators. Each training example captures a comparison: given the same prompt, which response did humans consider better? This pairwise structure is fundamental to the Bradley-Terry model, which we'll examine shortly.

In[11]:
Code
preference_data = [
    {
        "prompt": "How do I improve my writing skills?",
        "chosen": """Here are effective strategies to improve your writing:

1. **Read widely and actively** - Study how skilled authors structure sentences and arguments
2. **Write daily** - Even 15 minutes of practice builds consistency
3. **Seek feedback** - Join a writing group or find a trusted reader
4. **Edit ruthlessly** - Good writing is rewriting; cut unnecessary words
5. **Study grammar** - Understanding rules helps you break them purposefully

Start with one habit and build from there.""",
        "rejected": """Writing is hard but you can get better. Just practice a lot and read books. 
Maybe take a class or something. Good luck!""",
    },
    {
        "prompt": "Explain quantum entanglement simply",
        "chosen": """Quantum entanglement is like having two magic coins. When you flip them while they're together, then separate them, even across the universe, they always land the same way.

In physics terms: two particles become correlated so that measuring one instantly affects what you'll measure on the other, regardless of distance. Einstein called this "spooky action at a distance."

This isn't about sending information faster than light, but about correlations that can't be explained by classical physics.""",
        "rejected": """Quantum entanglement is a quantum mechanical phenomenon where particles become interconnected through their quantum states such that the quantum state of each particle cannot be described independently.""",
    },
]
Out[12]:
Console
Number of preference pairs: 2

Example prompt: How do I improve my writing skills?
Chosen response length: 481 chars
Rejected response length: 120 chars

The training examples typically show that chosen responses are longer, more detailed, and better structured than rejected ones. The reward model learns to associate these features, along with factual accuracy and tone, with higher scalar scores.

Reward Model Training Loss

The training objective for the reward model emerges from a probabilistic framework for modeling human preferences. We begin with a natural question: given two responses to the same prompt, how likely is it that a human prefers one over the other? The Bradley-Terry model provides an elegant answer by relating this preference probability to the difference in quality scores assigned to each response.

The core insight is that we can model preferences as arising from latent quality scores. If response ywy_w has a higher quality score than response yly_l, then ywy_w should be preferred more often. The Bradley-Terry model formalizes this intuition by expressing the preference probability as a function of the score difference:

P(ywylx)=σ(rθ(x,yw)rθ(x,yl))P(y_w \succ y_l | x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))

where:

  • P(ywylx)P(y_w \succ y_l | x): the probability that response ywy_w is preferred over yly_l given prompt xx
  • σ\sigma: the logistic sigmoid function, σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}}
  • rθr_\theta: the reward model with parameters θ\theta
  • xx: the input prompt
  • ywy_w: the preferred ("winning") response
  • yly_l: the rejected ("losing") response

The sigmoid function σ\sigma plays a crucial role in this formulation. It converts the unbounded difference in reward scores into a probability between 0 and 1, providing a smooth and differentiable mapping. When the reward for the winning response rθ(x,yw)r_\theta(x, y_w) is significantly larger than for the losing response, the difference becomes a large positive number, and the sigmoid approaches 1, indicating near-certainty that ywy_w would be preferred. Conversely, if the scores are equal, the sigmoid returns 0.5, reflecting maximum uncertainty. This elegant mathematical structure captures our intuition that larger quality differences should correspond to more decisive preferences.

Out[13]:
Visualization
Sigmoid curve showing how reward difference maps to preference probability in the Bradley-Terry model.
Bradley-Terry preference probability. The sigmoid function maps the reward difference between two responses to a probability of preference. A large positive difference implies near-certainty that the chosen response is better, while zero difference indicates equal preference likelihood.

To train the model, we minimize the negative log-likelihood of observing the preferences in our dataset. For a single preference pair, the loss is:

LRM=logσ(rθ(x,yw)rθ(x,yl))\mathcal{L}_{RM} = -\log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))

where:

  • LRM\mathcal{L}_{RM}: the scalar loss value to be minimized
  • σ\sigma: the logistic sigmoid function, σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}}
  • rθr_\theta: the reward model with parameters θ\theta that assigns a scalar score to a prompt-response pair
  • xx: the input prompt or instruction
  • ywy_w: the "winning" or preferred response
  • yly_l: the "losing" or rejected response
  • rθ(x,yw)rθ(x,yl)r_\theta(x, y_w) - r_\theta(x, y_l): the difference in reward scores (which we want to be positive)

Understanding why this loss works requires examining what happens during optimization. When the model correctly assigns a higher score to the preferred response (making the difference positive and large), the sigmoid outputs a value close to 1, and the negative log becomes small. When the model incorrectly ranks the responses (making the difference negative), the sigmoid outputs a value close to 0, and the negative log becomes very large, creating a strong gradient signal to correct this error. This objective therefore directly maximizes the likelihood that the model assigns a higher score to the preferred response ywy_w than the rejected response yly_l, which is precisely what we want from a reward model.

In[14]:
Code
import torch


def compute_reward_model_loss(reward_model, batch, tokenizer, device):
    """Compute Bradley-Terry loss for preference learning."""

    prompts = batch["prompts"]
    chosen_responses = batch["chosen"]
    rejected_responses = batch["rejected"]

    chosen_texts = [p + c for p, c in zip(prompts, chosen_responses)]
    rejected_texts = [p + r for p, r in zip(prompts, rejected_responses)]

    chosen_tokens = tokenizer(
        chosen_texts, padding=True, return_tensors="pt"
    ).to(device)
    rejected_tokens = tokenizer(
        rejected_texts, padding=True, return_tensors="pt"
    ).to(device)

    chosen_rewards = reward_model(
        chosen_tokens["input_ids"], chosen_tokens["attention_mask"]
    )
    rejected_rewards = reward_model(
        rejected_tokens["input_ids"], rejected_tokens["attention_mask"]
    )

    loss = -torch.log(torch.sigmoid(chosen_rewards - rejected_rewards)).mean()

    accuracy = (chosen_rewards > rejected_rewards).float().mean()

    return loss, accuracy

Reward Model Training Metrics:

  • Loss: Bradley-Terry negative log-likelihood
  • Accuracy: Fraction where r(chosen)>r(rejected)r(chosen) > r(rejected)
  • Target accuracy: 70-80% (higher may indicate overfitting)

Reward Model Calibration

A critical but often overlooked aspect is reward model calibration. The absolute reward values don't matter for ranking, since the Bradley-Terry model only uses differences between scores. However, the scale of these values significantly affects PPO training stability. If rewards are too large in magnitude, gradient updates can become unstable; if they vary too widely, the optimization landscape becomes difficult to navigate.

In[15]:
Code
import torch


def calibrate_reward_model(reward_model, calibration_data, tokenizer, device):
    """Calibrate reward model to have zero mean and unit variance."""
    reward_model.eval()

    all_rewards = []

    with torch.no_grad():
        for batch in calibration_data:
            texts = batch["texts"]
            tokens = tokenizer(texts, padding=True, return_tensors="pt").to(
                device
            )
            rewards = reward_model(
                tokens["input_ids"], tokens["attention_mask"]
            )
            all_rewards.append(rewards.cpu())

    all_rewards = torch.cat(all_rewards)
    mean_reward = all_rewards.mean().item()
    std_reward = all_rewards.std().item()

    return mean_reward, std_reward


def normalized_reward(raw_reward, mean, std):
    """Normalize reward to zero mean and unit variance."""
    return (raw_reward - mean) / (std + 1e-8)

Stage 3: PPO Fine-Tuning

With the SFT model and reward model ready, we can now run PPO optimization. This stage adjusts the policy to maximize expected reward while staying close to the SFT model's behavior. The challenge here is delicate: we want the model to improve according to the reward signal without losing the coherent language abilities it acquired during pretraining and SFT.

The PPO Training Loop

As we detailed in the PPO for Language Models chapter, each training iteration involves a carefully orchestrated sequence of steps that together enable stable policy improvement:

  1. Sampling: Generate responses from the current policy
  2. Reward computation: Score responses using the reward model
  3. Advantage estimation: Compute advantages using GAE
  4. Policy update: Optimize the clipped surrogate objective. This iterative process gradually shifts the policy's behavior toward responses that score higher according to the reward model, while the various stability mechanisms in PPO prevent the optimization from taking steps that are too large or in harmful directions.
In[16]:
Code
import torch
import torch.optim


class RLHFTrainer:
    """Complete RLHF trainer implementing the PPO training loop."""

    def __init__(
        self, policy_model, ref_model, reward_model, tokenizer, config
    ):
        self.policy = policy_model
        self.ref_model = ref_model  # Frozen copy of SFT model
        self.reward_model = reward_model
        self.tokenizer = tokenizer
        self.config = config

        self.optimizer = torch.optim.AdamW(
            self.policy.parameters(),
            lr=config["learning_rate"],
            weight_decay=config["weight_decay"],
        )

    def generate_responses(self, prompts, max_length=256):
        """Generate responses from current policy."""
        self.policy.eval()

        responses = []
        log_probs_list = []

        with torch.no_grad():
            for prompt in prompts:
                input_ids = self.tokenizer.encode(prompt, return_tensors="pt")

                output_ids = []
                log_probs = []

                for _ in range(max_length):
                    outputs = self.policy(input_ids)
                    next_token_logits = outputs.logits[:, -1, :]
                    probs = torch.softmax(next_token_logits, dim=-1)

                    next_token = torch.multinomial(probs, num_samples=1)

                    token_log_prob = torch.log(probs[0, next_token.item()])

                    output_ids.append(next_token.item())
                    log_probs.append(token_log_prob.item())

                    input_ids = torch.cat([input_ids, next_token], dim=1)

                    if next_token.item() == self.tokenizer.eos_token_id:
                        break

                responses.append(self.tokenizer.decode(output_ids))
                log_probs_list.append(log_probs)

        return responses, log_probs_list

Computing the Complete Reward

The total reward used to update the policy combines two competing objectives. We want to maximize the reward model's score, which represents human preferences, but we also want to prevent the policy from straying too far from the reference model, which represents stable, coherent language generation. The KL divergence penalty provides this regularization, creating a tug-of-war that encourages improvement without catastrophic drift.

We'll explore the KL divergence penalty in detail in the next chapter, but the basic formulation captures this balance mathematically:

Rtotal(x,y)=RRM(x,y)βDKL(πθπref)R_{total}(x, y) = R_{RM}(x, y) - \beta \cdot D_{KL}(\pi_\theta || \pi_{ref})

where:

  • Rtotal(x,y)R_{total}(x, y): the combined reward used to update the policy
  • xx: the input prompt
  • yy: the generated response
  • RRM(x,y)R_{RM}(x, y): the preference score from the reward model
  • β\beta: the KL penalty coefficient controlling regularization strength
  • DKLD_{KL}: the Kullback-Leibler divergence between the two distributions
  • πθ\pi_\theta: the current policy model
  • πref\pi_{ref}: the reference model (frozen SFT model)

The KL coefficient β\beta acts as a dial controlling the trade-off between reward maximization and behavioral stability. A larger β\beta keeps the policy closer to the reference model, preserving more of the original capabilities but potentially limiting how much the model can improve. A smaller β\beta allows more aggressive optimization toward higher rewards, but risks the policy finding reward model exploits or losing coherence. This formulation ensures that while we maximize the preference score, we maintain the linguistic coherence and knowledge of the original model.

Out[17]:
Visualization
Stacked bar chart showing how total reward is composed of reward model score minus KL penalty across training steps.
Reward composition in RLHF. The total reward (blue line) balances the raw preference score (green) against a KL divergence penalty (red). As the policy diverges from the reference model, the growing penalty prevents catastrophic drift.
In[18]:
Code
import torch


def compute_rewards_with_kl_penalty(
    policy_model,
    ref_model,
    reward_model,
    prompts,
    responses,
    tokenizer,
    kl_coef,
    device,
):
    """Compute total reward including KL penalty."""

    full_texts = [p + r for p, r in zip(prompts, responses)]
    tokens = tokenizer(full_texts, padding=True, return_tensors="pt").to(device)

    with torch.no_grad():
        rm_rewards = reward_model(tokens["input_ids"], tokens["attention_mask"])

    with torch.no_grad():
        policy_outputs = policy_model(
            tokens["input_ids"], attention_mask=tokens["attention_mask"]
        )
        ref_outputs = ref_model(
            tokens["input_ids"], attention_mask=tokens["attention_mask"]
        )

        policy_logprobs = torch.log_softmax(policy_outputs.logits, dim=-1)
        ref_logprobs = torch.log_softmax(ref_outputs.logits, dim=-1)

        token_kl = (
            policy_logprobs.exp() * (policy_logprobs - ref_logprobs)
        ).sum(dim=-1)
        kl_penalty = token_kl.sum(dim=-1)

    total_rewards = rm_rewards - kl_coef * kl_penalty

    return total_rewards, rm_rewards, kl_penalty

PPO Update Step

The policy update uses the clipped surrogate objective from the PPO Algorithm chapter. This objective function represents the core mechanism that enables stable policy improvement: rather than directly maximizing expected reward, which could lead to catastrophically large updates, PPO constrains how much the policy can change in a single step. The clipping mechanism ensures that even if the advantage estimates suggest a large improvement, the actual policy update remains bounded.

In[19]:
Code
import torch


def ppo_update(
    policy_model, optimizer, batch, clip_epsilon=0.2, entropy_coef=0.01
):
    """Perform PPO policy update with clipped objective."""
    policy_model.train()

    states = batch["states"]
    actions = batch["actions"]
    old_log_probs = batch["old_log_probs"]
    advantages = batch["advantages"]
    returns = batch["returns"]

    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    outputs = policy_model(states)
    logits = outputs.logits

    log_probs = torch.log_softmax(logits, dim=-1)
    action_log_probs = log_probs.gather(-1, actions.unsqueeze(-1)).squeeze(-1)

    ratios = torch.exp(action_log_probs - old_log_probs)

    surr1 = ratios * advantages
    surr2 = torch.clamp(ratios, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()

    probs = torch.softmax(logits, dim=-1)
    entropy = -(probs * log_probs).sum(dim=-1).mean()
    entropy_loss = -entropy_coef * entropy

    total_loss = policy_loss + entropy_loss

    optimizer.zero_grad()
    total_loss.backward()
    torch.nn.utils.clip_grad_norm_(policy_model.parameters(), max_norm=1.0)
    optimizer.step()

    return {
        "policy_loss": policy_loss.item(),
        "entropy": entropy.item(),
        "mean_ratio": ratios.mean().item(),
        "clip_fraction": (
            (ratios < 1 - clip_epsilon) | (ratios > 1 + clip_epsilon)
        )
        .float()
        .mean()
        .item(),
    }

RLHF Debugging

RLHF training is notoriously difficult to debug. The interplay between the policy, reward model, and KL constraint creates many potential failure modes.

Key Metrics to Monitor

Effective RLHF debugging requires tracking multiple metrics throughout training:

In[20]:
Code
import numpy as np


class RLHFMetricsTracker:
    """Track key metrics for RLHF debugging."""

    def __init__(self):
        self.metrics_history = {
            "reward_mean": [],
            "reward_std": [],
            "kl_divergence": [],
            "policy_loss": [],
            "entropy": [],
            "clip_fraction": [],
            "response_length": [],
            "unique_tokens": [],
        }

    def log_step(self, metrics_dict):
        for key, value in metrics_dict.items():
            if key in self.metrics_history:
                self.metrics_history[key].append(value)

    def check_health(self):
        """Check for common RLHF failure modes."""
        warnings = []

        if len(self.metrics_history["kl_divergence"]) > 100:
            recent_kl = np.mean(self.metrics_history["kl_divergence"][-100:])
            if recent_kl > 10:
                warnings.append(
                    f"HIGH KL DIVERGENCE ({recent_kl:.2f}): Policy diverging from reference"
                )
            elif recent_kl < 0.01:
                warnings.append(
                    f"LOW KL DIVERGENCE ({recent_kl:.4f}): Policy not learning"
                )

        if len(self.metrics_history["reward_mean"]) > 100:
            recent_reward = np.mean(self.metrics_history["reward_mean"][-100:])
            if recent_reward > 5:
                warnings.append(
                    f"VERY HIGH REWARD ({recent_reward:.2f}): Possible reward hacking"
                )

        if len(self.metrics_history["entropy"]) > 100:
            recent_entropy = np.mean(self.metrics_history["entropy"][-100:])
            if recent_entropy < 0.1:
                warnings.append(
                    f"LOW ENTROPY ({recent_entropy:.4f}): Policy becoming deterministic"
                )

        if len(self.metrics_history["response_length"]) > 100:
            recent_len = np.mean(self.metrics_history["response_length"][-100:])
            early_len = np.mean(self.metrics_history["response_length"][:100])
            if recent_len < early_len * 0.5:
                warnings.append(
                    "RESPONSE LENGTH COLLAPSED: Model producing very short responses"
                )
            elif recent_len > early_len * 2:
                warnings.append(
                    "RESPONSE LENGTH EXPLOSION: Model producing very long responses"
                )

        return warnings
In[21]:
Code
import numpy as np

tracker = RLHFMetricsTracker()

np.random.seed(42)
for i in range(200):
    kl = 0.5 + 0.1 * np.random.randn() + i * 0.08  # Gradually increasing KL
    tracker.log_step(
        {
            "kl_divergence": kl,
            "reward_mean": 2.0 + 0.5 * np.random.randn() + i * 0.04,
            "entropy": max(0.05, 1.0 - i * 0.004 + 0.1 * np.random.randn()),
            "response_length": 100 + 10 * np.random.randn() - i * 0.2,
        }
    )

warnings = tracker.check_health()
Out[22]:
Console
Health Check Results:
  ⚠️  HIGH KL DIVERGENCE (12.45): Policy diverging from reference
  ⚠️  VERY HIGH REWARD (7.96): Possible reward hacking

The metrics tracker successfully identifies the simulated anomalies. By monitoring the KL divergence and reward statistics we can catch issues like the "High KL" spike and "High Reward" events (simulating reward hacking) before they destabilize the entire training run.

### Common Failure Modes Understanding these failure modes helps you diagnose and fix training issues: Reward Hacking occurs when the policy finds exploits in the reward model that don't correspond to genuine quality improvements. Signs include rapidly increasing reward with degrading response quality, or unusual patterns like excessive repetition or specific phrases.
In[23]:
Code
import numpy as np


def detect_reward_hacking(responses, rewards, threshold_percentile=95):
    """Detect potential reward hacking by examining high-reward responses."""

    threshold = np.percentile(rewards, threshold_percentile)
    high_reward_indices = np.where(np.array(rewards) > threshold)[0]

    flags = []

    for idx in high_reward_indices:
        response = responses[idx]

        words = response.split()
        if len(words) > 0:
            unique_ratio = len(set(words)) / len(words)
            if unique_ratio < 0.3:
                flags.append(("repetition", idx, response[:100]))

        if len(response) > 2000:
            flags.append(("excessive_length", idx, f"Length: {len(response)}"))
        elif len(response) < 10:
            flags.append(("too_short", idx, response))

    return flags

Mode Collapse happens when the policy converges to producing nearly identical responses regardless of the prompt. Monitor response diversity and entropy throughout training.

In[24]:
Code
import numpy as np


def compute_response_diversity(responses, n_samples=100):
    """Measure diversity of generated responses."""

    unique_responses = len(set(responses[:n_samples]))
    uniqueness_ratio = unique_responses / min(n_samples, len(responses))

    all_words = []
    for response in responses[:n_samples]:
        all_words.extend(response.lower().split())

    if len(all_words) > 0:
        vocab_size = len(set(all_words))
        type_token_ratio = vocab_size / len(all_words)
    else:
        type_token_ratio = 0

    return {
        "uniqueness_ratio": uniqueness_ratio,
        "type_token_ratio": type_token_ratio,
        "avg_response_length": np.mean([len(r) for r in responses[:n_samples]]),
    }

KL Explosion indicates the policy is moving too fast away from the reference model. This often precedes training instability:

Out[25]:
Visualization
Two line plots comparing healthy KL divergence vs KL explosion during RLHF training.
Comparison of KL divergence trajectories. Left: Healthy training maintains KL near the target (0.5), indicating controlled optimization. Right: KL explosion occurs when the policy diverges rapidly from the reference, often leading to mode collapse.
Notebook output

Debugging Workflow

When RLHF training goes wrong, follow this systematic debugging approach:

  1. Check the reward model first: Generate samples and manually verify that reward model scores align with your quality intuitions. A miscalibrated or overfitted reward model dooms PPO from the start.

  2. Examine generated samples: Look at actual model outputs throughout training. Metrics can hide problems that become obvious when reading responses.

  3. Verify the KL penalty is working: The policy should stay reasonably close to the reference. If responses look completely different from SFT outputs, the KL constraint may be too weak.

  4. Monitor multiple metrics together: Single metrics can be misleading. High reward with low diversity suggests reward hacking. Low KL with no reward improvement suggests the policy isn't learning.

In[26]:
Code
import torch


def debug_rlhf_step(
    policy_model, ref_model, reward_model, sample_prompts, tokenizer, device
):
    """Comprehensive debugging for a single RLHF step."""

    debug_info = {}

    policy_model.eval()
    ref_model.eval()

    with torch.no_grad():
        responses = []
        for prompt in sample_prompts[:5]:
            input_ids = tokenizer.encode(prompt, return_tensors="pt").to(device)
            output_ids = policy_model.generate(
                input_ids, max_new_tokens=100, do_sample=True, temperature=0.7
            )
            response = tokenizer.decode(output_ids[0][len(input_ids[0]) :])
            responses.append(response)

        debug_info["sample_responses"] = list(
            zip(sample_prompts[:5], responses)
        )

        full_texts = [p + r for p, r in zip(sample_prompts[:5], responses)]
        tokens = tokenizer(full_texts, padding=True, return_tensors="pt").to(
            device
        )

        rewards = reward_model(tokens["input_ids"], tokens["attention_mask"])
        debug_info["rewards"] = rewards.cpu().tolist()

        debug_info["response_lengths"] = [len(r) for r in responses]
        debug_info["unique_words"] = [len(set(r.split())) for r in responses]

    return debug_info
Out[27]:
Visualization
Four subplot grid showing different RLHF failure modes compared to healthy training.
Visual comparison of healthy RLHF training against common failure modes like reward hacking, mode collapse, and KL explosion. While healthy training balances reward and stability, failure modes manifest as anomalous trends such as skyrocketing rewards paired with plummeting response diversity.
Notebook output
Notebook output
Notebook output

Putting It All Together

Let's trace through a complete RLHF training run, showing how all pieces connect:

In[28]:
Code
import numpy as np


def run_rlhf_pipeline(config):
    """Complete RLHF training pipeline."""

    print("=" * 50)
    print("STAGE 1: Supervised Fine-Tuning")
    print("=" * 50)

    # SFT training loop (simplified)
    sft_metrics = {
        "initial_loss": 3.5,
        "final_loss": 1.8,
        "epochs": config["sft_epochs"],
    }

    print(f"  Initial loss: {sft_metrics['initial_loss']:.3f}")
    print(f"  Final loss: {sft_metrics['final_loss']:.3f}")
    print(f"  Trained for {sft_metrics['epochs']} epochs")
    print()

    print("=" * 50)
    print("STAGE 2: Reward Model Training")
    print("=" * 50)

    rm_metrics = {
        "initial_accuracy": 0.52,
        "final_accuracy": 0.74,
        "epochs": config["rm_epochs"],
    }

    print(f"  Initial accuracy: {rm_metrics['initial_accuracy']:.1%}")
    print(f"  Final accuracy: {rm_metrics['final_accuracy']:.1%}")
    print(f"  Trained for {rm_metrics['epochs']} epochs")
    print()

    print("=" * 50)
    print("STAGE 3: PPO Fine-Tuning")
    print("=" * 50)

    ppo_metrics = []
    np.random.seed(42)

    for step in range(0, config["ppo_steps"], config["ppo_steps"] // 10):
        progress = step / config["ppo_steps"]

        metrics = {
            "step": step,
            "reward": 0.5 + 1.5 * progress + 0.2 * np.random.randn(),
            "kl": 0.1 + 0.3 * progress + 0.05 * np.random.randn(),
            "policy_loss": -0.5 - 0.3 * progress + 0.1 * np.random.randn(),
        }
        ppo_metrics.append(metrics)

        if step % (config["ppo_steps"] // 5) == 0:
            print(
                f"  Step {step:5d}: reward={metrics['reward']:.3f}, "
                f"KL={metrics['kl']:.3f}, loss={metrics['policy_loss']:.3f}"
            )

    return sft_metrics, rm_metrics, ppo_metrics
In[29]:
Code
pipeline_config = {"sft_epochs": 2, "rm_epochs": 1, "ppo_steps": 10000}
In[30]:
Code
sft_results, rm_results, ppo_results = run_rlhf_pipeline(pipeline_config)
Out[30]:
Console
==================================================
STAGE 1: Supervised Fine-Tuning
==================================================
  Initial loss: 3.500
  Final loss: 1.800
  Trained for 2 epochs

==================================================
STAGE 2: Reward Model Training
==================================================
  Initial accuracy: 52.0%
  Final accuracy: 74.0%
  Trained for 1 epochs

==================================================
STAGE 3: PPO Fine-Tuning
==================================================
  Step     0: reward=0.599, KL=0.093, loss=-0.435
  Step  2000: reward=1.116, KL=0.198, loss=-0.607
  Step  4000: reward=1.148, KL=0.124, loss=-0.792
  Step  6000: reward=1.218, KL=0.209, loss=-0.533
  Step  8000: reward=1.591, KL=0.346, loss=-0.855

The text output confirms that each stage completed successfully. The SFT loss decreased significantly, and the Reward Model achieved a validation accuracy of 74%, which is within the typical 70-80% range for effective preference modeling. These healthy prerequisites set the stage for the PPO phase, which we can now visualize.

In[31]:
Code
steps = [m["step"] for m in ppo_results]
rewards = [m["reward"] for m in ppo_results]
kls = [m["kl"] for m in ppo_results]
Out[32]:
Visualization
Two line plots showing reward and KL divergence trends during PPO training.
Mean reward and KL divergence trajectories during PPO training. The reward increases as the policy aligns with human preferences, while the KL divergence remains stable near the target of 0.5, ensuring the model maintains its original capabilities without drifting into incoherence or reward hacking.
Notebook output

The training curves demonstrate healthy alignment progress. The reward (left) steadily increases, indicating the model is learning to satisfy the reward model's preferences. Meanwhile, the KL divergence (right) remains controlled near the target of 0.5, ensuring the model maintains the coherent capabilities of the original SFT model without drifting into incoherence or reward hacking.

Limitations and Practical Considerations

The RLHF pipeline, while powerful, comes with significant challenges that you must navigate.

Computational cost is substantial. The pipeline requires training three separate models (SFT, reward model, and PPO policy), with the PPO stage being particularly expensive because it requires running both the policy and reference model for every batch. A single RLHF training run can cost hundreds of thousands of dollars in compute for large models, making iteration and experimentation prohibitively expensive for most organizations. This has driven interest in more efficient alternatives like Direct Preference Optimization (DPO), which we'll explore in upcoming chapters.

Human annotation quality fundamentally limits what RLHF can achieve. The reward model can only capture patterns present in the preference data, and human annotators bring their own biases, inconsistencies, and limitations. Disagreement between annotators is common, yet the Bradley-Terry model assumes a consistent underlying preference ordering. When annotators disagree about what makes a response "better," the reward model learns a noisy compromise that may not align with your preferences.

Reward hacking remains an unsolved problem despite various mitigation strategies. As we discussed in the Reward Hacking chapter, the policy will exploit any systematic weakness in the reward model. The KL penalty helps by anchoring behavior to the reference model, but sufficiently capable policies can still find exploits within the allowed KL budget. This creates an ongoing cat-and-mouse dynamic where you must continually patch reward model vulnerabilities.

Reproducibility is challenging due to the many interacting hyperparameters and the sensitivity of PPO training. Small changes in learning rate, KL coefficient, or even random seed can lead to qualitatively different outcomes. This makes it difficult to compare results across papers or replicate published findings.

Despite these limitations, RLHF remains the most widely deployed alignment technique for production language models. Understanding the complete pipeline, including its failure modes, is essential for anyone working on language model alignment. The next chapter examines the KL divergence penalty in detail, which plays a crucial role in balancing reward maximization with behavioral stability.

Summary

The RLHF pipeline transforms a pretrained language model into an aligned assistant through three sequential stages. Supervised Fine-Tuning creates a model that understands the instruction-following format and produces reasonable responses. Reward Model training captures human preferences in a learnable function that provides optimization signal. PPO Fine-Tuning then optimizes the policy to maximize rewards while staying close to the reference model.

Key takeaways from this chapter:

  • SFT provides the foundation: PPO requires a model that already produces usable responses; skipping SFT leads to unstable training
  • Reward model quality is paramount: A flawed reward model will lead to flawed policies; validate carefully before PPO
  • KL penalty prevents catastrophic drift: Without anchoring to the reference model, the policy will exploit reward model weaknesses
  • Monitor multiple metrics: Single metrics can be misleading; track reward, KL, entropy, and response characteristics together
  • Debugging requires examining actual outputs: Metrics summarize behavior, but reading generated responses reveals problems that numbers hide

The RLHF pipeline established the template for aligning large language models, but its complexity and cost have motivated simpler alternatives. The next chapter examines the KL divergence penalty in mathematical detail, followed by chapters on Direct Preference Optimization, which eliminates the reward model and PPO stages entirely while achieving comparable alignment results.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the RLHF pipeline.

Loading component...

Reference

BIBTEXAcademic
@misc{rlhfpipelinecompletethreestagetrainingguide, author = {Michael Brenndoerfer}, title = {RLHF Pipeline: Complete Three-Stage Training Guide}, year = {2025}, url = {https://mbrenndoerfer.com/writing/rlhf-pipeline-sft-reward-model-ppo-training}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). RLHF Pipeline: Complete Three-Stage Training Guide. Retrieved from https://mbrenndoerfer.com/writing/rlhf-pipeline-sft-reward-model-ppo-training
MLAAcademic
Michael Brenndoerfer. "RLHF Pipeline: Complete Three-Stage Training Guide." 2026. Web. today. <https://mbrenndoerfer.com/writing/rlhf-pipeline-sft-reward-model-ppo-training>.
CHICAGOAcademic
Michael Brenndoerfer. "RLHF Pipeline: Complete Three-Stage Training Guide." Accessed today. https://mbrenndoerfer.com/writing/rlhf-pipeline-sft-reward-model-ppo-training.
HARVARDAcademic
Michael Brenndoerfer (2025) 'RLHF Pipeline: Complete Three-Stage Training Guide'. Available at: https://mbrenndoerfer.com/writing/rlhf-pipeline-sft-reward-model-ppo-training (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). RLHF Pipeline: Complete Three-Stage Training Guide. https://mbrenndoerfer.com/writing/rlhf-pipeline-sft-reward-model-ppo-training