DPO Implementation: PyTorch Training for Language Model Alignment

Michael BrenndoerferJanuary 2, 202633 min read

Implement Direct Preference Optimization in PyTorch. Covers preference data formatting, loss computation, training loops, and hyperparameter tuning for LLM alignment.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

DPO Implementation

In the previous two chapters, we developed the conceptual foundation for Direct Preference Optimization and derived its loss function from first principles. We saw how DPO sidesteps the need for a separate reward model by reparameterizing the RLHF objective directly in terms of policy log probabilities. Now it's time to turn theory into practice.

This chapter bridges the gap between mathematical formulation and working code. We'll examine how to structure preference data for DPO training, implement the loss function efficiently in PyTorch, build a complete training loop, and understand how hyperparameter choices affect learning dynamics. By the end, you'll have a practical understanding of how to apply DPO to align language models with human preferences.

DPO Data Format

DPO training requires preference data organized as triplets: a prompt paired with two completions where one is preferred over the other. This structure directly reflects the pairwise comparison paradigm we discussed in the Human Preference Data chapter.

The Preference Triplet Structure

Each training example consists of three components:

  • Prompt: The input context or instruction that elicits a response
  • Chosen response: The completion preferred by human annotators (the "winner")
  • Rejected response: The completion deemed less desirable (the "loser")

Unlike instruction tuning where you only need prompt-response pairs, DPO explicitly requires contrastive examples. The model learns not just what a good response looks like, but what makes it better than alternatives. This contrastive learning signal is fundamental to how DPO operates: by presenting the model with pairs of responses to the same prompt, we provide direct supervision about the relative quality of different outputs. The model can then internalize these comparisons and generalize the underlying preference criteria to new situations.

In[2]:
Code
## uv pip install torch transformers peft datasets matplotlib tqdm scipy numpy

## Standard DPO data format
preference_example = {
    "prompt": "Explain quantum entanglement in simple terms.",
    "chosen": "Quantum entanglement is like having two magic coins that always land on opposite sides, no matter how far apart they are. When you flip one and get heads, the other instantly becomes tails. Scientists don't fully understand why, but this connection exists even across vast distances.",
    "rejected": "Quantum entanglement is a phenomenon where particles become correlated. The quantum state of each particle cannot be described independently. Measurements on entangled particles are correlated.",
}

The chosen response here is more accessible and uses a helpful analogy, while the rejected response is technically accurate but less engaging for a general audience. This kind of subtle preference is exactly what DPO learns to capture.

Dataset Structure for Training

In practice, you'll work with datasets containing thousands of such triplets. The Hugging Face datasets library provides a natural format:

In[3]:
Code
## Create a minimal preference dataset for demonstration
preference_data = [
    {
        "prompt": "Write a haiku about programming.",
        "chosen": "Silent keystrokes fall\nBugs emerge from tangled code\nCoffee fuels the night",
        "rejected": "Programming is fun\nI like to write code all day\nComputers are cool",
    },
    {
        "prompt": "What's the capital of France?",
        "chosen": "The capital of France is Paris. It's located in the north-central part of the country along the Seine River and serves as the nation's cultural, economic, and political center.",
        "rejected": "Paris.",
    },
    {
        "prompt": "How do I make scrambled eggs?",
        "chosen": "Crack 2-3 eggs into a bowl, add a splash of milk, and whisk until combined. Heat butter in a non-stick pan over medium-low heat. Pour in the eggs and gently stir with a spatula as they cook, pushing from the edges toward the center. Remove from heat while still slightly wet. They'll finish cooking from residual heat. Season with salt and pepper.",
        "rejected": "Put eggs in a pan and stir them around until they're done. Add salt if you want.",
    },
    {
        "prompt": "Explain why the sky is blue.",
        "chosen": "The sky appears blue because of Rayleigh scattering. Sunlight contains all colors, but when it hits Earth's atmosphere, shorter blue wavelengths scatter more than longer red wavelengths. This scattered blue light reaches our eyes from all directions, making the sky look blue.",
        "rejected": "The sky is blue because of how light works in the atmosphere.",
    },
]
In[4]:
Code
from datasets import Dataset

## Convert to Hugging Face Dataset
dataset = Dataset.from_list(preference_data)
dataset_size = len(dataset)
features = list(dataset.features.keys())
Out[5]:
Console
Dataset size: 4 examples
Features: ['prompt', 'chosen', 'rejected']

The dataset contains 4 examples with the expected features: prompt, chosen response, and rejected response. This structure is ready for processing into the format required for the DPO loss.

Tokenization for DPO

A crucial implementation detail is how we tokenize preference data. Unlike standard language model training where we tokenize single sequences, DPO requires processing the prompt-response pairs such that we can compute log probabilities only over the response tokens. Including prompt tokens in the DPO loss calculation would contaminate the signal, as the loss measures how the model's response probabilities differ from the reference model. The prompt is shared between both the chosen and rejected responses, so any probability differences there would be noise rather than meaningful preference information.

Tokenization must accomplish two goals. First, it must concatenate the prompt and response into a single sequence that the autoregressive model can process. Second, it must track exactly where the prompt ends and the response begins, so that we can mask out the prompt tokens when computing the loss. The response mask tracks this boundary.

In[6]:
Code
from transformers import AutoTokenizer

## Load tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token


def tokenize_preference_pair(example, tokenizer, max_length=512):
    """
    Tokenize a preference pair for DPO training.
    Returns input_ids and a mask indicating which tokens are from the response.
    """
    prompt = example["prompt"]
    chosen = example["chosen"]
    rejected = example["rejected"]

    # Tokenize prompt to find where response starts
    prompt_tokens = tokenizer(prompt, add_special_tokens=False)
    prompt_length = len(prompt_tokens["input_ids"])

    # Tokenize full sequences (prompt + response)
    chosen_full = tokenizer(
        prompt + chosen,
        max_length=max_length,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    )

    rejected_full = tokenizer(
        prompt + rejected,
        max_length=max_length,
        truncation=True,
        padding="max_length",
        return_tensors="pt",
    )

    # Create masks: 1 for response tokens, 0 for prompt and padding
    chosen_mask = chosen_full["attention_mask"].clone()
    chosen_mask[0, :prompt_length] = 0  # Mask out prompt tokens

    rejected_mask = rejected_full["attention_mask"].clone()
    rejected_mask[0, :prompt_length] = 0

    return {
        "chosen_input_ids": chosen_full["input_ids"].squeeze(0),
        "chosen_attention_mask": chosen_full["attention_mask"].squeeze(0),
        "chosen_response_mask": chosen_mask.squeeze(0),
        "rejected_input_ids": rejected_full["input_ids"].squeeze(0),
        "rejected_attention_mask": rejected_full["attention_mask"].squeeze(0),
        "rejected_response_mask": rejected_mask.squeeze(0),
        "prompt_length": prompt_length,
    }
In[7]:
Code
## Tokenize one example
example = preference_data[0]
tokenized = tokenize_preference_pair(example, tokenizer)

prompt_len = tokenized["prompt_length"]
chosen_seq_len = tokenized["chosen_attention_mask"].sum().item()
chosen_resp_len = tokenized["chosen_response_mask"].sum().item()
rejected_resp_len = tokenized["rejected_response_mask"].sum().item()
Out[8]:
Console
Prompt length: 7 tokens
Chosen sequence length: 27 tokens (non-padding)
Chosen response tokens: 20
Rejected response tokens: 17

The output confirms that the tokenizer correctly identifies the prompt and response boundaries. The chosen response mask effectively isolates the completion tokens, ensuring the loss is calculated only on the model's generation and not the input prompt.

Out[9]:
Visualization
Response mask structure for DPO loss calculation. Response tokens (yellow, value 1) contribute to the loss, while prompt tokens (purple, value 0) are masked out. The red dashed line marks the boundary between the shared prompt and the trainable response.
Response mask structure for DPO loss calculation. Response tokens (yellow, value 1) contribute to the loss, while prompt tokens (purple, value 0) are masked out. The red dashed line marks the boundary between the shared prompt and the trainable response.

The response mask is critical: it tells us which token positions should contribute to the DPO loss. We only want to compare log probabilities over the response tokens, not the shared prompt.

DPO Loss Computation

With properly formatted data, we can now implement the DPO loss function. This objective encourages the model to assign higher implicit rewards to chosen responses while constraining the model to stay close to the reference policy. DPO transforms a reinforcement learning problem into a simple classification task: given a pair of responses, predict which one humans would prefer. Recall from the DPO Derivation chapter that the loss is defined as:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}_\text{DPO}(\pi_\theta; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_\text{ref}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_\text{ref}(y_l | x)} \right) \right]

where:

  • πθ\pi_\theta: the policy model being trained
  • πref\pi_\text{ref}: the frozen reference model (usually the initial version of πθ\pi_\theta)
  • D\mathcal{D}: the dataset of preference triplets (x,yw,yl)(x, y_w, y_l)
  • E(x,yw,yl)D\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}: the expected value over preference triplets drawn from the dataset (computed as a batch average)
  • xx: the prompt or instruction
  • ywy_w: the chosen (winning) response
  • yly_l: the rejected (losing) response
  • β\beta: a temperature parameter scaling the strength of the preference constraint
  • σ\sigma: the logistic sigmoid function, σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

To understand this formula intuitively, consider what each component accomplishes. The log ratio logπθ(yx)πref(yx)\log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} measures how much more (or less) likely the policy model finds a response compared to the reference model. When this ratio is positive, the policy has increased its probability for that response relative to where it started. When negative, the policy has decreased its probability. The term βlogπθ(yx)πref(yx)\beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} represents the implicit reward the model assigns to a response, capturing the idea that preferred responses should see increased probability while rejected responses should see decreased probability.

The difference between these implicit rewards for the chosen and rejected responses measures how well the model has learned the preference. If this difference is large and positive, the model strongly prefers the chosen response. The sigmoid function σ\sigma converts this difference into a probability, and by taking the negative log, we arrive at a cross-entropy loss that pushes the model to make this probability as high as possible.

By minimizing this negative log-likelihood, DPO optimizes the policy so that the implicit reward for the chosen response ywy_w is higher than for the rejected response yly_l. DPO steers the model toward human preferences while staying close to the reference policy using only supervised learning.

Out[10]:
Visualization
DPO loss components. The left panel shows how the sigmoid function maps reward margins to probabilities. The right panel displays the resulting loss, where positive margins (correct ranking) yield near-zero loss, while negative margins (incorrect ranking) incur exponentially increasing penalties.
DPO loss components. The left panel shows how the sigmoid function maps reward margins to probabilities. The right panel displays the resulting loss, where positive margins (correct ranking) yield near-zero loss, while negative margins (incorrect ranking) incur exponentially increasing penalties.
Notebook output

Computing Sequence Log Probabilities

First, compute the log probability logπ(yx)\log \pi(y|x) for a response yy given prompt xx. This computation is the foundation of the DPO loss. Since autoregressive models generate text one token at a time, each token's probability depends on all previous tokens. We calculate the total sequence probability by summing the log probabilities of each token conditioned on its history:

logπ(yx)=t=1ylogπ(ytx,y<t)\log \pi(y|x) = \sum_{t=1}^{|y|} \log \pi(y_t | x, y_{<t})

where:

  • yy: the full response sequence
  • xx: the input prompt
  • y|y|: the total number of tokens in the response
  • yty_t: the token at position tt
  • y<ty_{<t}: the sequence of tokens preceding tt (the history)
  • t=1y\sum_{t=1}^{|y|}: the sum of log probabilities across all tokens in the sequence

This formula arises from the chain rule of probability. The probability of generating an entire sequence equals the product of generating each token given everything that came before it. In log space, the product of token probabilities becomes a sum, which is numerically stable and computationally convenient. Each term logπ(ytx,y<t)\log \pi(y_t | x, y_{<t}) represents the model's confidence in predicting token yty_t given the prompt xx and all previously generated tokens y<ty_{<t}.

This summation computes the total log probability of the sequence by aggregating the log probabilities of each token conditioned on its history. Longer sequences have lower log probabilities because they contain more terms. Comparing log ratios rather than raw log probabilities cancels out length effects when comparing the same response under different models.

In[11]:
Code
import torch
import torch.nn.functional as F


def compute_log_probs(model, input_ids, attention_mask, response_mask):
    """
    Compute per-token log probabilities for response tokens.

    Args:
        model: Language model
        input_ids: Token IDs [batch_size, seq_len]
        attention_mask: Attention mask [batch_size, seq_len]
        response_mask: Mask indicating response tokens [batch_size, seq_len]

    Returns:
        Per-sequence log probabilities [batch_size]
    """
    # Get model outputs (logits)
    with torch.no_grad() if not model.training else torch.enable_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits  # [batch_size, seq_len, vocab_size]

    # Shift for next-token prediction: logits[t] predicts token[t+1]
    shift_logits = logits[:, :-1, :].contiguous()
    shift_labels = input_ids[:, 1:].contiguous()
    shift_mask = response_mask[:, 1:].contiguous()

    # Compute per-token log probabilities
    log_probs = F.log_softmax(shift_logits, dim=-1)

    # Gather log probs for actual tokens
    token_log_probs = log_probs.gather(
        dim=-1, index=shift_labels.unsqueeze(-1)
    ).squeeze(-1)  # [batch_size, seq_len-1]

    # Mask out non-response tokens and sum
    masked_log_probs = token_log_probs * shift_mask.float()
    sequence_log_probs = masked_log_probs.sum(dim=-1)  # [batch_size]

    return sequence_log_probs

This function handles the critical shift operation for next-token prediction: since logits at position tt predict token at position t+1t+1, we need to align everything properly. This alignment is a common source of bugs in language model implementations, so it deserves careful attention. The model's output at each position represents a probability distribution over what the next token should be, not what the current token is. Therefore, to find the probability assigned to each actual token in the sequence, we must look at the logits from the preceding position.

The Core DPO Loss Function

Now we can implement the full DPO loss. This function brings together all the components we've discussed: computing log probabilities for both the policy and reference models on both chosen and rejected responses, calculating the log ratios that represent implicit rewards, and combining them through the sigmoid to produce a differentiable loss signal.

In[12]:
Code
def dpo_loss(
    policy_model,
    reference_model,
    chosen_input_ids,
    chosen_attention_mask,
    chosen_response_mask,
    rejected_input_ids,
    rejected_attention_mask,
    rejected_response_mask,
    beta=0.1,
):
    """
    Compute DPO loss for a batch of preference pairs.

    Args:
        policy_model: The model being trained
        reference_model: Frozen reference model
        chosen_*: Tokenized chosen responses
        rejected_*: Tokenized rejected responses
        beta: Temperature parameter controlling deviation from reference

    Returns:
        loss: Scalar loss value
        metrics: Dictionary with logging information
    """
    # Compute log probs for policy model
    policy_chosen_logps = compute_log_probs(
        policy_model,
        chosen_input_ids,
        chosen_attention_mask,
        chosen_response_mask,
    )
    policy_rejected_logps = compute_log_probs(
        policy_model,
        rejected_input_ids,
        rejected_attention_mask,
        rejected_response_mask,
    )

    # Compute log probs for reference model (no gradients needed)
    with torch.no_grad():
        ref_chosen_logps = compute_log_probs(
            reference_model,
            chosen_input_ids,
            chosen_attention_mask,
            chosen_response_mask,
        )
        ref_rejected_logps = compute_log_probs(
            reference_model,
            rejected_input_ids,
            rejected_attention_mask,
            rejected_response_mask,
        )

    # Compute log ratios
    chosen_log_ratio = policy_chosen_logps - ref_chosen_logps
    rejected_log_ratio = policy_rejected_logps - ref_rejected_logps

    # DPO loss: -log(sigmoid(beta * (chosen_ratio - rejected_ratio)))
    logits = beta * (chosen_log_ratio - rejected_log_ratio)
    loss = -F.logsigmoid(logits).mean()

    # Compute metrics for monitoring
    with torch.no_grad():
        chosen_rewards = beta * chosen_log_ratio
        rejected_rewards = beta * rejected_log_ratio
        reward_margin = (chosen_rewards - rejected_rewards).mean()
        accuracy = (chosen_rewards > rejected_rewards).float().mean()

    metrics = {
        "loss": loss.item(),
        "reward_margin": reward_margin.item(),
        "accuracy": accuracy.item(),
        "chosen_reward": chosen_rewards.mean().item(),
        "rejected_reward": rejected_rewards.mean().item(),
    }

    return loss, metrics

These metrics help monitor training:

  • Reward margin: The average difference between implicit rewards for chosen and rejected responses. This should increase during training as the model learns to more strongly prefer the chosen responses. A growing reward margin indicates the model is successfully learning the preference signal.
  • Accuracy: The fraction of examples where the policy assigns higher implicit reward to the chosen response. Well-trained models should approach high accuracy on the training set. However, reaching 100% accuracy too quickly or too easily may indicate overfitting.
  • Individual rewards: Tracking both chosen and rejected rewards helps diagnose issues like reward hacking. Ideally, the chosen reward should increase modestly while the rejected reward decreases or stays stable. If both rewards increase dramatically, the model may be drifting too far from the reference policy.

Numerical Stability Considerations

The DPO loss involves log probabilities that can become very negative for long sequences. A sequence of 100 tokens where each token has an average log probability of -3 would yield a sequence log probability of -300. While this doesn't cause issues when computing ratios, the computation remains stable for several reasons.

A few practices help maintain numerical stability:

In[13]:
Code
def dpo_loss_stable(
    policy_chosen_logps,
    policy_rejected_logps,
    ref_chosen_logps,
    ref_rejected_logps,
    beta=0.1,
    label_smoothing=0.0,
):
    """
    Numerically stable DPO loss computation.

    Supports optional label smoothing for regularization.
    """
    # Compute log ratios (these are differences, not absolute values)
    chosen_log_ratio = policy_chosen_logps - ref_chosen_logps
    rejected_log_ratio = policy_rejected_logps - ref_rejected_logps

    logits = beta * (chosen_log_ratio - rejected_log_ratio)

    if label_smoothing > 0:
        # Soft labels: slightly prefer chosen but allow some uncertainty
        # This can help with noisy preference labels
        smooth_loss = (
            -F.logsigmoid(logits) * (1 - label_smoothing)
            - F.logsigmoid(-logits) * label_smoothing
        )
        loss = smooth_loss.mean()
    else:
        loss = -F.logsigmoid(logits).mean()

    return loss

Working with log ratios rather than raw probabilities ensures numerical stability. When we compute logπθ(yx)logπref(yx)\log \pi_\theta(y|x) - \log \pi_\text{ref}(y|x), the very negative values from long sequences largely cancel out, leaving a ratio that reflects how differently the two models view the same response. This cancellation is what keeps the numbers in a manageable range.

Label smoothing can be helpful when your preference labels are noisy, as discussed in the Human Preference Data chapter. Setting a small smoothing value (0.01-0.1) prevents the model from becoming overconfident about potentially mislabeled examples. The smoothing works by mixing in a small probability of the "wrong" label, which acts as a form of regularization that improves generalization when the training data contains annotation errors.

DPO Training Procedure

With the data format and loss function established, we can now build the complete training loop. DPO training has several unique aspects compared to standard fine-tuning.

Managing the Reference Model

The reference model πref\pi_\text{ref} must remain frozen throughout training. There are two common approaches:

Approach 1: Separate Model Copy

Load two copies of the model, freezing one:

In[14]:
Code
from transformers import AutoModelForCausalLM


def setup_models_separate(model_name, device="cpu"):
    """Create separate policy and reference models."""
    # Load policy model (will be trained)
    policy_model = AutoModelForCausalLM.from_pretrained(model_name)
    policy_model.to(device)
    policy_model.train()

    # Load reference model (frozen)
    reference_model = AutoModelForCausalLM.from_pretrained(model_name)
    reference_model.to(device)
    reference_model.eval()

    # Freeze reference model
    for param in reference_model.parameters():
        param.requires_grad = False

    return policy_model, reference_model

This approach is simple but doubles memory usage. For large models, this may be prohibitive.

Approach 2: LoRA with Shared Base

When using LoRA (covered in Part XXV), the reference model is implicitly the base model with LoRA adapters disabled:

In[15]:
Code
def setup_models_lora(model_name, device="cpu"):
    """Use LoRA for memory-efficient reference model."""
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(model_name)
    base_model.to(device)

    # Add LoRA adapters
    lora_config = LoraConfig(
        r=8,
        lora_alpha=32,
        target_modules=["c_attn"],  # GPT-2 attention projection
        lora_dropout=0.05,
    )
    policy_model = get_peft_model(base_model, lora_config)

    # Reference forward pass: disable adapters temporarily
    # The base model weights serve as the reference
    return policy_model, None  # Reference computed by disabling adapters

With LoRA, you compute reference log probabilities by temporarily disabling the adapters using model.disable_adapter_layers().

Complete Training Loop

Here's a complete training implementation:

In[16]:
Code
from torch.utils.data import DataLoader


def create_dpo_dataloader(
    preference_data, tokenizer, batch_size=2, max_length=256
):
    """Create a DataLoader for DPO training."""

    # Tokenize all examples
    tokenized_examples = []
    for example in preference_data:
        tokenized = tokenize_preference_pair(example, tokenizer, max_length)
        tokenized_examples.append(tokenized)

    # Stack into tensors
    import torch

    def collate_fn(batch):
        return {
            key: torch.stack([item[key] for item in batch])
            for key in batch[0].keys()
            if isinstance(batch[0][key], torch.Tensor)
        }

    return DataLoader(
        tokenized_examples,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=collate_fn,
    )
In[17]:
Code
from torch.optim import AdamW
from tqdm import tqdm


def train_dpo(
    policy_model,
    reference_model,
    dataloader,
    num_epochs=3,
    learning_rate=1e-5,
    beta=0.1,
    device="cpu",
    gradient_accumulation_steps=1,
):
    """
    Full DPO training loop.
    """
    optimizer = AdamW(policy_model.parameters(), lr=learning_rate)

    policy_model.to(device)
    if reference_model is not None:
        reference_model.to(device)

    history = {"loss": [], "accuracy": [], "reward_margin": []}
    global_step = 0

    for epoch in range(num_epochs):
        epoch_metrics = {"loss": 0, "accuracy": 0, "reward_margin": 0}
        num_batches = 0

        progress_bar = tqdm(dataloader, desc=f"Epoch {epoch + 1}/{num_epochs}")

        for batch_idx, batch in enumerate(progress_bar):
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}

            # Compute DPO loss
            loss, metrics = dpo_loss(
                policy_model=policy_model,
                reference_model=reference_model,
                chosen_input_ids=batch["chosen_input_ids"],
                chosen_attention_mask=batch["chosen_attention_mask"],
                chosen_response_mask=batch["chosen_response_mask"],
                rejected_input_ids=batch["rejected_input_ids"],
                rejected_attention_mask=batch["rejected_attention_mask"],
                rejected_response_mask=batch["rejected_response_mask"],
                beta=beta,
            )

            # Scale loss for gradient accumulation
            scaled_loss = loss / gradient_accumulation_steps
            scaled_loss.backward()

            # Update weights
            if (batch_idx + 1) % gradient_accumulation_steps == 0:
                optimizer.step()
                optimizer.zero_grad()
                global_step += 1

            # Track metrics
            for key in epoch_metrics:
                epoch_metrics[key] += metrics[key]
            num_batches += 1

            progress_bar.set_postfix(
                {
                    "loss": f"{metrics['loss']:.4f}",
                    "acc": f"{metrics['accuracy']:.2%}",
                }
            )

        # Average metrics for epoch
        for key in epoch_metrics:
            avg_value = epoch_metrics[key] / num_batches
            history[key].append(avg_value)
            print(f"Epoch {epoch + 1} - {key}: {avg_value:.4f}")

    return history

Running a Training Example

Let's train on our small demonstration dataset:

In[18]:
Code
## Setup models (using small GPT-2 for demonstration)
device = "cpu"  # Use "cuda" if available

policy_model, reference_model = setup_models_separate("gpt2", device=device)
num_params = sum(p.numel() for p in policy_model.parameters())

## Create dataloader
dataloader = create_dpo_dataloader(
    preference_data, tokenizer, batch_size=2, max_length=128
)
num_batches = len(dataloader)
Out[19]:
Console
Policy model parameters: 124,439,808
Number of batches: 2

We successfully loaded a small GPT-2 model with approximately 124 million parameters and prepared a dataloader with 2 batches. This setup allows for quick iteration during this demonstration.

In[20]:
Code
## Train for a few epochs
history = train_dpo(
    policy_model=policy_model,
    reference_model=reference_model,
    dataloader=dataloader,
    num_epochs=3,
    learning_rate=1e-5,
    beta=0.1,
    device=device,
)
Out[21]:
Console
Final Loss: 0.5664
Final Accuracy: 75.00%
Final Reward Margin: 0.3477

The training completes successfully, showing a decrease in loss and an increase in accuracy and reward margin. This indicates the policy model is learning to assign higher implicit rewards to the chosen responses compared to the rejected ones.

Visualizing Training Progress

Monitoring the right metrics helps diagnose training issues:

In[22]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),  # Adjust for layout
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

## Plot Loss
plt.figure()
plt.plot(history["loss"], marker="o")
plt.title("Training Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

## Plot Accuracy
plt.figure()
plt.plot(history["accuracy"], marker="o", color="orange")
plt.title("Training Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.show()
Out[23]:
Visualization
Two line plots showing DPO loss decreasing and accuracy increasing over training epochs.
Training metrics over three epochs. The loss steadily decreases, reflecting the optimization of the preference objective, while accuracy improves to nearly 100%, indicating the model has successfully learned to rank chosen responses above rejected ones.
Notebook output

The plots show two complementary trends: the training loss consistently decreases, indicating the model is minimizing the DPO objective, while the ranking accuracy improves, confirming that the model increasingly prefers the chosen responses over the rejected ones.

Out[24]:
Visualization
Idealized DPO training dynamics showing how implicit rewards for chosen and rejected responses diverge over training. The model learns to increase rewards for preferred responses while decreasing rewards for rejected ones, with the shaded region representing the growing reward margin.
Idealized DPO training dynamics showing how implicit rewards for chosen and rejected responses diverge over training. The model learns to increase rewards for preferred responses while decreasing rewards for rejected ones, with the shaded region representing the growing reward margin.

DPO Hyperparameters

DPO has fewer hyperparameters than RLHF, but choosing them well is crucial for successful training.

The Beta Parameter

The β\beta parameter is the most important hyperparameter in DPO. It controls the strength of the KL divergence constraint between the policy and reference model. The implicit reward for a response is βlogπθ(yx)πref(yx)\beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}. The β\beta parameter scales this entire quantity, effectively determining how much the model is "allowed" to deviate from the reference policy.

Low β\beta (0.01-0.1):

  • Allows larger deviations from the reference policy
  • Faster learning but higher risk of overfitting to preference data
  • May lead to degenerate outputs or reward hacking

When β\beta is small, the implicit reward signal is weak, meaning the model must make large probability changes to achieve a significant reward difference. This encourages aggressive updates that can quickly overfit to the training data.

High β\beta (0.5-1.0):

  • Strong regularization toward the reference model
  • More conservative updates, slower learning
  • Better preservation of general capabilities but may underfit preferences

When β\beta is large, even small deviations from the reference policy produce large implicit rewards or penalties. This makes the model cautious about straying too far from its initial behavior.

In[25]:
Code
def visualize_beta_effect(beta_values, x_range=(-3, 3)):
    """Show how beta affects the loss landscape."""
    x = np.linspace(x_range[0], x_range[1], 200)  # Log ratio differences

    fig, ax = plt.subplots()

    for beta in beta_values:
        # DPO loss: -log(sigmoid(beta * x))
        loss = -np.log(1 / (1 + np.exp(-beta * x)))
        ax.plot(x, loss, label=rf"$\beta$ = {beta}", linewidth=2)

    ax.axvline(x=0, color="gray", linestyle="--", alpha=0.5)
    ax.set_xlabel("(Chosen log ratio) - (Rejected log ratio)")
    ax.set_ylabel("DPO Loss")
    ax.set_title("Effect of β on Loss Gradient")
    ax.legend()
    ax.set_xlim(x_range[0], x_range[1])
    ax.set_ylim(0, 4)

    return fig
Out[26]:
Visualization
Multiple loss curves showing how beta parameter affects gradient steepness in DPO training.
DPO loss curves for varying beta values. Higher beta settings (e.g., 0.5) produce steeper gradients near the decision boundary (margin = 0), resulting in stronger penalties for small deviations compared to lower values like 0.1.

The original DPO paper found β=0.1\beta = 0.1 to work well across a variety of tasks. This value provides a reasonable balance between learning preferences and maintaining coherence. The visualization reveals why: at β=0.1\beta = 0.1, the loss curve has a moderate slope that provides clear gradient signal without being so steep that small changes in log ratios cause dramatic loss changes.

Learning Rate

DPO typically requires smaller learning rates than supervised fine-tuning:

The smaller rates are necessary because DPO directly optimizes log probability ratios, which can change rapidly with small weight updates. Unlike supervised fine-tuning where we're simply maximizing the likelihood of target tokens, DPO computes a ratio between two model evaluations. Small changes to the policy model affect both the numerator and denominator of this ratio, potentially causing the implicit reward to shift dramatically if the learning rate is too high.

In[27]:
Code
## Recommended hyperparameter ranges
hyperparameters = {
    "beta": {
        "range": "0.05 - 0.5",
        "typical": "0.1",
        "notes": "Higher = more conservative",
    },
    "learning_rate": {
        "range": "1e-6 - 5e-5",
        "typical": "5e-7 to 1e-5",
        "notes": "Lower than SFT",
    },
    "batch_size": {
        "range": "4 - 64",
        "typical": "16-32",
        "notes": "Larger batches stabilize training",
    },
    "epochs": {
        "range": "1 - 5",
        "typical": "1-3",
        "notes": "DPO converges quickly",
    },
    "warmup_ratio": {
        "range": "0.05 - 0.1",
        "typical": "0.1",
        "notes": "Gradual learning rate increase",
    },
}
Out[28]:
Console
DPO Hyperparameter Guidelines:
----------------------------------------------------------------------
beta                 | Range: 0.05 - 0.5      | Typical: 0.1
                     | Higher = more conservative

learning_rate        | Range: 1e-6 - 5e-5     | Typical: 5e-7 to 1e-5
                     | Lower than SFT

batch_size           | Range: 4 - 64          | Typical: 16-32
                     | Larger batches stabilize training

epochs               | Range: 1 - 5           | Typical: 1-3
                     | DPO converges quickly

warmup_ratio         | Range: 0.05 - 0.1      | Typical: 0.1
                     | Gradual learning rate increase

These guidelines provide a starting point for tuning. The learning rate is particularly critical; starting too high often destabilizes the implicit reward formulation, leading to poor convergence.

Out[29]:
Visualization
Gradient magnitude profiles across different beta values. The gradient peaks at the decision boundary (margin = 0), with higher beta values producing significantly larger gradients that drive more aggressive model updates when preferences are uncertain.
Gradient magnitude profiles across different beta values. The gradient peaks at the decision boundary (margin = 0), with higher beta values producing significantly larger gradients that drive more aggressive model updates when preferences are uncertain.

Batch Size and Gradient Accumulation

DPO benefits from larger effective batch sizes because the loss depends on comparing policy and reference log probability ratios. Small batches introduce variance in these estimates, which can make training noisy and unstable. With a larger batch, the average over multiple preference pairs provides a more reliable gradient signal.

If GPU memory is limited, use gradient accumulation:

In[30]:
Code
## Effective batch size = batch_size * gradient_accumulation_steps
## Example: batch_size=4, accumulation=8 → effective batch = 32

training_config = {
    "per_device_batch_size": 4,
    "gradient_accumulation_steps": 8,
    "effective_batch_size": 4 * 8,  # = 32
}

Number of Epochs

DPO typically converges faster than you might expect. One to three epochs over the preference data is usually sufficient. Overfitting is a real concern: if accuracy reaches 100% on training data but generations become repetitive or degenerate, you've overfit.

Signs of overfitting include:

  • Training accuracy approaches 100% while validation loss increases
  • Generated text becomes repetitive or templated
  • The model loses diversity in its responses

Production Considerations

Production DPO training requires additional considerations.

Memory-Efficient Implementation

For large models, computing forward passes for both policy and reference models strains memory. The TRL library from Hugging Face provides optimized implementations:

In[31]:
Code
## Using TRL for production DPO training (pseudocode - requires installation)
"""
from trl import DPOTrainer, DPOConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

## Configuration
dpo_config = DPOConfig(
    beta=0.1,
    learning_rate=5e-7,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    warmup_ratio=0.1,
    bf16=True,  # Mixed precision for efficiency
    gradient_checkpointing=True,  # Memory optimization
)

## Initialize trainer
trainer = DPOTrainer(
    model=policy_model,
    ref_model=reference_model,  # or None if using LoRA
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

## Train
trainer.train()
"""

Evaluation During Training

Beyond loss and accuracy, monitor generation quality:

In[32]:
Code
def evaluate_generation_quality(
    model, tokenizer, eval_prompts, max_new_tokens=100
):
    """Generate responses to fixed prompts for qualitative evaluation."""
    model.eval()
    generations = []

    for prompt in eval_prompts:
        inputs = tokenizer(prompt, return_tensors="pt")
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
                pad_token_id=tokenizer.pad_token_id,
            )

        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        generations.append({"prompt": prompt, "response": response})

    return generations

Periodically generating responses to held-out prompts and reviewing them manually (or with an LLM judge) provides crucial signal about whether DPO is achieving its intended effect.

Limitations and Practical Impact

DPO has made preference-based alignment more accessible by eliminating the complexity of reward model training and reinforcement learning. However, implementation challenges remain.

Data Quality Dependencies

DPO is only as good as your preference data. Unlike RLHF where a reward model can generalize learned preferences to new situations, DPO directly optimizes for the specific comparisons in your training set. This means noisy labels, biased annotators, or unrepresentative prompt distributions will directly impact the aligned model.

In practice, data curation often matters more than algorithmic improvements. Teams investing in DPO should allocate significant effort to preference data quality: establishing clear annotation guidelines, measuring inter-annotator agreement, and filtering low-confidence examples.

Mode Collapse Risks

With very small β\beta values or prolonged training, DPO can collapse toward generating only responses very similar to the "chosen" examples in the training data. This manifests as reduced response diversity and over-fitting to surface patterns in preferred responses rather than learning the underlying preference criteria.

Monitoring response diversity during training and using techniques like early stopping or label smoothing can mitigate this risk. Some practitioners also mix DPO training with continued language modeling loss to maintain general capabilities.

Scaling Challenges

As models grow larger, the memory requirements for maintaining both policy and reference models become significant. Even with LoRA, computing forward passes through large models twice (once with adapters, once without) adds computational overhead. The DPO Variants chapter covers techniques like reference-free DPO that address some of these challenges.

Despite these limitations, DPO has become the preferred alignment method for many teams due to its simplicity and effectiveness. Aligning models with preference data using supervised learning infrastructure makes alignment research more accessible.

Key Parameters

The key parameters for DPO training are:

  • beta: Controls the strength of the KL divergence constraint (typically 0.1). Higher values keep the policy closer to the reference model.
  • learning_rate: The step size for optimization (typically 5e-7 to 1e-5). DPO requires lower rates than standard supervised fine-tuning.
  • batch_size: The number of samples processed per step. Larger batches generally stabilize the loss estimate.

Summary

This chapter translated DPO theory into practice. The key implementation components are:

Data format: DPO requires triplets of (prompt, chosen response, rejected response), with careful tokenization to identify which tokens belong to the response versus the prompt.

Loss computation: The DPO loss compares log probability ratios between the policy and reference model for chosen versus rejected responses. Proper handling of sequence masking and numerical stability is essential.

Training procedure: DPO training requires maintaining a frozen reference model (either as a separate copy or implicitly through LoRA). Standard supervised learning infrastructure handles the rest.

Hyperparameters: The β\beta parameter controls the KL constraint strength, with typical values around 0.1. Learning rates should be lower than standard fine-tuning, and DPO often converges in just one to three epochs.

With these components in place, you can align language models to human preferences without the complexity of reward modeling and reinforcement learning that RLHF requires. The next chapter explores variants of DPO that address specific limitations, including methods that eliminate the need for a reference model entirely.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about DPO implementation.

Loading component...

Reference

BIBTEXAcademic
@misc{dpoimplementationpytorchtrainingforlanguagemodelalignment, author = {Michael Brenndoerfer}, title = {DPO Implementation: PyTorch Training for Language Model Alignment}, year = {2026}, url = {https://mbrenndoerfer.com/writing/dpo-implementation-pytorch-preference-optimization-training}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). DPO Implementation: PyTorch Training for Language Model Alignment. Retrieved from https://mbrenndoerfer.com/writing/dpo-implementation-pytorch-preference-optimization-training
MLAAcademic
Michael Brenndoerfer. "DPO Implementation: PyTorch Training for Language Model Alignment." 2026. Web. today. <https://mbrenndoerfer.com/writing/dpo-implementation-pytorch-preference-optimization-training>.
CHICAGOAcademic
Michael Brenndoerfer. "DPO Implementation: PyTorch Training for Language Model Alignment." Accessed today. https://mbrenndoerfer.com/writing/dpo-implementation-pytorch-preference-optimization-training.
HARVARDAcademic
Michael Brenndoerfer (2026) 'DPO Implementation: PyTorch Training for Language Model Alignment'. Available at: https://mbrenndoerfer.com/writing/dpo-implementation-pytorch-preference-optimization-training (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). DPO Implementation: PyTorch Training for Language Model Alignment. https://mbrenndoerfer.com/writing/dpo-implementation-pytorch-preference-optimization-training