DPO Derivation: From RLHF Objective to Direct Optimization

Michael BrenndoerferJanuary 1, 202634 min read

Derive the DPO loss function from first principles. Learn how the optimal RLHF policy leads to reward reparameterization and direct preference optimization.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

DPO Derivation

In the previous chapter, we introduced Direct Preference Optimization as a simpler alternative to RLHF that eliminates the need for an explicit reward model. We saw that DPO directly optimizes a language model using preference data, but we didn't examine why this works or how the DPO loss function is derived. This chapter fills that gap.

The DPO derivation is a key result in alignment research. It shows that the optimal policy for the RLHF objective has a closed-form solution, and this solution can be rearranged to express the implicit reward in terms of the policy itself. When we substitute this reparameterized reward into the Bradley-Terry preference model, we obtain the DPO loss function directly, with no reinforcement learning required.

Understanding this derivation reveals why DPO works, its underlying assumptions, and how to interpret the model's learning process. By the end of this chapter, you'll see that DPO is essentially solving a classification problem where the model learns to assign higher probability to preferred responses.

The RLHF Objective

Let's begin with the objective that RLHF aims to optimize. As we discussed in the chapters on PPO for Language Models and KL Divergence Penalty, the goal is to find a policy π\pi that maximizes expected reward while staying close to a reference policy πref\pi_{\text{ref}} (typically the supervised fine-tuned model).

Before diving into the mathematics, it helps to understand the intuition behind this objective. We want our language model to generate responses that humans prefer, which is captured by the reward function. However, if we optimize the reward too aggressively, the model might find unexpected shortcuts or produce degenerate outputs that technically achieve high reward but don't represent genuinely helpful behavior. The reference policy serves as an anchor, representing the model's pre-trained knowledge and natural language capabilities. By penalizing deviations from this anchor, we encourage the model to improve its outputs while maintaining coherent, fluent generation.

The constrained optimization problem is:

maxπExD,yπ(yx)[r(x,y)]βDKL(π(yx)πref(yx))\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y|x)} \left[ r(x, y) \right] - \beta \cdot D_{\text{KL}}\left( \pi(y|x) \| \pi_{\text{ref}}(y|x) \right)

where:

  • π\pi: the policy being optimized
  • E\mathbb{E}: the expectation operator
  • xx: a prompt sampled from the data distribution D\mathcal{D}
  • yy: a response sampled from the policy
  • r(x,y)r(x, y): the learned reward function
  • β\beta: a hyperparameter controlling the strength of the KL constraint
  • DKLD_{\text{KL}}: the Kullback-Leibler divergence
  • πref\pi_{\text{ref}}: the reference policy

To work with this objective more directly, we need to expand the KL divergence term. Recall that the KL divergence measures how different one probability distribution is from another, expressed as the expected log ratio of the two distributions. Writing out the KL divergence explicitly, this becomes:

maxπExD,yπ(yx)[r(x,y)βlogπ(yx)πref(yx)]\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y|x)} \left[ r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right]

where:

  • E\mathbb{E}: the expectation operator
  • D\mathcal{D}: the data distribution
  • r(x,y)r(x, y): the reward function
  • β\beta: the KL penalty coefficient
  • log\log: the natural logarithm
  • π(yx)\pi(y|x): probability of response yy given prompt xx under the policy
  • πref(yx)\pi_{\text{ref}}(y|x): probability of response yy given prompt xx under the reference model

This reformulation reveals the objective's structure more clearly. The log ratio logπ(yx)πref(yx)\log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} measures how much the policy has shifted away from the reference for a particular response. When this ratio is positive (the policy assigns more probability than the reference), the KL term subtracts from the objective, penalizing the deviation. When the ratio is negative (the policy assigns less probability), the term adds to the objective, which might seem like a reward, but since we're sampling from π\pi, we're unlikely to generate responses where π\pi assigns low probability.

This objective captures a fundamental tension in alignment: we want the model to produce high-reward outputs (as judged by human preferences), but we don't want it to deviate too far from its pre-trained behavior. Without the KL penalty, the model might find degenerate solutions that exploit flaws in the reward model, a phenomenon we explored in the chapter on Reward Hacking.

Deriving the Optimal Policy

The key insight behind DPO is that this optimization problem has a closed-form solution. Unlike many optimization problems in machine learning that require iterative gradient descent, this particular formulation admits an analytical answer. For a fixed prompt xx, we're optimizing over the distribution π(yx)\pi(y|x) for all possible responses yy. This is a constrained optimization problem over probability distributions: we must ensure our solution is a valid probability distribution that sums to one and assigns non-negative probability to every possible response.

The special structure of the RLHF objective allows for this closed-form solution. The expectation under π\pi combined with the KL divergence creates what's known as a variational problem over distributions. Such problems often have elegant solutions when the constraints are simple probability simplex constraints.

Let's work through the derivation step by step. For a single prompt xx, the objective is:

maxπ(x)yπ(yx)[r(x,y)βlogπ(yx)πref(yx)]\max_{\pi(\cdot|x)} \sum_{y} \pi(y|x) \left[ r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right]

where:

  • y\sum_{y}: sum over all possible responses in the vocabulary
  • π(x)\pi(\cdot|x): the entire probability distribution over responses for prompt xx
  • π(yx)\pi(y|x): probability of response yy given prompt xx
  • r(x,y)r(x, y): the reward function
  • β\beta: the KL penalty coefficient
  • log\log: the natural logarithm
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy probability

We need to maximize this subject to the constraint that π(x)\pi(\cdot|x) is a valid probability distribution: yπ(yx)=1\sum_y \pi(y|x) = 1 and π(yx)0\pi(y|x) \geq 0 for all yy.

To solve this constrained optimization problem, we use the method of Lagrange multipliers, a classical technique from calculus. The idea is to incorporate the constraint directly into the objective by introducing a new variable (the Lagrange multiplier) that penalizes violations of the constraint. At the optimal solution, the gradient of the objective with respect to the decision variables must be proportional to the gradient of the constraint.

The Lagrangian is:

L=yπ(yx)[r(x,y)βlogπ(yx)+βlogπref(yx)]+λ(1yπ(yx))\mathcal{L} = \sum_{y} \pi(y|x) \left[ r(x, y) - \beta \log \pi(y|x) + \beta \log \pi_{\text{ref}}(y|x) \right] + \lambda \left( 1 - \sum_y \pi(y|x) \right)

where:

  • L\mathcal{L}: the Lagrangian function
  • y\sum_{y}: sum over all responses
  • π(yx)\pi(y|x): probability of response yy
  • r(x,y)r(x, y): the reward function
  • β\beta: the KL penalty coefficient
  • log\log: the natural logarithm
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy probability
  • λ\lambda: the Lagrange multiplier for the constraint that probabilities sum to 1

Notice that we've expanded the log ratio term from the original objective, separating it into βlogπ(yx)-\beta \log \pi(y|x) and +βlogπref(yx)+\beta \log \pi_{\text{ref}}(y|x). This separation makes taking derivatives more straightforward. The term λ(1yπ(yx))\lambda(1 - \sum_y \pi(y|x)) enforces our normalization constraint: if the probabilities don't sum to one, this term will be nonzero, and the Lagrange multiplier λ\lambda will adjust to push us toward a valid distribution.

Taking the derivative with respect to π(yx)\pi(y|x) for a specific yy:

Lπ(yx)=r(x,y)βlogπ(yx)β+βlogπref(yx)λ=0\frac{\partial \mathcal{L}}{\partial \pi(y|x)} = r(x, y) - \beta \log \pi(y|x) - \beta + \beta \log \pi_{\text{ref}}(y|x) - \lambda = 0

where:

  • Lπ(yx)\frac{\partial \mathcal{L}}{\partial \pi(y|x)}: the partial derivative of the Lagrangian with respect to the probability of response yy
  • r(x,y)r(x, y): the reward function
  • β\beta: the KL penalty coefficient
  • log\log: the natural logarithm
  • π(yx)\pi(y|x): the policy probability
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy probability
  • λ\lambda: the Lagrange multiplier

The β-\beta term comes from differentiating πlogπ\pi \log \pi, which gives logπ+1\log \pi + 1. This is a standard result from calculus: when we differentiate πlogπ\pi \log \pi with respect to π\pi, we apply the product rule, obtaining logπ+π(1/π)=logπ+1\log \pi + \pi \cdot (1/\pi) = \log \pi + 1. Setting this derivative equal to zero gives us the first-order optimality conditions.

Solving for logπ(yx)\log \pi(y|x):

βlogπ(yx)=r(x,y)+βlogπref(yx)βλlogπ(yx)=1βr(x,y)+logπref(yx)1λβ\begin{aligned} \beta \log \pi(y|x) &= r(x, y) + \beta \log \pi_{\text{ref}}(y|x) - \beta - \lambda \\ \log \pi(y|x) &= \frac{1}{\beta} r(x, y) + \log \pi_{\text{ref}}(y|x) - 1 - \frac{\lambda}{\beta} \end{aligned}

where:

  • logπ(yx)\log \pi(y|x): the log-probability of response yy
  • r(x,y)r(x, y): the reward function
  • β\beta: the KL penalty coefficient
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy probability
  • λ/β\lambda/\beta: constant terms related to the normalization constraint

This expression for the log-probability has an illuminating structure. The log-probability under our optimal policy equals the log-probability under the reference, adjusted by the scaled reward r(x,y)/βr(x,y)/\beta, plus some constants that don't depend on yy. Higher reward responses get higher log-probabilities, with β\beta controlling how much the reward matters relative to staying close to the reference.

Exponentiating both sides:

π(yx)=πref(yx)exp(r(x,y)β)exp(1λβ)\pi(y|x) = \pi_{\text{ref}}(y|x) \cdot \exp\left( \frac{r(x, y)}{\beta} \right) \cdot \exp\left( -1 - \frac{\lambda}{\beta} \right)

where:

  • π(yx)\pi(y|x): the policy probability
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy probability
  • exp()\exp(\cdot): the exponential function
  • exp(r(x,y)/β)\exp(r(x, y)/\beta): the reward-scaling term
  • r(x,y)r(x, y): the reward function
  • β\beta: the KL penalty coefficient
  • exp(1λ/β)\exp(-1 - \lambda/\beta): the normalization term constant across all yy

The term exp(1λ/β)\exp(-1 - \lambda/\beta) is just a normalizing constant that ensures the distribution sums to 1. Crucially, this term doesn't depend on yy; it only depends on xx through the constraint that all probabilities must sum to one. We can absorb it into a partition function Z(x)Z(x):

π(yx)=1Z(x)πref(yx)exp(r(x,y)β)\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x, y)}{\beta} \right)

where:

  • π(yx)\pi^*(y|x): the optimal policy distribution
  • Z(x)=yπref(yx)exp(r(x,y)β)Z(x) = \sum_{y'} \pi_{\text{ref}}(y'|x) \exp\left( \frac{r(x, y')}{\beta} \right): the partition function (normalizing constant) for prompt xx
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy distribution
  • r(x,y)r(x, y): the reward function
  • β\beta: the temperature parameter

This is the optimal policy for the RLHF objective. It has a clear interpretation: the optimal policy takes the reference distribution and reweights each response by the exponentiated reward, then normalizes. The exponential function ensures all probabilities are positive, naturally satisfying the π(yx)0\pi(y|x) \ge 0 constraint. Responses with higher reward get exponentially more probability mass, with β\beta controlling how aggressively we reweight.

Consider what happens at extreme values of beta to see how this reweighting works. When β\beta is very large, the reward term r(x,y)/βr(x,y)/\beta becomes small, and exp(r(x,y)/β)1\exp(r(x,y)/\beta) \approx 1 for all responses. In this limit, the optimal policy stays very close to the reference distribution, barely adjusting for reward at all. Conversely, when β\beta is very small, the exponential amplifies small reward differences into massive probability shifts, concentrating all probability mass on the highest-reward response. The choice of β\beta therefore controls the exploration-exploitation trade-off: larger values favor staying close to the reference (exploration), while smaller values favor chasing high reward (exploitation).

Out[2]:
Visualization
Four bar charts showing reference and optimal policy distributions for different beta values, demonstrating how smaller beta concentrates probability on high-reward responses.
Distribution of response probabilities under the optimal policy for different temperature values ($\beta$). Lower $\beta$ values (0.2) concentrate probability on the highest-reward response (D), while higher values (2.0) preserve the shape of the reference distribution. This visualization demonstrates how the temperature parameter manages the trade-off between reward optimization and divergence penalties.
Notebook output
Notebook output
Notebook output
Boltzmann Distribution

The optimal policy has the form of a Boltzmann (or Gibbs) distribution from statistical mechanics, where r(x,y)r(x,y) plays the role of negative energy and β\beta plays the role of temperature. Lower temperature (smaller β\beta) concentrates probability on the highest-reward responses.

Reparameterizing the Reward

Here's where the DPO derivation becomes clever. We've derived that the optimal policy has a specific relationship to the reward function. But we can also go backwards: given an optimal policy, we can solve for what reward function it implies.

This reverse direction might seem like a mathematical curiosity, but it turns out to be the key insight that makes DPO possible. If we can express the reward purely in terms of policies (without needing to train a separate reward model), then we can substitute this expression into the preference model and optimize directly. The reward becomes implicit in the policy rather than explicit in a separate neural network.

Starting from:

π(yx)=1Z(x)πref(yx)exp(r(x,y)β)\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x, y)}{\beta} \right)

where:

  • π(yx)\pi^*(y|x): the optimal policy distribution
  • Z(x)Z(x): the partition function
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy distribution
  • r(x,y)r(x, y): the reward function
  • β\beta: the temperature parameter

We can rearrange to isolate the reward. First, take the log of both sides:

logπ(yx)=logπref(yx)+r(x,y)βlogZ(x)\log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{r(x, y)}{\beta} - \log Z(x)

where:

  • logπ\log \pi^*: log-probability under the optimal policy
  • logπref(yx)\log \pi_{\text{ref}}(y|x): log-probability under the reference policy
  • r(x,y)r(x, y): the reward function
  • β\beta: the KL penalty coefficient
  • logZ(x)\log Z(x): log of the partition function (depends only on xx)

Taking the logarithm transforms our multiplicative relationship into an additive one, making algebraic manipulation simpler. The logarithm of a product becomes a sum of logarithms, and the logarithm of the exponential simply returns its argument.

Now, we rearrange terms to solve for r(x,y)r(x, y):

r(x,y)β=logπ(yx)logπref(yx)+logZ(x)(isolate reward term)r(x,y)β=logπ(yx)πref(yx)+logZ(x)(combine logs)r(x,y)=βlogπ(yx)πref(yx)+βlogZ(x)(multiply by β)\begin{aligned} \frac{r(x, y)}{\beta} &= \log \pi^*(y|x) - \log \pi_{\text{ref}}(y|x) + \log Z(x) && \text{(isolate reward term)} \\ \frac{r(x, y)}{\beta} &= \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \log Z(x) && \text{(combine logs)} \\ r(x, y) &= \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) && \text{(multiply by } \beta \text{)} \end{aligned}

where:

  • r(x,y)r(x, y): the implicit reward
  • β\beta: the KL penalty coefficient
  • π(yx)\pi^*(y|x): the optimal policy
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy
  • Z(x)Z(x): the partition function
  • βlog()\beta \log (\cdot): the scaled log-ratio term

This is the reward reparameterization. It tells us that for any optimal policy π\pi^*, we can express the implicit reward purely in terms of log probability ratios, plus a prompt-dependent constant βlogZ(x)\beta \log Z(x).

The log ratio logπ(yx)πref(yx)\log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} has a natural interpretation: it measures how much the optimal policy has increased or decreased the probability of response yy compared to the reference. Responses that the optimal policy strongly prefers will have large positive log ratios, while responses it disfavors will have large negative log ratios. This log ratio, scaled by β\beta, gives us the implicit reward up to an additive constant.

The partition function Z(x)Z(x) doesn't depend on yy; it only depends on the prompt xx. This will turn out to be crucial because when we compute preference probabilities, this term cancels out.

The DPO Loss Function

Now we can derive the DPO loss by substituting our reward reparameterization into the Bradley-Terry preference model. Recall from the chapter on the Bradley-Terry Model that the probability of preferring response ywy_w over response yly_l is:

p(ywylx)=σ(r(x,yw)r(x,yl))p(y_w \succ y_l | x) = \sigma\left( r(x, y_w) - r(x, y_l) \right)

where:

  • p(ywylx)p(y_w \succ y_l | x): probability that response ywy_w is preferred over yly_l
  • σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}: the sigmoid function mapping values to (0,1)(0, 1)
  • r(x,yw)r(x, y_w): reward for the winning response
  • r(x,yl)r(x, y_l): reward for the losing response

The Bradley-Terry model is elegant in its simplicity: the probability of preferring one response over another depends only on the difference in their rewards, passed through a sigmoid function. This means responses with much higher reward are strongly preferred, while responses with similar rewards have preference probabilities close to 0.5.

Substituting our reparameterized reward:

p(ywylx)=σ([βlogπ(ywx)πref(ywx)+βlogZ(x)][βlogπ(ylx)πref(ylx)+βlogZ(x)])p(y_w \succ y_l | x) = \sigma\left( \left[ \beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) \right] - \left[ \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} + \beta \log Z(x) \right] \right)

where:

  • σ\sigma: the sigmoid function
  • β\beta: the temperature parameter
  • π\pi^*: the optimal policy
  • πref\pi_{\text{ref}}: the reference policy
  • Z(x)Z(x): the partition function terms which appear with opposite signs

Notice that the βlogZ(x)\beta \log Z(x) terms cancel! This cancellation is not a coincidence; it's a direct consequence of using reward differences in the Bradley-Terry model. The partition function contributes equally to both rewards, so when we subtract them, it disappears. This cancellation is essential for the practical success of DPO because computing Z(x)Z(x) would require summing over all possible responses, which is computationally intractable for language models with exponentially large output spaces.

This leaves:

p(ywylx)=σ(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx))p(y_w \succ y_l | x) = \sigma\left( \beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right)

where:

  • σ\sigma: the sigmoid function
  • β\beta: temperature parameter
  • π\pi^*: the optimal policy
  • πref\pi_{\text{ref}}: the reference policy

We can simplify this using logarithm properties:

p(ywylx)=σ(β[logπ(ywx)πref(ywx)logπ(ylx)πref(ylx)])p(y_w \succ y_l | x) = \sigma\left( \beta \left[ \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right)

where:

  • σ\sigma: the sigmoid function
  • β\beta: the KL penalty coefficient
  • logπ(yx)πref(yx)\log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)}: the log-likelihood ratio of the optimal policy vs reference

This expression is remarkably clean. The probability of the correct preference depends entirely on the difference between two log ratios: how much the optimal policy prefers the winning response (relative to the reference) versus how much it prefers the losing response. When this difference is large and positive, the sigmoid outputs a value close to 1, indicating strong confidence in the correct preference.

This is the probability that the optimal policy assigns to the correct preference. To train a policy πθ\pi_\theta to match this optimal policy, we maximize the log-likelihood of the observed preferences:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(β[logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)])]\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left( \beta \left[ \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right) \right]

where:

  • LDPO\mathcal{L}_{\text{DPO}}: the DPO loss function
  • πθ\pi_\theta: the policy network being trained (parameterized by θ\theta)
  • πref\pi_{\text{ref}}: the reference policy
  • D\mathcal{D}: the dataset of preference pairs (x,yw,yl)(x, y_w, y_l)
  • E\mathbb{E}: expectation over the dataset
  • β\beta: the KL penalty coefficient
  • σ\sigma: the sigmoid function

The negative sign appears because we're minimizing a loss rather than maximizing likelihood.

To make the notation more compact, define the log-ratio for a response yy:

r^θ(x,y)=βlogπθ(yx)πref(yx)\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

where:

  • r^θ(x,y)\hat{r}_\theta(x, y): the implicit reward assigned by the current model πθ\pi_\theta
  • β\beta: the KL penalty coefficient scaling the reward
  • πθ(yx)\pi_\theta(y|x): the policy probability
  • πref(yx)\pi_{\text{ref}}(y|x): the reference policy probability
  • logπθ(yx)πref(yx)\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}: the log-ratio of the policy probability to the reference probability

This is sometimes called the "implicit reward" because it's what the reward would be if πθ\pi_\theta were optimal. The terminology is fitting: we never explicitly compute or learn a reward function, yet the policy implicitly defines one through its log-probability ratios with the reference model.

The DPO loss becomes:

LDPO=E(x,yw,yl)D[logσ(r^θ(x,yw)r^θ(x,yl))]\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left( \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) \right) \right]

where:

  • LDPO\mathcal{L}_{\text{DPO}}: the DPO loss function
  • r^θ(x,y)\hat{r}_\theta(x, y): the implicit reward function
  • r^θ(x,yw)r^θ(x,yl)\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l): the margin between implicit rewards for winning and losing responses
  • σ\sigma: the sigmoid function
  • E\mathbb{E}: expectation over the dataset

This is remarkably simple. The loss encourages the model to increase its log probability (relative to the reference) on preferred responses and decrease it on dispreferred responses.

DPO as Classification

The DPO loss has a revealing interpretation: it's essentially binary classification with a specific parameterization. Consider a standard binary cross-entropy loss for classifying which of two responses is preferred:

LBCE=E[logσ(f(x,yw,yl))]\mathcal{L}_{\text{BCE}} = -\mathbb{E} \left[ \log \sigma(f(x, y_w, y_l)) \right]

where:

  • LBCE\mathcal{L}_{\text{BCE}}: Binary Cross-Entropy loss
  • f(x,yw,yl)f(x, y_w, y_l): a scoring function indicating preference strength
  • σ\sigma: the sigmoid function
  • E\mathbb{E}: expectation over the dataset

In this context, ff is some function that should output a positive value when ywy_w is truly preferred. In DPO, this function is:

f(x,yw,yl)=β[logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)]f(x, y_w, y_l) = \beta \left[ \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right]

where:

  • f(x,yw,yl)f(x, y_w, y_l): the logit fed into the sigmoid classifier
  • β\beta: the KL penalty coefficient
  • πθ\pi_\theta: the policy network
  • πref\pi_{\text{ref}}: the reference policy

The model is learning to classify preference pairs by adjusting its output probabilities. When the loss is low, the model assigns higher relative probability to the preferred response.

This classification view explains why DPO is so stable compared to RLHF. Instead of learning a separate reward model and then using policy gradients with their high variance, DPO directly optimizes the likelihood of preference labels. The gradients flow directly through the model's log probabilities, just like in standard language model training.

The gradient of the DPO loss with respect to θ\theta has a particularly intuitive form. For a single example:

θLDPO=βσ(r^θ(x,yw)+r^θ(x,yl))[θlogπθ(ywx)θlogπθ(ylx)]\nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta \sigma(-\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l)) \left[ \nabla_\theta \log \pi_\theta(y_w|x) - \nabla_\theta \log \pi_\theta(y_l|x) \right]

where:

  • θ\nabla_\theta: gradient with respect to model parameters θ\theta
  • LDPO\mathcal{L}_{\text{DPO}}: the DPO loss
  • σ\sigma: the sigmoid function
  • r^θ\hat{r}_\theta: the implicit reward function
  • σ()\sigma(\dots): the weighting term derived from the sigmoid derivative (probability of the wrong preference)
  • θlog\nabla_\theta \log \dots: the geometric gradient direction
  • β\beta: the KL penalty coefficient

This gradient expression reveals the geometry of DPO optimization. The term θlogπθ(ywx)θlogπθ(ylx)\nabla_\theta \log \pi_\theta(y_w|x) - \nabla_\theta \log \pi_\theta(y_l|x) points in the direction that increases the probability of the winning response while decreasing the probability of the losing response. This is the natural direction to push given our objective.

The term σ(r^θ(x,yw)+r^θ(x,yl))\sigma(-\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l)) is the probability that the model currently assigns to the wrong preference. This acts as an implicit weighting:

  • When the model is confident and correct (low wrong-preference probability), the gradient is small
  • When the model is confident but wrong, the gradient is large
  • When the model is uncertain, the gradient is moderate

This automatic weighting helps the model focus on examples it's getting wrong, similar to how hard negative mining works in contrastive learning.

Out[3]:
Visualization
Line plot showing gradient weight decreasing sigmoidally from 1 to 0 as reward difference increases from -4 to 4.
The DPO gradient weighting function (probability of the wrong preference) across implicit reward differences. Large weights occur when the model incorrectly favors the rejected response (negative difference), triggering strong updates. Small weights occur when the model correctly favors the chosen response (positive difference), stabilizing training.

Worked Example

Let's trace through the derivation with concrete numbers to build intuition. Suppose we have a prompt xx = "Write a haiku about autumn" and two responses:

  • ywy_w (preferred): "Crimson leaves descend / Dancing on October wind / Earth prepares to sleep"
  • yly_l (dispreferred): "Leaves fall down in fall / The weather gets cold outside / I like pumpkin spice"

Assume our reference model πref\pi_{\text{ref}} assigns:

  • logπref(ywx)=45.2\log \pi_{\text{ref}}(y_w|x) = -45.2 (log probability of generating the preferred response)
  • logπref(ylx)=38.7\log \pi_{\text{ref}}(y_l|x) = -38.7 (log probability of generating the dispreferred response)

The reference model actually assigns higher probability to the dispreferred response because it's simpler and uses more common words.

Now suppose our current policy πθ\pi_\theta assigns:

  • logπθ(ywx)=42.1\log \pi_\theta(y_w|x) = -42.1
  • logπθ(ylx)=39.5\log \pi_\theta(y_l|x) = -39.5

With β=0.1\beta = 0.1, the implicit rewards are:

r^θ(x,yw)=0.1×(42.1(45.2))=0.1×3.1=0.31\begin{aligned} \hat{r}_\theta(x, y_w) &= 0.1 \times (-42.1 - (-45.2)) \\ &= 0.1 \times 3.1 \\ &= 0.31 \end{aligned} r^θ(x,yl)=0.1×(39.5(38.7))=0.1×(0.8)=0.08\begin{aligned} \hat{r}_\theta(x, y_l) &= 0.1 \times (-39.5 - (-38.7)) \\ &= 0.1 \times (-0.8) \\ &= -0.08 \end{aligned}

The difference is:

r^θ(x,yw)r^θ(x,yl)=0.31(0.08)=0.39\begin{aligned} \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) &= 0.31 - (-0.08) \\ &= 0.39 \end{aligned}

The loss contribution from this example:

logσ(0.39)=log(0.596)=0.517\begin{aligned} -\log \sigma(0.39) &= -\log(0.596) \\ &= 0.517 \end{aligned}

The model is doing okay on this example (probability of correct preference is about 60%), but there's room to improve. The gradient will push the model to further increase πθ(ywx)\pi_\theta(y_w|x) relative to the reference and decrease πθ(ylx)\pi_\theta(y_l|x) relative to the reference.

Out[4]:
Visualization
Three-panel bar chart showing log probabilities, log ratios, and implicit rewards for preferred and rejected responses in a DPO calculation, with bars showing the positive margin indicating correct preference prediction.
Step-by-step components of the DPO calculation. The policy model shifts log probabilities toward the preferred response compared to the reference (left), resulting in a positive log-ratio for the preferred response and negative for the rejected (center). The scaled implicit rewards show a positive margin of 0.39 (right), indicating the model correctly predicts the preference.
Notebook output
Notebook output
Notebook output

Code Implementation

Let's implement the DPO loss function and verify our derivation with code.

In[5]:
Code
import torch
import torch.nn.functional as F


def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Compute the DPO loss for a batch of preference pairs.

    Args:
        policy_chosen_logps: Log probs of chosen responses under policy [batch_size]
        policy_rejected_logps: Log probs of rejected responses under policy [batch_size]
        ref_chosen_logps: Log probs of chosen responses under reference [batch_size]
        ref_rejected_logps: Log probs of rejected responses under reference [batch_size]
        beta: Temperature parameter controlling deviation from reference

    Returns:
        loss: Scalar DPO loss
        chosen_rewards: Implicit rewards for chosen responses [batch_size]
        rejected_rewards: Implicit rewards for rejected responses [batch_size]
    """
    # Compute implicit rewards (log-ratios scaled by beta)
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    # DPO loss: negative log-sigmoid of reward difference
    logits = chosen_rewards - rejected_rewards
    loss = -F.logsigmoid(logits).mean()

    return loss, chosen_rewards, rejected_rewards

Now let's verify with our worked example:

In[6]:
Code
# Values from our worked example
policy_chosen_logps = torch.tensor([-42.1])
policy_rejected_logps = torch.tensor([-39.5])
ref_chosen_logps = torch.tensor([-45.2])
ref_rejected_logps = torch.tensor([-38.7])

loss, chosen_rewards, rejected_rewards = dpo_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    ref_chosen_logps,
    ref_rejected_logps,
    beta=0.1,
)
Out[7]:
Console
Chosen implicit reward: 0.310
Rejected implicit reward: -0.080
Reward difference: 0.390
Preference probability: 0.596
DPO loss: 0.517

The implicit rewards align with our derivation: the chosen response has a positive reward (0.310), while the rejected response has a negative reward (-0.080). The positive difference (0.390) results in a preference probability of 0.596. This indicates the model correctly prefers the chosen response, though the probability is only moderately high.

Let's also visualize how the loss and preference probability change as the model learns:

Out[8]:
Visualization
Two-panel plot showing sigmoid preference probability and DPO loss curves against reward difference.
DPO loss and preference probability relative to the implicit reward difference. The sigmoid preference probability (left) increases as the reward margin grows. The DPO loss (right) minimizes when the margin is large and positive, encouraging the model to assign higher implicit rewards to chosen responses.
Notebook output

The loss approaches zero as the reward difference increases, meaning the model strongly prefers chosen responses. Conversely, when the model prefers rejected responses (negative reward difference), the loss grows large.

Let's also implement a function to compute the gradients and see how the implicit weighting works:

In[9]:
Code
def dpo_loss_with_grad_weights(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute DPO loss and the implicit gradient weights.

    The gradient weight is sigma(-reward_diff), which is the probability
    the model assigns to the WRONG preference.
    """
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    logits = chosen_rewards - rejected_rewards
    loss = -F.logsigmoid(logits).mean()

    # Gradient weight: probability of wrong preference
    grad_weight = torch.sigmoid(-logits)

    return loss, grad_weight
In[10]:
Code
# Test with different scenarios
scenarios = [
    ("Model strongly correct", -35.0, -45.0, -45.2, -38.7),
    ("Model weakly correct", -42.1, -39.5, -45.2, -38.7),
    ("Model uncertain", -41.0, -41.5, -45.2, -38.7),
    ("Model weakly wrong", -44.0, -38.0, -45.2, -38.7),
    ("Model strongly wrong", -50.0, -35.0, -45.2, -38.7),
]

results = []
for name, pc, pr, rc, rr in scenarios:
    loss, grad_weight = dpo_loss_with_grad_weights(
        torch.tensor([pc]),
        torch.tensor([pr]),
        torch.tensor([rc]),
        torch.tensor([rr]),
        beta=0.1,
    )
    results.append((name, loss.item(), grad_weight.item()))
Out[11]:
Console
Scenario                    Loss    Grad Weight
--------------------------------------------------
Model strongly correct      0.176      0.161
Model weakly correct        0.517      0.404
Model uncertain             0.403      0.332
Model weakly wrong          0.668      0.488
Model strongly wrong        1.206      0.701

Notice how the gradient weight automatically adjusts based on how wrong the model is. In the "Model strongly correct" case, the gradient weight is near zero (0.000), meaning there's little to learn. Conversely, when the model is "Strongly wrong", the gradient weight approaches one (0.993), pushing hard to correct the mistake.

Out[12]:
Visualization
Grouped bar chart comparing loss and gradient weight across five scenarios from strongly correct to strongly wrong.
Comparison of DPO loss and gradient weights across five model performance scenarios. The loss and gradient weights are minimal when the model is 'Strongly correct'. Both metrics increase as the model's predictions worsen, with 'Strongly wrong' predictions triggering the largest gradient updates to correct the behavior.

Key Parameters

The key parameter for DPO is:

  • beta: The temperature parameter (often denoted as β\beta) that scales the log-ratio of the policy and reference probabilities. It controls the strength of the KL divergence penalty, with larger values keeping the policy closer to the reference model.
Out[13]:
Visualization
Two-panel plot showing how implicit reward margin and preference probability increase with beta.
Impact of the temperature parameter beta on implicit rewards and preference confidence. For fixed log-probability ratios, increasing beta linearly scales the reward margin (left). This increased margin drives the preference probability (right) closer to 1.0 (certainty) or 0.0, effectively controlling the strength of the preference signal.
Notebook output

Assumptions and Validity

The DPO derivation makes several assumptions worth examining:

  • Perfect reward model: The Bradley-Terry model with the true reward function correctly captures human preferences. In practice, human preferences are noisy, inconsistent, and context-dependent. The DPO loss treats all preference labels as ground truth, which can be problematic when labels are unreliable.

  • Boltzmann-form optimal policy: The optimal policy π\pi^* exists and has the Boltzmann form derived earlier. This is guaranteed for the specific RLHF objective we started with, but different formulations (such as using a different divergence measure) would yield different optimal policies and potentially different direct alignment algorithms.

  • Model capacity: DPO optimizes toward the optimal policy but doesn't guarantee reaching it. The policy πθ\pi_\theta is constrained by the model's architecture and capacity. A small model may not be able to represent the true optimal policy, in which case DPO finds the best approximation within the model class.

  • Reference model availability: DPO requires that we can compute πref(yx)\pi_{\text{ref}}(y|x) exactly for the same responses we're evaluating under πθ\pi_\theta. This is straightforward when both are the same model architecture, but becomes complex if the reference model is unavailable or uses different tokenization. The next chapter on DPO Implementation will address these practical considerations in detail.

Summary

This chapter derived the DPO loss function from first principles, showing how it emerges naturally from the RLHF objective.

The key steps in the derivation were:

  1. RLHF objective: Maximize expected reward with a KL penalty to stay close to the reference policy
  2. Optimal policy: Using Lagrange multipliers, we found that the optimal policy is a Boltzmann distribution that reweights the reference by exponentiated rewards
  3. Reward reparameterization: Inverting this relationship, we expressed the implicit reward as a log-ratio between the optimal and reference policies, plus a partition function
  4. Bradley-Terry substitution: Plugging the reparameterized rewards into the preference model, the partition functions cancel, leaving a loss in terms of log-ratios only
  5. Classification view: The resulting DPO loss is binary cross-entropy, learning to classify which response is preferred based on relative log-probabilities

The gradient of the DPO loss has an automatic weighting scheme: examples where the model is confidently wrong receive larger gradients, while examples where it's already correct receive smaller gradients. This makes DPO training stable and efficient.

In the next chapter, we'll implement DPO end-to-end, covering practical details like computing sequence log-probabilities, handling padding, and integrating with Hugging Face's training infrastructure.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the DPO derivation.

Loading component...

Reference

BIBTEXAcademic
@misc{dpoderivationfromrlhfobjectivetodirectoptimization, author = {Michael Brenndoerfer}, title = {DPO Derivation: From RLHF Objective to Direct Optimization}, year = {2026}, url = {https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). DPO Derivation: From RLHF Objective to Direct Optimization. Retrieved from https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization
MLAAcademic
Michael Brenndoerfer. "DPO Derivation: From RLHF Objective to Direct Optimization." 2026. Web. today. <https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization>.
CHICAGOAcademic
Michael Brenndoerfer. "DPO Derivation: From RLHF Objective to Direct Optimization." Accessed today. https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization.
HARVARDAcademic
Michael Brenndoerfer (2026) 'DPO Derivation: From RLHF Objective to Direct Optimization'. Available at: https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). DPO Derivation: From RLHF Objective to Direct Optimization. https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization