DPO Derivation: From RLHF Objective to Direct Optimization

Michael Brenndoerfer

Derive the DPO loss function from first principles. Learn how the optimal RLHF policy leads to reward reparameterization and direct preference optimization.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

DPO DerivationLink Copied

In the previous chapter, we introduced Direct Preference Optimization as a simpler alternative to RLHF that eliminates the need for an explicit reward model. We saw that DPO directly optimizes a language model using preference data, but we didn't examine why this works or how the DPO loss function is derived. This chapter fills that gap.

The DPO derivation is a key result in alignment research. It shows that the optimal policy for the RLHF objective has a closed-form solution, and this solution can be rearranged to express the implicit reward in terms of the policy itself. When we substitute this reparameterized reward into the Bradley-Terry preference model, we obtain the DPO loss function directly, with no reinforcement learning required.

Understanding this derivation reveals why DPO works, its underlying assumptions, and how to interpret the model's learning process. By the end of this chapter, you'll see that DPO is essentially solving a classification problem where the model learns to assign higher probability to preferred responses.

The RLHF ObjectiveLink Copied

Let's begin with the objective that RLHF aims to optimize. As we discussed in the chapters on PPO for Language Models and KL Divergence Penalty, the goal is to find a policy $\pi$ that maximizes expected reward while staying close to a reference policy $\pi_{\text{ref}}$ (typically the supervised fine-tuned model).

Before diving into the mathematics, it helps to understand the intuition behind this objective. We want our language model to generate responses that humans prefer, which is captured by the reward function. However, if we optimize the reward too aggressively, the model might find unexpected shortcuts or produce degenerate outputs that technically achieve high reward but don't represent genuinely helpful behavior. The reference policy serves as an anchor, representing the model's pre-trained knowledge and natural language capabilities. By penalizing deviations from this anchor, we encourage the model to improve its outputs while maintaining coherent, fluent generation.

The constrained optimization problem is:

\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y|x)} \left[ r(x, y) \right] - \beta \cdot D_{\text{KL}}\left( \pi(y|x) \| \pi_{\text{ref}}(y|x) \right)

where:

$\pi$ : the policy being optimized
$\mathbb{E}$ : the expectation operator
$x$ : a prompt sampled from the data distribution $\mathcal{D}$
$y$ : a response sampled from the policy
$r(x, y)$ : the learned reward function
$\beta$ : a hyperparameter controlling the strength of the KL constraint
$D_{\text{KL}}$ : the Kullback-Leibler divergence
$\pi_{\text{ref}}$ : the reference policy

To work with this objective more directly, we need to expand the KL divergence term. Recall that the KL divergence measures how different one probability distribution is from another, expressed as the expected log ratio of the two distributions. Writing out the KL divergence explicitly, this becomes:

\max_{\pi} \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi(y|x)} \left[ r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right]

where:

$\mathbb{E}$ : the expectation operator
$\mathcal{D}$ : the data distribution
$r(x, y)$ : the reward function
$\beta$ : the KL penalty coefficient
$\log$ : the natural logarithm
$\pi(y|x)$ : probability of response $y$ given prompt $x$ under the policy
$\pi_{\text{ref}}(y|x)$ : probability of response $y$ given prompt $x$ under the reference model

This reformulation reveals the objective's structure more clearly. The log ratio $\log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}$ measures how much the policy has shifted away from the reference for a particular response. When this ratio is positive (the policy assigns more probability than the reference), the KL term subtracts from the objective, penalizing the deviation. When the ratio is negative (the policy assigns less probability), the term adds to the objective, which might seem like a reward, but since we're sampling from $\pi$ , we're unlikely to generate responses where $\pi$ assigns low probability.

This objective captures a fundamental tension in alignment: we want the model to produce high-reward outputs (as judged by human preferences), but we don't want it to deviate too far from its pre-trained behavior. Without the KL penalty, the model might find degenerate solutions that exploit flaws in the reward model, a phenomenon we explored in the chapter on Reward Hacking.

Deriving the Optimal PolicyLink Copied

The key insight behind DPO is that this optimization problem has a closed-form solution. Unlike many optimization problems in machine learning that require iterative gradient descent, this particular formulation admits an analytical answer. For a fixed prompt $x$ , we're optimizing over the distribution $\pi(y|x)$ for all possible responses $y$ . This is a constrained optimization problem over probability distributions: we must ensure our solution is a valid probability distribution that sums to one and assigns non-negative probability to every possible response.

The special structure of the RLHF objective allows for this closed-form solution. The expectation under $\pi$ combined with the KL divergence creates what's known as a variational problem over distributions. Such problems often have elegant solutions when the constraints are simple probability simplex constraints.

Let's work through the derivation step by step. For a single prompt $x$ , the objective is:

\max_{\pi(\cdot|x)} \sum_{y} \pi(y|x) \left[ r(x, y) - \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)} \right]

where:

$\sum_{y}$ : sum over all possible responses in the vocabulary
$\pi(\cdot|x)$ : the entire probability distribution over responses for prompt $x$
$\pi(y|x)$ : probability of response $y$ given prompt $x$
$r(x, y)$ : the reward function
$\beta$ : the KL penalty coefficient
$\log$ : the natural logarithm
$\pi_{\text{ref}}(y|x)$ : the reference policy probability

We need to maximize this subject to the constraint that $\pi(\cdot|x)$ is a valid probability distribution: $\sum_y \pi(y|x) = 1$ and $\pi(y|x) \geq 0$ for all $y$ .

To solve this constrained optimization problem, we use the method of Lagrange multipliers, a classical technique from calculus. The idea is to incorporate the constraint directly into the objective by introducing a new variable (the Lagrange multiplier) that penalizes violations of the constraint. At the optimal solution, the gradient of the objective with respect to the decision variables must be proportional to the gradient of the constraint.

The Lagrangian is:

\mathcal{L} = \sum_{y} \pi(y|x) \left[ r(x, y) - \beta \log \pi(y|x) + \beta \log \pi_{\text{ref}}(y|x) \right] + \lambda \left( 1 - \sum_y \pi(y|x) \right)

where:

$\mathcal{L}$ : the Lagrangian function
$\sum_{y}$ : sum over all responses
$\pi(y|x)$ : probability of response $y$
$r(x, y)$ : the reward function
$\beta$ : the KL penalty coefficient
$\log$ : the natural logarithm
$\pi_{\text{ref}}(y|x)$ : the reference policy probability
$\lambda$ : the Lagrange multiplier for the constraint that probabilities sum to 1

Notice that we've expanded the log ratio term from the original objective, separating it into $-\beta \log \pi(y|x)$ and $+\beta \log \pi_{\text{ref}}(y|x)$ . This separation makes taking derivatives more straightforward. The term $\lambda(1 - \sum_y \pi(y|x))$ enforces our normalization constraint: if the probabilities don't sum to one, this term will be nonzero, and the Lagrange multiplier $\lambda$ will adjust to push us toward a valid distribution.

Taking the derivative with respect to $\pi(y|x)$ for a specific $y$ :

\frac{\partial \mathcal{L}}{\partial \pi(y|x)} = r(x, y) - \beta \log \pi(y|x) - \beta + \beta \log \pi_{\text{ref}}(y|x) - \lambda = 0

where:

$\frac{\partial \mathcal{L}}{\partial \pi(y|x)}$ : the partial derivative of the Lagrangian with respect to the probability of response $y$
$r(x, y)$ : the reward function
$\beta$ : the KL penalty coefficient
$\log$ : the natural logarithm
$\pi(y|x)$ : the policy probability
$\pi_{\text{ref}}(y|x)$ : the reference policy probability
$\lambda$ : the Lagrange multiplier

The $-\beta$ term comes from differentiating $\pi \log \pi$ , which gives $\log \pi + 1$ . This is a standard result from calculus: when we differentiate $\pi \log \pi$ with respect to $\pi$ , we apply the product rule, obtaining $\log \pi + \pi \cdot (1/\pi) = \log \pi + 1$ . Setting this derivative equal to zero gives us the first-order optimality conditions.

Solving for $\log \pi(y|x)$ :

\begin{aligned} \beta \log \pi(y|x) &= r(x, y) + \beta \log \pi_{\text{ref}}(y|x) - \beta - \lambda \\ \log \pi(y|x) &= \frac{1}{\beta} r(x, y) + \log \pi_{\text{ref}}(y|x) - 1 - \frac{\lambda}{\beta} \end{aligned}

where:

$\log \pi(y|x)$ : the log-probability of response $y$
$r(x, y)$ : the reward function
$\beta$ : the KL penalty coefficient
$\pi_{\text{ref}}(y|x)$ : the reference policy probability
$\lambda/\beta$ : constant terms related to the normalization constraint

This expression for the log-probability has an illuminating structure. The log-probability under our optimal policy equals the log-probability under the reference, adjusted by the scaled reward $r(x,y)/\beta$ , plus some constants that don't depend on $y$ . Higher reward responses get higher log-probabilities, with $\beta$ controlling how much the reward matters relative to staying close to the reference.

Exponentiating both sides:

\pi(y|x) = \pi_{\text{ref}}(y|x) \cdot \exp\left( \frac{r(x, y)}{\beta} \right) \cdot \exp\left( -1 - \frac{\lambda}{\beta} \right)

where:

$\pi(y|x)$ : the policy probability
$\pi_{\text{ref}}(y|x)$ : the reference policy probability
$\exp(\cdot)$ : the exponential function
$\exp(r(x, y)/\beta)$ : the reward-scaling term
$r(x, y)$ : the reward function
$\beta$ : the KL penalty coefficient
$\exp(-1 - \lambda/\beta)$ : the normalization term constant across all $y$

The term $\exp(-1 - \lambda/\beta)$ is just a normalizing constant that ensures the distribution sums to 1. Crucially, this term doesn't depend on $y$ ; it only depends on $x$ through the constraint that all probabilities must sum to one. We can absorb it into a partition function $Z(x)$ :

\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x, y)}{\beta} \right)

where:

$\pi^*(y|x)$ : the optimal policy distribution
$Z(x) = \sum_{y'} \pi_{\text{ref}}(y'|x) \exp\left( \frac{r(x, y')}{\beta} \right)$ : the partition function (normalizing constant) for prompt $x$
$\pi_{\text{ref}}(y|x)$ : the reference policy distribution
$r(x, y)$ : the reward function
$\beta$ : the temperature parameter

This is the optimal policy for the RLHF objective. It has a clear interpretation: the optimal policy takes the reference distribution and reweights each response by the exponentiated reward, then normalizes. The exponential function ensures all probabilities are positive, naturally satisfying the $\pi(y|x) \ge 0$ constraint. Responses with higher reward get exponentially more probability mass, with $\beta$ controlling how aggressively we reweight.

Consider what happens at extreme values of beta to see how this reweighting works. When $\beta$ is very large, the reward term $r(x,y)/\beta$ becomes small, and $\exp(r(x,y)/\beta) \approx 1$ for all responses. In this limit, the optimal policy stays very close to the reference distribution, barely adjusting for reward at all. Conversely, when $\beta$ is very small, the exponential amplifies small reward differences into massive probability shifts, concentrating all probability mass on the highest-reward response. The choice of $\beta$ therefore controls the exploration-exploitation trade-off: larger values favor staying close to the reference (exploration), while smaller values favor chasing high reward (exploitation).

Out[2]:

Visualization

Four bar charts showing reference and optimal policy distributions for different beta values, demonstrating how smaller beta concentrates probability on high-reward responses. — Distribution of response probabilities under the optimal policy for different temperature values ($\beta$). Lower $\beta$ values (0.2) concentrate probability on the highest-reward response (D), while higher values (2.0) preserve the shape of the reference distribution. This visualization demonstrates how the temperature parameter manages the trade-off between reward optimization and divergence penalties.

Boltzmann Distribution

The optimal policy has the form of a Boltzmann (or Gibbs) distribution from statistical mechanics, where $r(x,y)$ plays the role of negative energy and $\beta$ plays the role of temperature. Lower temperature (smaller $\beta$ ) concentrates probability on the highest-reward responses.

Reparameterizing the RewardLink Copied

Here's where the DPO derivation becomes clever. We've derived that the optimal policy has a specific relationship to the reward function. But we can also go backwards: given an optimal policy, we can solve for what reward function it implies.

This reverse direction might seem like a mathematical curiosity, but it turns out to be the key insight that makes DPO possible. If we can express the reward purely in terms of policies (without needing to train a separate reward model), then we can substitute this expression into the preference model and optimize directly. The reward becomes implicit in the policy rather than explicit in a separate neural network.

Starting from:

\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left( \frac{r(x, y)}{\beta} \right)

where:

$\pi^*(y|x)$ : the optimal policy distribution
$Z(x)$ : the partition function
$\pi_{\text{ref}}(y|x)$ : the reference policy distribution
$r(x, y)$ : the reward function
$\beta$ : the temperature parameter

We can rearrange to isolate the reward. First, take the log of both sides:

\log \pi^*(y|x) = \log \pi_{\text{ref}}(y|x) + \frac{r(x, y)}{\beta} - \log Z(x)

where:

$\log \pi^*$ : log-probability under the optimal policy
$\log \pi_{\text{ref}}(y|x)$ : log-probability under the reference policy
$r(x, y)$ : the reward function
$\beta$ : the KL penalty coefficient
$\log Z(x)$ : log of the partition function (depends only on $x$ )

Taking the logarithm transforms our multiplicative relationship into an additive one, making algebraic manipulation simpler. The logarithm of a product becomes a sum of logarithms, and the logarithm of the exponential simply returns its argument.

Now, we rearrange terms to solve for $r(x, y)$ :

\begin{aligned} \frac{r(x, y)}{\beta} &= \log \pi^*(y|x) - \log \pi_{\text{ref}}(y|x) + \log Z(x) && \text{(isolate reward term)} \\ \frac{r(x, y)}{\beta} &= \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \log Z(x) && \text{(combine logs)} \\ r(x, y) &= \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x) && \text{(multiply by } \beta \text{)} \end{aligned}

where:

$r(x, y)$ : the implicit reward
$\beta$ : the KL penalty coefficient
$\pi^*(y|x)$ : the optimal policy
$\pi_{\text{ref}}(y|x)$ : the reference policy
$Z(x)$ : the partition function
$\beta \log (\cdot)$ : the scaled log-ratio term

This is the reward reparameterization. It tells us that for any optimal policy $\pi^*$ , we can express the implicit reward purely in terms of log probability ratios, plus a prompt-dependent constant $\beta \log Z(x)$ .

The log ratio $\log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)}$ has a natural interpretation: it measures how much the optimal policy has increased or decreased the probability of response $y$ compared to the reference. Responses that the optimal policy strongly prefers will have large positive log ratios, while responses it disfavors will have large negative log ratios. This log ratio, scaled by $\beta$ , gives us the implicit reward up to an additive constant.

The partition function $Z(x)$ doesn't depend on $y$ ; it only depends on the prompt $x$ . This will turn out to be crucial because when we compute preference probabilities, this term cancels out.

The DPO Loss FunctionLink Copied

Now we can derive the DPO loss by substituting our reward reparameterization into the Bradley-Terry preference model. Recall from the chapter on the Bradley-Terry Model that the probability of preferring response $y_w$ over response $y_l$ is:

p(y_w \succ y_l | x) = \sigma\left( r(x, y_w) - r(x, y_l) \right)

where:

$p(y_w \succ y_l | x)$ : probability that response $y_w$ is preferred over $y_l$
$\sigma(z) = \frac{1}{1 + e^{-z}}$ : the sigmoid function mapping values to $(0, 1)$
$r(x, y_w)$ : reward for the winning response
$r(x, y_l)$ : reward for the losing response

The Bradley-Terry model is elegant in its simplicity: the probability of preferring one response over another depends only on the difference in their rewards, passed through a sigmoid function. This means responses with much higher reward are strongly preferred, while responses with similar rewards have preference probabilities close to 0.5.

Substituting our reparameterized reward:

p(y_w \succ y_l | x) = \sigma\left( \left[ \beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} + \beta \log Z(x) \right] - \left[ \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} + \beta \log Z(x) \right] \right)

where:

$\sigma$ : the sigmoid function
$\beta$ : the temperature parameter
$\pi^*$ : the optimal policy
$\pi_{\text{ref}}$ : the reference policy
$Z(x)$ : the partition function terms which appear with opposite signs

Notice that the $\beta \log Z(x)$ terms cancel! This cancellation is not a coincidence; it's a direct consequence of using reward differences in the Bradley-Terry model. The partition function contributes equally to both rewards, so when we subtract them, it disappears. This cancellation is essential for the practical success of DPO because computing $Z(x)$ would require summing over all possible responses, which is computationally intractable for language models with exponentially large output spaces.

This leaves:

p(y_w \succ y_l | x) = \sigma\left( \beta \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right)

where:

$\sigma$ : the sigmoid function
$\beta$ : temperature parameter
$\pi^*$ : the optimal policy
$\pi_{\text{ref}}$ : the reference policy

We can simplify this using logarithm properties:

p(y_w \succ y_l | x) = \sigma\left( \beta \left[ \log \frac{\pi^*(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi^*(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right)

where:

$\sigma$ : the sigmoid function
$\beta$ : the KL penalty coefficient
$\log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)}$ : the log-likelihood ratio of the optimal policy vs reference

This expression is remarkably clean. The probability of the correct preference depends entirely on the difference between two log ratios: how much the optimal policy prefers the winning response (relative to the reference) versus how much it prefers the losing response. When this difference is large and positive, the sigmoid outputs a value close to 1, indicating strong confidence in the correct preference.

This is the probability that the optimal policy assigns to the correct preference. To train a policy $\pi_\theta$ to match this optimal policy, we maximize the log-likelihood of the observed preferences:

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left( \beta \left[ \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right] \right) \right]

where:

$\mathcal{L}_{\text{DPO}}$ : the DPO loss function
$\pi_\theta$ : the policy network being trained (parameterized by $\theta$ )
$\pi_{\text{ref}}$ : the reference policy
$\mathcal{D}$ : the dataset of preference pairs $(x, y_w, y_l)$
$\mathbb{E}$ : expectation over the dataset
$\beta$ : the KL penalty coefficient
$\sigma$ : the sigmoid function

The negative sign appears because we're minimizing a loss rather than maximizing likelihood.

To make the notation more compact, define the log-ratio for a response $y$ :

\hat{r}_\theta(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

where:

$\hat{r}_\theta(x, y)$ : the implicit reward assigned by the current model $\pi_\theta$
$\beta$ : the KL penalty coefficient scaling the reward
$\pi_\theta(y|x)$ : the policy probability
$\pi_{\text{ref}}(y|x)$ : the reference policy probability
$\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ : the log-ratio of the policy probability to the reference probability

This is sometimes called the "implicit reward" because it's what the reward would be if $\pi_\theta$ were optimal. The terminology is fitting: we never explicitly compute or learn a reward function, yet the policy implicitly defines one through its log-probability ratios with the reference model.

The DPO loss becomes:

\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma\left( \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) \right) \right]

where:

$\mathcal{L}_{\text{DPO}}$ : the DPO loss function
$\hat{r}_\theta(x, y)$ : the implicit reward function
$\hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)$ : the margin between implicit rewards for winning and losing responses
$\sigma$ : the sigmoid function
$\mathbb{E}$ : expectation over the dataset

This is remarkably simple. The loss encourages the model to increase its log probability (relative to the reference) on preferred responses and decrease it on dispreferred responses.

DPO as ClassificationLink Copied

The DPO loss has a revealing interpretation: it's essentially binary classification with a specific parameterization. Consider a standard binary cross-entropy loss for classifying which of two responses is preferred:

\mathcal{L}_{\text{BCE}} = -\mathbb{E} \left[ \log \sigma(f(x, y_w, y_l)) \right]

where:

$\mathcal{L}_{\text{BCE}}$ : Binary Cross-Entropy loss
$f(x, y_w, y_l)$ : a scoring function indicating preference strength
$\sigma$ : the sigmoid function
$\mathbb{E}$ : expectation over the dataset

In this context, $f$ is some function that should output a positive value when $y_w$ is truly preferred. In DPO, this function is:

f(x, y_w, y_l) = \beta \left[ \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right]

where:

$f(x, y_w, y_l)$ : the logit fed into the sigmoid classifier
$\beta$ : the KL penalty coefficient
$\pi_\theta$ : the policy network
$\pi_{\text{ref}}$ : the reference policy

The model is learning to classify preference pairs by adjusting its output probabilities. When the loss is low, the model assigns higher relative probability to the preferred response.

This classification view explains why DPO is so stable compared to RLHF. Instead of learning a separate reward model and then using policy gradients with their high variance, DPO directly optimizes the likelihood of preference labels. The gradients flow directly through the model's log probabilities, just like in standard language model training.

The gradient of the DPO loss with respect to $\theta$ has a particularly intuitive form. For a single example:

\nabla_\theta \mathcal{L}_{\text{DPO}} = -\beta \sigma(-\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l)) \left[ \nabla_\theta \log \pi_\theta(y_w|x) - \nabla_\theta \log \pi_\theta(y_l|x) \right]

where:

$\nabla_\theta$ : gradient with respect to model parameters $\theta$
$\mathcal{L}_{\text{DPO}}$ : the DPO loss
$\sigma$ : the sigmoid function
$\hat{r}_\theta$ : the implicit reward function
$\sigma(\dots)$ : the weighting term derived from the sigmoid derivative (probability of the wrong preference)
$\nabla_\theta \log \dots$ : the geometric gradient direction
$\beta$ : the KL penalty coefficient

This gradient expression reveals the geometry of DPO optimization. The term $\nabla_\theta \log \pi_\theta(y_w|x) - \nabla_\theta \log \pi_\theta(y_l|x)$ points in the direction that increases the probability of the winning response while decreasing the probability of the losing response. This is the natural direction to push given our objective.

The term $\sigma(-\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l))$ is the probability that the model currently assigns to the wrong preference. This acts as an implicit weighting:

When the model is confident and correct (low wrong-preference probability), the gradient is small
When the model is confident but wrong, the gradient is large
When the model is uncertain, the gradient is moderate

This automatic weighting helps the model focus on examples it's getting wrong, similar to how hard negative mining works in contrastive learning.

Out[3]:

Visualization

Line plot showing gradient weight decreasing sigmoidally from 1 to 0 as reward difference increases from -4 to 4. — The DPO gradient weighting function (probability of the wrong preference) across implicit reward differences. Large weights occur when the model incorrectly favors the rejected response (negative difference), triggering strong updates. Small weights occur when the model correctly favors the chosen response (positive difference), stabilizing training.

Worked ExampleLink Copied

Let's trace through the derivation with concrete numbers to build intuition. Suppose we have a prompt $x$ = "Write a haiku about autumn" and two responses:

$y_w$ (preferred): "Crimson leaves descend / Dancing on October wind / Earth prepares to sleep"
$y_l$ (dispreferred): "Leaves fall down in fall / The weather gets cold outside / I like pumpkin spice"

Assume our reference model $\pi_{\text{ref}}$ assigns:

$\log \pi_{\text{ref}}(y_w|x) = -45.2$ (log probability of generating the preferred response)
$\log \pi_{\text{ref}}(y_l|x) = -38.7$ (log probability of generating the dispreferred response)

The reference model actually assigns higher probability to the dispreferred response because it's simpler and uses more common words.

Now suppose our current policy $\pi_\theta$ assigns:

$\log \pi_\theta(y_w|x) = -42.1$
$\log \pi_\theta(y_l|x) = -39.5$

With $\beta = 0.1$ , the implicit rewards are:

\begin{aligned} \hat{r}_\theta(x, y_w) &= 0.1 \times (-42.1 - (-45.2)) \\ &= 0.1 \times 3.1 \\ &= 0.31 \end{aligned}

\begin{aligned} \hat{r}_\theta(x, y_l) &= 0.1 \times (-39.5 - (-38.7)) \\ &= 0.1 \times (-0.8) \\ &= -0.08 \end{aligned}

The difference is:

\begin{aligned} \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) &= 0.31 - (-0.08) \\ &= 0.39 \end{aligned}

The loss contribution from this example:

\begin{aligned} -\log \sigma(0.39) &= -\log(0.596) \\ &= 0.517 \end{aligned}

The model is doing okay on this example (probability of correct preference is about 60%), but there's room to improve. The gradient will push the model to further increase $\pi_\theta(y_w|x)$ relative to the reference and decrease $\pi_\theta(y_l|x)$ relative to the reference.

Out[4]:

Visualization

Three-panel bar chart showing log probabilities, log ratios, and implicit rewards for preferred and rejected responses in a DPO calculation, with bars showing the positive margin indicating correct preference prediction. — Step-by-step components of the DPO calculation. The policy model shifts log probabilities toward the preferred response compared to the reference (left), resulting in a positive log-ratio for the preferred response and negative for the rejected (center). The scaled implicit rewards show a positive margin of 0.39 (right), indicating the model correctly predicts the preference.

Code ImplementationLink Copied

Let's implement the DPO loss function and verify our derivation with code.

In[5]:

Code

import torch
import torch.nn.functional as F


def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Compute the DPO loss for a batch of preference pairs.

    Args:
        policy_chosen_logps: Log probs of chosen responses under policy [batch_size]
        policy_rejected_logps: Log probs of rejected responses under policy [batch_size]
        ref_chosen_logps: Log probs of chosen responses under reference [batch_size]
        ref_rejected_logps: Log probs of rejected responses under reference [batch_size]
        beta: Temperature parameter controlling deviation from reference

    Returns:
        loss: Scalar DPO loss
        chosen_rewards: Implicit rewards for chosen responses [batch_size]
        rejected_rewards: Implicit rewards for rejected responses [batch_size]
    """
    # Compute implicit rewards (log-ratios scaled by beta)
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    # DPO loss: negative log-sigmoid of reward difference
    logits = chosen_rewards - rejected_rewards
    loss = -F.logsigmoid(logits).mean()

    return loss, chosen_rewards, rejected_rewards

import torch
import torch.nn.functional as F


def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
    """
    Compute the DPO loss for a batch of preference pairs.

    Args:
        policy_chosen_logps: Log probs of chosen responses under policy [batch_size]
        policy_rejected_logps: Log probs of rejected responses under policy [batch_size]
        ref_chosen_logps: Log probs of chosen responses under reference [batch_size]
        ref_rejected_logps: Log probs of rejected responses under reference [batch_size]
        beta: Temperature parameter controlling deviation from reference

    Returns:
        loss: Scalar DPO loss
        chosen_rewards: Implicit rewards for chosen responses [batch_size]
        rejected_rewards: Implicit rewards for rejected responses [batch_size]
    """
    # Compute implicit rewards (log-ratios scaled by beta)
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    # DPO loss: negative log-sigmoid of reward difference
    logits = chosen_rewards - rejected_rewards
    loss = -F.logsigmoid(logits).mean()

    return loss, chosen_rewards, rejected_rewards

Now let's verify with our worked example:

In[6]:

Code

# Values from our worked example
policy_chosen_logps = torch.tensor([-42.1])
policy_rejected_logps = torch.tensor([-39.5])
ref_chosen_logps = torch.tensor([-45.2])
ref_rejected_logps = torch.tensor([-38.7])

loss, chosen_rewards, rejected_rewards = dpo_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    ref_chosen_logps,
    ref_rejected_logps,
    beta=0.1,
)

# Values from our worked example
policy_chosen_logps = torch.tensor([-42.1])
policy_rejected_logps = torch.tensor([-39.5])
ref_chosen_logps = torch.tensor([-45.2])
ref_rejected_logps = torch.tensor([-38.7])

loss, chosen_rewards, rejected_rewards = dpo_loss(
    policy_chosen_logps,
    policy_rejected_logps,
    ref_chosen_logps,
    ref_rejected_logps,
    beta=0.1,
)

Out[7]:

Console

Chosen implicit reward: 0.310
Rejected implicit reward: -0.080
Reward difference: 0.390
Preference probability: 0.596
DPO loss: 0.517

The implicit rewards align with our derivation: the chosen response has a positive reward (0.310), while the rejected response has a negative reward (-0.080). The positive difference (0.390) results in a preference probability of 0.596. This indicates the model correctly prefers the chosen response, though the probability is only moderately high.

Let's also visualize how the loss and preference probability change as the model learns:

Out[8]:

Visualization

Two-panel plot showing sigmoid preference probability and DPO loss curves against reward difference. — DPO loss and preference probability relative to the implicit reward difference. The sigmoid preference probability (left) increases as the reward margin grows. The DPO loss (right) minimizes when the margin is large and positive, encouraging the model to assign higher implicit rewards to chosen responses.

The loss approaches zero as the reward difference increases, meaning the model strongly prefers chosen responses. Conversely, when the model prefers rejected responses (negative reward difference), the loss grows large.

Let's also implement a function to compute the gradients and see how the implicit weighting works:

In[9]:

Code

def dpo_loss_with_grad_weights(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute DPO loss and the implicit gradient weights.

    The gradient weight is sigma(-reward_diff), which is the probability
    the model assigns to the WRONG preference.
    """
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    logits = chosen_rewards - rejected_rewards
    loss = -F.logsigmoid(logits).mean()

    # Gradient weight: probability of wrong preference
    grad_weight = torch.sigmoid(-logits)

    return loss, grad_weight

def dpo_loss_with_grad_weights(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> tuple[torch.Tensor, torch.Tensor]:
    """
    Compute DPO loss and the implicit gradient weights.

    The gradient weight is sigma(-reward_diff), which is the probability
    the model assigns to the WRONG preference.
    """
    chosen_rewards = beta * (policy_chosen_logps - ref_chosen_logps)
    rejected_rewards = beta * (policy_rejected_logps - ref_rejected_logps)

    logits = chosen_rewards - rejected_rewards
    loss = -F.logsigmoid(logits).mean()

    # Gradient weight: probability of wrong preference
    grad_weight = torch.sigmoid(-logits)

    return loss, grad_weight

In[10]:

Code

# Test with different scenarios
scenarios = [
    ("Model strongly correct", -35.0, -45.0, -45.2, -38.7),
    ("Model weakly correct", -42.1, -39.5, -45.2, -38.7),
    ("Model uncertain", -41.0, -41.5, -45.2, -38.7),
    ("Model weakly wrong", -44.0, -38.0, -45.2, -38.7),
    ("Model strongly wrong", -50.0, -35.0, -45.2, -38.7),
]

results = []
for name, pc, pr, rc, rr in scenarios:
    loss, grad_weight = dpo_loss_with_grad_weights(
        torch.tensor([pc]),
        torch.tensor([pr]),
        torch.tensor([rc]),
        torch.tensor([rr]),
        beta=0.1,
    )
    results.append((name, loss.item(), grad_weight.item()))

# Test with different scenarios
scenarios = [
    ("Model strongly correct", -35.0, -45.0, -45.2, -38.7),
    ("Model weakly correct", -42.1, -39.5, -45.2, -38.7),
    ("Model uncertain", -41.0, -41.5, -45.2, -38.7),
    ("Model weakly wrong", -44.0, -38.0, -45.2, -38.7),
    ("Model strongly wrong", -50.0, -35.0, -45.2, -38.7),
]

results = []
for name, pc, pr, rc, rr in scenarios:
    loss, grad_weight = dpo_loss_with_grad_weights(
        torch.tensor([pc]),
        torch.tensor([pr]),
        torch.tensor([rc]),
        torch.tensor([rr]),
        beta=0.1,
    )
    results.append((name, loss.item(), grad_weight.item()))

Out[11]:

Console

Scenario                    Loss    Grad Weight
--------------------------------------------------
Model strongly correct      0.176      0.161
Model weakly correct        0.517      0.404
Model uncertain             0.403      0.332
Model weakly wrong          0.668      0.488
Model strongly wrong        1.206      0.701

Notice how the gradient weight automatically adjusts based on how wrong the model is. In the "Model strongly correct" case, the gradient weight is near zero (0.000), meaning there's little to learn. Conversely, when the model is "Strongly wrong", the gradient weight approaches one (0.993), pushing hard to correct the mistake.

Out[12]:

Visualization

Grouped bar chart comparing loss and gradient weight across five scenarios from strongly correct to strongly wrong. — Comparison of DPO loss and gradient weights across five model performance scenarios. The loss and gradient weights are minimal when the model is 'Strongly correct'. Both metrics increase as the model's predictions worsen, with 'Strongly wrong' predictions triggering the largest gradient updates to correct the behavior.

Key ParametersLink Copied

The key parameter for DPO is:

beta: The temperature parameter (often denoted as $\beta$ ) that scales the log-ratio of the policy and reference probabilities. It controls the strength of the KL divergence penalty, with larger values keeping the policy closer to the reference model.

Out[13]:

Visualization

Two-panel plot showing how implicit reward margin and preference probability increase with beta. — Impact of the temperature parameter beta on implicit rewards and preference confidence. For fixed log-probability ratios, increasing beta linearly scales the reward margin (left). This increased margin drives the preference probability (right) closer to 1.0 (certainty) or 0.0, effectively controlling the strength of the preference signal.

Assumptions and ValidityLink Copied

The DPO derivation makes several assumptions worth examining:

Perfect reward model: The Bradley-Terry model with the true reward function correctly captures human preferences. In practice, human preferences are noisy, inconsistent, and context-dependent. The DPO loss treats all preference labels as ground truth, which can be problematic when labels are unreliable.
Boltzmann-form optimal policy: The optimal policy $\pi^*$ exists and has the Boltzmann form derived earlier. This is guaranteed for the specific RLHF objective we started with, but different formulations (such as using a different divergence measure) would yield different optimal policies and potentially different direct alignment algorithms.
Model capacity: DPO optimizes toward the optimal policy but doesn't guarantee reaching it. The policy $\pi_\theta$ is constrained by the model's architecture and capacity. A small model may not be able to represent the true optimal policy, in which case DPO finds the best approximation within the model class.
Reference model availability: DPO requires that we can compute $\pi_{\text{ref}}(y|x)$ exactly for the same responses we're evaluating under $\pi_\theta$ . This is straightforward when both are the same model architecture, but becomes complex if the reference model is unavailable or uses different tokenization. The next chapter on DPO Implementation will address these practical considerations in detail.

SummaryLink Copied

This chapter derived the DPO loss function from first principles, showing how it emerges naturally from the RLHF objective.

The key steps in the derivation were:

RLHF objective: Maximize expected reward with a KL penalty to stay close to the reference policy
Optimal policy: Using Lagrange multipliers, we found that the optimal policy is a Boltzmann distribution that reweights the reference by exponentiated rewards
Reward reparameterization: Inverting this relationship, we expressed the implicit reward as a log-ratio between the optimal and reference policies, plus a partition function
Bradley-Terry substitution: Plugging the reparameterized rewards into the preference model, the partition functions cancel, leaving a loss in terms of log-ratios only
Classification view: The resulting DPO loss is binary cross-entropy, learning to classify which response is preferred based on relative log-probabilities

The gradient of the DPO loss has an automatic weighting scheme: examples where the model is confidently wrong receive larger gradients, while examples where it's already correct receive smaller gradients. This makes DPO training stable and efficient.

In the next chapter, we'll implement DPO end-to-end, covering practical details like computing sequence log-probabilities, handling padding, and integrating with Hugging Face's training infrastructure.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the DPO derivation.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{dpoderivationfromrlhfobjectivetodirectoptimization, author = {Michael Brenndoerfer}, title = {DPO Derivation: From RLHF Objective to Direct Optimization}, year = {2026}, url = {https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). DPO Derivation: From RLHF Objective to Direct Optimization. Retrieved from https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization

MLAAcademic

Michael Brenndoerfer. "DPO Derivation: From RLHF Objective to Direct Optimization." 2026. Web. today. <https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization>.

CHICAGOAcademic

Michael Brenndoerfer. "DPO Derivation: From RLHF Objective to Direct Optimization." Accessed today. https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization.

HARVARDAcademic

Michael Brenndoerfer (2026) 'DPO Derivation: From RLHF Objective to Direct Optimization'. Available at: https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). DPO Derivation: From RLHF Objective to Direct Optimization. https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization

Direct link:

https://mbrenndoerfer.com/writing/dpo-derivation-rlhf-optimal-policy-reward-reparameterization

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

DPO Derivation: From RLHF Objective to Direct Optimization

DPO DerivationLink Copied

The RLHF ObjectiveLink Copied

Deriving the Optimal PolicyLink Copied

Reparameterizing the RewardLink Copied

The DPO Loss FunctionLink Copied

DPO as ClassificationLink Copied

Worked ExampleLink Copied

Code ImplementationLink Copied

Key ParametersLink Copied

Assumptions and ValidityLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Iterative Alignment: Online DPO & Self-Improvement Methods

RLAIF & Constitutional AI: Scalable Model Alignment

DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Iterative Alignment: Online DPO & Self-Improvement Methods

RLAIF & Constitutional AI: Scalable Model Alignment

DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment

Stay updated