DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment

Michael Brenndoerfer

Explore DPO variants including IPO, KTO, ORPO, and cDPO. Learn when to use each method for LLM alignment based on data format and computational constraints.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

DPO VariantsLink Copied

In the previous chapters, we derived Direct Preference Optimization from first principles and implemented it as a simpler alternative to RLHF. While DPO represented a significant advance in alignment methodology, you quickly identified several limitations: sensitivity to noisy preference labels, the requirement for paired preference data, potential overfitting to extreme probability ratios, and the computational overhead of maintaining a frozen reference model. These challenges motivated a wave of DPO variants, each addressing specific weaknesses while preserving the core insight that preference learning can bypass explicit reward modeling.

This chapter explores the most influential DPO variants: IPO addresses overfitting through a regularized objective, KTO enables learning from unpaired binary feedback, ORPO eliminates the reference model entirely, and cDPO handles label noise through conservative smoothing. Understanding these variants equips you to select the right alignment method for your data characteristics and computational constraints.

Identity Preference Optimization (IPO)Link Copied

IPO emerged from a careful analysis of DPO's theoretical properties. Azar et al. (2024) identified a subtle but important issue: as DPO training progresses, the implicit reward gap between chosen and rejected responses can grow unboundedly, causing the policy to assign extreme probabilities that don't reflect the actual strength of human preferences.

The Overfitting Problem in DPOLink Copied

To understand why IPO was developed, we must first examine a fundamental tension within DPO's optimization dynamics. Recall from our DPO derivation that the loss function is:

\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right) \right]

where:

$\mathcal{L}_{\text{DPO}}$ : the DPO loss function
$\mathbb{E}_{(x, y_w, y_l)}$ : the expectation over the dataset of preference pairs
$\pi_\theta(y|x)$ : the probability of response $y$ given prompt $x$ under the policy being trained
$\pi_{\text{ref}}(y|x)$ : the probability under the frozen reference model
$y_w, y_l$ : the chosen (winner) and rejected (loser) responses
$\beta$ : the temperature parameter controlling the strength of the KL divergence penalty
$\sigma$ : the sigmoid function

To appreciate the overfitting problem, consider what this loss function is asking the model to do. The expression inside the sigmoid represents the difference between two log-ratios: how much more likely the policy makes the chosen response compared to the reference, minus the same quantity for the rejected response. The negative log-sigmoid of this quantity becomes the loss. When training minimizes this loss, it maximizes what's inside the log-sigmoid.

The sigmoid function $\sigma$ approaches 1 as its argument grows large. To minimize this loss (maximize the log), the model is incentivized to make the log-ratio difference as large as possible. In the limit, this pushes the model toward assigning probability approaching 1 to chosen responses and probability approaching 0 to rejected ones. There is no natural stopping point in this formulation: the gradient always points toward making the gap larger, even when the gap is already enormous.

Out[2]:

Visualization

The negative log-sigmoid loss function used in DPO. As the argument increases, the loss decreases monotonically toward zero without ever reaching it. This creates an unbounded incentive to increase the log-ratio difference.

This behavior is problematic for several reasons:

Overconfidence: Human preferences aren't absolute. A response labeled "chosen" isn't infinitely better than the "rejected" alternative; it's merely preferred in that comparison. Two responses might differ only slightly in quality, yet the labeling treats one as definitively better. When the model learns to assign extreme probabilities based on these labels, it develops a false sense of certainty that doesn't match the underlying reality of human judgment.
Training instability: As log probabilities approach $-\infty$ for rejected responses, gradients can become unstable. The model must push probability mass away from rejected responses, and when those probabilities become vanishingly small, the numerical representations become problematic. Small perturbations can cause large swings in the loss.
Poor generalization: Overfitting to training preferences may not transfer to held-out prompts. When the model learns extreme preferences on training data, it essentially memorizes that specific responses are "infinitely good" or "infinitely bad" rather than learning generalizable patterns about what makes responses better or worse.

IPO's Squared Error ObjectiveLink Copied

IPO addresses this by reformulating preference learning as a regression problem with a specific target. The key insight is that instead of allowing the reward gap to grow without bound, we should specify how large we want that gap to be and then train the model to achieve exactly that target. This transforms an unbounded optimization problem into a bounded one with clear convergence criteria.

Instead of pushing the reward gap to infinity, IPO targets a fixed margin:

\mathcal{L}_{\text{IPO}} = \mathbb{E}_{(x, y_w, y_l)} \left[ \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta} \right)^2 \right]

where:

$\mathcal{L}_{\text{IPO}}$ : the IPO loss function
$\mathbb{E}_{(x, y_w, y_l)}$ : the expectation over the dataset of preference pairs
$\frac{1}{2\beta}$ : the target margin derived from the regularization strength $\beta$
$\pi_\theta, \pi_{\text{ref}}$ : the policy and reference models

The key innovation is the squared loss with target $\frac{1}{2\beta}$ . This creates fundamentally different optimization dynamics. Rather than rewarding the model for making the gap as large as possible, IPO rewards the model for making the gap equal to a specific value. Any deviation from this target, whether too small or too large, incurs a penalty. This is the essence of regression: we have a target value, and we minimize the squared distance to that target.

To analyze these dynamics more precisely, let's define the log-ratio difference as:

h_\theta(x, y_w, y_l) = \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}

where:

$h_\theta(x, y_w, y_l)$ : the difference in log-probability ratios between chosen and rejected responses
$\log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}$ : the "implicit reward" for a specific response $y$

This quantity $h_\theta$ captures the essence of what preference optimization is trying to achieve. It measures how much the policy has learned to prefer the chosen response over the rejected one, relative to what the reference model would predict. A positive value means the policy has shifted probability mass toward the chosen response; a larger positive value means a stronger learned preference.

With this notation, the IPO loss becomes elegantly simple:

\mathcal{L}_{\text{IPO}} = \mathbb{E} \left[ \left( h_\theta - \frac{1}{2\beta} \right)^2 \right]

where:

$\mathcal{L}_{\text{IPO}}$ : the IPO loss function
$\mathbb{E}$ : the expectation over the dataset
$h_\theta$ : the log-ratio difference computed by the model
$\frac{1}{2\beta}$ : the specific target value that the difference should converge to

This formulation makes the regression nature of IPO crystal clear. We are asking the model to make $h_\theta$ equal to $\frac{1}{2\beta}$ for every preference pair in the dataset. The squared loss penalizes deviations in either direction: if the model hasn't learned a strong enough preference ( $h_\theta < \frac{1}{2\beta}$ ), the loss is positive and gradients push toward a larger gap; if the model has learned too strong a preference ( $h_\theta > \frac{1}{2\beta}$ ), the loss is also positive and gradients push toward a smaller gap.

The gradient with respect to $h_\theta$ reveals this self-correcting behavior:

\frac{\partial \mathcal{L}_{\text{IPO}}}{\partial h_\theta} = 2 \left( h_\theta - \frac{1}{2\beta} \right)

where:

$\frac{\partial \mathcal{L}_{\text{IPO}}}{\partial h_\theta}$ : the gradient of the loss with respect to the log-ratio difference
$h_\theta - \frac{1}{2\beta}$ : the error term (distance from target) that drives the update

This gradient is zero when $h_\theta = \frac{1}{2\beta}$ , creating a stable equilibrium. Once the log-ratio difference reaches the target margin, there's no further pressure to increase it. This bounded optimization prevents the extreme probability assignments that plague standard DPO. The equilibrium is not just stable but attractive: regardless of where training starts, the gradients always point toward the target, and the strength of the gradient is proportional to the distance from the target.

Interpreting the Target MarginLink Copied

The target $\frac{1}{2\beta}$ has a principled interpretation that connects IPO back to the theoretical foundations of preference learning. The $\beta$ parameter controls the strength of the KL constraint (as discussed in Part XXVII, Chapter 10). A larger $\beta$ means weaker regularization, allowing larger deviations from the reference policy. With IPO's target:

When $\beta$ is large (weak KL penalty), the target margin $\frac{1}{2\beta}$ is small
When $\beta$ is small (strong KL penalty), the target margin is large

This might seem counterintuitive until you consider that IPO's margin represents how much the policy should diverge from the reference on each preference pair. Stronger KL constraints ( $\beta$ small) actually permit larger per-example margins because the overall deviation is more tightly controlled. To understand this, think about a budget analogy: if you have a strict overall spending limit, you can afford to splurge on individual items because you're being careful everywhere else. Conversely, if you have a loose overall limit, you need to be more conservative on each purchase to avoid overshooting.

The target margin also has an interpretation in terms of the Bradley-Terry model that underlies preference learning. In this model, the probability that response $y_w$ is preferred to $y_l$ depends on the difference in their rewards. The target $\frac{1}{2\beta}$ corresponds to a specific preference probability that represents a moderate preference rather than absolute certainty. This aligns with the reality that human annotations express preferences, not certainties.

Out[3]:

Visualization

IPO target margin varies with the beta parameter. Smaller beta (stronger KL regularization) results in larger per-example target margins, while larger beta (weaker regularization) produces smaller targets.

Gradient ComparisonLink Copied

Different loss functions result in different gradient behaviors, explaining why methods perform differently in practice. The gradient determines how the model updates its parameters at each step, so differences in gradient behavior translate directly into differences in training dynamics.

DPO gradient magnitude:

\left| \frac{\partial \mathcal{L}_{\text{DPO}}}{\partial h_\theta} \right| = \beta \sigma(-\beta h_\theta)

where:

$\beta$ : the temperature parameter
$\sigma(-\beta h_\theta)$ : the sigmoid of the negative scaled log-ratio, acting as a weighting factor

This sigmoid term means DPO's gradient is largest when $h_\theta$ is near zero (uncertain predictions) and vanishes as $h_\theta \to \infty$ (confident predictions). The intuition here is that DPO pushes hardest when the model is uncertain about which response is preferred, and pushes more gently when the model is already confident. While vanishing gradients prevent infinite growth in principle, they also mean learning slows dramatically once the model becomes confident. This creates a problematic dynamic: the model can still drift toward extreme probabilities because the gradients, though small, remain consistently positive.

IPO gradient magnitude:

\left| \frac{\partial \mathcal{L}_{\text{IPO}}}{\partial h_\theta} \right| = 2 \left| h_\theta - \frac{1}{2\beta} \right|

where:

$| h_\theta - \frac{1}{2\beta} |$ : the absolute distance from the target margin

IPO's gradient magnitude is proportional to distance from the target. If the model overshoots the target ( $h_\theta > \frac{1}{2\beta}$ ), the gradient pushes it back. This self-correcting behavior is absent in DPO. The gradient grows stronger as the model strays further from the target, which means that even large deviations are corrected rather than allowed to persist. This property makes IPO training more robust to initialization and hyperparameter choices, since the optimization will eventually find the target regardless of where it starts.

Kahneman-Tversky Optimization (KTO)Link Copied

While DPO and IPO improve upon RLHF's computational complexity, they share a fundamental data requirement: paired preferences. Each training example must contain a prompt with both a chosen and rejected response. This pairing constraint creates practical challenges.

The Unpaired Feedback ProblemLink Copied

Real-world human feedback often comes in unpaired form, and this mismatch between how feedback is collected and how preference optimization algorithms expect data creates significant friction in practical applications:

Binary ratings: You click thumbs up or thumbs down on individual responses
Flagging systems: You report problematic outputs without providing alternatives
Implicit signals: Engagement metrics indicate whether a response was helpful

Converting this abundant unpaired feedback into DPO's paired format requires either discarding data or artificially constructing pairs. Discarding data wastes valuable signal; constructing pairs introduces artifacts that may not reflect genuine preferences. Consider a scenario where you have 10,000 thumbs-up ratings and 5,000 thumbs-down ratings, but these ratings come from different conversations with different prompts. To use DPO, you would need to either match these into pairs somehow, losing most of your data, or generate new responses to create artificial comparisons.

KTO, introduced by Ethayarajh et al. (2024), eliminates this requirement by designing an objective that works directly with unpaired binary feedback. This enables learning from the full breadth of available feedback without forcing it into an unnatural format.

Inspiration from Prospect TheoryLink Copied

KTO draws inspiration from Kahneman and Tversky's prospect theory, which describes how humans actually make decisions under uncertainty. This connection to behavioral economics is more than a naming convention: it provides principled guidance for how to weight different types of feedback. Two key insights from behavioral economics inform KTO's design:

Reference dependence: People evaluate outcomes relative to a reference point, not in absolute terms. A $100 gain feels different depending on whether you expected $0 or $200. This insight suggests that the "goodness" of a model response should be measured relative to some baseline expectation, not in absolute terms. A response that seems mediocre after an excellent previous response might seem impressive after a poor one.

Loss aversion: Losses loom larger than equivalent gains. Losing $100 feels worse than gaining $100 feels good. Empirically, losses are weighted approximately 2x more heavily than gains. For alignment, this suggests that avoiding bad outputs might be more important than producing good ones. You might forgive a bland response, but you remember harmful or incorrect ones.

KTO incorporates both principles into its loss function, treating "desirable" and "undesirable" responses asymmetrically.

Out[4]:

Visualization

Prospect theory's value function showing loss aversion. Losses (negative x) produce steeper decreases in value than equivalent gains produce increases. KTO incorporates this asymmetry into alignment training.

The KTO Loss FunctionLink Copied

The construction of KTO's loss function proceeds in several stages, each motivated by the behavioral economics principles described above. We begin by defining the implicit reward, which serves as the raw signal that KTO will transform into a learning objective.

For a response $y$ to prompt $x$ with binary label $z \in \{\text{desirable}, \text{undesirable}\}$ , KTO defines:

r_\theta(x, y) = \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

where:

$r_\theta(x, y)$ : the implicit reward assigned to response $y$ given prompt $x$
$\pi_\theta(y|x)$ : the probability of the response under the current policy
$\pi_{\text{ref}}(y|x)$ : the probability of the response under the reference model

This is the implicit reward, the same quantity used in DPO. It measures how much more likely the current policy makes this response compared to the reference. A positive implicit reward means the policy has learned to favor this response; a negative implicit reward means the policy has learned to disfavor it. The key insight is that this quantity is well-defined even for a single response, without needing a comparison partner.

KTO then defines a reference point that implements the reference dependence principle from prospect theory:

z_0 = \mathbb{E}_{x' \sim \mathcal{D}} \left[ \text{KL}(\pi_\theta(\cdot|x') \| \pi_{\text{ref}}(\cdot|x')) \right]

where:

$z_0$ : the reference point (baseline) for evaluation
$\pi_\theta, \pi_{\text{ref}}$ : the policy and reference models
$\mathcal{D}$ : the training dataset distribution
$\mathbb{E}_{x' \sim \mathcal{D}}$ : the expectation over prompts $x'$ sampled from $\mathcal{D}$
$\text{KL}$ : the Kullback-Leibler divergence measuring the drift of the policy from the reference

The reference point $z_0$ represents the expected KL divergence between policy and reference across the training distribution. In practice, this is estimated from a running average during training. The reference point serves a crucial role: it defines what counts as "above average" versus "below average" performance. Rather than using an arbitrary fixed threshold, KTO adapts the reference point to the current state of training, creating a dynamic baseline that evolves as the model improves.

With the implicit reward and reference point defined, KTO constructs a value function that differs based on whether the response is desirable:

v(x, y) = \begin{cases} \sigma(\beta \cdot r_\theta(x, y) - z_0) & \text{if } z = \text{desirable} \\ \sigma(z_0 - \beta \cdot r_\theta(x, y)) & \text{if } z = \text{undesirable} \end{cases}

where:

$v(x, y)$ : the value assigned to the response, bounded between 0 and 1
$\sigma$ : the sigmoid function
$z$ : the label indicating if the response is desirable or undesirable
$\beta \cdot r_\theta(x, y) - z_0$ : the shifted implicit reward used for desirable examples

This asymmetric definition is the mathematical implementation of reference dependence. For desirable responses, we ask: is the implicit reward higher than the reference point? For undesirable responses, we ask: is the implicit reward lower than the reference point? The sigmoid function squashes these comparisons into the range (0, 1), creating a smooth value measure.

The loss weights desirable and undesirable examples differently, implementing the loss aversion principle:

\mathcal{L}_{\text{KTO}} = \mathbb{E}_{(x, y, z)} \left[ \lambda_z \cdot (1 - v(x, y)) \right]

where:

$\mathcal{L}_{\text{KTO}}$ : the KTO loss function
$\mathbb{E}_{(x, y, z)}$ : the expectation over the dataset of labeled examples
$\lambda_z$ : the weighting factor specific to the label type
$v(x, y)$ : the value computed by the value function

Where $\lambda_{\text{desirable}} = \lambda_d$ and $\lambda_{\text{undesirable}} = \lambda_u$ , typically with $\lambda_u > \lambda_d$ to implement loss aversion. By setting $\lambda_u > \lambda_d$ , we tell the model that failing to suppress a bad response is worse than failing to promote a good response. This asymmetry reflects the empirical finding that you are more bothered by failures than impressed by successes.

Understanding the Value FunctionLink Copied

The asymmetric value function encodes prospect theory's core insights, and understanding its behavior illuminates why KTO works. The function transforms implicit rewards into values differently depending on the label, creating distinct learning signals for positive and negative feedback.

For desirable responses, the sigmoid argument is $\beta \cdot r_\theta - z_0$ . The model is rewarded (low loss) when the implicit reward exceeds the reference point. Intuitively, the model should push desirable responses to have higher implicit rewards than average. When $\beta \cdot r_\theta > z_0$ , the sigmoid output is greater than 0.5, meaning the value is high and the loss $(1 - v)$ is low. The model learns to associate desirable responses with above-average implicit rewards.

For undesirable responses, the sigmoid argument is $z_0 - \beta \cdot r_\theta$ . The model is rewarded when the implicit reward falls below the reference point. The model should push undesirable responses to have lower implicit rewards than average. When $\beta \cdot r_\theta < z_0$ , the sigmoid output is greater than 0.5, meaning the value is high and the loss is low. The model learns to associate undesirable responses with below-average implicit rewards.

Out[5]:

Visualization

KTO value functions for desirable and undesirable responses. The reference point $z_0$ divides the space: desirable responses gain value above $z_0$, while undesirable responses gain value below $z_0$.

KTO loss functions for desirable and undesirable responses. The loss penalizes deviations from the reference point in the wrong direction, implementing loss aversion through asymmetric slopes.

The reference point $z_0$ serves as the dividing line between "gains" (implicit rewards above average) and "losses" (implicit rewards below average). This relative framing means KTO doesn't need paired comparisons; it learns to push good responses up and bad responses down relative to an adaptive baseline. The elegance of this approach is that it converts an inherently comparative problem (which response is better?) into a classification problem (is this response good or bad?), which is exactly the form that unpaired feedback provides.

Practical Implementation DetailsLink Copied

KTO requires tracking the reference point $z_0$ during training:

In[6]:

Code

import torch


def kto_loss(
    policy_logps: torch.Tensor,  # log π_θ(y|x) for batch
    reference_logps: torch.Tensor,  # log π_ref(y|x) for batch
    is_desirable: torch.Tensor,  # binary mask: 1 for desirable, 0 for undesirable
    kl_reference: float,  # z_0: running average KL
    beta: float = 0.1,
    lambda_d: float = 1.0,  # weight for desirable
    lambda_u: float = 1.0,  # weight for undesirable (often > lambda_d)
) -> torch.Tensor:
    """
    Compute KTO loss for a batch of (potentially unpaired) examples.
    """
    # Implicit reward: log ratio
    implicit_reward = policy_logps - reference_logps

    # Value function differs by desirability
    desirable_values = torch.sigmoid(beta * implicit_reward - kl_reference)
    undesirable_values = torch.sigmoid(kl_reference - beta * implicit_reward)

    # Select appropriate value based on label
    values = torch.where(
        is_desirable.bool(), desirable_values, undesirable_values
    )

    # Weighted loss
    weights = torch.where(is_desirable.bool(), lambda_d, lambda_u)
    loss = weights * (1 - values)

    return loss.mean()

import torch


def kto_loss(
    policy_logps: torch.Tensor,  # log π_θ(y|x) for batch
    reference_logps: torch.Tensor,  # log π_ref(y|x) for batch
    is_desirable: torch.Tensor,  # binary mask: 1 for desirable, 0 for undesirable
    kl_reference: float,  # z_0: running average KL
    beta: float = 0.1,
    lambda_d: float = 1.0,  # weight for desirable
    lambda_u: float = 1.0,  # weight for undesirable (often > lambda_d)
) -> torch.Tensor:
    """
    Compute KTO loss for a batch of (potentially unpaired) examples.
    """
    # Implicit reward: log ratio
    implicit_reward = policy_logps - reference_logps

    # Value function differs by desirability
    desirable_values = torch.sigmoid(beta * implicit_reward - kl_reference)
    undesirable_values = torch.sigmoid(kl_reference - beta * implicit_reward)

    # Select appropriate value based on label
    values = torch.where(
        is_desirable.bool(), desirable_values, undesirable_values
    )

    # Weighted loss
    weights = torch.where(is_desirable.bool(), lambda_d, lambda_u)
    loss = weights * (1 - values)

    return loss.mean()

The running average KL is typically computed as:

In[7]:

Code

def update_kl_reference(
    kl_reference: float, batch_kl: float, momentum: float = 0.99
):
    """Update running average of KL divergence."""
    return momentum * kl_reference + (1 - momentum) * batch_kl

def update_kl_reference(
    kl_reference: float, batch_kl: float, momentum: float = 0.99
):
    """Update running average of KL divergence."""
    return momentum * kl_reference + (1 - momentum) * batch_kl

KTO's AdvantagesLink Copied

KTO offers several practical benefits:

Data efficiency: Uses all available binary feedback without discarding unpaired examples
Natural data format: Matches how most user feedback is actually collected
Principled asymmetry: The loss-aversion weighting reflects empirical findings about human judgment
Stable training: The reference point provides a grounding mechanism similar to IPO's target

Odds Ratio Preference Optimization (ORPO)Link Copied

ORPO takes a more radical departure from the DPO framework by eliminating the reference model entirely. Introduced by Hong et al. (2024), ORPO combines supervised fine-tuning with preference optimization into a single training objective.

Motivation: The Reference Model BurdenLink Copied

All previous methods (RLHF, DPO, IPO, KTO) require maintaining a reference policy $\pi_{\text{ref}}$ . This creates practical complications:

Memory overhead: Two copies of the model must reside in memory (or frequent loading/unloading)
Two-stage training: Models typically undergo SFT first to create the reference, then preference training
Computational cost: Every forward pass requires evaluating both policies

ORPO asks: can we eliminate the reference model while still preventing the policy from drifting arbitrarily far from sensible outputs?

The Odds Ratio ApproachLink Copied

Instead of comparing log probabilities to a reference, ORPO uses the odds ratio between chosen and rejected responses within the current policy itself. This shift is significant. Instead of measuring change from a fixed reference, ORPO measures the relative likelihood of responses within the current policy. The odds of generating response $y$ given prompt $x$ are:

\text{odds}_\theta(y|x) = \frac{\pi_\theta(y|x)}{1 - \pi_\theta(y|x)}

where:

$\text{odds}_\theta(y|x)$ : the odds of generating response $y$ under policy $\theta$
$\pi_\theta(y|x)$ : the probability of the response

The odds representation has a natural interpretation: it tells us how likely the response is compared to everything else. If the odds are 2:1, the response is twice as likely as all alternatives combined. The odds formulation is particularly useful because ratios of odds have clean mathematical properties, as we will see shortly.

For a language model, the probability of a specific sequence becomes numerically insignificant as length increases, making direct odds calculation unstable. Consider that even a moderately long sequence might have a probability of $10^{-50}$ or smaller, which would make both the numerator and denominator of the odds calculation problematically small. ORPO instead defines the sequence-level odds as the geometric mean of the token-level odds:

\text{odds}_\theta(y|x) \approx \exp\left( \frac{1}{|y|} \sum_{t=1}^{|y|} \log \frac{\pi_\theta(y_t|x, y_{<t})}{1 - \pi_\theta(y_t|x, y_{<t})} \right)

where:

$x$ : the input prompt
$|y|$ : the length of the response in tokens
$y_t$ : the token at step $t$
$y_{<t}$ : the sequence of tokens preceding step $t$ (the context)
$\pi_\theta(y_t|x, y_{<t})$ : the probability of the next token given the prompt $x$ and previous tokens $y_{<t}$
$\exp(\dots)$ : the exponentiation to convert average log-odds back to the odds scale

This formulation works at the token level, where probabilities are large enough to be numerically stable, and then aggregates these token-level odds into a sequence-level measure. The averaging by sequence length ensures that longer sequences aren't automatically penalized, since we're taking a geometric mean rather than a product.

The ratio of odds between chosen ( $y_w$ ) and rejected ( $y_l$ ) responses becomes:

\text{OR}_\theta(y_w, y_l) = \frac{\text{odds}_\theta(y_w|x)}{\text{odds}_\theta(y_l|x)}

where:

$\text{OR}_\theta$ : the odds ratio between the chosen and rejected responses
$y_w, y_l$ : the chosen and rejected responses

This odds ratio captures the relative preference of the current policy for the chosen response over the rejected one. An odds ratio greater than 1 means the policy favors the chosen response; we want to train the policy to increase this ratio.

The ORPO Loss FunctionLink Copied

ORPO combines two components that work together to provide both language modeling signal and preference signal. This combination is what allows ORPO to eliminate the separate SFT stage and the reference model.

Supervised fine-tuning loss on the chosen response:

\mathcal{L}_{\text{SFT}} = -\mathbb{E}_{(x, y_w)} \left[ \log \pi_\theta(y_w|x) \right]

where:

$\mathcal{L}_{\text{SFT}}$ : the supervised fine-tuning loss component
$\mathbb{E}_{(x, y_w)}$ : the expectation over prompts and chosen responses
$\pi_\theta(y_w|x)$ : the likelihood of the chosen response

This component serves two purposes: it teaches the model to generate fluent, coherent text (the standard language modeling objective), and it provides an anchor that prevents the model from drifting too far from producing sensible outputs. By training on the chosen responses, the model learns what good outputs look like.

Odds ratio loss that increases the relative odds of chosen over rejected:

\mathcal{L}_{\text{OR}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \log \text{OR}_\theta(y_w, y_l) \right) \right]

where:

$\mathcal{L}_{\text{OR}}$ : the odds ratio loss component
$\mathbb{E}_{(x, y_w, y_l)}$ : the expectation over the dataset of preference pairs
$y_w, y_l$ : the chosen and rejected responses
$\log \text{OR}_\theta(y_w, y_l)$ : the log odds ratio (log of the ratio of odds)
$\sigma$ : the sigmoid function

This component implements the preference learning objective. By maximizing the log-sigmoid of the log odds ratio, we push the model to make the chosen response relatively more likely than the rejected one. The structure is similar to DPO's loss, but operates on odds ratios within a single policy rather than log-probability ratios between two policies.

The combined ORPO objective is:

\mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}} + \lambda \cdot \mathcal{L}_{\text{OR}}

where:

$\mathcal{L}_{\text{ORPO}}$ : the total ORPO loss
$\lambda$ : the coefficient weighting the odds ratio loss against the SFT loss

The hyperparameter $\lambda$ controls the trade-off between learning to generate good responses and learning to distinguish good responses from bad ones. Too small a $\lambda$ and the model ignores preferences; too large and the model may sacrifice fluency for preference optimization.

Why Odds Ratios WorkLink Copied

The odds ratio formulation provides implicit regularization without an explicit reference model. To understand why this works, consider what happens during training:

The SFT loss pulls the model toward generating the chosen response
The OR loss pushes chosen odds higher relative to rejected odds
Both losses operate on the same policy, creating a coupled optimization

The key insight is that increasing odds for chosen responses while decreasing odds for rejected responses automatically constrains how much the model can deviate from generating coherent text. If the model tried to maximize the odds ratio by assigning near-zero probability to rejected responses, the SFT loss on chosen responses would suffer because probability mass must be conserved. The model cannot simply declare everything "bad"; it must maintain a coherent probability distribution over all possible responses.

This coupling between the two loss components creates an implicit regularization effect. The SFT loss ensures the model keeps generating reasonable text, while the OR loss ensures it prefers better text to worse text. Neither loss alone would achieve both objectives, but together they provide a balanced training signal.

Comparing Log-Ratio and Odds-Ratio ObjectivesLink Copied

Comparing what DPO and ORPO optimize reveals how they prevent policy drift.

DPO optimizes:

\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}

where:

$\pi_\theta, \pi_{\text{ref}}$ : the policy and reference model probabilities
$y_w, y_l$ : the chosen and rejected responses

ORPO optimizes:

\log \frac{\text{odds}_\theta(y_w|x)}{\text{odds}_\theta(y_l|x)} = \log \text{odds}_\theta(y_w|x) - \log \text{odds}_\theta(y_l|x)

where:

$\log \text{odds}_\theta(y|x)$ : the log-odds of a response under the current policy
$y_w, y_l$ : the chosen and rejected responses

DPO measures how much the policy's preference for $y_w$ over $y_l$ has changed relative to the reference. ORPO measures the absolute odds ratio within the current policy. The reference model in DPO serves as an anchor; ORPO's SFT component and odds formulation provide an alternative form of anchoring. In DPO, the anchor is external: a frozen copy of the model from before preference training. In ORPO, the anchor is internal: the requirement that the model maintain coherent probability distributions while also producing high-likelihood chosen responses.

Out[8]:

Visualization

DPO architecture requiring two models. The policy model is trained while a frozen reference model is maintained in memory to compute the probability ratios for the loss.

ORPO architecture using a single model. The policy serves as its own reference through the odds ratio formulation, combining SFT and preference optimization in one forward pass.

Conservative DPO (cDPO)Link Copied

The variants discussed so far address algorithmic limitations. cDPO, introduced by Mitchell et al. (2023), addresses a data quality issue: label noise in preference annotations.

Label Noise in Preference DataLink Copied

Human preference annotations are inherently noisy. Annotators may:

Disagree with each other on the same comparison
Make mistakes due to fatigue or inattention
Apply inconsistent criteria across examples
Be influenced by surface features rather than actual quality

Studies of inter-annotator agreement on preference tasks typically show agreement rates of 70-80%, meaning 20-30% of labels may be "wrong", in the sense that a different annotator would have labeled them oppositely.

Standard DPO treats all preference labels as ground truth. If the training data says $y_w \succ y_l$ , the model learns to prefer $y_w$ . When this label is actually wrong, the model learns an incorrect preference.

Label Smoothing for PreferencesLink Copied

cDPO applies label smoothing to preference labels. The idea behind label smoothing is familiar from classification: instead of training on hard targets (0 or 1), we train on soft targets that acknowledge uncertainty. For preferences, instead of treating each comparison as a certainty, we model the possibility that the annotation might be wrong.

Instead of treating preferences as deterministic ( $P(y_w \succ y_l) = 1$ ), cDPO models uncertainty:

\begin{aligned} P(y_w \succ y_l) &= 1 - \epsilon \\ P(y_l \succ y_w) &= \epsilon \end{aligned}

where:

$\epsilon$ : the label smoothing parameter, $\epsilon \in [0, 0.5)$ , representing uncertainty about annotation correctness

The parameter $\epsilon$ represents our belief about how often the annotations are wrong. If $\epsilon = 0$ , we trust all annotations completely, recovering standard DPO. If $\epsilon = 0.3$ , we believe that 30% of the time, the annotators got it backwards and the rejected response was actually better.

The cDPO loss modifies the Bradley-Terry model probability accordingly:

\mathcal{L}_{\text{cDPO}} = -\mathbb{E}_{(x, y_w, y_l)} \Big[ (1-\epsilon) \log \sigma(\beta \cdot h_\theta) + \epsilon \log \sigma(-\beta \cdot h_\theta) \Big]

where:

$\mathcal{L}_{\text{cDPO}}$ : the cDPO loss function
$\mathbb{E}_{(x, y_w, y_l)}$ : the expectation over the dataset
$h_\theta$ : the log-ratio difference between chosen and rejected responses
$\epsilon$ : the label smoothing parameter
$\sigma$ : the sigmoid function
$\beta$ : the temperature parameter

This loss function has two terms. The first term, weighted by $(1-\epsilon)$ , is the standard DPO loss: it rewards the model for preferring the labeled chosen response. The second term, weighted by $\epsilon$ , is the opposite: it rewards the model for preferring the labeled rejected response. The combination represents the expected loss under our belief about annotation accuracy.

Using the identity $\sigma(-z) = 1 - \sigma(z)$ , we can rewrite this loss to explicitly show the probability mass assigned to the rejected response:

\mathcal{L}_{\text{cDPO}} = -\mathbb{E} \Big[ (1-\epsilon) \log \sigma(\beta \cdot h_\theta) + \epsilon \log (1 - \sigma(\beta \cdot h_\theta)) \Big]

where:

$1 - \sigma(\beta \cdot h_\theta)$ : the probability assigned to the rejected response (equivalent to $\sigma(-\beta \cdot h_\theta)$ )
$\epsilon$ : the weighting factor for the "incorrect" preference direction

Effect on OptimizationLink Copied

Label smoothing fundamentally changes the loss landscape. The gradient of cDPO with respect to $h_\theta$ is:

\frac{\partial \mathcal{L}_{\text{cDPO}}}{\partial h_\theta} = -\beta \left[ (1-\epsilon) \cdot \sigma(-\beta \cdot h_\theta) - \epsilon \cdot \sigma(\beta \cdot h_\theta) \right]

where:

$\frac{\partial \mathcal{L}_{\text{cDPO}}}{\partial h_\theta}$ : the gradient of the loss with respect to the log-ratio difference

This gradient has two competing terms. The first term pushes toward larger $h_\theta$ (stronger preference for chosen). The second term pushes toward smaller $h_\theta$ (acknowledging that the rejected response might actually be better). These competing forces create a balance point where the gradient is zero.

Setting the gradient to zero allows us to solve for the optimal log-ratio difference $h_\theta^*$ :

\begin{aligned} \sigma(\beta \cdot h_\theta^*) &= \frac{1 - \epsilon}{(1 - \epsilon) + \epsilon} && \text{(rearrange terms)} \\ &= 1 - \epsilon && \text{(simplify denominator)} \\ \beta \cdot h_\theta^* &= \log \left( \frac{1-\epsilon}{\epsilon} \right) && \text{(invert sigmoid)} \end{aligned}

where:

$h_\theta^*$ : the optimal value of the log-ratio difference
$\epsilon$ : the label smoothing parameter

This means cDPO has a finite optimal gap (similar to IPO), which depends on $\epsilon$ . With standard DPO ( $\epsilon = 0$ ), the optimal gap is infinite. With cDPO ( $\epsilon > 0$ ), the model converges to a bounded preference strength. The relationship between $\epsilon$ and the optimal gap is intuitive: higher annotation uncertainty (larger $\epsilon$ ) leads to a smaller optimal gap, because the model shouldn't be confident when the labels might be wrong.

Out[9]:

Visualization

cDPO equilibrium point as a function of label smoothing $\epsilon$. Higher uncertainty ($\epsilon$) leads to a smaller optimal log-ratio gap ($h^*$), preventing overconfidence.

cDPO loss landscapes for different smoothing values. As $\epsilon$ increases, the loss minimum shifts closer to zero and the slope becomes gentler, reflecting increased uncertainty.

Choosing the Smoothing ParameterLink Copied

The smoothing parameter $\epsilon$ should reflect actual annotation noise. Practical guidelines include:

$\epsilon = 0.1$ : Assumes 90% annotation accuracy, appropriate for high-quality curated datasets
$\epsilon = 0.2$ : Assumes 80% accuracy, appropriate for crowd-sourced annotations
$\epsilon = 0.3$ : Assumes 70% accuracy, appropriate for noisy or automated annotations

When inter-annotator agreement data is available, $\epsilon$ can be estimated directly as the disagreement rate.

Implementation ComparisonLink Copied

Let's implement all four variants to see the differences in practice.

In[10]:

Code

import torch
import torch.nn.functional as F


def compute_log_ratios(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
) -> torch.Tensor:
    """Compute the log-ratio difference h_θ used by DPO, IPO, and cDPO."""
    chosen_ratio = policy_chosen_logps - reference_chosen_logps
    rejected_ratio = policy_rejected_logps - reference_rejected_logps
    return chosen_ratio - rejected_ratio


def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> torch.Tensor:
    """Standard DPO loss."""
    h_theta = compute_log_ratios(
        policy_chosen_logps,
        policy_rejected_logps,
        reference_chosen_logps,
        reference_rejected_logps,
    )
    return -F.logsigmoid(beta * h_theta).mean()


def ipo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> torch.Tensor:
    """IPO loss with squared error objective."""
    h_theta = compute_log_ratios(
        policy_chosen_logps,
        policy_rejected_logps,
        reference_chosen_logps,
        reference_rejected_logps,
    )
    target = 1.0 / (2.0 * beta)
    return ((h_theta - target) ** 2).mean()


def cdpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1,
    epsilon: float = 0.1,  # label smoothing
) -> torch.Tensor:
    """Conservative DPO with label smoothing."""
    h_theta = compute_log_ratios(
        policy_chosen_logps,
        policy_rejected_logps,
        reference_chosen_logps,
        reference_rejected_logps,
    )
    # Smoothed loss: (1-ε)log σ(βh) + ε log σ(-βh)
    loss = -(1 - epsilon) * F.logsigmoid(
        beta * h_theta
    ) - epsilon * F.logsigmoid(-beta * h_theta)
    return loss.mean()

import torch
import torch.nn.functional as F


def compute_log_ratios(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
) -> torch.Tensor:
    """Compute the log-ratio difference h_θ used by DPO, IPO, and cDPO."""
    chosen_ratio = policy_chosen_logps - reference_chosen_logps
    rejected_ratio = policy_rejected_logps - reference_rejected_logps
    return chosen_ratio - rejected_ratio


def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> torch.Tensor:
    """Standard DPO loss."""
    h_theta = compute_log_ratios(
        policy_chosen_logps,
        policy_rejected_logps,
        reference_chosen_logps,
        reference_rejected_logps,
    )
    return -F.logsigmoid(beta * h_theta).mean()


def ipo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1,
) -> torch.Tensor:
    """IPO loss with squared error objective."""
    h_theta = compute_log_ratios(
        policy_chosen_logps,
        policy_rejected_logps,
        reference_chosen_logps,
        reference_rejected_logps,
    )
    target = 1.0 / (2.0 * beta)
    return ((h_theta - target) ** 2).mean()


def cdpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1,
    epsilon: float = 0.1,  # label smoothing
) -> torch.Tensor:
    """Conservative DPO with label smoothing."""
    h_theta = compute_log_ratios(
        policy_chosen_logps,
        policy_rejected_logps,
        reference_chosen_logps,
        reference_rejected_logps,
    )
    # Smoothed loss: (1-ε)log σ(βh) + ε log σ(-βh)
    loss = -(1 - epsilon) * F.logsigmoid(
        beta * h_theta
    ) - epsilon * F.logsigmoid(-beta * h_theta)
    return loss.mean()

Now let's implement KTO and ORPO, which have different data requirements:

In[11]:

Code

import torch
import torch.nn.functional as F


def kto_loss(
    policy_logps: torch.Tensor,  # log probs for all examples
    reference_logps: torch.Tensor,  # reference log probs
    is_desirable: torch.Tensor,  # binary: 1=good, 0=bad
    kl_reference: float,  # running average KL
    beta: float = 0.1,
    lambda_d: float = 1.0,
    lambda_u: float = 1.0,
) -> torch.Tensor:
    """KTO loss for unpaired binary feedback."""
    implicit_reward = beta * (policy_logps - reference_logps)

    # Different value functions for desirable vs undesirable
    desirable_mask = is_desirable.bool()

    values = torch.zeros_like(policy_logps)
    values[desirable_mask] = torch.sigmoid(
        implicit_reward[desirable_mask] - kl_reference
    )
    values[~desirable_mask] = torch.sigmoid(
        kl_reference - implicit_reward[~desirable_mask]
    )

    # Weighted loss
    weights = torch.where(desirable_mask, lambda_d, lambda_u)
    return (weights * (1 - values)).mean()


def orpo_loss(
    policy_chosen_logps: torch.Tensor,  # per-token log probs, averaged
    policy_rejected_logps: torch.Tensor,
    chosen_length: torch.Tensor,  # sequence lengths
    rejected_length: torch.Tensor,
    lambda_or: float = 0.1,
) -> torch.Tensor:
    """ORPO loss combining SFT with odds ratio preference."""
    # SFT loss on chosen (negative log likelihood)
    sft_loss = -policy_chosen_logps.mean()

    # Compute log odds (approximation using average log probs)
    # log odds = log_prob - log(1 - exp(log_prob))
    # For numerical stability with small probabilities:
    def log_odds(log_probs):
        # Clamp to avoid numerical issues
        probs = torch.exp(log_probs.clamp(min=-100))
        return log_probs - torch.log1p(-probs.clamp(max=1 - 1e-7))

    chosen_log_odds = log_odds(policy_chosen_logps)
    rejected_log_odds = log_odds(policy_rejected_logps)

    # Odds ratio loss
    log_or = chosen_log_odds - rejected_log_odds
    or_loss = -F.logsigmoid(log_or).mean()

    return sft_loss + lambda_or * or_loss

import torch
import torch.nn.functional as F


def kto_loss(
    policy_logps: torch.Tensor,  # log probs for all examples
    reference_logps: torch.Tensor,  # reference log probs
    is_desirable: torch.Tensor,  # binary: 1=good, 0=bad
    kl_reference: float,  # running average KL
    beta: float = 0.1,
    lambda_d: float = 1.0,
    lambda_u: float = 1.0,
) -> torch.Tensor:
    """KTO loss for unpaired binary feedback."""
    implicit_reward = beta * (policy_logps - reference_logps)

    # Different value functions for desirable vs undesirable
    desirable_mask = is_desirable.bool()

    values = torch.zeros_like(policy_logps)
    values[desirable_mask] = torch.sigmoid(
        implicit_reward[desirable_mask] - kl_reference
    )
    values[~desirable_mask] = torch.sigmoid(
        kl_reference - implicit_reward[~desirable_mask]
    )

    # Weighted loss
    weights = torch.where(desirable_mask, lambda_d, lambda_u)
    return (weights * (1 - values)).mean()


def orpo_loss(
    policy_chosen_logps: torch.Tensor,  # per-token log probs, averaged
    policy_rejected_logps: torch.Tensor,
    chosen_length: torch.Tensor,  # sequence lengths
    rejected_length: torch.Tensor,
    lambda_or: float = 0.1,
) -> torch.Tensor:
    """ORPO loss combining SFT with odds ratio preference."""
    # SFT loss on chosen (negative log likelihood)
    sft_loss = -policy_chosen_logps.mean()

    # Compute log odds (approximation using average log probs)
    # log odds = log_prob - log(1 - exp(log_prob))
    # For numerical stability with small probabilities:
    def log_odds(log_probs):
        # Clamp to avoid numerical issues
        probs = torch.exp(log_probs.clamp(min=-100))
        return log_probs - torch.log1p(-probs.clamp(max=1 - 1e-7))

    chosen_log_odds = log_odds(policy_chosen_logps)
    rejected_log_odds = log_odds(policy_rejected_logps)

    # Odds ratio loss
    log_or = chosen_log_odds - rejected_log_odds
    or_loss = -F.logsigmoid(log_or).mean()

    return sft_loss + lambda_or * or_loss

Let's visualize how these losses behave differently:

In[12]:

Code

import numpy as np

## Create range of h_theta values (log-ratio differences)
h_theta = np.linspace(-2, 12, 200)
beta = 0.1

## DPO loss: -log σ(βh)
dpo = -np.log(1 / (1 + np.exp(-beta * h_theta)))

## IPO loss: (h - 1/2β)²
target = 1 / (2 * beta)
ipo = (h_theta - target) ** 2

## cDPO loss with ε=0.1
epsilon = 0.1
sigmoid_pos = 1 / (1 + np.exp(-beta * h_theta))
sigmoid_neg = 1 / (1 + np.exp(beta * h_theta))
cdpo = -(1 - epsilon) * np.log(sigmoid_pos + 1e-10) - epsilon * np.log(
    sigmoid_neg + 1e-10
)

## Normalize for visualization
dpo_norm = (dpo - dpo.min()) / (dpo.max() - dpo.min() + 1e-10)
ipo_norm = (ipo - ipo.min()) / (ipo.max() - ipo.min() + 1e-10)
cdpo_norm = (cdpo - cdpo.min()) / (cdpo.max() - cdpo.min() + 1e-10)

import numpy as np

## Create range of h_theta values (log-ratio differences)
h_theta = np.linspace(-2, 12, 200)
beta = 0.1

## DPO loss: -log σ(βh)
dpo = -np.log(1 / (1 + np.exp(-beta * h_theta)))

## IPO loss: (h - 1/2β)²
target = 1 / (2 * beta)
ipo = (h_theta - target) ** 2

## cDPO loss with ε=0.1
epsilon = 0.1
sigmoid_pos = 1 / (1 + np.exp(-beta * h_theta))
sigmoid_neg = 1 / (1 + np.exp(beta * h_theta))
cdpo = -(1 - epsilon) * np.log(sigmoid_pos + 1e-10) - epsilon * np.log(
    sigmoid_neg + 1e-10
)

## Normalize for visualization
dpo_norm = (dpo - dpo.min()) / (dpo.max() - dpo.min() + 1e-10)
ipo_norm = (ipo - ipo.min()) / (ipo.max() - ipo.min() + 1e-10)
cdpo_norm = (cdpo - cdpo.min()) / (cdpo.max() - cdpo.min() + 1e-10)

Out[13]:

Visualization

Line plot comparing DPO, IPO, and cDPO loss functions across log-ratio difference values. — Comparison of loss functions for DPO variants. DPO's loss decreases monotonically, whereas IPO and cDPO provide stable minima that prevent unbounded reward gaps.

Now let's examine the gradients:

In[14]:

Code

import numpy as np

## Compute gradients analytically
## DPO gradient: -β * σ(-βh)
dpo_grad = -beta * (1 / (1 + np.exp(beta * h_theta)))

## IPO gradient: 2(h - 1/2β)
ipo_grad = 2 * (h_theta - target)

## cDPO gradient: -β[(1-ε)σ(-βh) - ε·σ(βh)]
cdpo_grad = -beta * ((1 - epsilon) * sigmoid_neg - epsilon * sigmoid_pos)

import numpy as np

## Compute gradients analytically
## DPO gradient: -β * σ(-βh)
dpo_grad = -beta * (1 / (1 + np.exp(beta * h_theta)))

## IPO gradient: 2(h - 1/2β)
ipo_grad = 2 * (h_theta - target)

## cDPO gradient: -β[(1-ε)σ(-βh) - ε·σ(βh)]
cdpo_grad = -beta * ((1 - epsilon) * sigmoid_neg - epsilon * sigmoid_pos)

Out[15]:

Visualization

Line plot comparing gradient magnitudes of DPO, IPO, and cDPO as functions of log-ratio difference. — Gradient magnitudes for DPO variants. While DPO's gradient vanishes at high confidence, IPO and cDPO maintain corrective signals that drive the model toward stable equilibrium points.

Comparing Alignment MethodsLink Copied

With multiple alignment approaches now available, selecting the right method depends on your specific constraints: data format, computational budget, annotation quality, and desired training dynamics.

Decision FrameworkLink Copied

The choice between alignment methods depends on several factors:

Data format availability:

Paired preferences (chosen vs rejected for same prompt): Use DPO, IPO, cDPO, or ORPO
Unpaired binary feedback (thumbs up/down on individual responses): Use KTO
Ranked lists with multiple responses per prompt: DPO can be adapted with pairwise comparisons

Reference model constraints:

Can maintain frozen reference in memory: DPO, IPO, cDPO, KTO
Memory-constrained single-model training: ORPO

Annotation quality:

High-quality expert annotations (>90% consistency): Standard DPO
Moderate-quality annotations (70-90%): cDPO with appropriate ε
Noisy or automated labels (<70%): cDPO with higher ε, or KTO

Training stability requirements:

Need bounded optimization: IPO or cDPO
Standard training sufficient: DPO

Out[16]:

Visualization

Decision flowchart for selecting a DPO variant. The decision path prioritizes data availability (paired vs. unpaired) and memory resources, directing you to specialized variants like KTO for binary feedback or ORPO for memory-constrained settings.

Empirical Performance ComparisonLink Copied

Published results show that no single method dominates across all benchmarks. General patterns from the literature include:

DPO remains a strong baseline. Its simplicity and well-understood behavior make it the default choice for many applications. Performance degradation typically occurs only with very long training or highly noisy data.

IPO shows advantages when training for many epochs or when the preference data has high confidence (strong agreement between responses). The bounded optimization prevents the overconfident predictions that can emerge with extended DPO training.

KTO achieves comparable performance to DPO when preference data is artificially unpaired from originally paired data. When working with naturally unpaired data (binary ratings), KTO significantly outperforms naive approaches like converting to paired format.

ORPO reduces computational overhead by 20-30% by eliminating the reference model. Quality results are competitive with DPO on standard benchmarks, though some studies report slightly lower performance on challenging alignment tasks.

cDPO provides consistent improvements over standard DPO when annotation noise is known to be present. The gains are proportional to the actual noise level; cDPO shows little difference from DPO on clean data.

Computational CostsLink Copied

The computational characteristics differ significantly:

Computational requirements for DPO variants.

Method	Memory	Forward Passes	Training Stages
DPO	2× model	2× per step	SFT → DPO
IPO	2× model	2× per step	SFT → IPO
cDPO	2× model	2× per step	SFT → cDPO
KTO	2× model	2× per step	SFT → KTO
ORPO	1× model	1× per step	Single stage

ORPO's single-stage training can reduce wall-clock time by 40-50% compared to two-stage approaches, making it attractive for rapid iteration.

Training DynamicsLink Copied

Let's simulate training dynamics for each method to see how they converge differently:

In[17]:

Code

import numpy as np


def simulate_training(method, n_steps=100, lr=0.1, beta=0.1, epsilon=0.1):
    """Simulate how h_theta evolves during training for different methods."""
    h = 0.0  # Start with equal preference
    history = [h]
    target = 1 / (2 * beta)

    for _ in range(n_steps):
        if method == "dpo":
            # DPO gradient: β * σ(-βh)
            grad = beta / (1 + np.exp(beta * h))
        elif method == "ipo":
            # IPO gradient: -2(h - target)
            grad = -2 * (h - target)
        elif method == "cdpo":
            # cDPO gradient
            sigma_pos = 1 / (1 + np.exp(-beta * h))
            sigma_neg = 1 - sigma_pos
            grad = beta * ((1 - epsilon) * sigma_neg - epsilon * sigma_pos)

        h = h + lr * grad
        history.append(h)

    return history


## Simulate all methods
dpo_history = simulate_training("dpo", n_steps=150)
ipo_history = simulate_training("ipo", n_steps=150)
cdpo_history = simulate_training("cdpo", n_steps=150)

import numpy as np


def simulate_training(method, n_steps=100, lr=0.1, beta=0.1, epsilon=0.1):
    """Simulate how h_theta evolves during training for different methods."""
    h = 0.0  # Start with equal preference
    history = [h]
    target = 1 / (2 * beta)

    for _ in range(n_steps):
        if method == "dpo":
            # DPO gradient: β * σ(-βh)
            grad = beta / (1 + np.exp(beta * h))
        elif method == "ipo":
            # IPO gradient: -2(h - target)
            grad = -2 * (h - target)
        elif method == "cdpo":
            # cDPO gradient
            sigma_pos = 1 / (1 + np.exp(-beta * h))
            sigma_neg = 1 - sigma_pos
            grad = beta * ((1 - epsilon) * sigma_neg - epsilon * sigma_pos)

        h = h + lr * grad
        history.append(h)

    return history


## Simulate all methods
dpo_history = simulate_training("dpo", n_steps=150)
ipo_history = simulate_training("ipo", n_steps=150)
cdpo_history = simulate_training("cdpo", n_steps=150)

Out[18]:

Visualization

Line plot showing h_theta evolution over training steps for DPO, IPO, and cDPO methods. — Simulated training dynamics for DPO variants. While DPO grows indefinitely, IPO and cDPO converge to stable, finite log-ratio differences determined by their respective target parameters.

Practical GuidelinesLink Copied

Based on the analysis above, here are practical recommendations for choosing and implementing DPO variants.

Method Selection ChecklistLink Copied

When selecting an alignment method, consider:

What data do you have?
- Paired preferences → DPO, IPO, cDPO, or ORPO
- Unpaired binary feedback → KTO
How much memory can you allocate?
- Can fit two models → Any method
- Limited to one model → ORPO
How clean are your labels?
- High quality (>90% agreement) → DPO or IPO
- Moderate quality (70-90%) → cDPO with ε ≈ 0.1-0.2
- Low quality (<70%) → cDPO with ε ≈ 0.2-0.3
How long will you train?
- Short training (1-3 epochs) → DPO
- Extended training → IPO or cDPO

Hyperparameter RecommendationsLink Copied

Each method has specific hyperparameter considerations:

DPO: The β parameter typically works well in the range 0.1-0.5. Lower values allow more deviation from the reference; higher values keep the policy closer to the reference. Start with β=0.1 and adjust based on whether outputs are too conservative (increase β) or too different from base model (decrease β).

IPO: Uses the same β parameter, but interpretation differs slightly. The target margin 1/2β means smaller β creates larger targets. Start with β=0.1 (target=5) and adjust if convergence is too slow (increase β) or produces weak preferences (decrease β).

cDPO: Choose ε based on estimated annotation noise. If unknown, start with ε=0.1 as a conservative default. The effective training signal strength is (1-2ε), so ε=0.3 reduces effective signal by 60%.

KTO: The loss aversion ratio λ_u/λ_d is typically set to 1.0-2.0. Higher ratios emphasize avoiding bad outputs over producing good ones. The KL reference running average uses momentum 0.9-0.99.

ORPO: The λ_or weight balances SFT and preference objectives. Values of 0.1-0.5 are typical. Too low ignores preferences; too high destabilizes SFT.

Common PitfallsLink Copied

Several issues frequently arise when implementing DPO variants:

Length bias: All preference methods can develop biases toward response length if chosen responses are systematically longer or shorter. Monitor average response lengths during training and consider length normalization.

Mode collapse: Aggressive preference optimization can cause the model to produce repetitive outputs. The KL penalty (explicit in RLHF, implicit in DPO) helps, but monitoring diversity metrics is still important.

Reward hacking: As we discussed in Part XXVII, Chapter 5, models can learn to exploit artifacts in preference data. Regular evaluation on held-out prompts helps detect this.

Training instability: Large learning rates combined with small β can cause instability in DPO. Start with learning rates an order of magnitude smaller than SFT and increase gradually.

Limitations and ImpactLink Copied

The proliferation of DPO variants reflects both the importance of alignment and the difficulty of getting it right. Each method addresses real limitations, but none fully solves the alignment problem.

A fundamental challenge persists across all variants: they optimize for human preferences as measured during data collection, not for what humans actually want in deployment. Preferences are noisy, context-dependent, and can be manipulated. A model that perfectly optimizes collected preferences may still exhibit concerning behaviors on novel inputs. This gap between measured preferences and true preferences motivates ongoing research into more robust alignment approaches.

The computational simplification that DPO variants provide has democratized alignment research. Small teams and academic labs can now experiment with preference learning without RLHF's infrastructure requirements. This has accelerated progress but also raised concerns about alignment becoming a checkbox rather than a careful consideration.

Looking ahead, the next chapter on RLAIF explores using AI systems to generate preference data, potentially addressing data scalability while introducing new questions about circular dependencies in AI feedback. The field continues evolving, with you exploring combinations of methods (e.g., using KTO for unpaired data augmentation before DPO) and developing new theoretical frameworks for understanding when each approach succeeds or fails.

SummaryLink Copied

This chapter examined four influential DPO variants, each addressing specific limitations of the original algorithm:

IPO reformulates preference learning as regression with a target margin, preventing the unbounded preference growth that can occur with standard DPO. The squared loss creates self-correcting gradients that converge to a stable equilibrium.

KTO enables learning from unpaired binary feedback by incorporating insights from prospect theory. The asymmetric value function and adaptive reference point allow training on thumbs-up/thumbs-down data without requiring direct comparisons.

ORPO eliminates the reference model entirely by combining SFT with odds-ratio preference optimization. This reduces memory requirements and training stages while maintaining competitive performance.

cDPO handles annotation noise through label smoothing, modeling preferences probabilistically rather than deterministically. The smoothing parameter should reflect actual annotation uncertainty.

Method selection depends on data format (paired vs unpaired), computational constraints (reference model overhead), annotation quality (noise levels), and training duration (bounded vs unbounded optimization). No single method dominates across all scenarios, making informed selection essential for practical alignment work.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about DPO variants and their design principles.

Loading component...

Comments

Back to Language AI Handbook

Reference

BIBTEXAcademic

@misc{dpovariantsipoktoorpocdpoforllmalignment, author = {Michael Brenndoerfer}, title = {DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment}, year = {2026}, url = {https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2026). DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment. Retrieved from https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment

MLAAcademic

Michael Brenndoerfer. "DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment." 2026. Web. today. <https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment>.

CHICAGOAcademic

Michael Brenndoerfer. "DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment." Accessed today. https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment.

HARVARDAcademic

Michael Brenndoerfer (2026) 'DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment'. Available at: https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2026). DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment. https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment

Direct link:

https://mbrenndoerfer.com/writing/dpo-variants-ipo-kto-orpo-cdpo-llm-alignment

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment

DPO VariantsLink Copied

Identity Preference Optimization (IPO)Link Copied

The Overfitting Problem in DPOLink Copied

IPO's Squared Error ObjectiveLink Copied

Interpreting the Target MarginLink Copied

Gradient ComparisonLink Copied

Kahneman-Tversky Optimization (KTO)Link Copied

The Unpaired Feedback ProblemLink Copied

Inspiration from Prospect TheoryLink Copied

The KTO Loss FunctionLink Copied

Understanding the Value FunctionLink Copied

Practical Implementation DetailsLink Copied

KTO's AdvantagesLink Copied

Odds Ratio Preference Optimization (ORPO)Link Copied

Motivation: The Reference Model BurdenLink Copied

The Odds Ratio ApproachLink Copied

The ORPO Loss FunctionLink Copied

Why Odds Ratios WorkLink Copied

Comparing Log-Ratio and Odds-Ratio ObjectivesLink Copied

Conservative DPO (cDPO)Link Copied

Label Noise in Preference DataLink Copied

Label Smoothing for PreferencesLink Copied

Effect on OptimizationLink Copied

Choosing the Smoothing ParameterLink Copied

Implementation ComparisonLink Copied

Comparing Alignment MethodsLink Copied

Decision FrameworkLink Copied

Empirical Performance ComparisonLink Copied

Computational CostsLink Copied

Training DynamicsLink Copied

Practical GuidelinesLink Copied

Method Selection ChecklistLink Copied

Hyperparameter RecommendationsLink Copied

Common PitfallsLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Iterative Alignment: Online DPO & Self-Improvement Methods

RLAIF & Constitutional AI: Scalable Model Alignment

DPO Implementation: PyTorch Training for Language Model Alignment

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Iterative Alignment: Online DPO & Self-Improvement Methods

RLAIF & Constitutional AI: Scalable Model Alignment

DPO Implementation: PyTorch Training for Language Model Alignment

Stay updated