Iterative Alignment: Online DPO & Self-Improvement Methods

Michael BrenndoerferJanuary 5, 202646 min read

Master iterative alignment for LLMs with online DPO, rolling references, Constitutional AI, and SPIN. Build self-improving models beyond single-shot training.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Iterative Alignment

The alignment methods we've explored so far, such as RLHF, DPO, and their variants, typically treat alignment as a one-shot process. You collect preferences, train once, and deploy the model. But alignment is an ongoing challenge, not a single event. Models that seem well-aligned can drift as they encounter new situations, and static preference datasets quickly become outdated as the model's behavior evolves. This chapter explores iterative approaches to alignment, where models undergo multiple rounds of training with fresh preference data, sometimes generating that data themselves.

Iterative alignment is necessary because training a model on preference data changes its distribution. Its outputs no longer match the distribution that generated the original training examples. This creates a distribution mismatch that can degrade alignment quality over time. By iteratively collecting new preferences on the model's current outputs and retraining, we can maintain tighter alignment and even enable models to improve beyond what any single training run could achieve.

The Case for Iteration

Consider what happens in standard DPO training. You start with a reference model πref\pi_{\text{ref}}, collect human preferences comparing pairs of responses, and train the model to prefer the winning responses. After training, you have a new model πθ\pi_{\theta} that behaves differently from πref\pi_{\text{ref}}. But your training data was generated from πref\pi_{\text{ref}}, so it reflects the kinds of responses that model produced and the relative quality judgments between them.

To understand why this matters, imagine teaching someone to write by showing them examples of mediocre essays and asking them to identify which is slightly better. They might learn to avoid the worst mistakes, but they never see what excellent writing actually looks like. The training signal is fundamentally limited by the quality of the examples being compared. When your model improves beyond the capability level present in the training data, the preference comparisons become less informative because they no longer represent the kinds of decisions the improved model needs to make.

This creates several problems:

  • Distribution mismatch: The preference comparisons were between responses from πref\pi_{\text{ref}}, but the model now generates different responses. The training signal becomes less relevant.
  • Ceiling effects: If both responses in a preference pair are poor, learning to prefer the "less bad" option doesn't teach the model what a truly good response looks like.
  • Static feedback: Human values and expectations evolve. A dataset collected once becomes stale.
Out[2]:
Visualization
Comparison of response quality distributions for reference and trained policies. The trained policy (coral) shifts toward higher quality regions, decreasing the overlap (purple) with the original reference distribution (steelblue) and rendering the initial training signal less relevant.
Comparison of response quality distributions for reference and trained policies. The trained policy (coral) shifts toward higher quality regions, decreasing the overlap (purple) with the original reference distribution (steelblue) and rendering the initial training signal less relevant.

Distribution mismatch is the primary reason why iteration matters. When you train a model to prefer response A over response B, you are implicitly teaching it something about the boundary between good and bad responses in that region of output space. But after training, the model no longer generates responses in that same region. It has moved to a different part of the output distribution, where the old preference comparisons may not provide useful guidance. The model might now be generating responses that are uniformly better than anything in the original training set, or it might be making new types of errors that the original comparisons never addressed.

Iterative alignment addresses these issues by making alignment a continuous process. After each training round, we generate new responses from the updated model, collect fresh preferences, and train again. This keeps the training distribution matched to the current policy and allows the model to learn from increasingly sophisticated comparisons. Each iteration provides feedback on the model's actual current behavior, creating a feedback loop that can continuously refine alignment quality.

Iterative DPO

As we discussed in the DPO chapters, Direct Preference Optimization implicitly defines a reward function through the log-probability ratio between the policy and reference model. This mathematical relationship makes the reference model central to what DPO learns. The reference model serves as an anchor point that defines the baseline against which improvements are measured. When we iterate DPO, we have a choice: keep the original reference model fixed, or update it at each iteration. This choice significantly affects training dynamics and the final model's behavior.

Fixed Reference Iteration

The simplest form of iterative DPO keeps the original reference model πref\pi_{\text{ref}} fixed across all iterations. At each round tt, we generate responses from the current policy πt\pi_t, collect preferences, and train to get πt+1\pi_{t+1}. This approach maintains a consistent anchor point throughout the entire iterative process, always measuring progress relative to where the model started.

The loss at iteration tt becomes:

L(t)=E(x,yw,yl)D(t)[logσ(βlogπθ(ywx)πref(ywx)βlogπθ(ylx)πref(ylx))]\mathcal{L}^{(t)} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}^{(t)}} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

Unpacking the formula components clarifies how they guide iterative training:

  • L(t)\mathcal{L}^{(t)}: the loss function minimized at iteration tt
  • E\mathbb{E}: the expectation operator averaging over all preference pairs in the dataset
  • D(t)\mathcal{D}^{(t)}: the dataset of preferences collected from responses generated by the current policy πt\pi_t
  • πθ\pi_\theta: the policy model currently being trained (parameterized by θ\theta)
  • πref\pi_{\text{ref}}: the fixed reference model (usually the initial policy)
  • xx: the input prompt
  • yw,yly_w, y_l: the winning and losing responses, respectively, for prompt xx
  • σ\sigma: the logistic sigmoid function
  • β\beta: the temperature parameter controlling the strength of the KL penalty
  • βlogπθ(yx)πref(yx)\beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}: the implicit reward term (higher values indicate the policy is more likely to generate yy than the reference)

The key change from standard DPO is that D(t)\mathcal{D}^{(t)} contains preferences over responses generated by πt\pi_t, not the original reference model. This subtle but crucial difference ensures the model learns from comparisons that reflect its current capabilities rather than from stale comparisons between outputs it no longer produces. The model receives feedback on the kinds of responses it actually generates now, making the training signal directly relevant to improving current behavior.

However, keeping a fixed reference creates a growing gap between the reference and policy distributions. As the policy improves over iterations, it diverges further from πref\pi_{\text{ref}}. The implicit reward function r(x,y)=βlogπθ(yx)πref(yx)r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)} now compares very different distributions. When the policy assigns high probability to a response that the reference considers extremely unlikely, the log ratio becomes very large. Conversely, when the policy considers a response unlikely but the reference assigns moderate probability, the log ratio becomes very negative. These extreme values can destabilize training by creating large gradients that push the model erratically.

Rolling Reference Iteration

A more common approach updates the reference model at each iteration. After training round tt, we set πref(t+1)=πt\pi_{\text{ref}}^{(t+1)} = \pi_t before starting round t+1t+1. This keeps the reference close to the current policy, maintaining stable gradients and ensuring that the log probability ratios remain in a reasonable range throughout training.

The rolling reference focuses on incremental improvements instead of measuring progress against the initial model. By using the previous iteration's policy as the reference, we are essentially asking: "Given where we are now, which responses represent improvements?" This framing naturally prevents the extreme probability ratios that arise when comparing against a very different reference distribution.

The iterative procedure becomes:

  1. Initialize π0\pi_0 with a supervised fine-tuned model
  2. Set πref(1)=π0\pi_{\text{ref}}^{(1)} = \pi_0
  3. For each iteration t=1,2,,Tt = 1, 2, \ldots, T:
    • Generate responses from πt1\pi_{t-1} on a set of prompts
    • Collect preferences over response pairs
    • Train πt\pi_t using DPO with reference πref(t)\pi_{\text{ref}}^{(t)}
    • Update πref(t+1)=πt\pi_{\text{ref}}^{(t+1)} = \pi_t

This rolling reference approach is sometimes called online DPO because the training data comes from the current policy rather than a static offline dataset. The term "online" here refers to the fact that data generation is interleaved with training, creating a dynamic feedback loop between model behavior and the training signal.

Rolling reference training starts from a neutral position. The policy and reference are identical at the beginning of each round, so the log probability ratios start at zero for all responses. This means the training signal comes entirely from the preference comparisons in the current iteration's dataset, without any accumulated bias from previous iterations. The model learns to improve relative to its immediate predecessor, and the cumulative effect of many such improvements can be substantial.

Out[3]:
Visualization
Comparison of fixed and rolling reference strategies in iterative DPO. A fixed reference maintains a consistent anchor but results in growing log-probability ratios as the policy improves, while a rolling reference updates at each iteration to keep ratios bounded but allows for cumulative drift from the starting distribution.
Comparison of fixed and rolling reference strategies in iterative DPO. A fixed reference maintains a consistent anchor but results in growing log-probability ratios as the policy improves, while a rolling reference updates at each iteration to keep ratios bounded but allows for cumulative drift from the starting distribution.
Notebook output

IPO for Iterative Training

Recall from our discussion of DPO variants that Identity Preference Optimization (IPO) avoids the length exploitation issues that can plague DPO over multiple iterations. The problem with standard DPO becomes particularly acute in iterative settings: if the model learns that slightly longer responses tend to win preference comparisons, it will generate longer responses, which will then be compared against each other, and the longest will again tend to win. This feedback loop can drive response length to extreme values over many iterations.

IPO addresses this concern through a fundamentally different objective formulation. Rather than using the sigmoid function to bound the preference probability, IPO directly targets a specific margin between the preferred and dispreferred log probability ratios. The IPO objective is:

LIPO=E(x,yw,yl)[(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)12β)2]\mathcal{L}_{\text{IPO}} = \mathbb{E}_{(x, y_w, y_l)} \left[ \left( \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} - \frac{1}{2\beta} \right)^2 \right]

The components of the IPO formula prevent the runaway behavior seen in DPO:

  • LIPO\mathcal{L}_{\text{IPO}}: the Identity Preference Optimization loss
  • E\mathbb{E}: the expectation operator averaging over the dataset
  • πθ\pi_\theta: the policy model being trained
  • πref\pi_{\text{ref}}: the reference model
  • xx: the input prompt
  • yw,yly_w, y_l: the preferred and dispreferred responses for prompt xx
  • β\beta: the hyperparameter derived from the KL regularization strength

This objective minimizes the squared error between the log-likelihood ratio gap and a finite target margin of 1/(2β)1/(2\beta). The key insight is that IPO does not simply try to make the preferred response more likely than the dispreferred one; it tries to make the gap between them equal to a specific target value. Once the model achieves this target margin, further increasing the gap actually increases the loss. This built-in saturation prevents the model from pushing probability ratios to extreme values.

Out[4]:
Visualization
Loss landscapes for DPO and IPO objectives. The DPO loss (left) continues to decrease as the margin between winning and losing responses grows, which can incentivize length exploitation, whereas the IPO loss (right) reaches a minimum at the target margin to provide natural regularization.
Loss landscapes for DPO and IPO objectives. The DPO loss (left) continues to decrease as the margin between winning and losing responses grows, which can incentivize length exploitation, whereas the IPO loss (right) reaches a minimum at the target margin to provide natural regularization.
Notebook output

Unlike DPO, which tries to maximize the gap (bounded only by the asymptotic behavior of the sigmoid), IPO targets this specific margin, preventing the model from pushing probability ratios to extreme values. The squared error formulation means that overshooting the target margin is penalized just as much as undershooting it. This symmetric penalty structure provides natural regularization that becomes increasingly important over multiple iterations where small biases otherwise compound into significant distortions.

Online Preference Learning

The distinction between online and offline preference learning parallels the classic reinforcement learning distinction between on-policy and off-policy methods. This parallel is not merely superficial; the same fundamental trade-offs apply in both contexts, and understanding these trade-offs helps clarify when each approach is appropriate.

Online vs Offline Preference Learning

Offline methods use a fixed dataset of preferences collected once, typically from a different model or earlier version. Online methods generate fresh responses from the current policy and collect preferences on those responses during training.

In offline reinforcement learning, an agent learns from a fixed dataset of experiences collected by a different policy, perhaps a human expert or an earlier version of the agent. The challenge is that the agent cannot explore: it must learn entirely from the experiences provided. In offline preference learning, the situation is analogous. The model learns from preference comparisons between responses that were generated by some other model, and it cannot generate new responses to get feedback on during training.

Advantages of Online Learning

Online preference learning offers several benefits over offline approaches:

  • Distribution matching: Training data always reflects the current policy's behavior
  • Progressive difficulty: As the model improves, comparisons become more nuanced
  • Adaptive exploration: The model can explore regions of response space that earlier versions couldn't reach
  • Reduced overfitting: Fresh data prevents memorizing fixed preference patterns

Distribution matching directly addresses the motivation for iterative alignment. When training data comes from the current policy, every preference comparison provides information about choices the model is actually making. There is no disconnect between what the model generates and what the training signal evaluates. This tight coupling ensures that the training signal remains relevant throughout the optimization process.

Progressive difficulty emerges naturally from online learning. In early iterations, the model might generate responses with obvious quality differences, making preference judgments easy. As the model improves, the differences between its responses become more subtle, requiring more careful evaluation. This natural curriculum exposes the model to increasingly sophisticated distinctions, pushing it toward finer-grained improvements that offline data might not capture.

The major drawback is cost. Online learning requires generating new responses and obtaining preferences continuously, which can be expensive when using human annotators. Each iteration requires not just training time but also the time and resources needed to generate candidate responses and evaluate them. For organizations with limited annotation budgets, this ongoing cost can be prohibitive.

On-Policy Sampling Strategies

When generating responses for online preference learning, several sampling strategies exist, each with different trade-offs between sample efficiency, coverage, and computational cost:

Best-of-N sampling: Generate NN responses per prompt, use a reward model to select the best and worst, then train on this pair. This is sample-efficient but requires a reward model. The approach maximizes the information content of each preference pair by ensuring a substantial quality gap between the compared responses. However, it relies on the reward model's accuracy, and any systematic biases in the reward model will be amplified through this selection process.

Diverse sampling: Use different temperatures or sampling strategies to generate varied responses, ensuring the preference pairs cover diverse regions of response space. Higher temperatures produce more varied outputs, potentially exposing the model to a broader range of quality levels. This approach reduces the risk of overfitting to a narrow slice of the output distribution but may produce many comparisons between clearly bad responses, which provide limited learning signal.

Targeted sampling: Focus on prompts where the model's current responses are uncertain or low-quality, similar to active learning approaches. By directing annotation effort toward the most informative examples, this strategy can achieve better sample efficiency than random sampling. The challenge lies in identifying which prompts will yield the most useful preference comparisons without actually collecting those comparisons first.

Converting RLHF to Online DPO

The RLHF pipeline we covered earlier is inherently online: PPO samples from the current policy and updates based on reward model scores. We can convert this to online DPO by:

  1. Generating pairs of responses from the current policy
  2. Using the reward model to determine preferences (instead of providing scalar rewards)
  3. Training with DPO instead of PPO

This hybrid approach, sometimes called reward model distillation or RLHaF (Reinforcement Learning from Human-AI Feedback), combines the stability of DPO with the online sampling of RLHF. The key insight is that a reward model can be used to generate preference labels rather than to provide direct reward signals. By converting scalar rewards into binary preference comparisons, we can use the simpler DPO training procedure while retaining the benefits of online data generation.

This conversion is practical because DPO is more stable and easier to tune than PPO. By using the reward model only to generate preference labels rather than to guide policy gradient updates, we sidestep many of the challenges that make RLHF difficult in practice.

Self-Improvement Loops

Iterative alignment enables self-improvement, where models generate their own training data to improve without human annotation. This sounds almost paradoxical: how can a model teach itself to be better than it already is? The answer lies in understanding the asymmetry between different cognitive tasks.

The Self-Improvement Paradox

Self-improvement relies on the fact that evaluating responses is easier than generating them. A model might struggle to write a perfect response but can often tell which of two responses is better. By generating many candidates, evaluating them, and training on the best, the model can improve beyond its initial capabilities.

This asymmetry is fundamental to many human learning processes as well. Consider a chess player analyzing their games: they might not be able to find the best move during play, but with time to reflect, they can often identify which of their moves were mistakes. The evaluation task, with its reduced time pressure and ability to compare alternatives, is genuinely easier than the generation task of finding the best move under tournament conditions.

This is analogous to how humans learn. A student might not be able to write a great essay from scratch but can recognize quality when comparing examples. Through exposure to what makes one response better than another, they gradually internalize those standards. The model undergoes a similar process: by repeatedly seeing which of its outputs are preferred, it learns to generate responses with those preferred characteristics more reliably.

The self-improvement loop works because the model effectively bootstraps from its ability to recognize quality to its ability to generate quality. Each iteration tightens this connection. The model generates candidates that reflect its current generation capabilities, evaluates them using its (potentially superior) evaluation capabilities, and trains on the resulting preferences to improve its generation capabilities. The cycle then repeats with a now-improved generator.

Constitutional AI

Anthropic's Constitutional AI (CAI) pioneered practical self-improvement for alignment. The approach works in two phases:

Critique and revision phase: The model generates a response, then critiques it according to a set of principles (the "constitution"), then revises the response to address the critique. This creates pairs of (original, revised) responses.

Preference learning phase: The revised responses are treated as preferred over the originals, creating synthetic preference data for DPO or RLHF training.

The constitution might include principles like:

  • "Please choose the response that is most helpful while being harmless and honest."
  • "Please choose the response that is least likely to cause harm if misused."
  • "Please choose the response that best follows your instructions while refusing harmful requests."

Following explicit principles is easier than inferring human preferences. When a model must infer human preferences from examples, it faces an ambiguous learning problem: many different underlying value functions could explain the observed preferences. But when principles are stated explicitly, the model has clear criteria to apply. By making the criteria explicit, the model can self-critique more reliably, identifying specific ways in which a response violates stated principles and revising accordingly.

Constitutional AI also addresses the scalability challenge of human annotation. Collecting human preferences is expensive and slow. By using explicit principles that the model can apply to its own outputs, CAI dramatically reduces the need for human involvement in each training iteration. Humans still play a crucial role in defining the constitutional principles, but the actual preference generation scales with compute rather than with human annotator time.

Self-Play Fine-Tuning (SPIN)

Self-Play Fine-Tuning takes a different approach to self-improvement. Instead of critique and revision, SPIN uses the model's own outputs as negative examples:

  1. Take a prompt xx and ground-truth response yy^* (from supervised data)
  2. Generate a response yy' from the current model
  3. Create a preference pair: (y,y)(y^*, y') where yy^* is preferred
  4. Train with DPO to prefer yy^* over yy'

The ground-truth response yy^* should be at least as good as current model generations. By training to prefer yy^*, the model moves toward the data distribution while exceeding its current capabilities. This approach requires no external judge or reward model; the ground-truth data itself provides the quality signal.

As training progresses, the model's outputs improve, so the contrast between yy^* and yy' decreases. This naturally creates a curriculum: early iterations provide strong gradients (when model outputs are poor), while later iterations fine-tune on subtle differences. The learning signal adapts automatically to the model's current capability level without any explicit curriculum design.

The SPIN approach has an elegant theoretical interpretation. At convergence, the model should generate responses indistinguishable from the ground-truth distribution. If the model's outputs matched the ground-truth distribution perfectly, the preference pairs would be between equally good responses, providing no learning signal. The training naturally terminates when the model achieves the quality level of the ground-truth data.

LLM-as-Judge

A powerful approach for synthetic preference generation uses a language model itself as the preference judge. Given two responses, a judge model determines which is better, often with explanations.

The judge can be:

  • The same model being trained (self-judgment)
  • A larger, more capable model (teacher-student)
  • A specialized critic model trained for evaluation

As we discussed in the RLAIF chapter, using AI feedback can scale alignment annotation significantly. In iterative settings, this enables rapid generation of preference data without human bottlenecks. A single GPU can generate thousands of preference judgments in the time it would take a human annotator to evaluate dozens.

However, self-judgment introduces risks. The model might develop blind spots that it can't recognize in its own outputs. If the model consistently makes a particular type of error, it may also consistently fail to identify that error when judging its outputs. Using an external judge model or periodically incorporating human feedback helps maintain calibration and prevents the accumulation of systematic biases over iterations.

Alignment Stability

Iterative training introduces failure modes that don't appear in single-round alignment. Understanding these is crucial for building robust iterative systems that improve reliably rather than degrading in subtle ways over time.

Distribution Shift and Model Collapse

Over many iterations, the model's output distribution can shift dramatically from where it started. Small biases in the preference signal accumulate, potentially leading to model collapse: the model produces homogeneous outputs that score well on the preference metric but lack diversity or quality.

To understand how model collapse occurs, consider the dynamics of iterative training with a slightly biased preference signal. Suppose the judge model has a subtle preference for longer responses. In the first iteration, this bias causes the model to learn that somewhat longer responses are preferred. In the second iteration, the training data consists of these somewhat longer responses, and the judge again prefers the longest among them. The model learns to generate even longer outputs. This process continues iteration after iteration, with each round amplifying the length bias.

Consider a judge model that slightly prefers longer responses. Over iterations, the policy learns to generate longer outputs. These longer outputs are compared against each other, and the judge still prefers the longest. The model's outputs grow unboundedly, eventually becoming verbose and unhelpful.

Out[5]:
Visualization
Multiplicative growth of attributes caused by small evaluation biases over ten iterations. A minor initial preference (e.g., a 5% bias for longer responses) compounds significantly during iterative training, eventually leading to extreme behaviors and a collapse in output diversity.
Multiplicative growth of attributes caused by small evaluation biases over ten iterations. A minor initial preference (e.g., a 5% bias for longer responses) compounds significantly during iterative training, eventually leading to extreme behaviors and a collapse in output diversity.
Notebook output

Similar collapse can occur along other dimensions: formality, hedging, particular phrasings. Any small consistent bias in evaluation compounds over iterations. A slight preference for confident-sounding language can evolve into overconfident assertions. A slight preference for hedged statements can produce responses so qualified that they become uninformative. The iterative process acts as an amplifier for whatever biases exist in the evaluation signal.

Mitigations include:

  • Diversity penalties: Explicitly encourage varied outputs during generation
  • Length normalization: Prevent length-based gaming
  • Regular human evaluation: Catch drift before it compounds
  • KL penalties: Keep the policy close to a fixed reference

Reward Hacking Accumulation

As we discussed in the Reward Hacking chapter, models can exploit gaps between what the reward signal measures and what we actually want. In iterative settings, this becomes more severe because each iteration builds on the potentially compromised foundation of previous iterations.

Each iteration might introduce a small amount of reward hacking. The next iteration's preference data comes from this slightly-hacked model, so the hacking becomes embedded in the training distribution. The next round of training builds on this compromised foundation, potentially amplifying the exploit. What starts as a barely noticeable tendency can grow into a dominant behavior pattern over many iterations.

For example, if the reward model has a small blind spot around sycophantic responses, early iterations might learn slight sycophancy. Later iterations generate training data that normalizes this behavior, making the model more confidently sycophantic, which the reward model continues to miss. The model learns that agreeing with you leads to positive feedback, then learns that emphatic agreement leads to even more positive feedback, eventually producing responses that are effusively agreeable regardless of whether your statements are accurate.

Breaking this cycle requires:

  • Diverse reward signals: Use multiple reward models or judges with different biases
  • Human oversight: Periodically inject human preferences to correct drift
  • Adversarial probing: Actively search for behaviors the reward might miss

Capability Degradation

A subtle failure mode in iterative alignment is capability degradation or alignment tax. As the model learns to satisfy preferences, it might lose general capabilities that weren't directly measured. The training process optimizes what is measured, and capabilities that are not included in the preference signal may deteriorate.

For instance, iterative training focused on helpfulness might degrade the model's performance on factual recall or mathematical reasoning. The model becomes very good at sounding helpful but loses substance. If the preference comparisons rarely involve questions requiring precise factual knowledge, the model has no training signal to maintain that capability. Meanwhile, the weight updates that improve helpfulness may interfere with the representations that support factual recall.

This occurs because:

  • Training updates push the model toward preference-satisfying regions of parameter space
  • These regions may not preserve all pretrained capabilities
  • Iterative training compounds these small capability losses

The compounding effect is particularly concerning. A 1% capability loss per iteration might seem acceptable, but after 50 iterations, the cumulative loss becomes substantial. Capabilities that were strong in the initial model may become unreliable in the final model, even though no single iteration caused dramatic degradation.

Monitoring diverse capabilities (not just alignment metrics) across iterations helps detect this. Many teams incorporate capability benchmarks into their iterative training loops, halting or reverting training if key capabilities degrade.

Reference Model Choice

The choice of reference model significantly impacts iterative stability. Using a rolling reference (updating πref\pi_{\text{ref}} each iteration) keeps gradients stable but allows unbounded drift from the original model. Using a fixed reference maintains a consistent anchor but can create increasingly large probability ratios that destabilize training.

Each choice involves trade-offs. The rolling reference provides a stable optimization landscape at each iteration, since the policy and reference start close together. But there is no mechanism preventing the policy from drifting arbitrarily far from the initial model over many iterations. The fixed reference prevents such drift by always measuring progress relative to the same anchor, but as the policy improves, the growing divergence between policy and reference creates numerical challenges.

A hybrid approach uses a fixed distant reference with a soft KL penalty plus a rolling recent reference for DPO:

L=LDPO(πθ,πrolling)+λDKL(πθπfixed)\mathcal{L} = \mathcal{L}_{\text{DPO}}(\pi_\theta, \pi_{\text{rolling}}) + \lambda \cdot D_{KL}(\pi_\theta \| \pi_{\text{fixed}})

The combined loss function includes these terms:

  • L\mathcal{L}: the combined hybrid loss function
  • LDPO\mathcal{L}_{\text{DPO}}: the standard DPO loss calculated using the rolling reference
  • πθ\pi_\theta: the current policy model
  • πrolling\pi_{\text{rolling}}: the reference model updated at each iteration (usually πt1\pi_{t-1})
  • λ\lambda: the weighting coefficient for the fixed reference KL term
  • DKLD_{KL}: the Kullback-Leibler divergence
  • πfixed\pi_{\text{fixed}}: the static reference model (usually π0\pi_0)

This provides stable gradients from the rolling reference and bounded drift from the fixed anchor. The DPO loss provides the primary training signal, using the rolling reference to ensure reasonable probability ratios. The KL penalty term acts as a soft constraint, allowing the policy to improve beyond the initial model while preventing extreme divergence. The weighting coefficient λ\lambda controls the strength of this anchoring effect, with larger values keeping the policy closer to the original model at the cost of potentially limiting improvement.

Implementation

Let's implement an iterative DPO training loop that demonstrates these concepts. We'll simulate preference data generation and show how to structure multiple training rounds.

First, we'll create a simple language model for demonstration purposes. In practice, you'd use a full transformer, but this captures the key dynamics.

In[6]:
Code
import torch
import torch.nn as nn
import torch.nn.functional as F


class SimpleLM(nn.Module):
    """A simple language model for demonstrating iterative DPO."""

    def __init__(self, vocab_size=1000, hidden_dim=256, num_layers=2):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_dim)
        self.layers = nn.ModuleList(
            [nn.Linear(hidden_dim, hidden_dim) for _ in range(num_layers)]
        )
        self.output = nn.Linear(hidden_dim, vocab_size)
        self.vocab_size = vocab_size

    def forward(self, x):
        # x shape: (batch, seq_len)
        h = self.embedding(x)  # (batch, seq_len, hidden)
        for layer in self.layers:
            h = F.relu(layer(h))
        logits = self.output(h)  # (batch, seq_len, vocab)
        return logits

    def log_prob(self, sequences):
        """Compute log probability of sequences."""
        logits = self.forward(sequences[:, :-1])
        targets = sequences[:, 1:]
        log_probs = F.log_softmax(logits, dim=-1)
        token_log_probs = torch.gather(
            log_probs, -1, targets.unsqueeze(-1)
        ).squeeze(-1)
        return token_log_probs.sum(dim=-1)

    def generate(self, prompt, max_length=20, temperature=1.0):
        """Generate a sequence given a prompt."""
        self.eval()
        with torch.no_grad():
            current = prompt.clone()
            for _ in range(max_length):
                logits = self.forward(current)
                next_logits = logits[:, -1, :] / temperature
                probs = F.softmax(next_logits, dim=-1)
                next_token = torch.multinomial(probs, 1)
                current = torch.cat([current, next_token], dim=1)
        return current

Now let's define the DPO training components.

In[7]:
Code
def compute_dpo_loss(policy, reference, chosen, rejected, beta=0.1):
    """
    Compute the DPO loss for a batch of preference pairs.

    Args:
        policy: Current policy model
        reference: Reference model (frozen)
        chosen: Preferred sequences (batch, seq_len)
        rejected: Dispreferred sequences (batch, seq_len)
        beta: Temperature parameter

    Returns:
        Scalar loss value
    """
    # Compute log probabilities under policy
    policy_chosen_logp = policy.log_prob(chosen)
    policy_rejected_logp = policy.log_prob(rejected)

    # Compute log probabilities under reference (no gradients)
    with torch.no_grad():
        ref_chosen_logp = reference.log_prob(chosen)
        ref_rejected_logp = reference.log_prob(rejected)

    # Compute log ratios
    chosen_ratio = policy_chosen_logp - ref_chosen_logp
    rejected_ratio = policy_rejected_logp - ref_rejected_logp

    # DPO loss: -log(sigmoid(beta * (chosen_ratio - rejected_ratio)))
    logits = beta * (chosen_ratio - rejected_ratio)
    loss = -F.logsigmoid(logits).mean()

    return loss

We need a simulated preference oracle. In practice, this would be human annotators or a reward model. Our oracle prefers sequences with certain patterns.

In[8]:
Code
class PreferenceOracle:
    """
    Simulated preference oracle that prefers certain patterns.
    In real applications, this would be human feedback or a reward model.
    """

    def __init__(self, preferred_tokens, penalty_tokens, seed=42):
        self.preferred = set(preferred_tokens)
        self.penalty = set(penalty_tokens)
        self.rng = np.random.RandomState(seed)

    def score(self, sequence):
        """Score a sequence based on preferred/penalty tokens."""
        seq_list = sequence.cpu().numpy().tolist()
        score = 0
        for token in seq_list:
            if token in self.preferred:
                score += 1
            elif token in self.penalty:
                score -= 1
        return score

    def compare(self, seq1, seq2):
        """
        Compare two sequences, return (chosen, rejected).
        Adds some noise to simulate annotator disagreement.
        """
        score1 = self.score(seq1) + self.rng.normal(0, 0.5)
        score2 = self.score(seq2) + self.rng.normal(0, 0.5)

        if score1 >= score2:
            return seq1, seq2
        else:
            return seq2, seq1

Now let's implement the iterative DPO training loop with both fixed and rolling reference options.

In[9]:
Code
import copy
import numpy as np
from collections import defaultdict


class IterativeDPOTrainer:
    """Trainer for iterative DPO with configurable reference updating."""

    def __init__(
        self, model, oracle, beta=0.1, lr=1e-4, rolling_reference=True
    ):
        self.model = model
        self.oracle = oracle
        self.beta = beta
        self.lr = lr
        self.rolling_reference = rolling_reference

        # Initialize reference as copy of initial model
        self.reference = copy.deepcopy(model)
        self.reference.eval()
        for param in self.reference.parameters():
            param.requires_grad = False

        # Keep original reference for KL monitoring
        self.original_reference = copy.deepcopy(model)
        self.original_reference.eval()
        for param in self.original_reference.parameters():
            param.requires_grad = False

        self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)
        self.metrics = defaultdict(list)

    def generate_preference_data(self, prompts, num_pairs=2, temperature=1.0):
        """
        Generate preference pairs by sampling from current policy.
        """
        self.model.eval()
        chosen_list, rejected_list = [], []

        for prompt in prompts:
            # Generate multiple responses
            responses = []
            for _ in range(num_pairs * 2):
                response = self.model.generate(
                    prompt.unsqueeze(0), max_length=15, temperature=temperature
                )
                responses.append(response.squeeze(0))

            # Create preference pairs
            for i in range(0, len(responses) - 1, 2):
                chosen, rejected = self.oracle.compare(
                    responses[i], responses[i + 1]
                )
                chosen_list.append(chosen)
                rejected_list.append(rejected)

        self.model.train()
        return chosen_list, rejected_list

    def train_iteration(self, prompts, num_pairs=2, num_steps=50):
        """Run one iteration of DPO training."""
        # Generate preference data from current policy
        chosen_list, rejected_list = self.generate_preference_data(
            prompts, num_pairs
        )

        # Training steps on this data
        losses = []
        for step in range(num_steps):
            # Sample a batch
            batch_size = min(16, len(chosen_list))
            indices = np.random.choice(
                len(chosen_list), batch_size, replace=False
            )

            # Pad sequences to same length
            max_len = max(
                max(chosen_list[i].size(0) for i in indices),
                max(rejected_list[i].size(0) for i in indices),
            )

            chosen_batch = torch.zeros(batch_size, max_len, dtype=torch.long)
            rejected_batch = torch.zeros(batch_size, max_len, dtype=torch.long)

            for j, idx in enumerate(indices):
                c, r = chosen_list[idx], rejected_list[idx]
                chosen_batch[j, : c.size(0)] = c
                rejected_batch[j, : r.size(0)] = r

            # Compute loss and update
            self.optimizer.zero_grad()
            loss = compute_dpo_loss(
                self.model,
                self.reference,
                chosen_batch,
                rejected_batch,
                self.beta,
            )
            loss.backward()
            self.optimizer.step()
            losses.append(loss.item())

        return np.mean(losses)

    def compute_kl_divergence(self, prompts, num_samples=10):
        """Estimate KL divergence from original reference."""
        self.model.eval()
        total_kl = 0
        count = 0

        with torch.no_grad():
            for prompt in prompts[:5]:  # Subsample for efficiency
                for _ in range(num_samples):
                    # Generate from current policy
                    seq = self.model.generate(
                        prompt.unsqueeze(0), max_length=10
                    )

                    # Compute log prob difference
                    policy_logp = self.model.log_prob(seq)
                    ref_logp = self.original_reference.log_prob(seq)
                    total_kl += (policy_logp - ref_logp).item()
                    count += 1

        self.model.train()
        return total_kl / count if count > 0 else 0

    def update_reference(self):
        """Update reference model to current policy (for rolling reference)."""
        if self.rolling_reference:
            self.reference.load_state_dict(self.model.state_dict())
            self.reference.eval()

    def run_iterations(self, prompts, num_iterations=5, num_steps_per_iter=50):
        """Run multiple iterations of DPO training."""
        for iteration in range(num_iterations):
            # Train for one iteration
            avg_loss = self.train_iteration(
                prompts, num_pairs=2, num_steps=num_steps_per_iter
            )

            # Compute metrics
            kl_div = self.compute_kl_divergence(prompts)

            self.metrics["iteration"].append(iteration)
            self.metrics["loss"].append(avg_loss)
            self.metrics["kl_divergence"].append(kl_div)

            # Update reference for next iteration
            self.update_reference()

            print(
                f"Iteration {iteration + 1}: Loss={avg_loss:.4f}, KL={kl_div:.4f}"
            )

        return self.metrics

Let's run the iterative training and visualize the results.

In[10]:
Code
# Set up the experiment
torch.manual_seed(42)

# Create model and oracle
vocab_size = 100
model = SimpleLM(vocab_size=vocab_size, hidden_dim=64, num_layers=2)

# Oracle prefers tokens 10-20, penalizes tokens 80-90
oracle = PreferenceOracle(
    preferred_tokens=list(range(10, 20)), penalty_tokens=list(range(80, 90))
)

# Create sample prompts (random token sequences)
num_prompts = 20
prompts = [torch.randint(0, vocab_size, (5,)) for _ in range(num_prompts)]

# Run iterative DPO with rolling reference
trainer_rolling = IterativeDPOTrainer(
    model=SimpleLM(vocab_size=vocab_size, hidden_dim=64, num_layers=2),
    oracle=oracle,
    beta=0.1,
    rolling_reference=True,
)
metrics_rolling = trainer_rolling.run_iterations(
    prompts, num_iterations=5, num_steps_per_iter=30
)
In[11]:
Code
# Run with fixed reference for comparison
trainer_fixed = IterativeDPOTrainer(
    model=SimpleLM(vocab_size=vocab_size, hidden_dim=64, num_layers=2),
    oracle=oracle,
    beta=0.1,
    rolling_reference=False,
)
metrics_fixed = trainer_fixed.run_iterations(
    prompts, num_iterations=5, num_steps_per_iter=30
)
In[12]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (3.0, 2.5),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

## Loss comparison
plt.figure()
plt.plot(
    metrics_rolling["iteration"],
    metrics_rolling["loss"],
    "b-o",
    label="Rolling Reference",
    linewidth=2,
)
plt.plot(
    metrics_fixed["iteration"],
    metrics_fixed["loss"],
    "r-s",
    label="Fixed Reference",
    linewidth=2,
)
plt.xlabel("Iteration")
plt.ylabel("DPO Loss")
plt.title("Training Loss Across Iterations")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## KL divergence comparison
plt.figure()
plt.plot(
    metrics_rolling["iteration"],
    metrics_rolling["kl_divergence"],
    "b-o",
    label="Rolling Reference",
    linewidth=2,
)
plt.plot(
    metrics_fixed["iteration"],
    metrics_fixed["kl_divergence"],
    "r-s",
    label="Fixed Reference",
    linewidth=2,
)
plt.xlabel("Iteration")
plt.ylabel("KL Divergence from Original")
plt.title("Policy Drift Over Iterations")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
Out[12]:
Visualization
Comparison of training loss and policy drift for rolling versus fixed reference models. The rolling reference strategy maintains stable loss values across iterations by keeping the anchor close to the current policy, whereas the fixed reference results in higher loss and greater cumulative divergence from the initial model.
Comparison of training loss and policy drift for rolling versus fixed reference models. The rolling reference strategy maintains stable loss values across iterations by keeping the anchor close to the current policy, whereas the fixed reference results in higher loss and greater cumulative divergence from the initial model.
Notebook output

The visualization reveals an important dynamic: with a rolling reference, the KL divergence from the original model grows steadily as the policy improves through iterations. This represents the cumulative effect of iterative training. With a fixed reference, the KL growth pattern differs because the training signal always pulls toward the same anchor point.

Now let's implement a self-improvement loop using the SPIN concept.

In[13]:
Code
class SPINTrainer:
    """
    Self-Play Fine-Tuning trainer.
    Uses ground truth responses as positive examples and
    model generations as negative examples.
    """

    def __init__(self, model, beta=0.1, lr=1e-4):
        self.model = model
        self.beta = beta
        self.optimizer = torch.optim.Adam(model.parameters(), lr=lr)

        # Reference is always the previous iteration's model
        self.reference = copy.deepcopy(model)
        self.reference.eval()
        for param in self.reference.parameters():
            param.requires_grad = False

        self.metrics = defaultdict(list)

    def train_iteration(self, prompts, ground_truth_responses, num_steps=50):
        """
        One SPIN iteration: model outputs are rejected,
        ground truth is chosen.
        """
        self.model.train()
        losses = []

        for step in range(num_steps):
            # Sample batch
            batch_size = min(8, len(prompts))
            indices = np.random.choice(len(prompts), batch_size, replace=False)

            # Ground truth responses are "chosen"
            chosen_batch = ground_truth_responses[indices]

            # Generate responses from current model as "rejected"
            self.model.eval()
            rejected_list = []
            for idx in indices:
                prompt = prompts[idx : idx + 1]
                response = self.model.generate(prompt, max_length=15)
                rejected_list.append(response.squeeze(0))
            self.model.train()

            # Pad rejected sequences
            max_len = max(r.size(0) for r in rejected_list)
            max_len = max(max_len, chosen_batch.size(1))

            rejected_batch = torch.zeros(batch_size, max_len, dtype=torch.long)
            for j, r in enumerate(rejected_list):
                rejected_batch[j, : r.size(0)] = r

            # Pad chosen if needed
            if chosen_batch.size(1) < max_len:
                chosen_padded = torch.zeros(
                    batch_size, max_len, dtype=torch.long
                )
                chosen_padded[:, : chosen_batch.size(1)] = chosen_batch
                chosen_batch = chosen_padded

            # Compute SPIN loss (same as DPO but with different data)
            self.optimizer.zero_grad()
            loss = compute_dpo_loss(
                self.model,
                self.reference,
                chosen_batch,
                rejected_batch,
                self.beta,
            )
            loss.backward()
            self.optimizer.step()
            losses.append(loss.item())

        return np.mean(losses)

    def update_reference(self):
        """Update reference to current model for next iteration."""
        self.reference.load_state_dict(self.model.state_dict())

    def run_iterations(
        self, prompts, ground_truth, num_iterations=5, num_steps=50
    ):
        """Run multiple SPIN iterations."""
        for iteration in range(num_iterations):
            avg_loss = self.train_iteration(prompts, ground_truth, num_steps)
            self.metrics["iteration"].append(iteration)
            self.metrics["loss"].append(avg_loss)

            # Update reference for next iteration
            self.update_reference()

            print(f"SPIN Iteration {iteration + 1}: Loss={avg_loss:.4f}")

        return self.metrics
In[14]:
Code
# Create synthetic "ground truth" responses
# In practice, these would be human-written high-quality responses
torch.manual_seed(123)

prompts_tensor = torch.stack([p for p in prompts])  # (num_prompts, prompt_len)

# Ground truth: sequences with many preferred tokens
ground_truth = torch.zeros(len(prompts), 20, dtype=torch.long)
for i in range(len(prompts)):
    ground_truth[i, :5] = prompts[i]  # Copy prompt
    # Fill rest with preferred tokens (10-20)
    ground_truth[i, 5:] = torch.randint(10, 20, (15,))

# Run SPIN training
spin_model = SimpleLM(vocab_size=vocab_size, hidden_dim=64, num_layers=2)
spin_trainer = SPINTrainer(spin_model, beta=0.1)
spin_metrics = spin_trainer.run_iterations(
    prompts_tensor, ground_truth, num_iterations=5, num_steps=30
)

Let's evaluate how the model's outputs change over SPIN iterations.

In[15]:
Code
def evaluate_generation_quality(model, prompts, oracle, num_samples=5):
    """Evaluate average oracle score of model generations."""
    model.eval()
    total_score = 0
    count = 0

    with torch.no_grad():
        for prompt in prompts[:10]:
            for _ in range(num_samples):
                response = model.generate(prompt.unsqueeze(0), max_length=15)
                score = oracle.score(response.squeeze(0))
                total_score += score
                count += 1

    return total_score / count
In[16]:
Code
## Compare initial vs trained model
initial_model = SimpleLM(vocab_size=vocab_size, hidden_dim=64, num_layers=2)

initial_score = evaluate_generation_quality(initial_model, prompts, oracle)
spin_score = evaluate_generation_quality(spin_trainer.model, prompts, oracle)
Out[17]:
Console
Generation Quality Evaluation:
----------------------------------------
Initial model average score: 0.60
SPIN-trained model average score: 1.24

The SPIN-trained model achieves a significantly higher score, confirming that self-play fine-tuning effectively improved alignment with the oracle's preferences.

In[18]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (6.0, 4.0),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

plt.figure()
plt.plot(
    spin_metrics["iteration"],
    spin_metrics["loss"],
    "g-o",
    linewidth=2,
    markersize=8,
)
plt.xlabel("Iteration")
plt.ylabel("SPIN Loss")
plt.title("Self-Play Fine-Tuning Loss")
plt.grid(True, alpha=0.3)
plt.show()
Out[18]:
Visualization
Line plot showing SPIN loss decreasing over training iterations.
Self-Play Fine-Tuning (SPIN) training loss over iterations. The decreasing loss curve demonstrates the model successfully learning to prioritize ground-truth responses over its own generated candidates as its internal generation quality improves.

The rapid decrease in loss indicates the model is successfully optimizing the SPIN objective, learning to distinguish ground truth responses from its own generations. Finally, let's implement monitoring for alignment stability.

In[19]:
Code
class AlignmentMonitor:
    """Monitor alignment stability across iterations."""

    def __init__(self, original_model, oracle):
        self.original_model = copy.deepcopy(original_model)
        self.original_model.eval()
        self.oracle = oracle
        self.history = defaultdict(list)

    def compute_diversity(self, model, prompts, num_samples=10):
        """Measure output diversity using unique token ratio."""
        model.eval()
        all_tokens = []

        with torch.no_grad():
            for prompt in prompts[:5]:
                for _ in range(num_samples):
                    response = model.generate(
                        prompt.unsqueeze(0), max_length=15
                    )
                    tokens = response.squeeze(0)[5:].tolist()  # Exclude prompt
                    all_tokens.extend(tokens)

        unique_ratio = (
            len(set(all_tokens)) / len(all_tokens) if all_tokens else 0
        )
        return unique_ratio

    def compute_length_stats(self, model, prompts, num_samples=10):
        """Measure response length statistics."""
        model.eval()
        lengths = []

        with torch.no_grad():
            for prompt in prompts[:5]:
                for _ in range(num_samples):
                    response = model.generate(
                        prompt.unsqueeze(0), max_length=20
                    )
                    lengths.append(response.size(1))

        return np.mean(lengths), np.std(lengths)

    def snapshot(self, model, prompts, iteration):
        """Record metrics for current iteration."""
        diversity = self.compute_diversity(model, prompts)
        mean_len, std_len = self.compute_length_stats(model, prompts)
        oracle_score = evaluate_generation_quality(
            model, prompts, self.oracle, 5
        )

        self.history["iteration"].append(iteration)
        self.history["diversity"].append(diversity)
        self.history["mean_length"].append(mean_len)
        self.history["std_length"].append(std_len)
        self.history["oracle_score"].append(oracle_score)

        return {
            "diversity": diversity,
            "mean_length": mean_len,
            "oracle_score": oracle_score,
        }
In[20]:
Code
# Run iterative training with monitoring
monitored_model = SimpleLM(vocab_size=vocab_size, hidden_dim=64, num_layers=2)
monitor = AlignmentMonitor(monitored_model, oracle)

trainer_monitored = IterativeDPOTrainer(
    model=monitored_model, oracle=oracle, beta=0.1, rolling_reference=True
)

stability_metrics = []
for iteration in range(6):
    # Record metrics before training
    metrics = monitor.snapshot(trainer_monitored.model, prompts, iteration)
    stability_metrics.append(metrics)

    # Train one iteration (except on last loop)
    if iteration < 5:
        trainer_monitored.train_iteration(prompts, num_pairs=2, num_steps=30)
        trainer_monitored.update_reference()
Out[21]:
Console
Alignment Stability Metrics:
Iter 0: Diversity=0.133, Score=0.52
Iter 1: Diversity=0.133, Score=0.48
Iter 2: Diversity=0.133, Score=0.92
Iter 3: Diversity=0.133, Score=0.60
Iter 4: Diversity=0.133, Score=0.88
Iter 5: Diversity=0.133, Score=1.14

The metrics show the expected trade-off: as the Oracle Score improves (higher alignment), the Diversity decreases. This confirms that the model is converging onto the specific patterns rewarded by the oracle.

In[22]:
Code
import matplotlib.pyplot as plt

plt.rcParams.update(
    {
        "figure.figsize": (3.0, 2.5),
        "figure.dpi": 300,
        "figure.constrained_layout.use": True,
        "font.family": "sans-serif",
        "font.sans-serif": [
            "Noto Sans CJK SC",
            "Apple SD Gothic Neo",
            "DejaVu Sans",
            "Arial",
        ],
        "font.size": 10,
        "axes.titlesize": 11,
        "axes.titleweight": "bold",
        "axes.titlepad": 8,
        "axes.labelsize": 10,
        "axes.labelpad": 4,
        "xtick.labelsize": 9,
        "ytick.labelsize": 9,
        "legend.fontsize": 9,
        "legend.title_fontsize": 10,
        "legend.frameon": True,
        "legend.loc": "best",
        "lines.linewidth": 1.5,
        "lines.markersize": 5,
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": "--",
        "axes.spines.top": False,
        "axes.spines.right": False,
        "axes.prop_cycle": plt.cycler(
            color=["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728", "#7f7f7f"]
        ),
    }
)

iterations = list(range(len(stability_metrics)))
diversities = [m["diversity"] for m in stability_metrics]
scores = [m["oracle_score"] for m in stability_metrics]

## Left plot: Diversity
plt.figure()
plt.plot(iterations, diversities, "purple", marker="o", linewidth=2)
plt.xlabel("Iteration")
plt.ylabel("Diversity (Unique Token Ratio)")
plt.title("Output Diversity Over Iterations")
plt.grid(True, alpha=0.3)
plt.ylim(0, 1)
plt.show()

## Right plot: Score
plt.figure()
plt.plot(iterations, scores, "green", marker="s", linewidth=2)
plt.xlabel("Iteration")
plt.ylabel("Oracle Preference Score")
plt.title("Alignment Quality Over Iterations")
plt.grid(True, alpha=0.3)
plt.show()
Out[22]:
Visualization
Two-panel plot showing diversity and oracle score over training iterations.
Trends in output diversity and alignment quality over iterative training rounds. As the model successfully aligns with the target reward signal (increasing oracle score), it tends to converge on a narrower set of responses (decreasing unique token ratio), highlighting the inherent trade-off between alignment and diversity.
Notebook output

The stability metrics reveal the trade-offs inherent in iterative alignment. As the model learns to satisfy the oracle's preferences, it may sacrifice diversity, a phenomenon that warrants careful monitoring in production systems.

Key Parameters

The key parameters for the iterative alignment implementation are:

  • beta: The temperature parameter controlling the strength of the KL penalty (typically 0.1). Higher values keep the model closer to the reference.
  • rolling_reference: Boolean indicating whether to update the reference model at each iteration (True) or keep it fixed (False).
  • num_iterations: The number of cycles of generation and training to perform.
  • num_pairs: The number of response pairs generated per prompt for preference collection in each round.

Practical Considerations

Deploying iterative alignment in production requires attention to several practical issues that our simplified examples gloss over.

Iteration Frequency

How often should you iterate? Too frequent iteration (e.g., after every batch) prevents the model from fully learning from current data. Too infrequent iteration (e.g., once per quarter) allows significant distribution shift. Most practitioners find weekly to monthly iteration cycles work well, depending on the rate of model deployment and feedback collection.

Data Mixing

Rather than training exclusively on new preference data each iteration, mixing in data from previous iterations helps prevent catastrophic forgetting of earlier lessons. A common approach maintains a replay buffer of preferences from all iterations, sampling proportionally from recent and historical data.

Human-in-the-Loop Checkpoints

Even when using AI feedback for most iterations, periodic human evaluation checkpoints catch drifts that AI judges miss. These checkpoints might occur every few iterations or whenever automated metrics suggest significant behavioral changes.

Rollback Mechanisms

Sometimes iterative training goes wrong: the model develops undesirable behaviors or loses capabilities. Having clear rollback criteria and maintaining checkpoints from each iteration enables quick recovery. Define specific metrics that trigger automatic rollback: capability benchmark drops below threshold, diversity falls too low, or human evaluators flag concerning patterns.

Limitations and Impact

Iterative alignment represents a significant advance over single-shot training but introduces its own challenges. The computational cost multiplies with each iteration: you need to generate new data, potentially annotate it, and train again. For large models, this becomes expensive quickly.

The self-improvement approaches, while promising, have limits. A model cannot improve beyond what its evaluation criteria capture. If the judge model (whether human or AI) has blind spots, iterative training amplifies rather than corrects them. Constitutional AI partially addresses this with explicit principles, but those principles themselves may be incomplete or contradictory.

Perhaps most concerning is the possibility of subtle value drift. Small consistent biases in the preference signal, whether from the reward model, the AI judge, or sampling strategies, compound over iterations. After many rounds of training, the model might satisfy measured preferences while diverging from underlying human values in ways that are difficult to detect until something goes wrong.

These challenges don't diminish the importance of iterative alignment; they highlight the need for careful monitoring, diverse evaluation, and appropriate human oversight. As models become more capable, the alignment process must become more sophisticated. Single-shot training may suffice for simple tasks, but truly aligned AI systems likely require continuous, iterative refinement guided by ongoing human feedback and evaluation.

The upcoming chapters on inference optimization cover the practical machinery needed to deploy these iteratively-aligned models efficiently. Techniques like KV caching, quantization, and efficient attention make it feasible to run aligned models at scale while maintaining the quality achieved through careful iterative training.

Summary

Iterative alignment extends single-shot methods like DPO and RLHF into ongoing processes that maintain tight alignment as models and expectations evolve. Key concepts from this chapter:

  • Distribution mismatch after training motivates iterative approaches: the model's outputs no longer match the distribution that generated preference data
  • Iterative DPO runs multiple training rounds, with either a fixed reference (stable anchor, growing ratios) or rolling reference (stable ratios, cumulative drift)
  • Online preference learning generates fresh data from the current policy, keeping training distribution matched but requiring continuous annotation
  • Self-improvement through SPIN, Constitutional AI, and LLM-as-judge enables scaling beyond human annotation bottlenecks
  • Stability concerns include model collapse, reward hacking accumulation, capability degradation, and subtle value drift
  • Monitoring diversity, capability retention, and KL divergence from the original model helps detect problems early
  • Practical deployment requires balancing iteration frequency, mixing historical data, incorporating human checkpoints, and maintaining rollback capabilities

Iterative alignment isn't a solved problem; it's an active research area with significant open questions about stability, safety, and scalability. But the core insight is clear: alignment should be a continuous process, not a one-time event.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about iterative alignment methods.

Loading component...

Reference

BIBTEXAcademic
@misc{iterativealignmentonlinedposelfimprovementmethods, author = {Michael Brenndoerfer}, title = {Iterative Alignment: Online DPO & Self-Improvement Methods}, year = {2026}, url = {https://mbrenndoerfer.com/writing/iterative-alignment-online-dpo-self-improvement}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2026). Iterative Alignment: Online DPO & Self-Improvement Methods. Retrieved from https://mbrenndoerfer.com/writing/iterative-alignment-online-dpo-self-improvement
MLAAcademic
Michael Brenndoerfer. "Iterative Alignment: Online DPO & Self-Improvement Methods." 2026. Web. today. <https://mbrenndoerfer.com/writing/iterative-alignment-online-dpo-self-improvement>.
CHICAGOAcademic
Michael Brenndoerfer. "Iterative Alignment: Online DPO & Self-Improvement Methods." Accessed today. https://mbrenndoerfer.com/writing/iterative-alignment-online-dpo-self-improvement.
HARVARDAcademic
Michael Brenndoerfer (2026) 'Iterative Alignment: Online DPO & Self-Improvement Methods'. Available at: https://mbrenndoerfer.com/writing/iterative-alignment-online-dpo-self-improvement (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2026). Iterative Alignment: Online DPO & Self-Improvement Methods. https://mbrenndoerfer.com/writing/iterative-alignment-online-dpo-self-improvement