Reward Modeling: Building Preference Predictors for RLHF

Michael BrenndoerferDecember 24, 202537 min read

Build neural networks that learn human preferences from pairwise comparisons. Master reward model architecture, Bradley-Terry loss, and evaluation for RLHF.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Reward Modeling

In the previous chapter, we explored the Bradley-Terry model and how it provides a probabilistic framework for converting pairwise human preferences into a consistent scoring system. Now we turn to the practical question: how do we build a neural network that learns to predict these preferences?

A reward model is a neural network that takes a prompt and response as input and outputs a scalar score indicating how "good" that response is according to human preferences. This model serves as a proxy for human judgment, allowing us to provide dense feedback signals during reinforcement learning without requiring a human to evaluate every generated response. The reward model sits at the heart of RLHF: it translates sparse, noisy human preferences into a continuous signal that can guide policy optimization.

Building an effective reward model requires careful consideration of architecture choices, loss functions, and evaluation methods. The model must generalize from a limited set of human comparisons to accurately score responses it has never seen before, including responses generated by future versions of the language model being trained. This chapter covers the complete pipeline from architecture design through training and evaluation.

Reward Model Architecture

The standard approach to building a reward model starts with a pretrained language model and adds a simple regression head that maps the final hidden states to a scalar reward. This design philosophy reflects a key insight: rather than learning to understand language from scratch, we can leverage the rich representations already encoded in models that have been trained on vast text corpora. The pretrained model provides the linguistic foundation, while the new regression head learns to interpret those representations through the lens of human preferences.

Base Model Selection

Reward models typically use the same architecture family as the policy model they will train. If you're using RLHF to fine-tune a 7B parameter LLaMA model, your reward model might be initialized from the same pretrained checkpoint or a similar model in the family. This architectural alignment ensures the reward model can process the same input representations and has similar language understanding capabilities. The choice is not arbitrary: when the reward model shares the same "vocabulary" of internal representations as the policy model, it can more accurately evaluate the subtle qualities of responses that the policy might generate.

The key modification is replacing or augmenting the language modeling head. Instead of predicting the next token, we need to output a single scalar value that represents the quality of the entire response. This transformation converts a generative model into an evaluative one, shifting from the question "what comes next?" to "how good is this?"

Value Head Design

The core architectural question for reward modeling is this: how do we collapse an entire sequence of hidden states, one for each token in the prompt and response, into a single number that captures the overall quality? The solution involves selecting a representative hidden state and projecting it down to a scalar through a learned transformation.

The reward model architecture can be expressed as:

r(x,y)=fθ(hfinal)r(x, y) = f_\theta(h_{\text{final}})

where:

  • r(x,y)r(x, y): the scalar reward score output by the model
  • xx: the prompt text input to the model
  • yy: the response text to be evaluated
  • hfinalh_{\text{final}}: the hidden state vector at the final token position
  • fθf_\theta: the learned function (value head) mapping the hidden state to a scalar

This formulation captures the essential transformation at the heart of reward modeling. The input is a high-dimensional hidden state vector, perhaps 768 or 4096 dimensions depending on the model size, and the output is a single real number. The function fθf_\theta must learn to extract and weigh the relevant features from this rich representation to produce a meaningful quality score.

This projection is typically implemented as a linear layer:

fθ(h)=wTh+bf_\theta(h) = w^T h + b

where:

  • fθ(h)f_\theta(h): the scalar output of the value head
  • wRdw \in \mathbb{R}^d: the learned weight vector for the value head
  • wTw^T: the transpose of ww, enabling the dot product with hh
  • hh: the input hidden state vector
  • bRb \in \mathbb{R}: the learned bias term
  • dd: the hidden dimension size of the base transformer model

The linear layer performs a weighted sum over all dimensions of the hidden state. Each component wiw_i learns to assign importance to the corresponding dimension hih_i of the hidden representation. Positive weights mean that feature contributes positively to the reward, negative weights indicate a negative contribution, and weights near zero suggest the feature is irrelevant for quality assessment. The bias term bb shifts the overall reward scale, allowing the model to center its predictions appropriately.

Why use the final token position? In autoregressive models, information flows from left to right through causal attention. Each token can only attend to tokens that came before it, creating a natural accumulation of information as we move through the sequence. The final token's hidden state has "seen" all preceding tokens in both the prompt and response, making it a natural summary of the entire sequence. By the time we reach the end, the model has processed every word, every argument, and every nuance. This is analogous to using the [CLS] token representation in BERT-style models, which we covered in Part XVII, where a special token is positioned to aggregate information from the entire input.

Some implementations average over all response token positions instead:

havg=1ytresponsehth_{\text{avg}} = \frac{1}{|y|} \sum_{t \in \text{response}} h_t

where:

  • havgh_{\text{avg}}: the mean pooled hidden representation
  • y|y|: the number of tokens in the response
  • hth_t: the hidden state at token position tt
  • tresponse\sum_{t \in \text{response}}: a summation over all token positions belonging to the response

This mean pooling approach treats all positions as equally important and computes their centroid in hidden space. The intuition is that every part of the response matters, and averaging captures the "typical" representation across the sequence. However, using the final token is more common in practice because it naturally captures the complete context and requires no additional computation. The causal attention mechanism has already done the work of aggregating information, so the final position provides a ready-made summary.

Handling Variable-Length Inputs

The reward model must handle prompt-response pairs of varying lengths. Some prompts are brief questions, others are lengthy instructions with context. Similarly, responses range from terse answers to elaborate explanations. The architecture must gracefully accommodate this variability while maintaining consistent scoring semantics.

The input is formatted as a concatenation of the prompt and response:

[prompt tokens] [response tokens] [EOS]

The model processes this sequence through the transformer layers, and we extract the hidden state at the EOS (end-of-sequence) token position for the value head. This approach leverages the causal attention mechanism to ensure the reward is computed based on the complete context. The EOS token serves as a natural boundary marker, signaling where the response ends and providing a consistent extraction point regardless of the actual sequence length.

Reward vs. Value Functions

In reinforcement learning terminology, a reward function r(s,a)r(s, a) gives the immediate reward for taking action aa in state ss. A value function V(s)V(s) estimates the expected cumulative future reward from state ss. Our reward model acts as a reward function, providing a scalar score for the completed response. The value function used during PPO training is a separate component, which we'll discuss in upcoming chapters on policy optimization.

Preference Loss Function

The reward model is trained on human preference data, where annotators indicate which of two responses they prefer for a given prompt. We need a loss function that encourages the model to assign higher rewards to preferred responses. The key insight is that we do not need absolute quality labels. Instead, we only need relative comparisons, and from these pairwise signals, the model can learn a consistent scoring function.

Deriving the Loss from Bradley-Terry

The mathematical foundation for our loss function comes from the Bradley-Terry model, which provides a principled way to convert scalar scores into preference probabilities. As we established in the Bradley-Terry chapter, the probability that response ywy_w (the winner) is preferred over response yly_l (the loser) given their reward scores is:

P(ywylx)=σ(r(x,yw)r(x,yl))P(y_w \succ y_l | x) = \sigma(r(x, y_w) - r(x, y_l))

where:

  • P(ywylx)P(y_w \succ y_l | x): the probability that response ywy_w is preferred over yly_l given prompt xx
  • xx: the prompt text input
  • σ\sigma: the logistic sigmoid function, σ(z)=11+ez\sigma(z) = \frac{1}{1+e^{-z}}
  • r(x,y)r(x, y): the scalar score output by the reward model for a given input pair
  • yw,yly_w, y_l: the winning and losing responses, respectively

This equation has a beautiful interpretation. The preference probability depends only on the difference between rewards, not their absolute values. If one response scores 10 points higher than another, the preference probability is the same whether the scores are (15, 5) or (105, 95). This shift-invariance is both mathematically convenient and practically useful: it means the model only needs to learn relative quality, not calibrate to some arbitrary absolute scale.

The sigmoid function σ\sigma transforms this difference into a valid probability between 0 and 1. When the reward difference is zero, both responses are equally preferred with probability 0.5. As the difference grows positive (the winner scores much higher), the probability approaches 1. As it grows negative (the winner actually scores lower, indicating a model error), the probability approaches 0.

Out[2]:
Visualization
Using Python 3.13.5 environment at: /Users/michaelbrenndoerfer/tinker/research-deepagents/.venv
Audited 2 packages in 44ms
The Bradley-Terry model maps reward differences to preference probabilities via the sigmoid function. When rewards are equal (difference of zero), both responses have equal 50% preference probability. Larger positive differences indicate stronger preference for the higher-scored response.
The Bradley-Terry model maps reward differences to preference probabilities via the sigmoid function. When rewards are equal (difference of zero), both responses have equal 50% preference probability. Larger positive differences indicate stronger preference for the higher-scored response.

To train the model, we maximize the log-likelihood of the observed preferences. This corresponds to minimizing the negative log-likelihood, which is equivalent to the binary cross-entropy loss applied to the reward difference. The derivation proceeds through several algebraic steps, each illuminating a different aspect of the loss function. We can derive the final loss form step-by-step:

L=logP(ywylx)(negative log-likelihood)=logσ(r(x,yw)r(x,yl))(substitute probability model)=log(11+e(r(x,yw)r(x,yl)))(substitute sigmoid definition)=log(1+e(r(x,yw)r(x,yl)))(log property: log(1/z)=logz)=log(1+er(x,yl)r(x,yw))(simplify exponent)\begin{aligned} \mathcal{L} &= -\log P(y_w \succ y_l | x) && \text{(negative log-likelihood)} \\ &= -\log \sigma(r(x, y_w) - r(x, y_l)) && \text{(substitute probability model)} \\ &= -\log \left( \frac{1}{1 + e^{-(r(x, y_w) - r(x, y_l))}} \right) && \text{(substitute sigmoid definition)} \\ &= \log(1 + e^{-(r(x, y_w) - r(x, y_l))}) && \text{(log property: } -\log(1/z) = \log z \text{)} \\ &= \log(1 + e^{r(x, y_l) - r(x, y_w)}) && \text{(simplify exponent)} \end{aligned}

where:

  • L\mathcal{L}: the preference loss to be minimized
  • P(ywylx)P(y_w \succ y_l | x): the probability that response ywy_w is preferred
  • σ\sigma: the logistic sigmoid function
  • xx: the prompt text input
  • yw,yly_w, y_l: the winning and losing responses
  • r(x,yl)r(x,yw)r(x, y_l) - r(x, y_w): the reward difference (negative if the preferred response scores higher)
  • e()e^{(\cdot)}: the exponential function
  • log\log: the natural logarithm

The final form of the loss, log(1+er(x,yl)r(x,yw))\log(1 + e^{r(x, y_l) - r(x, y_w)}), is known as the softplus function applied to the negative reward margin. This form is numerically stable and commonly implemented directly in deep learning frameworks.

Intuition Behind the Loss

This equation has a clear interpretation that connects directly to how we want the model to behave. Consider what happens in different scenarios.

When the reward model correctly assigns a much higher score to the preferred response (r(x,yw)r(x,yl)r(x, y_w) \gg r(x, y_l)), the difference is a large positive number, and σ\sigma of that difference approaches 1. Taking the negative log gives a loss close to zero. The model has confidently made the right prediction, so there is little left to learn from this example.

Conversely, when the model incorrectly assigns a higher score to the rejected response, the difference is negative, σ\sigma outputs a value near 0, and the negative log produces a large loss. This large loss creates strong gradients that push the model away from its incorrect belief.

Out[3]:
Visualization
The preference loss as a function of reward margin. When the model correctly ranks responses (positive margin), loss is small, while incorrect rankings produce large losses and strong gradients for learning. This asymmetry ensures the model focuses learning on examples it gets wrong.
The preference loss as a function of reward margin. When the model correctly ranks responses (positive margin), loss is small, while incorrect rankings produce large losses and strong gradients for learning. This asymmetry ensures the model focuses learning on examples it gets wrong.

The gradient of this loss pushes the model to:

  • Increase r(x,yw)r(x, y_w) (the preferred response's reward)
  • Decrease r(x,yl)r(x, y_l) (the rejected response's reward)

The magnitude of these updates depends on how confident the model currently is. If the model already strongly prefers the correct response, gradients are small. If it's uncertain or wrong, gradients are larger. This is the standard behavior of log-loss functions and provides natural calibration during training. The model learns most from examples where it is wrong or uncertain, while examples it already handles well contribute minimally to parameter updates.

Batch Loss Formulation

In practice, we train on batches of preference pairs rather than individual examples. For a dataset of NN preference pairs D={(x(i),yw(i),yl(i))}i=1N\mathcal{D} = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}_{i=1}^N, the full training objective is:

L(θ)=1Ni=1Nlogσ(rθ(x(i),yw(i))rθ(x(i),yl(i)))\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \log \sigma\left(r_\theta(x^{(i)}, y_w^{(i)}) - r_\theta(x^{(i)}, y_l^{(i)})\right)

where:

  • L(θ)\mathcal{L}(\theta): the total loss over the dataset parameterized by model weights θ\theta
  • NN: the total number of preference pairs in the batch
  • i=1N\sum_{i=1}^N: summation over all examples in the batch
  • x(i)x^{(i)}: the prompt for the ii-th example
  • yw(i),yl(i)y_w^{(i)}, y_l^{(i)}: the preferred and rejected responses for the ii-th example
  • rθr_\theta: the reward model function parameterized by θ\theta
  • σ\sigma: the sigmoid function converting the score difference into a probability
  • log\log: the natural logarithm

The averaging by NN normalizes the loss across different batch sizes, ensuring that learning rates behave consistently regardless of batch configuration. This is the standard reward modeling loss used in systems like InstructGPT, Anthropic's Constitutional AI, and most open-source RLHF implementations. Its simplicity and effectiveness have made it the default choice for training reward models from pairwise preferences.

Margin-Based Variants

Some implementations add a margin term to encourage larger reward differences between preferred and rejected responses:

L=logσ(r(x,yw)r(x,yl)m)\mathcal{L} = -\log \sigma(r(x, y_w) - r(x, y_l) - m)

where:

  • L\mathcal{L}: the margin-based preference loss
  • xx: the prompt text
  • yw,yly_w, y_l: the preferred and rejected responses
  • m>0m > 0: a fixed margin hyperparameter ensuring the winner's score exceeds the loser's by at least mm
  • r(x,yw)r(x,yl)r(x, y_w) - r(x, y_l): the raw reward difference
  • σ\sigma: the sigmoid function
  • log\log: the natural logarithm

The margin mm acts as a buffer zone. With a margin of, say, 0.5, the model is penalized unless the preferred response scores at least 0.5 points higher than the rejected one. This prevents the model from being satisfied with arbitrarily small reward differences, even when the ranking is technically correct.

This penalizes the model even when it correctly ranks preferences but with a small margin, encouraging more confident predictions. The idea is that a robust reward model should produce clearly separated scores, making it easier for downstream RL algorithms to distinguish good from bad responses. However, this can hurt calibration and is not universally used. The margin effectively changes the decision boundary from zero to mm, which may not align with the true underlying preference probabilities.

Reward Model Training

Training a reward model involves several practical considerations beyond the loss function, including data preparation, optimization settings, and regularization strategies.

Data Preparation

Each training example consists of a prompt and two responses with a preference label. The standard format is:

In[4]:
Code
{
    "prompt": "Explain quantum entanglement to a 10-year-old.",
    "chosen": "Imagine you have two magic coins...",
    "rejected": "Quantum entanglement is a phenomenon...",
}
Out[4]:
Console
{'prompt': 'Explain quantum entanglement to a 10-year-old.',
 'chosen': 'Imagine you have two magic coins...',
 'rejected': 'Quantum entanglement is a phenomenon...'}

During training, we need to compute rewards for both responses. A common approach processes both responses in a single forward pass by concatenating them:

[prompt] [chosen response] [EOS] [PAD] ... [prompt] [rejected response] [EOS]

The model computes hidden states for both sequences, extracts the final token representations for each, passes them through the value head, and computes the loss on their difference.

Optimization Configuration

Reward model training typically uses conservative hyperparameters:

  • Learning rate: 1×1051 \times 10^{-5} to 5×1055 \times 10^{-5}, lower than instruction tuning to preserve pretrained knowledge
  • Batch size: Large batches (32-128 pairs) help with gradient stability
  • Epochs: 1-3 epochs over the preference data; more risks overfitting
  • Optimizer: AdamW with weight decay 0.01-0.1

The learning rate is kept low because we're building on a pretrained model that already has strong language understanding. We want to learn the preference structure without disturbing the underlying representations too much.

Regularization Considerations

Reward models are susceptible to overfitting, especially when trained on limited preference data. Several techniques help:

  • Early stopping based on validation accuracy is essential. The model should generalize to held-out preferences, not memorize training pairs.
  • Dropout in the value head (though not typically in the pretrained layers) adds regularization without affecting the language model's representations.
  • Label smoothing can help with noisy labels. If some human preferences are inconsistent or reflect borderline cases, treating them as probabilistic rather than hard labels improves robustness.

Training Stability

The reward scale is arbitrary, unlike classification where outputs are bounded probabilities. This can cause optimization instability. Common solutions include:

  • Reward normalization after training scales rewards to have zero mean and unit variance over a reference set. This makes the reward signal easier to use in downstream RL training.
  • Gradient clipping prevents large updates from outlier examples. A max gradient norm of 1.0 is typical.
  • Learning rate warmup over the first 5-10% of training helps stabilize early optimization.

Reward Model Evaluation

Evaluating reward models is crucial but challenging. Unlike language models where we can measure perplexity on held-out text, reward models are evaluated on their ability to predict human preferences. The evaluation must assess whether the learned reward function captures the underlying preference structure, not just memorizes the training comparisons.

Primary Metrics

Pairwise accuracy is the most direct evaluation. On a held-out set of preference pairs, we measure how often the reward model assigns a higher score to the human-preferred response. This metric directly measures the model's ability to perform its intended function: distinguishing better responses from worse ones.

Accuracy=1Ntesti=1Ntest1[r(x(i),yw(i))>r(x(i),yl(i))]\text{Accuracy} = \frac{1}{N_{\text{test}}} \sum_{i=1}^{N_{\text{test}}} \mathbf{1}[r(x^{(i)}, y_w^{(i)}) > r(x^{(i)}, y_l^{(i)})]

where:

  • Accuracy\text{Accuracy}: the fraction of correctly ranked pairs
  • NtestN_{\text{test}}: the total number of examples in the test set
  • i=1Ntest\sum_{i=1}^{N_{\text{test}}}: summation over all test examples
  • x(i)x^{(i)}: the prompt for the ii-th test example
  • yw(i),yl(i)y_w^{(i)}, y_l^{(i)}: the preferred and rejected responses for the ii-th test example
  • 1[]\mathbf{1}[\cdot]: the indicator function, evaluating to 1 if the condition is true and 0 otherwise
  • r(x(i),)r(x^{(i)}, \dots): the predicted reward for the ii-th pair's responses

The indicator function 1[]\mathbf{1}[\cdot] returns 1 when the condition inside is true (the chosen response receives a higher reward) and 0 otherwise. Summing these indicators and dividing by the total count gives us the proportion of correctly ranked pairs.

A random model achieves 50% accuracy, while human inter-annotator agreement typically ranges from 65-80% depending on the task difficulty. A good reward model should approach but not necessarily exceed human agreement levels, since disagreements in the training data cap achievable performance. If humans themselves only agree 75% of the time on which response is better, we cannot expect the model to exceed this ceiling. In fact, a model achieving much higher accuracy might be exploiting artifacts in the data rather than capturing genuine preferences.

Calibration measures whether the model's confidence matches its accuracy. If the model predicts a preference probability of 0.8, it should be correct about 80% of the time for similar-confidence predictions. Calibration is important because the reward differences will be used as optimization signals. A well-calibrated model produces reliable gradients: large reward differences genuinely indicate strong preferences, while small differences reflect uncertainty.

Agreement with Human Evaluators

Beyond automatic metrics, direct comparison with human evaluations provides insight into model quality:

  • Correlation with human scores: If human evaluators rate responses on a Likert scale (1-5), we can measure Spearman correlation between model rewards and human ratings.
  • Head-to-head win rates: Show humans new responses ranked by the reward model and ask them to validate the rankings. This catches cases where the model has learned spurious correlations.

Detecting Reward Model Weaknesses

Reward models can learn shortcuts that don't align with true quality:

  • Length bias: Models often prefer longer responses, even when brevity is more appropriate. Test by comparing responses of different lengths where the shorter one is genuinely better.
  • Style over substance: Models may prefer responses with confident tone or specific formatting regardless of accuracy. Test with factually incorrect but confidently-written responses.
  • Prompt sensitivity: Check if reward differences are consistent across paraphrased prompts asking the same question.

These weaknesses become critical during RL training, when the policy model can exploit them. We'll explore this issue in depth in the upcoming chapter on reward hacking.

Worked Example: Computing Preference Loss

Let's trace through the loss computation for a single preference pair to solidify understanding. Walking through a concrete numerical example makes the abstract formulas tangible and reveals how the loss function behaves in practice.

Suppose we have a reward model that produces:

  • r(x,yw)=2.1r(x, y_w) = 2.1 for the preferred response
  • r(x,yl)=0.8r(x, y_l) = 0.8 for the rejected response

The reward difference is:

Δr=r(x,yw)r(x,yl)=2.10.8=1.3\Delta r = r(x, y_w) - r(x, y_l) = 2.1 - 0.8 = 1.3

where:

  • Δr\Delta r: the reward difference between the chosen and rejected responses
  • r(x,yw)r(x, y_w): the reward score for the preferred response (2.12.1)
  • r(x,yl)r(x, y_l): the reward score for the rejected response (0.80.8)

This positive difference of 1.3 indicates the model correctly believes the preferred response is better. The question is: how confident is this prediction, and how much loss does it incur?

The Bradley-Terry probability that the model predicts the correct preference:

P(ywyl)=σ(1.3)(apply sigmoid)=11+e1.3(definition)=11+0.273(compute exponential)0.786(final result)\begin{aligned} P(y_w \succ y_l) &= \sigma(1.3) && \text{(apply sigmoid)} \\ &= \frac{1}{1 + e^{-1.3}} && \text{(definition)} \\ &= \frac{1}{1 + 0.273} && \text{(compute exponential)} \\ &\approx 0.786 && \text{(final result)} \end{aligned}

where:

  • P(ywyl)P(y_w \succ y_l): the calculated probability that the model prefers ywy_w
  • σ(1.3)\sigma(1.3): the sigmoid function applied to the reward difference
  • 0.7860.786: the resulting probability (approx. 78.6%)

The model assigns about 78.6% probability to the correct preference. This is a reasonably confident prediction, reflecting the moderately large reward margin of 1.3 points.

The loss for this example:

L=log(0.786)0.241\mathcal{L} = -\log(0.786) \approx 0.241

where:

  • L\mathcal{L}: the computed preference loss value
  • log(0.786)-\log(0.786): the negative log-likelihood of the correct preference

The loss of 0.241 is relatively small, indicating the model is performing well on this example. There is still some room for improvement: if the model were perfectly confident (probability 1.0), the loss would be zero.

Now consider if the model had incorrectly scored the responses:

  • r(x,yw)=0.5r(x, y_w) = 0.5
  • r(x,yl)=1.8r(x, y_l) = 1.8

The reward difference is 1.3-1.3, the predicted probability drops to σ(1.3)=0.214\sigma(-1.3) = 0.214, and the loss increases dramatically to log(0.214)=1.54-\log(0.214) = 1.54.

This demonstrates how the loss heavily penalizes incorrect rankings while providing smaller gradients when the model is already correct. The loss of 1.54 is more than six times larger than the 0.241 for the correct prediction, creating strong pressure to fix the incorrect ranking. This asymmetry is exactly what we want: the model should focus its learning capacity on examples it gets wrong.

Out[5]:
Visualization
Loss values for the worked example. The correct prediction (margin 1.3) yields low loss of 0.24, while an incorrect ranking (margin -1.3) produces much higher loss of 1.54. The six-fold difference in loss creates strong gradients that prioritize fixing incorrect rankings.
Loss values for the worked example. The correct prediction (margin 1.3) yields low loss of 0.24, while an incorrect ranking (margin -1.3) produces much higher loss of 1.54. The six-fold difference in loss creates strong gradients that prioritize fixing incorrect rankings.
Notebook output

Code Implementation

Let's build a reward model from scratch using a small pretrained transformer. We'll implement the architecture, loss function, and training loop.

Setting Up the Environment

In[6]:
Code
!uv pip install transformers torch matplotlib numpy
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch.optim
from transformers import AutoModel, AutoTokenizer
import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from typing import List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

Reward Model Architecture

We'll build a reward model by adding a value head to a pretrained transformer. The value head is a simple linear layer that projects the final hidden state to a scalar.

In[7]:
Code
class RewardModel(nn.Module):
    """
    Reward model built on a pretrained transformer.

    Takes (prompt, response) pairs and outputs scalar rewards.
    """

    def __init__(self, model_name: str, dropout: float = 0.1):
        super().__init__()
        # Load pretrained transformer as the backbone
        self.backbone = AutoModel.from_pretrained(model_name)
        hidden_size = self.backbone.config.hidden_size

        # Value head: projects final hidden state to scalar reward
        self.value_head = nn.Sequential(
            nn.Dropout(dropout), nn.Linear(hidden_size, 1)
        )

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor,
    ) -> torch.Tensor:
        """
        Compute reward for input sequences.

        Args:
            input_ids: Token IDs [batch_size, seq_len]
            attention_mask: Attention mask [batch_size, seq_len]

        Returns:
            Tensor: Scalar rewards [batch_size]
        """
        # Get hidden states from backbone
        outputs = self.backbone(
            input_ids=input_ids, attention_mask=attention_mask
        )
        hidden_states = outputs.last_hidden_state  # [batch, seq, hidden]

        # Find position of last non-padding token for each sequence
        # Sum attention mask to get sequence lengths
        seq_lengths = attention_mask.sum(dim=1) - 1  # -1 for 0-indexing
        batch_indices = torch.arange(
            hidden_states.size(0), device=hidden_states.device
        )

        # Extract final token hidden state for each sequence
        final_hidden = hidden_states[
            batch_indices, seq_lengths
        ]  # [batch, hidden]

        # Project to scalar reward
        rewards = self.value_head(final_hidden).squeeze(-1)  # [batch]

        return rewards

Let's verify the architecture outputs the expected shapes.

In[8]:
Code
# Initialize model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
reward_model = RewardModel(model_name)

# Create a sample input
sample_text = "What is machine learning? Machine learning is a field of AI."
inputs = tokenizer(sample_text, return_tensors="pt", padding=True)

# Forward pass
with torch.no_grad():
    reward = reward_model(inputs["input_ids"], inputs["attention_mask"])
Out[9]:
Console
Input shape: torch.Size([1, 15])
Reward shape: torch.Size([1])
Reward value: 0.1240

The output confirms that the model processes the tokenized sequence and produces a single scalar score (batch size 1, output dimension 1). This scalar represents the reward for the prompt-response pair, which will be used to rank different responses against each other.

Preference Dataset

Now let's create a dataset class that handles preference pairs. Each example contains a prompt with a chosen (preferred) and rejected response.

In[10]:
Code
@dataclass
class PreferencePair:
    prompt: str
    chosen: str
    rejected: str


class PreferenceDataset(Dataset):
    """Dataset of preference pairs for reward model training."""

    def __init__(
        self, pairs: List[PreferencePair], tokenizer, max_length: int = 256
    ):
        self.pairs = pairs
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.pairs)

    def __getitem__(self, idx: int):
        pair = self.pairs[idx]

        # Tokenize prompt and chosen response
        # Passing two strings automatically adds the separator token
        chosen_enc = self.tokenizer(
            pair.prompt,
            pair.chosen,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        # Tokenize prompt and rejected response
        rejected_enc = self.tokenizer(
            pair.prompt,
            pair.rejected,
            max_length=self.max_length,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

        return {
            "chosen_ids": chosen_enc["input_ids"].squeeze(0),
            "chosen_mask": chosen_enc["attention_mask"].squeeze(0),
            "rejected_ids": rejected_enc["input_ids"].squeeze(0),
            "rejected_mask": rejected_enc["attention_mask"].squeeze(0),
        }

Let's create some synthetic preference data for demonstration.

In[11]:
Code
# Synthetic preference pairs for demonstration
preference_pairs = [
    PreferencePair(
        prompt="Explain gravity simply.",
        chosen="Gravity is the force that pulls objects toward each other. The Earth pulls you down, which is why you stay on the ground.",
        rejected="Gravity is described by Einstein's field equations relating the curvature of spacetime to energy-momentum.",
    ),
    PreferencePair(
        prompt="What is photosynthesis?",
        chosen="Photosynthesis is how plants make food from sunlight, water, and carbon dioxide, producing oxygen as a byproduct.",
        rejected="It's a plant thing.",
    ),
    PreferencePair(
        prompt="How do computers work?",
        chosen="Computers process information using tiny electronic switches called transistors that can be on or off, representing 1s and 0s.",
        rejected="Computers work by executing machine code instructions on a von Neumann architecture with fetch-decode-execute cycles.",
    ),
    PreferencePair(
        prompt="Why is the sky blue?",
        chosen="Sunlight contains all colors. Blue light scatters more than other colors when it hits air molecules, so we see blue when we look up.",
        rejected="The sky is blue.",
    ),
    PreferencePair(
        prompt="What causes rain?",
        chosen="Water evaporates from oceans and lakes, rises as vapor, cools in clouds, and falls as rain when droplets get heavy enough.",
        rejected="Rain is caused by the condensation of atmospheric water vapor into droplets when air masses cool below the dew point temperature.",
    ),
]

# Create more training data by duplicating with slight variations
extended_pairs = preference_pairs * 10  # 50 total pairs for training demo

Preference Loss Function

The loss function implements the Bradley-Terry preference model. We compute rewards for both responses and maximize the probability of preferring the chosen response.

In[12]:
Code
def compute_preference_loss(
    reward_model: nn.Module,
    chosen_ids: torch.Tensor,
    chosen_mask: torch.Tensor,
    rejected_ids: torch.Tensor,
    rejected_mask: torch.Tensor,
) -> Tuple[torch.Tensor, dict]:
    """
    Compute the Bradley-Terry preference loss.

    Args:
        reward_model: The reward model
        chosen_ids: Token IDs for chosen responses [batch, seq]
        chosen_mask: Attention mask for chosen [batch, seq]
        rejected_ids: Token IDs for rejected responses [batch, seq]
        rejected_mask: Attention mask for rejected [batch, seq]

    Returns:
        loss: Scalar loss value
        metrics: Dictionary with additional metrics
    """
    # Compute rewards for both responses
    chosen_rewards = reward_model(chosen_ids, chosen_mask)
    rejected_rewards = reward_model(rejected_ids, rejected_mask)

    # Preference loss: -log(sigmoid(r_chosen - r_rejected))
    # Equivalent to: log(1 + exp(r_rejected - r_chosen))
    reward_diff = chosen_rewards - rejected_rewards
    loss = -F.logsigmoid(reward_diff).mean()

    # Compute accuracy (how often chosen reward > rejected reward)
    accuracy = (reward_diff > 0).float().mean()

    # Average reward margin
    margin = reward_diff.mean()

    metrics = {
        "loss": loss.item(),
        "accuracy": accuracy.item(),
        "margin": margin.item(),
        "chosen_reward_mean": chosen_rewards.mean().item(),
        "rejected_reward_mean": rejected_rewards.mean().item(),
    }

    return loss, metrics

Training Loop

Now we implement the complete training loop with logging and validation.

In[13]:
Code
def train_reward_model(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: Optional[DataLoader],
    num_epochs: int = 3,
    learning_rate: float = 2e-5,
    device: str = "cpu",
) -> dict:
    """
    Train the reward model on preference data.

    Args:
        model: Reward model to train
        train_loader: Training data loader
        val_loader: Validation data loader (optional)
        num_epochs: Number of training epochs
        learning_rate: Learning rate for optimizer
        device: Device to train on

    Returns:
        history: Dictionary with training metrics
    """
    model = model.to(device)
    optimizer = torch.optim.AdamW(
        model.parameters(), lr=learning_rate, weight_decay=0.01
    )

    history = {
        "train_loss": [],
        "train_accuracy": [],
        "val_loss": [],
        "val_accuracy": [],
    }

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_metrics = {"loss": 0, "accuracy": 0, "count": 0}

        for batch in train_loader:
            chosen_ids = batch["chosen_ids"].to(device)
            chosen_mask = batch["chosen_mask"].to(device)
            rejected_ids = batch["rejected_ids"].to(device)
            rejected_mask = batch["rejected_mask"].to(device)

            optimizer.zero_grad()
            loss, metrics = compute_preference_loss(
                model, chosen_ids, chosen_mask, rejected_ids, rejected_mask
            )
            loss.backward()

            # Gradient clipping for stability
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimizer.step()

            train_metrics["loss"] += metrics["loss"] * len(chosen_ids)
            train_metrics["accuracy"] += metrics["accuracy"] * len(chosen_ids)
            train_metrics["count"] += len(chosen_ids)

        # Compute epoch averages
        train_loss = train_metrics["loss"] / train_metrics["count"]
        train_acc = train_metrics["accuracy"] / train_metrics["count"]
        history["train_loss"].append(train_loss)
        history["train_accuracy"].append(train_acc)

        # Validation phase
        if val_loader is not None:
            model.eval()
            val_metrics = {"loss": 0, "accuracy": 0, "count": 0}

            with torch.no_grad():
                for batch in val_loader:
                    chosen_ids = batch["chosen_ids"].to(device)
                    chosen_mask = batch["chosen_mask"].to(device)
                    rejected_ids = batch["rejected_ids"].to(device)
                    rejected_mask = batch["rejected_mask"].to(device)

                    _, metrics = compute_preference_loss(
                        model,
                        chosen_ids,
                        chosen_mask,
                        rejected_ids,
                        rejected_mask,
                    )

                    val_metrics["loss"] += metrics["loss"] * len(chosen_ids)
                    val_metrics["accuracy"] += metrics["accuracy"] * len(
                        chosen_ids
                    )
                    val_metrics["count"] += len(chosen_ids)

            val_loss = val_metrics["loss"] / val_metrics["count"]
            val_acc = val_metrics["accuracy"] / val_metrics["count"]
            history["val_loss"].append(val_loss)
            history["val_accuracy"].append(val_acc)

    return history

Let's train the model on our synthetic preference data.

In[14]:
Code
# Create datasets
train_pairs = extended_pairs[:40]
val_pairs = extended_pairs[40:]

train_dataset = PreferenceDataset(train_pairs, tokenizer, max_length=128)
val_dataset = PreferenceDataset(val_pairs, tokenizer, max_length=128)

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8)

# Initialize fresh model for training
reward_model = RewardModel(model_name)

# Train
history = train_reward_model(
    reward_model, train_loader, val_loader, num_epochs=5, learning_rate=2e-5
)
Out[15]:
Console
Training Results:
----------------------------------------
Epoch 1:
  Train Loss: 0.6222, Accuracy: 65.00%
  Val Loss:   0.2893, Accuracy: 100.00%
Epoch 2:
  Train Loss: 0.1785, Accuracy: 100.00%
  Val Loss:   0.0440, Accuracy: 100.00%
Epoch 3:
  Train Loss: 0.0262, Accuracy: 100.00%
  Val Loss:   0.0034, Accuracy: 100.00%
Epoch 4:
  Train Loss: 0.0036, Accuracy: 100.00%
  Val Loss:   0.0007, Accuracy: 100.00%
Epoch 5:
  Train Loss: 0.0007, Accuracy: 100.00%
  Val Loss:   0.0003, Accuracy: 100.00%

The model quickly learns to distinguish between preferred and rejected responses on this small dataset.

Visualizing Training Progress

Out[16]:
Visualization
Reward model training curves showing loss decreasing and accuracy improving over epochs. The model learns to predict human preferences from pairwise comparison data, with validation metrics tracking closely with training metrics.
Reward model training curves showing loss decreasing and accuracy improving over epochs. The model learns to predict human preferences from pairwise comparison data, with validation metrics tracking closely with training metrics.
Notebook output

The training curves demonstrate that the model effectively learns the preference ranking task, with validation accuracy tracking closely with training accuracy. This indicates the model is generalizing well to unseen preference pairs without significant overfitting.

Evaluating on New Examples

Let's evaluate the trained model on examples it hasn't seen during training.

In[17]:
Code
def score_response(
    model, tokenizer, prompt: str, response: str, device: str = "cpu"
) -> float:
    """Score a single prompt-response pair."""
    model.eval()
    inputs = tokenizer(
        prompt,
        response,
        return_tensors="pt",
        padding=True,
        truncation=True,
        max_length=128,
    )

    with torch.no_grad():
        reward = model(
            inputs["input_ids"].to(device), inputs["attention_mask"].to(device)
        )
    return reward.item()


# Test on new examples
test_prompt = "What is machine learning?"
test_responses = [
    (
        "Good explanation",
        "Machine learning is when computers learn patterns from data to make predictions without being explicitly programmed.",
    ),
    (
        "Too technical",
        "ML utilizes gradient descent optimization on parameterized function approximators.",
    ),
    ("Too brief", "It's AI stuff."),
]

# Compute scores for all responses
scores = []
for label, response in test_responses:
    score = score_response(reward_model, tokenizer, test_prompt, response)
    scores.append((score, label, response))

# Sort by score (higher is better)
scores.sort(reverse=True)
Out[18]:
Console
Prompt: What is machine learning?

Response Rankings:
------------------------------------------------------------
1. [Good explanation] Score: -3.5671
   "Machine learning is when computers learn patterns from data ..."

2. [Too technical] Score: -3.6675
   "ML utilizes gradient descent optimization on parameterized f..."

3. [Too brief] Score: -3.7383
   "It's AI stuff."

The model has learned to rank responses in a way that aligns with our training preferences: clear explanations over technical jargon or overly brief responses.

Analyzing Reward Distributions

A well-calibrated reward model should produce meaningful score separations between good and bad responses. Let's analyze the distribution of rewards on our validation set.

In[19]:
Code
# Collect rewards from validation set
chosen_rewards = []
rejected_rewards = []

# Ensure model is on the correct device
device = next(reward_model.parameters()).device

reward_model.eval()
with torch.no_grad():
    for batch in val_loader:
        # Move inputs to the same device as the model
        chosen_ids = batch["chosen_ids"].to(device)
        chosen_mask = batch["chosen_mask"].to(device)
        rejected_ids = batch["rejected_ids"].to(device)
        rejected_mask = batch["rejected_mask"].to(device)

        chosen_r = reward_model(chosen_ids, chosen_mask)
        rejected_r = reward_model(rejected_ids, rejected_mask)

        # Move to CPU for logging/plotting
        chosen_rewards.extend(chosen_r.cpu().numpy().tolist())
        rejected_rewards.extend(rejected_r.cpu().numpy().tolist())
Out[20]:
Visualization
Distribution of reward scores for chosen (preferred) vs. rejected responses. A well-trained model produces clearly separated distributions, with chosen responses consistently scoring higher than rejected ones.
Distribution of reward scores for chosen (preferred) vs. rejected responses. A well-trained model produces clearly separated distributions, with chosen responses consistently scoring higher than rejected ones.

The clear separation between distributions indicates the model has learned meaningful preference distinctions.

Key Parameters

The key parameters for the Reward Model implementation are:

  • model_name: The pretrained backbone (e.g., "distilbert-base-uncased"). Smaller models allow for faster iteration during experimentation.
  • dropout: Regularization applied to the value head (set to 0.1) to prevent overfitting on the small preference dataset.
  • learning_rate: A conservative rate (2e-5) is used to fine-tune the backbone without destroying pretrained features.
  • batch_size: Set to 8 for this demonstration, though larger batches are preferred for stability in full-scale training.
  • num_epochs: Training is limited to 5 epochs to avoid overfitting on the small synthetic dataset.

Limitations and Impact

Reward modeling is a powerful technique but comes with significant challenges that you must understand.

The Proxy Problem

The fundamental limitation of reward models is that they are proxies for human preferences, not perfect representations. The model learns from a finite set of comparisons made by a specific group of annotators under particular conditions. It cannot generalize perfectly to all possible responses or capture the full complexity of human values. When the policy model optimizes against this learned reward, it may find responses that score highly according to the proxy but don't actually satisfy human preferences. This phenomenon, known as reward hacking or Goodhart's Law in action, becomes more severe as optimization pressure increases. We'll explore this challenge in depth in the next chapter.

Annotation Quality and Consistency

Reward model quality is bounded by the quality of the underlying preference data. Human annotators disagree, make mistakes, and have biases. If 70% of annotators prefer response A over B, the "correct" label is somewhat arbitrary. The reward model learns from these noisy, inconsistent signals, and this uncertainty propagates into the learned reward function. Different annotator pools may have systematically different preferences based on cultural background, expertise, or task understanding. A reward model trained on one population may not generalize to another.

Distribution Shift

During RLHF training, the policy model generates responses that may differ substantially from those in the reward model's training set. The reward model must extrapolate to these out-of-distribution samples, and its predictions become less reliable. This creates a feedback loop: the policy learns to generate responses that score well according to the reward model's potentially incorrect extrapolations, which can lead to degraded actual quality even as measured rewards increase.

Computational Costs

Training reward models requires significant computational resources, particularly when using large base models to ensure the reward model has sufficient language understanding. During RL training, the reward model must evaluate every generated response, adding substantial inference costs to an already expensive training procedure.

Impact on RLHF Systems

Despite these limitations, reward models have enabled major advances in language model alignment. They provide the crucial bridge between sparse human feedback and dense training signals. The InstructGPT system that powers ChatGPT, Anthropic's Claude models, and many open-source chat models all rely on reward models as a core component.

The reward modeling approach has also influenced research directions, spurring work on direct preference optimization (DPO) methods that eliminate the need for explicit reward models, as well as techniques for reward model ensembles, uncertainty quantification, and robust optimization. Understanding reward modeling deeply is essential for grasping both current RLHF systems and the alternatives being developed to address its limitations.

Summary

This chapter covered the complete pipeline for building reward models that learn to predict human preferences.

Architecture: Reward models add a value head to pretrained transformers, projecting the final token's hidden state to a scalar reward. Using the same architecture family as the policy model ensures compatible representations.

Loss function: The Bradley-Terry preference loss L=logσ(rwrl)\mathcal{L} = -\log \sigma(r_w - r_l) maximizes the probability of correctly ranking preference pairs. Gradients naturally emphasize uncertain or incorrect predictions.

Training: Conservative hyperparameters preserve pretrained knowledge while learning preference structure. Early stopping, gradient clipping, and appropriate regularization prevent overfitting to limited preference data.

Evaluation: Pairwise accuracy measures ranking performance, with human agreement providing an upper bound. Testing for biases like length preference or style over substance reveals potential weaknesses.

The reward model serves as the critical interface between human preferences and policy optimization. In the following chapters, we'll examine how reward hacking can undermine this proxy relationship, and then explore how policy gradient methods and PPO use reward signals to improve language models.

Quiz

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about reward modeling for RLHF.

Loading component...

Reference

BIBTEXAcademic
@misc{rewardmodelingbuildingpreferencepredictorsforrlhf, author = {Michael Brenndoerfer}, title = {Reward Modeling: Building Preference Predictors for RLHF}, year = {2025}, url = {https://mbrenndoerfer.com/writing/reward-modeling-rlhf-architecture-training}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }
APAAcademic
Michael Brenndoerfer (2025). Reward Modeling: Building Preference Predictors for RLHF. Retrieved from https://mbrenndoerfer.com/writing/reward-modeling-rlhf-architecture-training
MLAAcademic
Michael Brenndoerfer. "Reward Modeling: Building Preference Predictors for RLHF." 2026. Web. today. <https://mbrenndoerfer.com/writing/reward-modeling-rlhf-architecture-training>.
CHICAGOAcademic
Michael Brenndoerfer. "Reward Modeling: Building Preference Predictors for RLHF." Accessed today. https://mbrenndoerfer.com/writing/reward-modeling-rlhf-architecture-training.
HARVARDAcademic
Michael Brenndoerfer (2025) 'Reward Modeling: Building Preference Predictors for RLHF'. Available at: https://mbrenndoerfer.com/writing/reward-modeling-rlhf-architecture-training (Accessed: today).
SimpleBasic
Michael Brenndoerfer (2025). Reward Modeling: Building Preference Predictors for RLHF. https://mbrenndoerfer.com/writing/reward-modeling-rlhf-architecture-training