Bradley-Terry Model: Converting Preferences to Rankings

Michael Brenndoerfer

Learn how the Bradley-Terry model converts pairwise preferences into consistent rankings. Foundation for reward modeling in RLHF and Elo rating systems.

Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

Bradley-Terry ModelLink Copied

In the previous chapter, we explored how human preference data is collected through pairwise comparisons: annotators see two model responses and indicate which one they prefer. But raw preference counts aren't directly useful for training. If response A was preferred over response B in 73% of comparisons, and response B was preferred over response C in 68% of comparisons, how much better is A than C? How do we convert these pairwise judgments into consistent quality scores that can guide model optimization?

The Bradley-Terry model, developed by Ralph Bradley and Milton Terry in 1952, provides an elegant statistical framework for exactly this problem. Originally designed for ranking chess players and sports teams, it assumes each item has a latent "strength" parameter, and the probability of one item beating another depends only on the ratio of their strengths. This simple assumption leads to a powerful model that converts noisy pairwise comparisons into consistent, transitive rankings.

For LLM alignment, the Bradley-Terry model is foundational. As we'll see in the upcoming chapter on Reward Modeling, the reward model training objective is directly derived from the Bradley-Terry likelihood. Understanding this model deeply will clarify why reward models are trained the way they are, and why later techniques like Direct Preference Optimization (DPO) can eliminate the need for explicit reward modeling altogether.

The Pairwise Comparison FrameworkLink Copied

Consider a setting where we have $n$ items (in our case, model responses) that we want to rank based on human preferences. We don't observe quality scores directly; instead, we observe the outcomes of pairwise comparisons. When comparing items $i$ and $j$ , a human annotator chooses one as better. This setup reflects how humans naturally make judgments: rather than assigning absolute quality scores on some numerical scale, people find it far easier to simply say which of two options they prefer. The challenge lies in synthesizing these individual binary decisions into a coherent global ranking.

Pairwise Comparison

A pairwise comparison is a judgment where an evaluator observes two items and selects which one is superior according to some criterion. The comparison produces a binary outcome: either item $i$ wins or item $j$ wins.

The key insight of the Bradley-Terry model is that we can explain these observed preferences through latent strength parameters. Each item $i$ has an associated strength $\pi_i > 0$ that represents its underlying quality. We never observe these strengths directly. Instead, we must infer them from the pattern of wins and losses across many comparisons. Items with higher strength values are better, and when two items are compared, the one with higher strength is more likely to win, but not guaranteed to win, which accounts for noise in human judgments. This probabilistic view is essential: it acknowledges that human preferences contain randomness, whether from inconsistency, variation in attention, or genuine differences of opinion among annotators.

The fundamental assumption is that the probability of item $i$ being preferred over item $j$ depends only on the ratio of their strengths:

P(i \succ j) = \frac{\pi_i}{\pi_i + \pi_j}

where:

$P(i \succ j)$ : the probability that item $i$ is preferred over item $j$
$\pi_i, \pi_j$ : the latent strength parameters of items $i$ and $j$ (must be positive)

This formula has an intuitive interpretation that makes it easy to reason about. Consider first the case where $\pi_i = \pi_j$ : substituting into the formula gives $P(i \succ j) = \frac{\pi_i}{2\pi_i} = 0.5$ . This confirms our intuition that equally strong items have equal probability of being chosen. Now consider a more extreme case: if $\pi_i = 3\pi_j$ , then $P(i \succ j) = \frac{3\pi_j}{3\pi_j + \pi_j} = \frac{3}{4} = 0.75$ . An item three times as strong wins 75% of the time. Notice that the absolute magnitudes of the strengths do not matter, only their ratio. Whether we have strengths of 3 and 1, or 300 and 100, the predicted win probability remains the same. This ratio-based formulation captures the intuitive notion that what matters is relative quality, not absolute quality.

Out[2]:

Visualization

Win probability $P(i \succ j)$ as a function of the strength ratio $\pi_i / \pi_j$. The curve intersects 0.5 when strengths are equal (ratio = 1) and asymptotically approaches 1.0 as the ratio increases, reflecting the diminishing returns of strength superiority.

The Log-Linear FormLink Copied

While the ratio formulation is intuitive, a more convenient parameterization uses log-strengths. This transformation opens up powerful connections to logistic regression and makes optimization considerably easier. Let $\beta_i = \log(\pi_i)$ . Since strengths must be positive, the log-strength can be any real number, which is mathematically more convenient for optimization algorithms. We can now derive the equivalent expression in terms of these log-strengths:

\begin{aligned} P(i \succ j) &= \frac{\pi_i}{\pi_i + \pi_j} \\ &= \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}} && \text{(substitute } \pi = e^{\beta} \text{)} \\ &= \frac{e^{\beta_i - \beta_j}}{1 + e^{\beta_i - \beta_j}} && \text{(divide by } e^{\beta_j} \text{)} \\ &= \sigma(\beta_i - \beta_j) && \text{(definition of sigmoid)} \end{aligned}

where:

$\beta_i, \beta_j$ : the log-strength parameters of items $i$ and $j$
$\pi_i, \pi_j$ : the latent strength parameters ( $\beta = \log \pi$ )
$\sigma(x)$ : the sigmoid function $\frac{1}{1 + e^{-x}}$

The function $\sigma(x)$ is the standard sigmoid familiar from logistic regression and neural network activations (as covered in Part VII on Activation Functions). The sigmoid function has several properties that make it ideal for representing probabilities. It maps any real number to the interval $(0, 1)$ , ensuring valid probabilities. It is symmetric around zero, meaning $\sigma(-x) = 1 - \sigma(x)$ , which captures the natural symmetry of pairwise comparisons: the probability that $j$ beats $i$ equals one minus the probability that $i$ beats $j$ . The sigmoid is also smooth and differentiable everywhere, which enables gradient-based optimization.

Out[3]:

Visualization

Sigmoid function characteristics. The sigmoid maps real-valued inputs to the (0, 1) range, with maximum gradient at zero.

Symmetry property. The symmetry $\sigma(x) + \sigma(-x) = 1$ ensures that the probabilities of mutually exclusive outcomes sum to unity.

This log-linear form reveals that the Bradley-Terry model is equivalent to logistic regression where the probability of $i$ beating $j$ depends on the difference in their log-strengths. The sigmoid function ensures probabilities stay between 0 and 1, with the midpoint at 0.5 when $\beta_i = \beta_j$ . This equivalence to logistic regression is more than a mathematical curiosity: it means we can leverage decades of research on logistic regression, including efficient optimization algorithms, regularization techniques, and statistical theory about confidence intervals and hypothesis tests.

Bradley-Terry Probability

The probability that item $i$ is preferred over item $j$ in the Bradley-Terry model is:

P(i \succ j) = \sigma(\beta_i - \beta_j) = \frac{1}{1 + e^{-(\beta_i - \beta_j)}}

where:

$\beta_i, \beta_j$ : the log-strength parameters of the two items
$\sigma(\cdot)$ : the sigmoid function mapping the difference to a probability

Preference Strength and TransitivityLink Copied

A crucial property of the Bradley-Terry model is that it produces transitive preferences. Transitivity means that if A is better than B, and B is better than C, then A must be better than C. If we know how often A beats B, and how often B beats C, the model automatically determines how often A beats C. This follows directly from the parameterization, and understanding why reveals something deep about the model's structure.

The transitivity arises because each item has a single strength parameter that governs all of its comparisons. When we estimate $\beta_A$ , $\beta_B$ , and $\beta_C$ , these values must simultaneously explain the A-versus-B comparisons, the B-versus-C comparisons, and the A-versus-C comparisons. The model cannot assign different "strengths" to A depending on which item it is facing. This constraint forces global consistency.

Consider three items with strengths $\beta_A = 2$ , $\beta_B = 1$ , and $\beta_C = 0$ . The pairwise probabilities are:

\begin{aligned} P(A \succ B) &= \sigma(2 - 1) = \sigma(1) \approx 0.731 \\ P(B \succ C) &= \sigma(1 - 0) = \sigma(1) \approx 0.731 \\ P(A \succ C) &= \sigma(2 - 0) = \sigma(2) \approx 0.881 \end{aligned}

where:

$\beta_A, \beta_B, \beta_C$ : the example strengths ( $2, 1, 0$ )
$P(A \succ B)$ : the probability A beats B
$\sigma(\cdot)$ : the sigmoid function applied to strength differences

Notice that even though A and B have the same preference probability over their respective "next" item (both beat their weaker neighbor about 73% of the time), A has a much higher probability of beating C than B does of beating C. The strengths compose properly: the gap between A and C is the sum of the gaps between A and B and between B and C. Specifically, $\beta_A - \beta_C = (\beta_A - \beta_B) + (\beta_B - \beta_C) = 1 + 1 = 2$ . This additive property of log-strengths is what makes the model so elegant. Strength differences accumulate predictably across chains of comparisons.

Out[4]:

Visualization

Pairwise probability matrix for three items with log-strengths $\beta_A=2$, $\beta_B=1$, and $\beta_C=0$. The probability of A beating C (0.88) exceeds that of A beating B (0.73), visually confirming how strength differences accumulate to produce transitive rankings.

Out[5]:

Visualization

Log-strength parameters illustrating the additive property of the Bradley-Terry model. The total strength difference between A and C (Δ=2) equals the sum of the intermediate differences between A and B and B and C, which enforces consistency in global rankings.

This transitivity is a feature, not a bug. It forces the model to find a consistent ranking that best explains all observed comparisons, even when individual comparisons contain noise or contradictions. If humans sometimes prefer B over A, the model can still conclude A is better overall if A wins more often than it loses. The model essentially averages over the noise to find the underlying signal. However, as we will discuss later, this built-in transitivity can also be a limitation when human preferences genuinely violate transitivity.

The Bradley-Terry LikelihoodLink Copied

To estimate the strength parameters $\beta = (\beta_1, \beta_2, \ldots, \beta_n)$ from observed comparisons, we use maximum likelihood estimation. The goal is to find the parameter values that make the observed data most probable. This is a principled statistical approach: among all possible strength assignments, we seek the one that best explains what we actually saw happen in the comparisons.

Suppose we have a dataset of $m$ pairwise comparisons, where comparison $k$ involves items $i_k$ and $j_k$ , and item $w_k$ was chosen as the winner. The key assumption is that each comparison is independent, which is reasonable if different annotators make different comparisons, or if the same annotator's judgments don't depend on what they saw before.

The likelihood of observing this data under the Bradley-Terry model is:

L(\beta) = \prod_{k=1}^{m} P(w_k \text{ wins comparison } k) = \prod_{k=1}^{m} \sigma(\beta_{w_k} - \beta_{\ell_k})

where:

$L(\beta)$ : the likelihood of the observed dataset given parameters $\beta$
$m$ : the total number of pairwise comparisons
$w_k, \ell_k$ : indices of the winning and losing items in the $k$ -th comparison
$\sigma(\cdot)$ : the sigmoid function mapping the strength difference to a probability

The product form arises from independence: the probability of seeing all the comparisons we saw is the product of the probabilities of each individual comparison. Each term in the product is the Bradley-Terry probability that the actual winner wins, given the current strength parameters. The goal of maximum likelihood estimation is to find the $\beta$ values that make this product as large as possible.

Taking the log-likelihood converts the product into a sum, which is more convenient both analytically and computationally:

\begin{aligned} \log L(\beta) &= \sum_{k=1}^{m} \log \sigma(\beta_{w_k} - \beta_{\ell_k}) && \text{(log of product is sum of logs)} \\ &= \sum_{k=1}^{m} \log \left( \frac{1}{1 + e^{-(\beta_{w_k} - \beta_{\ell_k})}} \right) && \text{(substitute sigmoid definition)} \\ &= -\sum_{k=1}^{m} \log(1 + e^{-(\beta_{w_k} - \beta_{\ell_k})}) && \text{(using } \log(1/x) = -\log x \text{)} \end{aligned}

where:

$\log L(\beta)$ : the log-likelihood function to be maximized
$m$ : the total number of comparisons
$w_k, \ell_k$ : indices of the winning and losing items in comparison $k$
$\beta_{w_k}, \beta_{\ell_k}$ : log-strength parameters for the winner and loser
$-\log(1 + e^{-x})$ : the log-sigmoid equivalent to the negative softplus function

This is the negative of the binary cross-entropy loss where we're predicting that the winner should win. The connection to binary cross-entropy is important: it means that training a Bradley-Terry model is computationally identical to training a logistic regression classifier. The maximum likelihood estimates are found by maximizing this expression with respect to $\beta$ , typically using gradient-based optimization methods like gradient descent or L-BFGS.

The Identifiability IssueLink Copied

There's a subtlety that arises when we try to estimate Bradley-Terry parameters: the model has a scale and location invariance. If we add a constant $c$ to all parameters, the probabilities don't change:

\sigma((\beta_i + c) - (\beta_j + c)) = \sigma(\beta_i - \beta_j)

where:

$c$ : an arbitrary constant added to all log-strength parameters
$\beta_i, \beta_j$ : the log-strength parameters
$\sigma(\cdot)$ : the sigmoid function

This invariance occurs because the sigmoid only depends on differences between parameters, not on their absolute values. The constant $c$ cancels out in the subtraction. Mathematically, this means the parameters are not uniquely identifiable; there are infinitely many solutions that produce the same likelihood. If $(\beta_1, \beta_2, \beta_3) = (2, 1, 0)$ is a solution, then so is $(102, 101, 100)$ , and so is $(-5, -6, -7)$ . All these solutions make identical predictions.

To resolve this ambiguity, we typically fix one parameter (e.g., $\beta_1 = 0$ ) or add a constraint like $\sum_i \beta_i = 0$ . This anchoring doesn't affect the relative rankings or prediction probabilities. It simply pins down the arbitrary location of the parameter scale so that optimization algorithms converge to a unique solution. The choice of anchor is arbitrary: we could anchor any item at any value, and the relative differences would remain the same.

Connection to Elo RatingsLink Copied

If the Bradley-Terry model sounds familiar, it should. The Elo rating system, used to rank chess players and now common in competitive gaming, is a Bradley-Terry model in disguise. Understanding this connection illuminates both systems and explains why Elo ratings have become standard for comparing language models.

Elo ratings are typically scaled so that a 400-point difference corresponds to a 10:1 expected score ratio. This scaling was chosen for historical and practical reasons: it produces ratings that are easy to interpret and compare. A rating difference of 400 points means the stronger player is expected to score about 10 points for every 1 point the weaker player scores over many games.

In standard Elo, the expected score of player $A$ against player $B$ is:

E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}

where:

$E_A$ : the expected score (win probability) of player $A$
$R_A, R_B$ : the Elo ratings of players $A$ and $B$
$400$ : the scaling factor where a difference of 400 points implies a 10:1 odds ratio

The use of base-10 exponentials rather than natural exponentials is purely conventional: Arpad Elo chose this scaling when he developed the system in the 1960s because it produced convenient numbers for chess ratings.

We can rewrite this in terms of the sigmoid function by converting the base-10 exponential to base- $e$ :

\begin{aligned} E_A &= \frac{1}{1 + 10^{(R_B - R_A)/400}} \\ &= \frac{1}{1 + e^{\ln(10) \cdot (R_B - R_A)/400}} && \text{(change of base: } 10^x = e^{x \ln 10} \text{)} \\ &= \frac{1}{1 + e^{-\frac{\ln 10}{400} (R_A - R_B)}} && \text{(rearrange exponent)} \\ &= \sigma\left( \frac{\ln 10}{400} (R_A - R_B) \right) && \text{(definition of sigmoid)} \end{aligned}

where:

$E_A$ : the expected score (win probability) for player $A$
$\ln 10 \approx 2.30$ : the natural logarithm of 10
$\sigma(\cdot)$ : the standard sigmoid function

This derivation demonstrates that the Elo rating is just a scaled version of the Bradley-Terry log-strength parameter. Specifically, if $\beta$ is the Bradley-Terry parameter, then the corresponding Elo rating is $R = \frac{400}{\ln 10} \beta \approx 173.7 \beta$ . The mathematical structure is identical; only the units differ.

Out[6]:

Visualization

Linear relationship between Bradley-Terry log-strength parameters and Elo ratings. The scaling factor of approximately 173.7 maps small log-strength differences to the broader Elo scale, where a single unit increase in $\beta$ translates to roughly 174 Elo points.

The connection matters because Elo-style ratings are now used to evaluate LLMs. Platforms like LMSYS Chatbot Arena use Elo ratings computed from human preference comparisons to rank models. When you vote for which model response you prefer, these votes become pairwise comparisons that update the models' Elo ratings. The Bradley-Terry framework provides the statistical foundation for these rankings, ensuring that the ratings are consistent and interpretable.

Worked ExampleLink Copied

Let's work through a concrete example to solidify these concepts and see how the mathematics plays out with real numbers. Suppose we have three model responses (A, B, C) and the following comparison outcomes:

Pairwise comparison outcomes for three model responses.

Comparison	Winner	Occurrences
A vs B	A	7
A vs B	B	3
B vs C	B	8
B vs C	C	2
A vs C	A	9
A vs C	C	1

We'll set $\beta_C = 0$ as our anchor. This choice is arbitrary, but it resolves the identifiability issue and allows us to interpret the other parameters as "how much better than C" each item is. The log-likelihood function aggregates all our comparison data:

\begin{aligned} \log L(\beta_A, \beta_B) &= 7\log\sigma(\beta_A - \beta_B) + 3\log\sigma(\beta_B - \beta_A) \\ &\quad + 8\log\sigma(\beta_B - 0) + 2\log\sigma(0 - \beta_B) \\ &\quad + 9\log\sigma(\beta_A - 0) + 1\log\sigma(0 - \beta_A) \end{aligned}

where:

$\log L$ : the log-likelihood function aggregating all comparisons
$7, 3, \dots$ : the observed counts (e.g., A beats B 7 times)
$\sigma(\cdot)$ : the sigmoid function mapping strength differences to probabilities
$\beta_A, \beta_B$ : the unknown log-strength parameters
$0$ : the fixed anchor value for item C ( $\beta_C = 0$ )

Each term in this sum corresponds to a set of comparisons. The first term, $7\log\sigma(\beta_A - \beta_B)$ , contributes to the likelihood every time A beats B, which happened 7 times. The second term accounts for the 3 times B beat A. Notice how the symmetry of preferences emerges: when A beats B, we use $\sigma(\beta_A - \beta_B)$ , and when B beats A, we use $\sigma(\beta_B - \beta_A) = 1 - \sigma(\beta_A - \beta_B)$ .

To find the maximum, we take derivatives and set them to zero. We rely on the derivative property of the log-sigmoid function: $\frac{d}{dx} \log \sigma(x) = 1 - \sigma(x)$ . This elegant result comes from the chain rule and the fact that $\sigma'(x) = \sigma(x)(1-\sigma(x))$ .

Applying the chain rule to the terms involving $\beta_A$ :

When A wins (e.g., against B): The term is $\log \sigma(\beta_A - \beta_B)$ . The derivative is $1 - \sigma(\beta_A - \beta_B)$ .
When A loses (e.g., against B): The term is $\log \sigma(\beta_B - \beta_A)$ . The derivative is $(1 - \sigma(\beta_B - \beta_A)) \cdot (-1)$ , which simplifies to $-\sigma(\beta_A - \beta_B)$ .

Substituting the counts into the gradient equation for $\beta_A$ :

\begin{aligned} \frac{\partial \log L}{\partial \beta_A} &= 7(1 - \sigma(\beta_A - \beta_B)) - 3\sigma(\beta_A - \beta_B) \\ &\quad + 9(1 - \sigma(\beta_A)) - 1\cdot\sigma(\beta_A) \end{aligned}

where:

$\frac{\partial \log L}{\partial \beta_A}$ : the gradient of the log-likelihood with respect to $\beta_A$
$\beta_A, \beta_B$ : the log-strength parameters
$\sigma(\cdot)$ : the sigmoid function
$1 - \sigma(\cdot)$ : the derivative of the $\log \sigma(\cdot)$ term for wins
$-\sigma(\cdot)$ : the derivative of the $\log \sigma(\cdot)$ term for losses

Using the property that $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ and $1 - \sigma(x) = \sigma(-x)$ :

\frac{\partial \log L}{\partial \beta_A} = 7\sigma(\beta_B - \beta_A) - 3\sigma(\beta_A - \beta_B) + 9\sigma(-\beta_A) - \sigma(\beta_A)

where:

$\sigma(\beta_B - \beta_A)$ : the probability that B beats A
$\sigma(-\beta_A)$ : the probability that C beats A (equivalent to $1 - \sigma(\beta_A)$ )
$\beta_A, \beta_B$ : the log-strength parameters
$\sigma(\cdot)$ : the sigmoid function

The gradient equations don't have a closed-form solution, so we use iterative optimization. Unlike linear regression, where we can solve for parameters directly, the nonlinearity of the sigmoid means we must use numerical methods. Let's implement this in code.

Code ImplementationLink Copied

We'll implement Bradley-Terry estimation from scratch using gradient descent, then show how it's applied in practice.

In[7]:

Code

# Define our comparison data
# Each tuple: (winner_idx, loser_idx, count)
comparisons = [
    (0, 1, 7),  # A beats B, 7 times
    (1, 0, 3),  # B beats A, 3 times
    (1, 2, 8),  # B beats C, 8 times
    (2, 1, 2),  # C beats B, 2 times
    (0, 2, 9),  # A beats C, 9 times
    (2, 0, 1),  # C beats A, 1 time
]

n_items = 3  # A, B, C
item_names = ["A", "B", "C"]

# Define our comparison data
# Each tuple: (winner_idx, loser_idx, count)
comparisons = [
    (0, 1, 7),  # A beats B, 7 times
    (1, 0, 3),  # B beats A, 3 times
    (1, 2, 8),  # B beats C, 8 times
    (2, 1, 2),  # C beats B, 2 times
    (0, 2, 9),  # A beats C, 9 times
    (2, 0, 1),  # C beats A, 1 time
]

n_items = 3  # A, B, C
item_names = ["A", "B", "C"]

Now let's define the negative log-likelihood function that we want to minimize:

In[8]:

Code

def neg_log_likelihood(beta, comparisons):
    """
    Compute negative log-likelihood for Bradley-Terry model.
    beta[0] corresponds to item A, beta[1] to B, etc.
    Item C (index 2) is fixed at 0 for identifiability.
    """
    # Extend beta with the fixed anchor (beta_C = 0)
    beta_full = np.array([beta[0], beta[1], 0.0])

    nll = 0.0
    for winner, loser, count in comparisons:
        # Log probability that winner beats loser
        log_prob = np.log(sigmoid(beta_full[winner] - beta_full[loser]) + 1e-10)
        nll -= count * log_prob

    return nll

def neg_log_likelihood(beta, comparisons):
    """
    Compute negative log-likelihood for Bradley-Terry model.
    beta[0] corresponds to item A, beta[1] to B, etc.
    Item C (index 2) is fixed at 0 for identifiability.
    """
    # Extend beta with the fixed anchor (beta_C = 0)
    beta_full = np.array([beta[0], beta[1], 0.0])

    nll = 0.0
    for winner, loser, count in comparisons:
        # Log probability that winner beats loser
        log_prob = np.log(sigmoid(beta_full[winner] - beta_full[loser]) + 1e-10)
        nll -= count * log_prob

    return nll

We also need the gradient for efficient optimization:

In[9]:

Code

def neg_log_likelihood_gradient(beta, comparisons):
    """
    Compute gradient of negative log-likelihood.
    """
    beta_full = np.array([beta[0], beta[1], 0.0])
    grad = np.zeros(2)  # Only for beta_A and beta_B (beta_C is fixed)

    for winner, loser, count in comparisons:
        # P(winner beats loser)
        prob = sigmoid(beta_full[winner] - beta_full[loser])

        # Gradient contribution
        # d/d(beta_winner) of log(sigma(x)) = 1 - sigma(x)
        # d/d(beta_loser) of log(sigma(x)) = -(1 - sigma(x))
        if winner < 2:  # Only update if not the anchor
            grad[winner] -= count * (1 - prob)
        if loser < 2:
            grad[loser] += count * (1 - prob)

    return grad

def neg_log_likelihood_gradient(beta, comparisons):
    """
    Compute gradient of negative log-likelihood.
    """
    beta_full = np.array([beta[0], beta[1], 0.0])
    grad = np.zeros(2)  # Only for beta_A and beta_B (beta_C is fixed)

    for winner, loser, count in comparisons:
        # P(winner beats loser)
        prob = sigmoid(beta_full[winner] - beta_full[loser])

        # Gradient contribution
        # d/d(beta_winner) of log(sigma(x)) = 1 - sigma(x)
        # d/d(beta_loser) of log(sigma(x)) = -(1 - sigma(x))
        if winner < 2:  # Only update if not the anchor
            grad[winner] -= count * (1 - prob)
        if loser < 2:
            grad[loser] += count * (1 - prob)

    return grad

Now let's optimize to find the maximum likelihood estimates:

In[10]:

Code

from scipy.optimize import minimize

# Initial guess
beta_init = np.array([0.0, 0.0])

# Optimize using L-BFGS-B
result = minimize(
    neg_log_likelihood,
    beta_init,
    args=(comparisons,),
    method="L-BFGS-B",
    jac=neg_log_likelihood_gradient,
)

beta_A, beta_B = result.x
beta_C = 0.0  # Fixed anchor

from scipy.optimize import minimize

# Initial guess
beta_init = np.array([0.0, 0.0])

# Optimize using L-BFGS-B
result = minimize(
    neg_log_likelihood,
    beta_init,
    args=(comparisons,),
    method="L-BFGS-B",
    jac=neg_log_likelihood_gradient,
)

beta_A, beta_B = result.x
beta_C = 0.0  # Fixed anchor

Out[11]:

Console

Maximum Likelihood Estimates:
  β_A = 2.2156
  β_B = 1.3761
  β_C = 0.0000 (fixed anchor)

Optimization converged: True
Final negative log-likelihood: 14.3638

The estimated coefficients indicate that Item A is the strongest, followed by Item B, with Item C as the baseline. The optimization successfully converged to these maximum likelihood estimates.

Let's verify these estimates by computing the predicted win probabilities:

In[12]:

Code

# Compute predicted probabilities
beta_full = np.array([beta_A, beta_B, beta_C])

prob_A_beats_B = sigmoid(beta_A - beta_B)
prob_B_beats_C = sigmoid(beta_B - beta_C)
prob_A_beats_C = sigmoid(beta_A - beta_C)

# Calculate observed win rates dynamically from data
# comparisons is list of (winner, loser, count)
counts = {(w, l): c for w, l, c in comparisons}

obs_A_beats_B = counts[(0, 1)] / (counts[(0, 1)] + counts[(1, 0)])
obs_B_beats_C = counts[(1, 2)] / (counts[(1, 2)] + counts[(2, 1)])
obs_A_beats_C = counts[(0, 2)] / (counts[(0, 2)] + counts[(2, 0)])

# Compute predicted probabilities
beta_full = np.array([beta_A, beta_B, beta_C])

prob_A_beats_B = sigmoid(beta_A - beta_B)
prob_B_beats_C = sigmoid(beta_B - beta_C)
prob_A_beats_C = sigmoid(beta_A - beta_C)

# Calculate observed win rates dynamically from data
# comparisons is list of (winner, loser, count)
counts = {(w, l): c for w, l, c in comparisons}

obs_A_beats_B = counts[(0, 1)] / (counts[(0, 1)] + counts[(1, 0)])
obs_B_beats_C = counts[(1, 2)] / (counts[(1, 2)] + counts[(2, 1)])
obs_A_beats_C = counts[(0, 2)] / (counts[(0, 2)] + counts[(2, 0)])

Out[13]:

Console

Predicted Win Probabilities:
  P(A > B) = 0.698  (observed: 0.700)
  P(B > C) = 0.798  (observed: 0.800)
  P(A > C) = 0.902  (observed: 0.900)

The model closely matches the observed frequencies. Now let's convert these to Elo-style ratings for interpretability:

In[14]:

Code

# Convert to Elo scale (where 400 points = 10:1 odds)
elo_scale = 400 / np.log(10)
elo_ratings = beta_full * elo_scale

# Normalize so average is 1500 (conventional Elo baseline)
elo_ratings = elo_ratings - np.mean(elo_ratings) + 1500

# Convert to Elo scale (where 400 points = 10:1 odds)
elo_scale = 400 / np.log(10)
elo_ratings = beta_full * elo_scale

# Normalize so average is 1500 (conventional Elo baseline)
elo_ratings = elo_ratings - np.mean(elo_ratings) + 1500

Out[15]:

Console

Elo-Style Ratings:
  A: 1677
  B: 1531
  C: 1292

The ratings align with our expectations: Item A has the highest rating, reflecting its dominance in the pairwise comparisons. A difference of approximately 200 points between A and B suggests a strong win probability for A.

Visualizing the Preference SurfaceLink Copied

Let's visualize how the Bradley-Terry probability varies with the strength difference:

Out[16]:

Visualization

S-shaped curve showing preference probability increasing from 0 to 1 as strength difference increases. — Bradley-Terry preference probability curve as a function of strength difference $\beta_i - \beta_j$. The sigmoid function produces a logistic response where probabilities change most rapidly near zero difference and saturate asymptotically at the extremes.

The curve shows the characteristic sigmoid shape. When strengths are equal ( $\beta_i - \beta_j = 0$ ), the win probability is exactly 0.5. Small strength differences near zero produce probability changes close to linear, while large differences saturate toward 0 or 1.

Out[17]:

Visualization

Log-likelihood surface for the three-item worked example with $\beta_C=0$. The contours indicate how model fit varies with parameters $\beta_A$ and $\beta_B$, with the maximum likelihood estimate marked by the red star. The elongated shape reflects the correlation between parameters arising from the relative nature of pairwise comparisons.

Out[18]:

Visualization

Observed win frequencies versus Bradley-Terry model predictions for the three pairwise matchups. The close alignment between empirical win rates (blue) and model probabilities (red) confirms the model's ability to fit the underlying preference structure.

Handling Real Preference DataLink Copied

In practice, preference data comes as individual comparison records rather than aggregated counts. Let's implement a more realistic version that handles raw comparison logs:

In[19]:

Code

class BradleyTerryModel:
    """
    Bradley-Terry model for pairwise comparisons.
    Estimates latent strength parameters from preference data.
    """

    def __init__(self, n_items):
        self.n_items = n_items
        self.beta = np.zeros(n_items)  # Log-strength parameters
        self.fitted = False

    def _neg_log_likelihood(self, beta, winners, losers):
        """Compute NLL for optimization."""
        # Fix last item at 0 for identifiability
        beta_full = np.concatenate([beta, [0.0]])

        diffs = beta_full[winners] - beta_full[losers]
        # Use log1p for numerical stability: log(sigmoid(x)) = -log(1 + exp(-x))
        log_probs = -np.log1p(np.exp(-diffs))

        return -np.sum(log_probs)

    def _gradient(self, beta, winners, losers):
        """Compute gradient of NLL."""
        beta_full = np.concatenate([beta, [0.0]])
        diffs = beta_full[winners] - beta_full[losers]
        probs = sigmoid(diffs)

        grad = np.zeros(self.n_items - 1)
        for i in range(self.n_items - 1):
            # Contribution from comparisons where item i won
            win_mask = winners == i
            grad[i] -= np.sum(1 - probs[win_mask])

            # Contribution from comparisons where item i lost
            lose_mask = losers == i
            grad[i] += np.sum(1 - probs[lose_mask])

        return grad

    def fit(self, winners, losers):
        """
        Fit model to comparison data.

        Args:
            winners: Array of winner indices for each comparison
            losers: Array of loser indices for each comparison
        """
        winners = np.asarray(winners)
        losers = np.asarray(losers)

        beta_init = np.zeros(self.n_items - 1)

        result = minimize(
            self._neg_log_likelihood,
            beta_init,
            args=(winners, losers),
            method="L-BFGS-B",
            jac=lambda b, w, l: self._gradient(b, w, l),
        )

        self.beta = np.concatenate([result.x, [0.0]])
        self.fitted = True
        return self

    def predict_proba(self, item_i, item_j):
        """Predict probability that item_i beats item_j."""
        if not self.fitted:
            raise ValueError("Model must be fitted before prediction")
        return sigmoid(self.beta[item_i] - self.beta[item_j])

    def get_rankings(self):
        """Return items sorted by strength (best first)."""
        return np.argsort(-self.beta)

class BradleyTerryModel:
    """
    Bradley-Terry model for pairwise comparisons.
    Estimates latent strength parameters from preference data.
    """

    def __init__(self, n_items):
        self.n_items = n_items
        self.beta = np.zeros(n_items)  # Log-strength parameters
        self.fitted = False

    def _neg_log_likelihood(self, beta, winners, losers):
        """Compute NLL for optimization."""
        # Fix last item at 0 for identifiability
        beta_full = np.concatenate([beta, [0.0]])

        diffs = beta_full[winners] - beta_full[losers]
        # Use log1p for numerical stability: log(sigmoid(x)) = -log(1 + exp(-x))
        log_probs = -np.log1p(np.exp(-diffs))

        return -np.sum(log_probs)

    def _gradient(self, beta, winners, losers):
        """Compute gradient of NLL."""
        beta_full = np.concatenate([beta, [0.0]])
        diffs = beta_full[winners] - beta_full[losers]
        probs = sigmoid(diffs)

        grad = np.zeros(self.n_items - 1)
        for i in range(self.n_items - 1):
            # Contribution from comparisons where item i won
            win_mask = winners == i
            grad[i] -= np.sum(1 - probs[win_mask])

            # Contribution from comparisons where item i lost
            lose_mask = losers == i
            grad[i] += np.sum(1 - probs[lose_mask])

        return grad

    def fit(self, winners, losers):
        """
        Fit model to comparison data.

        Args:
            winners: Array of winner indices for each comparison
            losers: Array of loser indices for each comparison
        """
        winners = np.asarray(winners)
        losers = np.asarray(losers)

        beta_init = np.zeros(self.n_items - 1)

        result = minimize(
            self._neg_log_likelihood,
            beta_init,
            args=(winners, losers),
            method="L-BFGS-B",
            jac=lambda b, w, l: self._gradient(b, w, l),
        )

        self.beta = np.concatenate([result.x, [0.0]])
        self.fitted = True
        return self

    def predict_proba(self, item_i, item_j):
        """Predict probability that item_i beats item_j."""
        if not self.fitted:
            raise ValueError("Model must be fitted before prediction")
        return sigmoid(self.beta[item_i] - self.beta[item_j])

    def get_rankings(self):
        """Return items sorted by strength (best first)."""
        return np.argsort(-self.beta)

Let's test this implementation with simulated preference data:

In[20]:

Code

# Simulate data: 5 items with true strengths
np.random.seed(42)
n_items = 5
true_beta = np.array([2.0, 1.0, 0.0, -0.5, -1.5])

# Generate 500 random comparisons
n_comparisons = 500
winners = []
losers = []

for _ in range(n_comparisons):
    # Pick two random items
    i, j = np.random.choice(n_items, 2, replace=False)

    # Determine winner based on true probabilities
    prob_i_wins = sigmoid(true_beta[i] - true_beta[j])
    if np.random.random() < prob_i_wins:
        winners.append(i)
        losers.append(j)
    else:
        winners.append(j)
        losers.append(i)

winners = np.array(winners)
losers = np.array(losers)

# Simulate data: 5 items with true strengths
np.random.seed(42)
n_items = 5
true_beta = np.array([2.0, 1.0, 0.0, -0.5, -1.5])

# Generate 500 random comparisons
n_comparisons = 500
winners = []
losers = []

for _ in range(n_comparisons):
    # Pick two random items
    i, j = np.random.choice(n_items, 2, replace=False)

    # Determine winner based on true probabilities
    prob_i_wins = sigmoid(true_beta[i] - true_beta[j])
    if np.random.random() < prob_i_wins:
        winners.append(i)
        losers.append(j)
    else:
        winners.append(j)
        losers.append(i)

winners = np.array(winners)
losers = np.array(losers)

In[21]:

Code

# Fit the model
model = BradleyTerryModel(n_items)
model.fit(winners, losers)

# Compare estimated to true parameters (both anchored at last item)
true_beta_anchored = true_beta - true_beta[-1]

# Fit the model
model = BradleyTerryModel(n_items)
model.fit(winners, losers)

# Compare estimated to true parameters (both anchored at last item)
true_beta_anchored = true_beta - true_beta[-1]

Out[22]:

Console

Parameter Recovery:
Item   True β     Estimated β  Error   
------------------------------------
0      3.500      3.517        +0.017  
1      2.500      2.239        -0.261  
2      1.500      1.560        +0.060  
3      1.000      0.560        -0.440  
4      0.000      0.000        +0.000

The estimated parameters closely recover the true values, demonstrating that the Bradley-Terry model can reliably learn latent strengths from noisy pairwise comparisons.

Visualizing Parameter RecoveryLink Copied

Out[23]:

Visualization

Scatter plot comparing true and estimated parameters with a diagonal reference line. — True versus estimated Bradley-Terry log-strength parameters for simulated data. The clustering of points along the diagonal reference line indicates accurate parameter recovery from 500 comparisons across 5 items, validating the estimation procedure.

Out[24]:

Visualization

Mean absolute error of parameter estimates as a function of sample size. The downward trend on the log-log scale demonstrates that estimation error decreases systematically as the number of comparisons grows, illustrating the statistical consistency of the maximum likelihood estimator.

Connection to Reward ModelingLink Copied

The Bradley-Terry model is the foundation of reward model training in RLHF. Understanding this connection reveals why reward models are trained the way they are and illuminates the mathematical underpinnings of alignment techniques. Recall from the previous chapter that preference data consists of pairs $(y_w, y_l)$ where $y_w$ is the preferred response and $y_l$ is the dispreferred response to the same prompt $x$ .

A reward model $r_\theta(x, y)$ assigns a scalar score to each prompt-response pair. This score represents the model's estimate of how good the response is for that particular prompt. The key insight is that we can interpret this reward as a Bradley-Terry strength parameter: the probability that response $y_w$ is preferred over $y_l$ should follow the Bradley-Terry formula:

P(y_w \succ y_l \mid x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))

where:

$y_w, y_l$ : the preferred (winning) and dispreferred (losing) responses
$x$ : the prompt or context
$r_\theta(x, y)$ : the reward model parameterized by weights $\theta$
$\sigma(\cdot)$ : the sigmoid function converting the reward difference into a probability

This formulation says that the probability of preferring one response over another depends on the difference in their reward scores, passed through a sigmoid. Responses with higher rewards are more likely to be preferred, but the relationship is probabilistic, allowing for noise in human judgments. Training the reward model to maximize this probability is equivalent to Bradley-Terry maximum likelihood estimation, except now the strength parameters are neural network outputs rather than fixed scalars. Instead of learning one number per item, we learn a function that can assign rewards to any prompt-response pair.

The loss function for reward model training is:

\mathcal{L}(\theta) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

where:

$\mathcal{L}(\theta)$ : the loss function to minimize (negative log-likelihood)
$\mathcal{D}$ : the dataset of preference pairs
$\mathbb{E}$ : the expectation over samples drawn from the dataset
$\sigma(\cdot)$ : the sigmoid function
$r_\theta(x, y)$ : the reward model output for input $x$ and response $y$
$y_w, y_l$ : the preferred and dispreferred responses

This is exactly the Bradley-Terry negative log-likelihood, with the expectation taken over the preference dataset. The connection is direct: minimizing this loss trains the neural network to produce reward scores that best explain the observed human preferences according to the Bradley-Terry model. We'll explore the full details of reward model architecture and training in the next chapter, but understanding Bradley-Terry makes the training objective transparent: we're fitting a neural network to predict comparison outcomes.

Out[25]:

Visualization

Schematic of the classical Bradley-Terry parameterization. Each item (A, B, C, D) maps to a single scalar log-strength parameter $\beta$, establishing a global ranking based on fixed values.

Out[26]:

Visualization

Neural reward model architecture replacing static parameters. A neural network $r_\theta(x, y)$ maps context-response pairs to scalar rewards, enabling generalization to unseen inputs while preserving the Bradley-Terry probabilistic interpretation.

Limitations and ImpactLink Copied

The Bradley-Terry model makes strong assumptions that don't always hold in practice. The most significant is the independence from irrelevant alternatives: the probability that A beats B doesn't depend on what other items exist in the comparison set. This means if you're comparing responses A and B, introducing a third response C shouldn't change how humans compare A to B. In reality, context effects like anchoring and decoy effects violate this assumption.

The model also assumes preferences are governed by a single latent dimension: each item has one "strength" value that determines all its comparisons. Human preferences for text are often multidimensional: a response might be more helpful but less safe, or more creative but less accurate. A single scalar reward compresses these tradeoffs into one number, potentially losing important information. Research into multi-objective RLHF attempts to address this limitation by learning separate reward models for different attributes.

Another challenge is intransitivity in human preferences. The Bradley-Terry model enforces transitivity by construction: if A is stronger than B and B is stronger than C, then A must be stronger than C. But humans sometimes exhibit cycles: preferring A over B, B over C, yet C over A. These cycles can arise from different annotators having different preferences, or from the same annotator weighing different factors in different comparisons. The Bradley-Terry model smooths over these inconsistencies, which is usually desirable but can mask systematic disagreements.

Despite these limitations, the Bradley-Terry model has had enormous impact on LLM alignment. Its simplicity makes it computationally tractable to train reward models on millions of preference comparisons. The resulting reward scores provide the training signal for PPO and other policy optimization algorithms. More recently, the mathematical connection between reward models and the Bradley-Terry likelihood enabled Direct Preference Optimization, which we'll cover later in this part. DPO derives an implicit reward model from the policy itself, eliminating the need for separate reward model training while still optimizing the same Bradley-Terry objective.

SummaryLink Copied

The Bradley-Terry model provides a principled framework for converting pairwise comparison data into consistent quality scores. Its core assumptions are:

Each item has a latent strength parameter $\beta_i$
The probability that item $i$ beats item $j$ is $\sigma(\beta_i - \beta_j)$
Parameters are learned by maximum likelihood estimation

Key insights from this chapter:

Log-linear form: The Bradley-Terry probability is a sigmoid of the strength difference, making it equivalent to logistic regression on pairwise comparisons
Transitivity by design: The model automatically produces transitive rankings, which is both a feature (consistency) and a limitation (can't represent true intransitivity)
Identifiability: Only differences in strength matter, so we need to anchor one parameter
Connection to Elo: The Elo rating system is a scaled Bradley-Terry model
Foundation for reward modeling: The reward model training objective is exactly the Bradley-Terry likelihood with neural network-parameterized strengths

In the next chapter on Reward Modeling, we'll see how this framework scales to neural networks that can generalize across prompts and responses, providing the dense training signal needed for RLHF.

QuizLink Copied

Ready to test your understanding? Take this quick quiz to reinforce what you've learned about the Bradley-Terry model and its role in preference learning.

Loading component...

Comments

Back to Language AI Handbook

Previous Chapter

Human Preference Data

Next Chapter

Reward Modeling

Reference

BIBTEXAcademic

@misc{bradleyterrymodelconvertingpreferencestorankings, author = {Michael Brenndoerfer}, title = {Bradley-Terry Model: Converting Preferences to Rankings}, year = {2025}, url = {https://mbrenndoerfer.com/writing/bradley-terry-model-pairwise-preferences-rankings}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-01-01} }

APAAcademic

Michael Brenndoerfer (2025). Bradley-Terry Model: Converting Preferences to Rankings. Retrieved from https://mbrenndoerfer.com/writing/bradley-terry-model-pairwise-preferences-rankings

MLAAcademic

Michael Brenndoerfer. "Bradley-Terry Model: Converting Preferences to Rankings." 2026. Web. today. <https://mbrenndoerfer.com/writing/bradley-terry-model-pairwise-preferences-rankings>.

CHICAGOAcademic

Michael Brenndoerfer. "Bradley-Terry Model: Converting Preferences to Rankings." Accessed today. https://mbrenndoerfer.com/writing/bradley-terry-model-pairwise-preferences-rankings.

HARVARDAcademic

Michael Brenndoerfer (2025) 'Bradley-Terry Model: Converting Preferences to Rankings'. Available at: https://mbrenndoerfer.com/writing/bradley-terry-model-pairwise-preferences-rankings (Accessed: today).

SimpleBasic

Michael Brenndoerfer (2025). Bradley-Terry Model: Converting Preferences to Rankings. https://mbrenndoerfer.com/writing/bradley-terry-model-pairwise-preferences-rankings

Direct link:

https://mbrenndoerfer.com/writing/bradley-terry-model-pairwise-preferences-rankings

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, leading AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

View Full Resume Publications Contact Books

Bradley-Terry Model: Converting Preferences to Rankings

Bradley-Terry ModelLink Copied

The Pairwise Comparison FrameworkLink Copied

The Log-Linear FormLink Copied

Preference Strength and TransitivityLink Copied

The Bradley-Terry LikelihoodLink Copied

The Identifiability IssueLink Copied

Connection to Elo RatingsLink Copied

Worked ExampleLink Copied

Code ImplementationLink Copied

Visualizing the Preference SurfaceLink Copied

Handling Real Preference DataLink Copied

Visualizing Parameter RecoveryLink Copied

Connection to Reward ModelingLink Copied

Limitations and ImpactLink Copied

SummaryLink Copied

QuizLink Copied

Comments

Reference

About the author: Michael Brenndoerfer

Related Content

Iterative Alignment: Online DPO & Self-Improvement Methods

RLAIF & Constitutional AI: Scalable Model Alignment

DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment

Stay updated

Comments

About the author: Michael Brenndoerfer

Related Content

Iterative Alignment: Online DPO & Self-Improvement Methods

RLAIF & Constitutional AI: Scalable Model Alignment

DPO Variants: IPO, KTO, ORPO & cDPO for LLM Alignment

Stay updated