RLHF Foundations: Learning from Human Preferences in Reinforcement Learning
Back to Writing

RLHF Foundations: Learning from Human Preferences in Reinforcement Learning

Michael BrenndoerferNovember 2, 202513 min read3,088 wordsInteractive

A comprehensive guide to preference-based learning, the framework developed by Christiano et al. in 2017 that enabled reinforcement learning agents to learn from human preferences. Learn how this foundational work established RLHF principles that became essential for aligning modern language models.

History of Language AI Cover
Part of History of Language AI

This article is part of the free-to-read History of Language AI book

View full handbook
Reading Level

Choose your expertise level to adjust how many terms are explained. Beginners see more tooltips, experts see fewer to maintain reading flow. Hover over underlined terms for instant definitions.

2017: RLHF Foundations

In 2017, researchers at OpenAI and DeepMind faced a fundamental challenge in reinforcement learning: how could agents learn complex behaviors when designing reward functions was difficult or impossible? Traditional reinforcement learning required engineers to specify exact reward functions that would guide agents toward desired behaviors. However, for many tasks, especially those involving natural language, robotics, or complex human preferences, crafting precise reward functions proved impractical or misleading. Researchers Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei addressed this problem by developing a method that allowed reinforcement learning agents to learn from human preferences rather than predefined reward functions.

The work, published in a paper titled "Deep Reinforcement Learning from Human Preferences," introduced a framework where human evaluators could provide pairwise comparisons between agent behaviors, and a reward model learned from these comparisons could guide the reinforcement learning process. This approach shifted the paradigm from requiring engineers to encode complex preferences into mathematical reward functions to allowing humans to express their preferences through simple comparisons. The method proved remarkably effective, enabling agents to learn behaviors that aligned closely with human intentions even when those intentions were difficult to formalize mathematically.

The significance of this development extended far beyond the specific reinforcement learning tasks explored in the original paper. The framework established core principles and techniques that would later become essential for aligning large language models with human values. When researchers began training language models that could generate coherent text, they encountered a similar problem: standard training objectives like maximum likelihood estimation could produce models that were fluent but didn't match human preferences for helpfulness, honesty, or safety. The preference learning approach developed in 2017 would provide the foundation for addressing this alignment problem.

The method demonstrated that learning from human feedback could scale effectively. Rather than requiring humans to manually craft reward functions for every possible scenario, the preference-based approach allowed human evaluators to guide learning through a relatively small number of comparisons. The learned reward model could then generalize to new situations, significantly reducing the human effort required while maintaining alignment with human preferences. This scalability would prove crucial when applying similar techniques to language models, where the space of possible outputs is vast and impossible to enumerate.

The Problem

Traditional reinforcement learning relies on a reward function that provides a scalar reward signal to the agent after each action or at the end of an episode. This reward function serves as the primary learning signal, guiding the agent to discover behaviors that maximize expected cumulative reward. For many applications, this approach works well when the reward function can be precisely specified. In game-playing scenarios, for example, the reward might be the game score, which is clearly defined and measurable. In control tasks, the reward might be based on measurable quantities like distance traveled, energy consumed, or task completion metrics.

However, many important applications involve objectives that are difficult or impossible to encode as precise reward functions. Consider a robotic agent learning to assist with household tasks. An engineer might try to define a reward based on metrics like task completion time or the number of objects moved, but these metrics might miss important aspects of what makes assistance helpful: being gentle with fragile items, cleaning up afterward, or understanding the user's unstated preferences. Similarly, for language tasks, a model trained to maximize likelihood might generate text that is grammatically correct but unhelpful, verbose, or unsafe. What makes text "good" often involves subtle qualities that resist simple mathematical formulation.

Another challenge emerges when reward functions are misspecified, meaning they don't fully capture the desired behavior. An agent trained to maximize a misspecified reward function will exploit loopholes, finding ways to achieve high reward that don't match human intentions. Classic examples include agents that pause video games indefinitely to avoid negative rewards, or cleaning robots that hide messes rather than actually cleaning them. These behaviors demonstrate that even when engineers think they've specified rewards correctly, subtle misalignments can lead to unexpected and undesirable outcomes.

Human preferences often involve trade-offs and nuances that are difficult to capture in a single reward function. People might prefer some behaviors over others for reasons that are hard to articulate or formalize. A conversation that is informative but also friendly and appropriately concise might be preferred over one that is merely informative. These multi-dimensional preferences, which might vary across contexts and individuals, create challenges for traditional reward function design. Engineers would need to specify weights for different aspects, make assumptions about relative importance, and hope that the resulting function captures what humans actually value.

The scalability problem also looms large. As tasks become more complex, the effort required to design appropriate reward functions grows. Each new scenario might require rethinking the reward structure, testing various formulations, and iterating based on observed behaviors. This process doesn't scale well to systems that need to handle diverse situations or adapt to different user preferences. For applications where rapid deployment or personalization is important, the bottleneck of reward function engineering becomes a significant limitation.

Finally, there was the challenge of evaluation. Even when engineers designed reward functions they believed captured desired behaviors, verifying alignment with human preferences required expensive human evaluation. This created a circular problem: engineers would design reward functions, train agents, evaluate with humans, discover misalignments, redesign reward functions, and repeat. This iterative process was time-consuming and didn't guarantee convergence to truly aligned behaviors. A method that could directly incorporate human preferences into the learning process would break this cycle.

The Solution

The preference-based learning framework addresses these challenges by separating reward specification from reward learning. Instead of requiring engineers to design reward functions, the method learns a reward model from human feedback, then uses this learned reward model to train reinforcement learning agents. The key insight is that humans can more easily compare behaviors than they can specify exact reward values or functions. By collecting pairwise comparisons from human evaluators, the system can learn what humans value without requiring them to formalize their preferences mathematically.

The process works in three main stages. First, a reinforcement learning agent interacts with the environment, generating trajectories of behavior. These trajectories are presented to human evaluators in pairs, and the evaluators indicate which trajectory they prefer. Second, a reward model is trained to predict human preferences by learning to assign higher rewards to trajectories that humans prefer. This reward model learns from the pairwise comparison data, developing an internal representation of what makes behaviors desirable according to human judgment. Third, the reinforcement learning agent is trained using the learned reward model as the source of rewards, optimizing its policy to maximize expected reward according to the reward model's predictions.

The reward model is typically implemented as a neural network that takes a trajectory as input and outputs a scalar reward value. During training, the model learns to assign higher rewards to preferred trajectories and lower rewards to non-preferred ones. The training objective encourages the reward model to correctly rank trajectories according to human preferences. Specifically, the model is trained using a ranking loss that maximizes the difference in predicted rewards between preferred and non-preferred trajectories. This approach allows the reward model to generalize beyond the specific comparisons it was trained on, learning general principles about what makes behaviors desirable.

The key technical innovation involves how trajectories are compared. Rather than requiring humans to evaluate complete trajectories, which might be long or complex, the method allows comparisons of trajectory segments. Human evaluators might compare specific segments of behavior, focusing attention on the most relevant parts. This segmentation makes human evaluation more efficient and allows the reward model to learn fine-grained preferences about specific aspects of behavior. The learned reward model can then aggregate these segment-level preferences when evaluating complete trajectories.

Another important aspect is how the reward model handles uncertainty. When the reward model is uncertain about which trajectory is preferred, it should express this uncertainty rather than making confident but potentially incorrect predictions. The framework incorporates this through the reward model's training, which learns not just point estimates of reward but also confidence in those estimates. This uncertainty can be used to guide further human evaluation, focusing attention on cases where the reward model is least confident and where additional human feedback would be most informative.

The integration with reinforcement learning uses the learned reward model as a drop-in replacement for a traditional reward function. Standard reinforcement learning algorithms, such as policy gradient methods or actor-critic approaches, can be applied without modification. The agent receives rewards from the reward model rather than from a hand-coded function, but from the agent's perspective, the learning process is identical. This compatibility meant that existing reinforcement learning infrastructure and algorithms could be used with minimal changes, making the approach practical to implement and experiment with.

The method also includes mechanisms for iterative improvement. As the agent learns and improves its behavior, it generates new trajectories that may be different from those used to train the initial reward model. These new trajectories can be evaluated by humans, providing additional preference data that can be used to refine the reward model. This iterative process allows the system to improve both the reward model and the agent's policy over time, creating a feedback loop that progressively aligns the agent's behavior with human preferences.

Applications and Impact

The initial applications demonstrated the method's effectiveness on diverse reinforcement learning tasks. In simulated robotic manipulation tasks, agents learned to perform complex manipulations that aligned with human preferences for safe and efficient behavior. In Atari game playing, agents learned strategies that prioritized behaviors humans found interesting or skillful, rather than just maximizing game score. These demonstrations showed that preference-based learning could produce agents that behaved in ways that matched human judgment, even when those behaviors weren't easily captured by simple reward functions.

The impact extended to practical applications where human preferences were crucial but difficult to formalize. In content recommendation systems, the approach could learn from user preferences expressed through interactions rather than requiring engineers to specify exact metrics for recommendation quality. In autonomous systems, agents could learn behaviors that matched human expectations for safety, courtesy, and efficiency without requiring exhaustive specification of all relevant factors. The method proved particularly valuable in domains where the gap between measurable quantities and actual value was large.

However, the most significant impact emerged several years later, when the preference-based learning framework became the foundation for aligning large language models. As language models grew in size and capability, researchers recognized that standard training objectives produced models that, while fluent, didn't match human preferences for helpfulness, accuracy, safety, and appropriate behavior. The challenge of aligning language model outputs with human values was fundamentally similar to the challenge of aligning reinforcement learning agent behaviors with human preferences.

The connection became clear when researchers adapted the preference learning framework for language models. Instead of learning from comparisons of agent trajectories in environments, the method learned from comparisons of language model outputs. Human evaluators could compare different responses to the same prompt, indicating which response they preferred. A reward model learned from these comparisons could then guide fine-tuning of the language model using reinforcement learning, specifically the Proximal Policy Optimization algorithm. This approach, which came to be known as Reinforcement Learning from Human Feedback or RLHF, became essential for training aligned language models.

RLHF enabled language models like GPT-3.5, GPT-4, Claude, and other modern systems to produce outputs that are more helpful, accurate, and safe. Without preference-based learning, these models would likely generate text that is fluent but often unhelpful, potentially harmful, or misaligned with user intentions. The method has become standard practice for training production language models, with major AI labs investing significant resources in collecting human preference data and training reward models.

The framework's influence extends beyond language models to other generative AI systems. Image generation models can be aligned using human preferences for aesthetics, safety, and appropriateness. Code generation models can learn from programmer preferences for code quality, style, and correctness. Any system where human judgment matters more than simple metrics can potentially benefit from preference-based learning approaches.

Limitations

While preference-based learning addresses many challenges, it also introduces new limitations and considerations. One fundamental issue is that human preferences may be inconsistent, context-dependent, or difficult to express through pairwise comparisons. Different humans might have different preferences, and the same human might express different preferences at different times or in different contexts. The method assumes that preferences are reasonably stable and consistent enough to learn from, but in practice, this assumption may not always hold.

The quality of the learned reward model depends critically on the quality and quantity of human preference data. If human evaluators provide noisy, inconsistent, or biased comparisons, the reward model will learn to reflect those issues. Poor quality preference data can lead to reward models that don't accurately capture human values, which then guide agents or models toward misaligned behaviors. Collecting high-quality preference data requires careful attention to evaluation protocols, training for evaluators, and quality assurance measures.

The scalability of human evaluation remains a challenge. While preference-based learning reduces the need for humans to specify reward functions, it still requires significant human effort to provide comparisons. For large-scale applications, this can become expensive and time-consuming. The method improves efficiency by learning generalizable reward models from relatively small amounts of comparison data, but human evaluation remains a bottleneck that limits how quickly systems can be improved or adapted to new domains.

Another concern involves reward hacking, where agents or models find ways to achieve high reward according to the learned reward model that don't actually align with human preferences. The reward model is an approximation of human preferences, and like any approximation, it may have blind spots or systematic errors. Clever optimization can exploit these imperfections, leading to behaviors that score well on the reward model but that humans would actually find undesirable. This problem is similar to reward hacking with hand-coded reward functions but may be harder to detect when preferences are learned implicitly.

The method also assumes that pairwise comparisons provide sufficient information to learn good reward models. However, some preferences might be inherently difficult to express through comparisons, especially when preferences involve multiple dimensions that can't be easily reduced to a single ranking. For example, a conversation might be preferred in some ways but not others, and forcing a single preference judgment might lose important nuance. More sophisticated preference elicitation methods might be needed for complex multi-dimensional preferences.

There are also questions about whose preferences are being learned. The method learns from the preferences of the humans who provide comparisons, but these humans might not be representative of all users or stakeholders. Preferences might vary across cultures, contexts, or individuals, and learning from one group's preferences might not produce behaviors that align with other groups' values. This concern becomes especially important for language models and other systems with broad user bases, where diverse preferences must be considered.

Finally, the iterative improvement process, while valuable, can be slow. Each cycle of generating behaviors, collecting human feedback, updating the reward model, and retraining the agent takes time and resources. For applications requiring rapid adaptation or frequent updates, this iterative process might not be fast enough. The method works well when there is time for careful alignment, but may be less suitable for scenarios requiring quick deployment or frequent changes.

Legacy and Looking Forward

The preference-based learning framework established in 2017 has become one of the most influential developments in AI alignment research. The core idea—learning from human preferences rather than hand-coded reward functions—has proven broadly applicable and has been adapted to many different contexts. The method demonstrated that human feedback could be incorporated into machine learning systems in principled and scalable ways, opening new possibilities for developing AI systems that better align with human values.

The legacy is most visible in modern language models, where RLHF has become standard practice for training production systems. Major language models released since 2020, including GPT-3.5, GPT-4, Claude, and many others, use variants of the preference learning framework to align model outputs with human preferences. The method has enabled these models to be more helpful, accurate, and safe than they would be with standard training objectives alone. Without the foundation laid in 2017, it's unlikely that modern language models would have achieved their current levels of alignment with human preferences.

Research continues to improve upon the original framework. Recent work has explored alternatives to the three-stage RLHF process, such as Direct Preference Optimization, which fine-tunes language models directly from preference data without training a separate reward model. Other research has investigated methods for making human evaluation more efficient, such as using AI assistants to help with preference collection or developing better protocols for training human evaluators. These improvements build on the core insights from 2017 while addressing some of the method's limitations.

The framework has also influenced thinking about AI safety and alignment more broadly. The idea that AI systems should be optimized for human preferences rather than simple metrics has become central to safety research. This shift in perspective has led to increased emphasis on human feedback, evaluation, and oversight in AI development. The preference learning approach has become a key tool in the broader effort to ensure that AI systems remain beneficial as they become more capable.

Looking forward, preference-based learning will likely remain essential for aligning advanced AI systems. As models become more capable and are deployed in more diverse contexts, the challenge of alignment becomes even more important. The framework provides a foundation for addressing these challenges, though ongoing research is needed to improve efficiency, handle diverse preferences, and ensure that learned reward models accurately capture human values across different contexts and cultures.

The development of preference-based learning in 2017 represents a pivotal moment in AI alignment research. By shifting from hand-coded reward functions to learned preferences, the work opened new pathways for developing AI systems that better match human intentions. The method's success in reinforcement learning and its later adaptation to language models demonstrate its fundamental importance. As AI systems continue to advance, the principles and techniques established in this work will likely remain central to ensuring that these systems remain aligned with human values and preferences.

Quiz

Ready to test your understanding of RLHF foundations? Challenge yourself with these questions about preference-based learning and see how well you've grasped the key concepts that enabled aligning AI systems with human preferences. Good luck!

Loading component...

Reference

BIBTEXAcademic
@misc{rlhffoundationslearningfromhumanpreferencesinreinforcementlearning, author = {Michael Brenndoerfer}, title = {RLHF Foundations: Learning from Human Preferences in Reinforcement Learning}, year = {2025}, url = {https://mbrenndoerfer.com/writing/rlhf-foundations-reinforcement-learning-human-preferences}, organization = {mbrenndoerfer.com}, note = {Accessed: 2025-11-02} }
APAAcademic
Michael Brenndoerfer (2025). RLHF Foundations: Learning from Human Preferences in Reinforcement Learning. Retrieved from https://mbrenndoerfer.com/writing/rlhf-foundations-reinforcement-learning-human-preferences
MLAAcademic
Michael Brenndoerfer. "RLHF Foundations: Learning from Human Preferences in Reinforcement Learning." 2025. Web. 11/2/2025. <https://mbrenndoerfer.com/writing/rlhf-foundations-reinforcement-learning-human-preferences>.
CHICAGOAcademic
Michael Brenndoerfer. "RLHF Foundations: Learning from Human Preferences in Reinforcement Learning." Accessed 11/2/2025. https://mbrenndoerfer.com/writing/rlhf-foundations-reinforcement-learning-human-preferences.
HARVARDAcademic
Michael Brenndoerfer (2025) 'RLHF Foundations: Learning from Human Preferences in Reinforcement Learning'. Available at: https://mbrenndoerfer.com/writing/rlhf-foundations-reinforcement-learning-human-preferences (Accessed: 11/2/2025).
SimpleBasic
Michael Brenndoerfer (2025). RLHF Foundations: Learning from Human Preferences in Reinforcement Learning. https://mbrenndoerfer.com/writing/rlhf-foundations-reinforcement-learning-human-preferences
Michael Brenndoerfer

About the author: Michael Brenndoerfer

All opinions expressed here are my own and do not reflect the views of my employer.

Michael currently works as an Associate Director of Data Science at EQT Partners in Singapore, where he drives AI and data initiatives across private capital investments.

With over a decade of experience spanning private equity, management consulting, and software engineering, he specializes in building and scaling analytics capabilities from the ground up. He has published research in leading AI conferences and holds expertise in machine learning, natural language processing, and value creation through data.

Stay updated

Get notified when I publish new articles on data and AI, private equity, technology, and more.