Reinforcement Learning from Human Feedback: From Zero to chatGPT

Name: Reinforcement Learning from Human Feedback: From Zero to chatGPT
Uploaded: 2022-12-13T17:34:48.000Z
Duration: 2 h 38 s

Introduction

The speakers introduce themselves and welcome the audience to the live session. They also ask the audience where they are from.

Nathan Lumbier is a reinforcement learning researcher at Hugging Face.

Tommaso Manini is the developer advocate at Hugging Face and writer of the deep reinforcement learning course.

The live session will be split into two parts: a presentation by Nathan about reinforcement learning from human feedback, followed by a Q&A section.

The audience is encouraged to ask questions in the chat or join the Discord server for further discussion.

Presentation on Reinforcement Learning from Human Feedback

Alex Nason presents an introduction to reinforcement learning from human feedback, discussing recent breakthroughs in machine learning and how they relate to this topic.

Recent Breakthroughs in Machine Learning

Two recent breakthroughs in machine learning are Chat GPT, a language model capable of generating high-quality text across various subjects, and Stable Diverse Sampling, which was used to release Chat GPT to the public.

These breakthroughs have led to significant advancements in natural language processing and other areas of machine learning.

Reinforcement Learning from Human Feedback

Reinforcement learning involves training an agent through trial-and-error interactions with its environment.

In some cases, it may be more efficient for humans to provide feedback rather than relying solely on rewards provided by the environment.

This approach is known as reinforcement learning from human feedback (RLHF).

RLHF has many potential applications, such as improving dialogue systems or optimizing medical treatments.

Challenges with RLHF

One challenge with RLHF is that humans may not always provide consistent or accurate feedback.

Another challenge is determining how much weight to give human feedback versus environmental rewards when training an agent.

Despite these challenges, RLHF shows promise as a way to improve the efficiency and effectiveness of reinforcement learning.

Discussion

The speakers answer questions from the audience and discuss various topics related to reinforcement learning from human feedback.

Q&A

Questions are asked about topics such as how to handle conflicting feedback, the potential for bias in human feedback, and how RLHF can be applied in different domains.

The speakers provide insights and suggestions for addressing these challenges.

They also discuss the importance of transparency and accountability when using RLHF in real-world applications.

Additional Topics

The speakers discuss additional topics related to reinforcement learning, such as model-based versus model-free approaches and the use of deep neural networks.

They also provide recommendations for further reading and resources on these topics.

Introduction to Reinforcement Learning for Human Feedback

The speaker introduces the topic of reinforcement learning and its potential to solve complex problems. They discuss the challenges of machine learning models and how they can fall short, leading to failure modes. The speaker also explains the concept of human feedback and how it can be used to encode human values in machine learning systems.

Origins of Reinforcement Learning for Human Feedback

Reinforcement learning is a mathematical framework that allows us to study different interactions in the world.

An agent interacts with an environment by taking an action, which returns two things: the state and the reward.

The agent uses a policy to map from the state to an action, allowing it to learn how to optimize reward signals over time.

Reinforcement learning for human feedback is one method used to integrate complex data sets into machine learning models and encode human values directly with humans.

Creating Loss Functions with Human Feedback

Writing loss functions for complex questions like what is funny or ethical can be difficult or inaccurate when done on paper.

Reinforcement learning for human feedback aims to learn these values directly from humans rather than assigning them based on assumptions or mislabeling.

This method has been successful in addressing the problem of creating a complex loss function for machine learning models.

Challenges and Failure Modes

Machine learning models can fall short due to interesting failure modes, such as chatbots becoming hateful when trained on biased data without grounding in values.

Bias in machine learning algorithms and datasets reflects biases of their designers and creators, making it challenging to ensure fair and safe interactions with society.

Reinforcement learning for human feedback is one potential solution to these challenges, allowing us to encode human values directly into machine learning systems.

Reinforcement Learning from Human Feedback

This section covers the history of RLHF and its origin in decision making. It also discusses the Tamer framework, which was introduced to solve Tetris using a reward model and policy.

Origin of RLHF

RLHF originated in decision making before deep reinforcement learning.

Autonomous agents were created without neural networks to represent a value function or policy.

A machine learning system was created that used a reward model and policy together by having humans label actions as correct or incorrect.

Tamer Framework

The Tamer framework was introduced to solve Tetris using a reward model and policy.

The reward model and policy were combined into one in this framework.

In Atari games, human feedback on trajectories was used to train a reward predictor that provided another signal into the solving task's policy.

Recent History of RLHF in Language Modeling

This section covers recent history related to RLHF in language modeling. It discusses OpenAI's experiments with training models to summarize text, including their use of ROHF.

OpenAI Experiments with Summarizing Text

OpenAI conducted experiments with ROHF to train models for summarizing text.

They used human feedback signals to get an output from machine learning models that is more compelling than those generated by language models alone.

These models are useful for tasks covering sensitive subjects where misinformation must be avoided.

Chat GPT Rumors

This section discusses rumors surrounding Chat GPT, an upcoming project from OpenAI related to natural language processing.

Rumors Surrounding Chat GPT

There are interesting rumors surrounding Chat GPT, but little is known about the project.

OpenAI is not as open as they once were, so information is scarce.

Chat GPT may be related to natural language processing.

Overview of RLHF

This section provides an overview of the three-phase process involved in RLHF, which includes language model pre-training, reward model training, and RL fine-tuning.

Language Model Pre-Training

NLP has standardized practices for getting a language model by scraping data from the internet and using unsupervised sequence prediction.

There is no single best answer on what the model size should be. The industry experiments have ranged from 10 billion to 280 billion parameters.

Human augmented text is optional and can be used to improve the data set. A company can pay humans to write responses to important prompts that are identified.

Reward Model Training

The goal is to get a model that maps from some input text sequence to a scalar reward value.

This step involves generating preferences without assigning a specific equation to it.

RL Fine-Tuning

This step involves fine-tuning the language model based on the reward in order to get more interesting performance.

Reward Model Training

In this section, the speaker discusses the reward model training process and how it differs from language model pre-training. They also explain how prompts are passed through a language model to generate text, which is then ranked using a human interface.

Prompt Data Sets

The reward model training starts with a specific data set that is more focused on the prompts that it expects people to see.

There are data sets on the internet that are kind of like preference data sets or props from using a chatbot.

These prompt data sets will be orders of magnitude smaller than the text corpuses used to pre-train a language model.

Ranking Generated Text

The downstream goal of having generated text is to rank it.

The same prompt goes through each language model, generating different texts.

Humans can label those different texts and create a relative ranking of what is going on.

Training the Reward Model

To train a reward model with supervised learning, we need input and output pairs.

We train on a sequence of text as input and use scalar values for reward as output.

This generates what we call the reward or preference model.

Policy and Fine-Tuning with RL

In this section, the speaker explains how states and actions in reinforcement learning are both language-based. They also discuss using callback libraries like KL Divergence to update models over time.

States and Actions in RL

States and actions in reinforcement learning are both language-based.

The reward model translates from these states of language to scalar rewards that can be used in an RL system.

Updating Models Over Time

To update models over time, we need to put the system into a feedback loop.

Callback libraries like KL Divergence can be used to compare distributions of generated text and update models accordingly.

Reinforcement Learning for Text Generation

In this section, the speaker discusses how reinforcement learning is used in text generation. They explain the reward model output and KL Divergence constraint on the text, and how they are combined to create a scalar notion of reward. The speaker also talks about RL Optimizer and PPO algorithm.

Reward Model Output and KL Divergence Constraint

The language model could output gibberish to get high reward from the reward model.

A scaling factor in Lambda is used to balance between the reward from the reward model and the KL Divergence constraint on the text.

Instruct GPT adds a reward term for text outputs of trained models that match high-quality annotations.

RL Optimizer and PPO Algorithm

The RL Optimizer operates as if the reward was given to it from the environment.

PPO stands for proximal policy optimization, which is an on-policy reinforcement learning algorithm optimized for parallel approaches.

PPO works on discrete or continuous actions, making it suitable for language models.

Manipulation Based on Human Feedback

In this section, the speaker answers a question about whether models can be manipulated based on human feedback. They discuss how data discordance can affect feedback accuracy and mention Facebook's work on detecting trolling in feedback.

Data discordance can cause differences in values between individuals.

Facebook's Blenderbot detects trolling in feedback.

A model predicting prompt difficulty could help improve accuracy.

Human Editors and RLHF

This section discusses the role of human editors in writing prompts and responses for chat GPT. It also covers the challenges of building an open-source chat GPT due to the need for high-quality data.

Human Editors and Chat GPT

The prompts are sourced from a wider distribution of people, while the responses are kept from a relatively closed source of contractors.

People write both prompts and responses for chat GPT.

There is an advantage to having diverse prompts, but the feedback data needs to be high quality, so it's written by a subset of people.

Challenges in Building Open-Source Chat GPT

Building an open-source chat GPT requires high-quality data, which is hard to obtain since it can't be written by everyone.

OpenAI has released open-source data for anthropic, which includes complex additions like context distillation and preference model pre-training.

Fine-tuning the reward model on ranking datasets before labeling it on generated prompts helps initialize the reward model.

Online iterated RLHF updates the reward model iteratively while interacting with users. However, this may not be applicable to every experiment.

RLHF Methods

This section covers some interesting parts of RLHF methods used in popular papers like Anthropics. It also compares different algorithms used by OpenAI and DeepMind.

Anthropics' Complex Additions

Anthropics uses context distillation to improve helpfulness, honesty, and harmlessness in their initial policy for rlhf.

Anthropics uses preference model pre-training to fine-tune the reward model on ranking datasets before labeling it on generated prompts.

OpenAI and DeepMind's RLHF Methods

OpenAI pioneered human-generating language model training text and added RL policy reward matching.

DeepMind used Advantage actor-critic, a different on-policy RL algorithm than PPO, which OpenAI uses more often.

The algorithm used often might be more reliant on infrastructure and expertise than the actual algorithm.

Introduction

The speaker introduces the topic of machine learning and feedback interfaces, acknowledging that the field is moving fast and there may be things they missed. They discuss how machine learning is becoming more user-friendly and highlight anthropics' text interface as an example.

Reward Model Feedback Interface

The speaker discusses how reward model feedback interfaces are making machine learning more user-friendly. They compare different feedback interfaces, including thumbs up/thumbs down systems and direct editing of outputs.

Anthropics' chatbot requires users to rank responses on a sliding scale.

Thumbs up/thumbs down systems are used because they're easy to get data from, but direct editing of outputs provides higher quality data.

These interfaces will continue to evolve over time.

Recent Examples

The speaker walks through recent examples of machine learning models and highlights the three-step process used by OpenAI's GPT model. They also compare Instruct GPT's optimization goals with those of Anthropics.

OpenAI's GPT model uses a three-step process: collecting demonstration data, training a supervised policy, collecting comparison data and training a reward model, then optimizing policy against the reward model using reinforcement learning.

Anthropics' trigger modifies pre-trained language models with preference model pre-training and context distillation.

Anthropics optimized for alignment and having an agent that was harmless and helpful.

Reinforcement Learning Research and Open Source

In this section, the speaker talks about two areas of investigation that interest him as a reinforcement learning researcher. He also discusses the need for better documentation of reinforcement learning optimizer choices and the potential benefits of offline RL.

Investigating Reinforcement Learning

The speaker is interested in investigating two high-level open areas: reinforcement learning and user-facing technologies.

As a reinforcement learning researcher, he believes there are many optimizer choices that are not well-documented and can be expanded on.

Some people don't even know if RL is necessary for certain processes, and PPO is definitely not explicitly necessary.

The speaker is also interested in exploring whether it's possible to train models using offline RL, which would reduce training costs but not data costs.

Challenges with Data Costs

Data costs associated with training large models like GPT-3 are very high due to the cost of labeling and disagreement in the data.

Human values have disagreement, making it difficult to determine one ground truth distribution that says what's right or wrong.

Feedback-type user interface questions could help capture human values more accurately.

Q&A Session

In this section, the speaker answers questions from viewers about his presentation.

Applying Reinforcement Learning to Standard Diffusion

The speaker thinks it's possible to apply reinforcement learning to standard diffusion as a fine-tuning method that could help with safety problems.

There are no structural reasons why this couldn't work, but encoding and decoding prompts might be trickier when working with images instead of words.

Introduction

The speaker talks about the popularity of language models and how human feedback is a huge field of machine learning.

Language Models and Human Feedback

Language models have become popular recently.

Human feedback is an important aspect of machine learning.

Hugging Face plans to use human feedback in reinforcement learning but hasn't come up with a specific project yet.

Evaluating Models without Human Feedback

The speaker discusses ways to evaluate models without human feedback.

Metrics for NLP

There are metrics and datasets designed to evaluate topics like harmfulness or alignment without having humans involved.

Blue, Rogue, and Rose are examples of such metrics.

Reinforcement Learning with Convergence Issues

The speaker talks about convergence issues in reinforcement learning.

RL Implementation Limitations

PPO update steps don't converge easily on bigger models.

Carper's trlx Library is working on scaling RL implementations to bigger language models.

Democratizing Access to Knowledge

The speaker discusses democratizing access to knowledge through language models.

Accessing More Languages

Once there's an open-source version, fine-tuned versions will be available for other languages.

Sustainability of GPT Systems

The speaker talks about the sustainability of GPT systems.

Cost Considerations

Upfront costs aren't a problem for companies, but the cost per inference of the model can be expensive.

Microsoft is working on creating low-cost APIs for complex model inference.

Specific Applications of GPT Systems

The speaker discusses specific applications of GPT systems.

Business Model Considerations

Chat GPT could be used for search and as an effective company search tool.

Chat GPT does well at answering specific questions.

Will Human Annotators Become Obsolete?

In this section, the speakers discuss whether human annotators will become obsolete in the future due to advancements in language models.

Future of Human Annotators

Tesla got rid of human annotators by creating more powerful models.

It is unlikely to happen soon, but there is concern that language models will be trained on other language models, making them the ultimate source of truth.

Companies are already trying to mimic GPT by using it to bootstrap their own training data.

Multimodal Models and Memory Usage

This section covers multimodal models and memory usage.

Multimodal Models

Researchers are currently exploring how to train models that use multiple types of data.

Hugging Face has a whole multimodal project where they're trying to figure out how to make bottom-up models more flexible.

Memory Usage

It's not confirmed if GPT has everything in its memory, but there are rumors that OpenAI has figured out some incredible scraping techniques.

The fact that an external company has figured it out as well as Google is still pretty remarkable.

The Future of RoHF and Data Labeling

This section discusses the future of RoHF and data labeling.

RoHF for Modalities

There will likely be a Rel HF for modalities like generating art and music.

The field of human feedback is much broader than RoHF, and it's just a new sub-token of it.

Data Labeling

There is discussion about whether builders should begin labeling data with existing language models like GPT or wait for the next generation.

The data pipeline is still useful, and OpenAI will use their huge data advantage when they want to do early death on GPT4.

Resources for Learning More

This section provides resources for learning more about the topics discussed in the transcript.

Learning Resources

The Alignment Community is very responsive to people engaging on their topics.

Other forums that researchers are affiliated with include LessWrong and the Alignment Forum.

The blog post written by the speakers is a good starting point for an introduction to these concepts.

Keeping up with GPT Models

In this section, the speakers discuss how other companies and open-source communities can keep up with the pace of GPT models.

Open Source Community vs. Open AI

The open-source community has more people and engagement than Open AI.

The scale of access is different.

Reinforcement Learning

Reinforcement learning from email feedback forks much better than just fine-tuning the original model directly with the same reward.

Rumors are that RL handles shifting the optimization might escape nicely.

Fine-tuning on the same dataset could work, but the optimization wasn't figured out in the same way.

There's no research paper version of the blog post that they wrote.

Contributing to GPT Code

In this section, Nathan answers a question about contributing to GPT code.

You cannot read your code of GBT as it's a proprietary model, and you can't contribute as an outsider since it's an internal project.

Q&A Session

In this section, Nathan talks about what people can do if they didn't have time to ask questions during his presentation.

Questions and Answers

If there isn't enough time to answer questions during a presentation, people can join Discord or leave comments below for upcoming days when they will make time to answer them.