deepsense.aideepsense.ai logo
  • Careers
    • Job offers
    • Summer internship
  • Clients’ stories
  • Services
    • AI software
    • Team augmentation
    • AI advisory
    • Train your team
    • Generative models
  • Industries
    • Retail
    • Manufacturing
    • Financial & Insurance
    • IT operations
    • TMT & Other
    • Medical & Beauty
  • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
  • About us
    • Our story
    • Management
    • Advisory board
    • Press center
  • Contact
  • Menu Menu
How can we improve language models using reinforcement learning? ChatGPT case study

How can we improve language models using reinforcement learning? ChatGPT case study

February 20, 2023/in Generative models /by Kinga Prusinkiewicz

ChatGPT is a cutting-edge natural language processing model released in November 2022 by OpenAI. It is a variant of the GPT-3 model, specifically designed for chatbot and conversational AI applications. On the rising tide of ChatGPT, there are plenty of amazing examples of the chatbot’s accomplishments, one of which is presented in Figure 1.

Figure 1: Example usage of ChatGPT to analyze worst-case time complexity of bubble sorting in the specified style

Figure 1: Example usage of ChatGPT to analyze worst-case time complexity of bubble sorting in the specified style. Source: https://twitter.com/goodside/status/1598129631609380864

Introduction of GPT models

Let’s start with a short introduction of what GPT models are (what the GPT model family is). This acronym is used when referring to a series (GPT, GPT-2 and GPT-3, with the next generations expected soon). It has been trained to process and generate human-like language, and it has achieved impressive results in various language tasks such as translation, summarization, and question answering. GPT has been trained on a massive dataset of text, and it uses this training data to learn patterns and relationships in language. This allows it to understand and generate language in a way that is similar to how humans do. GPT is a powerful tool for developers looking to create articles, poetry, stories, news, reports and dialogue. It can be fine-tuned for specific tasks or domains, allowing it to become even more effective at handling specific types of language tasks.

What makes ChatGPT different from classic GPT models is its incorporation of human feedback during training using reinforcement learning. In this post we will dive into the details of RLHF (Reinforcement Learning from Human Feedback) and how we can use it to fine-tune language models. It is worth noting that the idea was previously used by the OpenAI team in InstructGPT – a sibling model which was trained to follow an instruction in a prompt and provide a detailed response.

What is reinforcement learning?

Reinforcement learning is a machine learning area which aims to train models to make a sequence of decisions. The agent learns by interacting with the (usually complex) environment. Each action is chained with a reward (or penalty). The aim of the model is to learn which actions will maximize the total reward.

The typical reinforcement learning setup consists of a tuple of five elements:

  • States Space (\(S\)) – a set of possible states that an agent can visit.
  • Action Space (\(A\)) – a set of possible actions that an agent may take.
  • State Transition Probability (\(P\)) – describes the dynamics of the environment. It is also called the world model. For model-free reinforcement learning, it is not necessary to know the state transition probability.
  • Reward Function (\(R\)) – a reward (penalty) that an agent receives for a selected action made in a specific state.
  • Discount Factor (\(\gamma\)) – defines the present value of future rewards.

A reinforcement learning agent learns policy (\(\gamma\)), which defines the action that should be taken in the current state.

RL has a wide range of applications, including control systems and robotics. It is particularly useful for tasks that involve sequential decision-making or learning from experience, such as playing Go or Atari games.

Reinforcement learning from human feedback

The history of incorporating human feedback into reinforcement learning is very long. There have been plenty of ideas on how we can integrate human-based samples into the agent training process, for example by adjusting the algorithm itself or with reward shaping. We would like to focus a little bit more on the approach presented in “Deep Reinforcement Learning from Human Preferences” published by DeepMind in 2017.

A typical reinforcement learning training loop involves an agent who interacts with the environment and changes states. Each interaction is connected to a reward. The whole process is presented in Figure 2. The reward function has a huge impact on agent performance. If poorly designed, it results in poor agent performance as well. In the paper, the authors propose to learn the reward function from human feedback, while the agent is still training the same way as in the classical reinforcement learning task.

Figure 2: Classic reinforcement learning training loop. Source: own elaboration

Figure 2: Classic reinforcement learning training loop. Source: own elaboration.

The best way to train the agents is by going through the example provided by the authors in the video.

Figure 3: Experiment video screenshot. The human coordinator selected the left agent, as its behavior is more similar to a backflip, which was the goal

Figure 3: Experiment video screenshot. The human coordinator selected the left agent, as its behavior is more similar to a backflip, which was the goal. Source: https://www.youtube.com/watch?v=oC7Cw3fu3gU

The task was to teach the agent how to do a backflip. Two trajectories delivered by current policy were shown to a human, who decided which one did the better backflip (or at least made a better attempt at doing a backflip). Based on preference, the reward estimator is updated to grant the favored agent behavior with a higher reward. Then the agent is trained in the classical manner of reinforcement learning. Our training loop was enriched with one additional step. The new loop is presented in Figure 4.

Figure 4: Reinforcement learning from human feedback training loop. Source: own elaboration

Figure 4: Reinforcement learning from human feedback training loop. Source: own elaboration.

To sum up, the new process consists of three steps:

  1. Generating a set of trajectories \(\{\tau^{1}, …, \tau^{n}\}\), with learned policy. The parameters of the policy are learned via traditional reinforcement learning to maximize total reward. Policy can be learned using any suitable reinforcement learning algorithm.
  2. Selecting two segments \((\sigma^{1}, \sigma^{2})\) from the generated trajectories and letting the human compare them and rank which one did better. Human judgments are stored as a tuple \((\sigma^{1}, \sigma^{2}, \mu)\), where \(\mu\) is the distribution of which segment was preferred.
  3. Train the reward predictor using supervised learning techniques. To estimate the reward predictor, we should find a way to express the preferred strategy, which can be achieved via the Bradley-Terry model. The simplest example of how this model works is a situation in which we would like to rank football teams in a competition. As the number of matches played might not be even for all teams, we can introduce a model that compares the “strength” of teams to achieve the probability of one team beating another. We can introduce the same thing for trajectories:

$$
\widehat{P}[\sigma^{1} \succ \sigma^{2}] = \frac{\exp(\Sigma\widehat{r}(\sigma^{1}_{t}, a^{1}_{t}))}{\exp(\Sigma\widehat{r}(\sigma^{1}_{t}, a^{1}_{t})) + \exp(\Sigma\widehat{r}(\sigma^{2}_{t}, a^{2}_{t}))}
$$

Therefore we can write the loss function as:

$$
loss(\widehat{r}) = \Sigma_{(\sigma^{1}, \sigma^{2}, \mu)} \mu(1)\widehat{P}[\sigma^{1} \succ \sigma^2] + \mu(2)\widehat{P}[\sigma^{2} \succ \sigma^1]
$$

Now, as we are equipped with reinforcement learning from human feedback knowledge, we can take a deep dive into the ChatGPT example.

ChatGPT/Instruct GPT cases

ChatGPT and InstructGPT use reinforcement learning from human feedback in the model fine-tuning phase. We can split it into the three stages presented in Figure 5.

Figure 5: ChatGPT fine-tuning steps

Figure 5: ChatGPT fine-tuning steps. Source: https://openai.com/blog/chatgpt/

Step 1

The first step involves fine-tuning GPT-3.5 using data delivered by humans playing the role of assistant and user. The trainers had access to model-written suggestions to help with composing responses. These dialogues were mixed with the InstructGPT dataset, which contains prompts and instructions written by users of earlier versions of InstructGPT submitted through Playground. Regarding InstructGPT, the data collection step is limited to obtaining and using the InstructGPT dataset and fine-tuning the GPT-3 model. This step is summarized in Figure 6.

Figure 6: Language model pretraining

Figure 6: Language model pretraining. Source: https://huggingface.co/blog/rlhf

The next steps remain the same for both ChatGPT and InstructGPT.

Step 2

The second step is focused on the training reward model. The language model from the first step is used to prepare samples of responses that are compared and ranked by humans to express their preferences. According to the InstructGPT paper, a labeler receives between 4 and 9 responses to rank. It means, that there is \(\binom{K}{2}\) comparisons, where K is the number of responses to compare. Each set of comparisons is fed to a neural network to learn how to evaluate generated responses in terms of human preferences.

Figure 7: Reward model training

Figure 7: Reward model training. Source: https://huggingface.co/blog/rlhf

Step 3

The last step utilizes the prepared elements in one reinforcement learning task to fine-tune the language model. Let’s formulate the task to fit reinforcement learning language:

  • The agent is represented by a language model.
  • State space is the possible input token sequences.
  • The action space is all the tokens corresponding to the vocabulary of the language model.
  • The reward from the environment is delivered by the reward predictor trained in step 2.

The algorithm used in ChatGPT is PPO, which is short for Proximal Policy Optimization – a state-of-the-art technique in the Reinforcement Learning area. Kullbach-Leibler divergence is added to PPO loss between the initial model and current policy distributions to prevent one from moving substantially away from the initial model.

Figure 8: Fine-tuning with Reinforcement Learning

Figure 8: Fine-tuning with Reinforcement Learning. Source: https://huggingface.co/blog/rlhf

Summary

Using human feedback as the reward signal has several advantages. It allows the model to learn from real-world human preferences and expectations, making it more likely to generate responses that are natural and human-like. It also allows the model to learn more quickly and efficiently, since it can use the feedback it receives to fine-tune its output and avoid making the same mistakes in the future.

However, there are also some limitations to this approach. The feedback may be subjective and prone to bias, which could affect the model’s learning process. Additionally, it can be time-consuming and resource-intensive to collect and process large amounts of human feedback, especially if the model is generating a large number of responses.

Bibliography

  • “Deep reinforcement learning from human preferences” Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei, https://arxiv.org/abs/1706.03741
  • https://openai.com/blog/chatgpt/
  • https://openai.com/blog/instruction-following/
  • https://huggingface.co/blog/rlhf
  • https://openai.com/research/learning-from-human-preferences
  • https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1–VmlldzoyODk5MTIx
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
  • Share on Reddit
  • Share by Mail
https://deepsense.ai/wp-content/uploads/2023/02/How-can-we-improve-language-models-using-reinforcement-learning-ChatGPT-case-study.jpeg 337 1140 Kinga Prusinkiewicz https://deepsense.ai/wp-content/uploads/2019/04/DS_logo_color.svg Kinga Prusinkiewicz2023-02-20 07:00:452023-03-13 16:49:54How can we improve language models using reinforcement learning? ChatGPT case study

Start your search here

Build your AI solution
with us!

Contact us!

NEWSLETTER SUBSCRIPTION

    You can modify your privacy settings and unsubscribe from our lists at any time (see our privacy policy).

    This site is protected by reCAPTCHA and the Google privacy policy and terms of service apply.

    CATEGORIES

    • Generative models
    • Elasticsearch
    • Computer vision
    • Artificial Intelligence
    • AIOps
    • Big data & Spark
    • Data science
    • Deep learning
    • Machine learning
    • Neptune
    • Reinforcement learning
    • Seahorse
    • Job offer
    • Popular posts
    • AI Monthly Digest
    • Press release

    POPULAR POSTS

    • ChatGPT – what is the buzz all about?ChatGPT – what is the buzz all about?March 10, 2023
    • How to leverage ChatGPT to boost marketing strategyHow to leverage ChatGPT to boost marketing strategy?February 26, 2023
    • How can we improve language models using reinforcement learning? ChatGPT case studyHow can we improve language models using reinforcement learning? ChatGPT case studyFebruary 20, 2023

    Would you like
    to learn more?

    Contact us!
    • deepsense.ai logo white
    • Services
    • Customized AI software
    • Team augmentation
    • AI advisory
    • Generative models
    • Knowledge base
    • deeptalks
    • Blog
    • R&D hub
    • deepsense.ai
    • Careers
    • Summer internship
    • Our story
    • Management
    • Advisory board
    • Press center
    • Support
    • Terms of service
    • Privacy policy
    • Code of ethics
    • Contact us
    • Join our community
    • facebook logo linkedin logo twitter logo
    • © deepsense.ai 2014-
    Scroll to top

    This site uses cookies. By continuing to browse the site, you are agreeing to our use of cookies.

    OKLearn more

    Cookie and Privacy Settings



    How we use cookies

    We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.

    Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.

    Essential Website Cookies

    These cookies are strictly necessary to provide you with services available through our website and to use some of its features.

    Because these cookies are strictly necessary to deliver the website, refuseing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.

    We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.

    We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.

    Other external services

    We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.

    Google Webfont Settings:

    Google Map Settings:

    Google reCaptcha Settings:

    Vimeo and Youtube video embeds:

    Privacy Policy

    You can read about our cookies and privacy settings in detail on our Privacy Policy Page.

    Accept settingsHide notification only