Aligning language models to follow instructions
Captured source
source ↗Aligning language models to follow instructions | OpenAI
January 27, 2022
Aligning language models to follow instructions
Read paper View model card
Loading…
Share
We’ve trained language models that are much better at following user intentions than GPT‑3 while also making them more truthful and less toxic, using techniques developed through our alignment research. These InstructGPT models, which are trained with humans in the loop, are now deployed as the default language models on our API.
Loading...
The OpenAI API is powered by GPT‑3 language models which can be coaxed to perform natural language tasks using carefully engineered text prompts. But these models can also generate outputs that are untruthful, toxic, or reflect harmful sentiments. This is in part because GPT‑3 is trained to predict the next word on a large dataset of Internet text, rather than to safely perform the language task that the user wants. In other words, these models aren’t aligned with their users.
To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API,A our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT‑3.
The resulting InstructGPT models are much better at following instructions than GPT‑3. They also make up facts less often, and show small decreases in toxic output generation. Our labelers prefer outputs from our 1.3B InstructGPT model over outputs from a 175B GPT‑3 model, despite having more than 100x fewer parameters. At the same time, we show that we don’t have to compromise on GPT‑3’s capabilities, as measured by our model’s performance on academic NLP evaluations.
These InstructGPT models, which have been in beta on the API for more than a year, are now the default language models accessible on our API.B We believe that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability, and we will continue to push in this direction.
This is the first time our alignment research, which we’ve been pursuing for several years,1, 2, 3 has been applied to our product. Our work is also related to recent research that fine-tunes language models to follow instructions using academic NLP datasets, notably FLAN4 and T0.5 A key motivation for our work is to increase helpfulness and truthfulness while mitigating the harms and biases of language models.6, 7, 8, 9, 10 Some of our previous research in this direction found that we can reduce harmful outputs by fine-tuning on a small curated dataset of human demonstrations.11 Other research has focused on filtering the pre-training dataset,12 safety-specific control tokens,13, 14 or steering model generations.15, 16 We are exploring these ideas and others in our ongoing alignment research.
Results
We first evaluate how well outputs from InstructGPT follow user instructions, by having labelers compare its outputs to those from GPT‑3. We find that InstructGPT models are significantly preferred on prompts submitted to both the InstructGPT and GPT‑3 models on the API. This holds true when we add a prefix to the GPT‑3 prompt so that it enters an “instruction-following mode.”
Loading...
To measure the safety of our models, we primarily use a suite of existing metrics on publicly available datasets. Compared to GPT‑3, InstructGPT produces fewer imitative falsehoods (according to TruthfulQA17) and are less toxic (according to RealToxicityPrompts18). We also conduct human evaluations on our API prompt distribution, and find that InstructGPT makes up facts (“hallucinates”) less often, and generates more appropriate outputs.C
Loading...
Finally, we find that InstructGPT outputs are preferred to those from FLAN4 and T05 on our customer distribution. This indicates that the data used to train FLAN and T0, mostly academic NLP tasks, is not fully representative of how deployed language models are used in practice.
Methods
To train InstructGPT models, our core technique is reinforcement learning from human feedback (RLHF), a method we helped pioneer in our earlier alignment research. This technique uses human preferences as a reward signal to fine-tune our models, which is important as the safety and alignment problems we are aiming to solve are complex and subjective, and aren’t fully captured by simple automatic metrics.
We first collect a dataset of human-written demonstrations on prompts submitted to our API, and use this to train our supervised learning baselines. Next, we collect a dataset of human-labeled comparisons between two model outputs on a larger set of API prompts. We then train a reward model (RM) on this dataset to predict which output our labelers would prefer. Finally, we use this RM as a reward function and fine-tune our GPT‑3 policy to maximize this reward using the PPO algorithm.
One way of thinking about this process is that it “unlocks” capabilities that GPT‑3 already had, but were difficult to elicit through prompt engineering alone: this is because our training procedure has a limited ability to teach the model new capabilities relative to what is learned during pretraining, since it uses less than 2% of the compute and data relative to model pretraining.
A limitation of this approach is that it introduces an “alignment tax”: aligning the models only on customer tasks can make their performance worse on some other academic NLP tasks. This is undesirable since, if our alignment techniques make models worse on tasks that people care about, they’re less likely to be adopted in practice. We’ve found a simple algorithmic change that minimizes this alignment tax: during RL fine-tuning we mix in a small fraction of the original data used to train GPT‑3, and train on this data using the normal log likelihood maximization.D This roughly maintains performance on safety and human preferences, while mitigating performance decreases on academic tasks, and in several cases even surpassing the GPT‑3 baseline.
Generalizing to broader preferences
Our procedure aligns our models’ behavior with the preferences of our labelers, who directly produce the data used to train our models, and us researchers, who provide guidance to labelers through written instructions, direct feedback on specific examples, and informal conversations. It is also…
Excerpt shown — open the source for the full document.