Our approach to alignment research
Captured source
source ↗Our approach to alignment research | OpenAI
August 24, 2022
Our approach to alignment research
We are improving our AI systems’ ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other alignment problems.
Illustration: Justin Jay Wang
Loading…
Share
Our alignment research aims to make artificial general intelligence (AGI) aligned with human values and follow human intent. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned. Using scientific experiments, we study how alignment techniques scale and where they will break.
We tackle alignment problems both in our most capable AI systems as well as alignment problems that we expect to encounter on our path to AGI. Our main goal is to push current alignment ideas as far as possible, and to understand and document precisely how they can succeed or why they will fail. We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.
Unaligned AGI could pose substantial risks to humanity and solving the AGI alignment problem could be so difficult that it will require all of humanity to work together. Therefore we are committed to openly sharing our alignment research when it’s safe to do so: We want to be transparent about how well our alignment techniques actually work in practice and we want every AGI developer to use the world’s best alignment techniques.
At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent. It has three main pillars:
1. Training AI systems using human feedback 2. Training AI systems to assist human evaluation 3. Training AI systems to do alignment research
Aligning AI systems with human values also poses a range of other significant sociotechnical challenges, such as deciding to whom these systems should be aligned. Solving these problems is important to achieving our mission, but we do not discuss them in this post.
> “We want to be transparent about how well our alignment techniques actually work in practice and we want every AGI developer to use the world’s best alignment techniques.”
Training AI systems using human feedback
RL from human feedback is our main technique for aligning our deployed language models today. We train a class of models called InstructGPT derived from pretrained language models such as GPT‑3. These models are trained to follow human intent: both explicit intent given by an instruction as well as implicit intent such as truthfulness, fairness, and safety.
Our results show that there is a lot of low-hanging fruit on alignment-focused fine-tuning right now: InstructGPT is preferred by humans over a 100x larger pretrained model, while its fine-tuning costs <2% of GPT‑3’s pretraining compute and about 20,000 hours of human feedback. We hope that our work inspires others in the industry to increase their investment in alignment of large language models and that it raises the bar on users’ expectations about the safety of deployed models.
Our natural language API is a very useful environment for our alignment research: It provides us with a rich feedback loop about how well our alignment techniques actually work in the real world, grounded in a very diverse set of tasks that our customers are willing to pay money for. On average, our customers already prefer to use InstructGPT over our pretrained models.
Yet today’s versions of InstructGPT are quite far from fully aligned: they sometimes fail to follow simple instructions, aren’t always truthful, don’t reliably refuse harmful tasks, and sometimes give biased or toxic responses. Some customers find InstructGPT’s responses significantly less creative than the pretrained models’, something we hadn’t realized from running InstructGPT on publicly available benchmarks. We are also working on developing a more detailed scientific understanding of RL from human feedback and how to improve the quality of human feedback.
Aligning our API is much easier than aligning AGI since most tasks on our API aren’t very hard for humans to supervise and our deployed language models aren’t smarter than humans. We don’t expect RL from human feedback to be sufficient to align AGI, but it is a core building block for the scalable alignment proposals that we’re most excited about, and so it’s valuable to perfect this methodology.
Training models to assist human evaluation
RL from human feedback has a fundamental limitation: it assumes that humans can accurately evaluate the tasks our AI systems are doing. Today humans are pretty good at this, but as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g., finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth. In order to scale alignment, we want to use techniques like recursive reward modeling (RRM), debate, and iterated amplification.
Currently our main direction is based on RRM: we train models that can assist humans at evaluating our models on tasks that are too difficult for humans to evaluate directly. For example:
- We trained a model to summarize books. Evaluating book summaries takes a long time for humans if they are unfamiliar with the book, but our model can assist human evaluation by writing chapter summaries.
- We trained a model to assist humans at evaluating the factual accuracy by browsing the web and providing quotes and links. On simple questions, this model’s outputs are already preferred to responses written by humans.
- We trained a model to write critical comments on its own outputs: On a query-based summarization task, assistance with critical comments increases the flaws humans find in model outputs by 50% on average. This holds even if we ask humans to write plausible looking but incorrect summaries.
- We are creating a set of coding tasks selected to be very difficult to evaluate reliably for unassisted humans. We hope to release this data set soon.
Our alignment techniques need to work even if our AI systems are proposing…
Excerpt shown — open the source for the full document.
Notability
Scored, but no written rationale attached yet.
OpenAI has a writing signal matching evals and quality, infrastructure, safety and policy.