Lessons learned on language model safety and misuse
Captured source
source ↗Lessons learned on language model safety and misuse | OpenAI
March 3, 2022
Lessons learned on language model safety and misuse
We describe our latest thinking in the hope of helping other AI developers address safety and misuse of deployed models.
Illustration: Justin Jay Wang
Loading…
Share
Summary
The deployment of powerful AI systems has enriched our understanding of safety and misuse far more than would have been possible through research alone. Notably: API-based language model misuse often comes in different forms than we feared most; we have identified limitations in existing language model evaluations that we are addressing with novel benchmarks and classifiers; and basic safety research offers significant benefits for the commercial utility of AI systems.
Over the past two years, we’ve learned a lot about how language models can be used and abused—insights we couldn’t have gained without the experience of real-world deployment. In June 2020, we began giving access to developers and researchers to the OpenAI API, an interface for accessing and building applications on top of new AI models developed by OpenAI. Deploying GPT‑3, Codex, and other models in a way that reduces risks of harm has posed various technical and policy challenges.
Overview of our model deployment approach
Large language models are now capable of performing a very wide range of tasks, often out of the box. Their risk profiles, potential applications, and wider effects on society remain poorly understood. As a result, our deployment approach emphasizes continuous iteration, and makes use of the following strategies aimed at maximizing the benefits of deployment while reducing associated risks:
- Pre-deployment risk analysis, leveraging a growing set of safety evaluations and red teaming tools (e.g., we checked our InstructGPT for any safety degradations using the evaluations discussed below)
- Starting with a small user base (e.g., both GPT‑3 and our InstructGPT series began as private betas)
- Studying the results of pilots of novel use cases (e.g., exploring the conditions under which we could safely enable longform content generation, working with a small number of customers)
- Implementing processes that help keep a pulse on usage (e.g., review of use cases, token quotas, and rate limits)
- Conducting detailed retrospective reviews (e.g., of safety incidents and major deployments)
There is no silver bullet for responsible deployment, so we try to learn about and address our models’ limitations, and potential avenues for misuse, at every stage of development and deployment. This approach allows us to learn as much as we can about safety and policy issues at small scale and incorporate those insights prior to launching larger-scale deployments.
> “There is no silver bullet for responsible deployment.”
While not exhaustive, some areas where we’ve invested so far includeA:
- Pre-training data curation and filtering
- Fine-tuning models to better follow instructions
- Risk analysis of potential deployments
- Providing detailed user documentation
- Building tools to screen harmful model outputs
- Reviewing use cases against our policies
- Monitoring for signs of misuse
- Studying the impacts of our models
Since each stage of intervention has limitations, a holistic approach is necessary.
There are areas where we could have done more and where we still have room for improvement. For example, when we first worked on GPT‑3, we viewed it as an internal research artifact rather than a production system and were not as aggressive in filtering out toxic training data as we might have otherwise been. We have invested more in researching and removing such material for subsequent models. We have taken longer to address some instances of misuse in cases where we did not have clear policies on the subject, and have gotten better at iterating on those policies. And we continue to iterate towards a package of safety requirements that is maximally effective in addressing risks, while also being clearly communicated to developers and minimizing excessive friction.
Still, we believe that our approach has enabled us to measure and reduce various types of harms from language model use compared to a more hands-off approach, while at the same time enabling a wide range of scholarly, artistic, and commercial applications of our models.B
The many shapes and sizes of language model misuse
OpenAI has been active in researching the risks of AI misuse since our early work on the malicious use of AI in 2018 and on GPT‑2 in 2019, and we have paid particular attention to AI systems empowering influence operations. We have worked with external experts to develop proofs of concept and promoted careful analysis of such risks by third parties. We remain committed to addressing risks associated with language model-enabled influence operations and recently co-organized a workshop on the subject.C
Yet we have detected and stopped hundreds of actors attempting to misuse GPT‑3 for a much wider range of purposes than producing disinformation for influence operations, including in ways that we either didn’t anticipate or which we anticipated but didn’t expect to be so prevalent.D Our use case guidelines, content guidelines, and internal detection and response infrastructure were initially oriented towards risks that we anticipated based on internal and external research, such as generation of misleading political content with GPT‑3 or generation of malware with Codex. Our detection and response efforts have evolved over time in response to real cases of misuse encountered “in the wild” that didn’t feature as prominently as influence operations in our initial risk assessments. Examples include spam promotions for dubious medical products and roleplaying of racist fantasies.
To support the study of language model misuse and mitigation thereof, we are actively exploring opportunities to share statistics on safety incidents this year, in order to concretize discussions about language model misuse.
The difficulty of risk and impact measurement
Many aspects of language models’ risks and impacts remain hard to measure and therefore hard to monitor, minimize, and disclose in an accountable way. We have made active use of existing academic benchmarks for language model evaluation and are eager to continue building on external work, but we have also have found that existing benchmark…
Excerpt shown — open the source for the full document.