WritingOpenAIOpenAIpublished May 3, 2018seen 6d

AI safety via debate

Open original ↗

Captured source

source ↗
published May 3, 2018seen 6dcaptured 2dhttp 200method exa

AI safety via debate | OpenAI

May 3, 2018

AI safety via debate

We’re proposing an AI safety technique which trains agents to debate topics with one another, using a human to judge who wins.

Loading…

Share

We believe that this or a similar approach could eventually help us train AI systems to perform far more cognitively advanced tasks than humans are capable of, while remaining in line with human preferences. We’re going to outline this method together with preliminary proof-of-concept experiments and are also releasing a web interface so people can experiment with the technique.

One approach to aligning AI agents with human goals and preferences is to ask humans at training time⁠ which behaviors are safe and useful. While promising, this method requires humans to recognize good or bad behavior; in many situations an agent’s behavior may be too complex for a human to understand, or the task itself may be hard to judge or demonstrate. Examples include environments with very large, non-visual observation spaces—for instance, an agent that acts in a computer security-related environment, or an agent that coordinates a large set of industrial robots.

How can we augment humans so that they can effectively supervise advanced AI systems? One way is to take advantage of the AI itself to help with the supervision, asking the AI (or a separate AI) to point out flaws in any proposed action. To achieve this, we reframe the learning problem as a game played between two agents, where the agents have an argument with each other and the human judges the exchange. Even if the agents have a more advanced understanding of the problem than the human, the human may be able to judge which agent has the better argument (similar to expert witnesses arguing to convince a jury).

Our method proposes a specific debate format for such a game played between two dueling AI agents. The two agents can be trained by self play, similar to AlphaGo Zero⁠ or Dota 2⁠. Our hope is that, properly trained, such agents can produce value-aligned behavior far beyond the capabilities of the human judge. If the two agents disagree on the truth but the full reasoning is too large to show the humans, the debate can focus in on simpler and simpler factual disputes, eventually reaching a claim that is simple enough for direct judging.

As an example, consider the question “What’s the best place to go on vacation?”. If an agent Alice purportedly does research on our behalf and says “Alaska”, it’s hard to judge if this is really the best choice. If a second agent Bob says “no, it’s Bali”, that may sound convincing since Bali is warmer. Alice replies “you can’t go to Bali because your passport won’t arrive in time”, which surfaces a flaw with Bali which had not occurred to us. But Bob counters “expedited passport service takes only two weeks”. The debate continues until we reach a statement that the human can correctly judge, in the sense that the other agent doesn’t believe it can change the human’s mind.

Convincing a sparse MNIST classifier

While we expect this approach to be most effective in the long-term when agents talk to each other with natural language, we need to test it in a simpler domain as today’s natural language modeling is insufficient for this task. The test is easier with a task that is not beyond human capabilities. We can do this by moving to the visual domain, and by replacing “debaters have capabilities the judge lacks” with “debaters have knowledge the judge lacks.” The goal of the judge is to guess the contents of an image, but the judge is blind aside from a few pixels chosen by the debaters. The debaters see the complete image, and play a game where they alternate revealing single pixels to the judge for a few turns. Here the full image is a stand-in for information beyond human scale, and each revealed pixel is a stand-in for a natural language discussion point.

We tried this on the simplest possible visual task—MNIST. The judge is not a human but a classifier trained in advance to predict MNIST digits from 6 non-black pixels, sampled at random for each image presentation when pretraining the judge. Two agents then have a debate where they alternate revealing pixels, stopping at a total of 6 revealed pixels (so the judge sees only a little bit of information in total). One debater is honest and tries to make the judge guess right, the other debater tries to make the judge guess wrong. If both agents must declare in advance which digit they are arguing for, debate turns a 59.4% accurate sparse classifier into an 88.9% accurate debate. That is, the honest player wins 88.9% of the time on the MNIST test set, in the process boosting the judge above its unaided accuracy.

The structure of an MNIST debate game. Debaters Alice and Bob see an image, Alice tries to trick the judge into thinking it’s a 6, Bob tries to correctly convince the judge it’s a 5. They alternate revealing non-black pixels to the judge, who correctly identifies it as a 5 after seeing six pixels. Alice traces out the curve at the bottom to make it look like a 6 (red pixels), but Bob effectively counters this strategy by revealing pixels in the top right corner (blue pixels), which most MNIST 6′s do not have.

Intuitively, an image is more likely to contain pixels that convincingly demonstrate the truth than to contain pixels that convincingly demonstrate a lie, so 6 pixels chosen half honestly and half maliciously is much better than 6 random pixels. If the judge were a human capable of reasoning about a few arguments at a time but not sifting through the huge set of all possible arguments, optimal play in debate can (we hope) reward debating agents for doing the sifting for us even if we lack an a priori mechanism for distinguishing good arguments from bad.

The panel below shows several example games. Liar wins are more interesting than honest wins, so we have chosen samples so that the liar wins in 50% of the games shown below (in reality the liar wins 11.1% of the games). Mouse over one of the digits to see a game, including which pixels are revealed and the judge’s possible hypothesis at each move:

Loading...

Some digits are better lies than others. It is particularly easy to convince the judge that digits are 8 or 9:

Confusion matrices with diagonal set to zero for the 6 pixel MNIST judge (left) on random pixels and (right) for debate. Errors with 6 random pixels have no obvious pattern, while successful lies in…

Excerpt shown — open the source for the full document.