WritingOpenAIOpenAIpublished Mar 16, 2017seen 6d

Learning to communicate

Open original ↗

Captured source

source ↗
published Mar 16, 2017seen 6dcaptured 2dhttp 200method exa

Learning to communicate | OpenAI

March 16, 2017

Learning to communicate

In this post we’ll outline new OpenAI research in which agents develop their own language.

Loading…

Share

Our hypothesis is that true language understanding will come from agents that learn words in combination with how they affect the world, rather than spotting patterns in a huge corpus of text. As a first step, we wanted to see if cooperative agents could develop a simple language amongst themselves.

Training agents to invent a language

We’ve just released initial results⁠ in which we teach AI agents to create language by dropping them into a set of simple worlds, giving them the ability to communicate, and then giving them goals that can be best achieved by communicating with other agents. If they achieve a goal, then they get rewarded. We train them using reinforcement learning and, due to careful experiment design, they develop a shared language to help them achieve their goals.

Our approach yields agents that invent a (simple!) language which is grounded⁠ and compositional⁠. Grounded means that words in a language are tied to something directly experienced by a speaker in their environment, for example, a speaker forming an association between the word “tree” and images or experiences of trees. Compositional means that speakers can assemble multiple words into a sentence to represent a specific idea, such as getting another agent to go to a specific location.

To train the agents, we represent the experiment as a cooperative—rather than competitive—multi-agent reinforcement learning problem. The agents exist in a two-dimensional world with simple landmarks, and each agent has a goal. Goals can vary from looking at or moving to a specific location, to encouraging a separate agent to move to a location. Each agent can broadcast messages to the group. Every agent’s reward is the sum of the rewards paid out to all agents, encouraging collaboration.

At each time step, our RL agents can take two kinds of actions—(i) environment actions, like moving around or looking at things, and (ii) communication actions, like broadcasting a word to all other agents. (Note that though the agents come up with words that we found to correspond to objects and other agents, as well as actions like “Look at” or “Go to,” to the agents these words are abstract symbols represented by one-hot vector⁠—we label these one-hot vectors with English words that capture their meaning for the sake of interpretability.) Before an agent takes an action, it observes the communications from other agents from the previous time step as well as the locations of all entities and objects in the world. It stores that communication in a private recurrent neural network, giving it a memory for the words it hears.

We use discrete communication actions (messages formed of separate, word-like symbols) sent over a differentiable communication channel. A communication channel is differentiable if it allows agents to directly inform each other about what message they should have sent at each time step, by slightly altering their messages to make a positive change in the reward both agents expect to receive. Agents accomplish this by calculating the gradient⁠ of future reward with respect to changes in the sent messages (i.e., how much rewards would change with different messages). For example, if one agent realizes that it could have performed a task better if a second agent had sent different information, the first agent can tell the second exactly how to modify its messages to make them as useful as possible. In other words, agents ask the question: ‘how should I modify my communication output to get the most communal reward in the future?’.

Previous efforts achieved this sort of differentiable communication by having the agents send a vector of real numbers⁠ or a continuous approximation to binary values⁠ to each other, or used non-differentiable communication⁠ and training. We use the Gumbel-Softmax⁠ trick, to approximate discrete communication decisions with a continuous representation during training. This gets us the best of both worlds: during training the differentiable channel means agents can rapidly learn how to communicate with each other via using continuous representation, which at the end of training ends up converging on discrete outputs that are more interpretable and show traits like compositionality.

In the video that follows, we show how our agents evolve languages to fit the complexity of their situation, with solitary agents not needing to communicate, two agents inventing one-word phrases to coordinate with each other in simple tasks, and three agents composing multiple words in sentences to accomplish more challenging tasks.

How experimental setup influences how language evolves

All research projects have complications⁠; in this case, our agents frequently invented languages that didn’t display the compositional traits we wanted. And even when they succeeded, their solutions had their own idiosyncrasies.

The first problem we ran into was the agents’ tendency to create a single utterance and intersperse it with spaces to create meaning. This Morse code language was hard to decipher and non-compositional. To correct this, we imposed a slight cost on every utterance and added a preference for achieving the task quickly. This encouraged the agents to use their communication channel concisely, which led to the development of a larger vocabulary.

Another issue we faced was agents trying to use single words to encode the meaning of entire sentences. This happened when we gave them the ability to use large vocabularies; they’d eventually create a single utterance that encoded the meaning of an entire sentence such as “red agent, go to blue landmark.” While useful for the agents, this approach requires vocabulary size to grow exponentially with the sentence length and doesn’t fit with our broader goal of creating AI that is interpretable to humans.) To deter agents from creating this sort of language we incorporated a preference for compact vocabulary sizes through a preference for using already-popular words, inspired by ideas outlined in The evolution of syntactic communication⁠. We incorporate this by putting a reward for speaking a particular word that is proportional to how frequently that word has been spoken previously.

Lastly, we encountered agents inventing landmark references not based on color, but…

Excerpt shown — open the source for the full document.