Learning Montezuma’s Revenge from a single demonstration
Captured source
source ↗Learning Montezuma’s Revenge from a single demonstration | OpenAI
July 4, 2018
Learning Montezuma’s Revenge from a single demonstration
Loading…
Share
We’ve trained an agent to achieve a high score of 74,500 on Montezuma’s Revenge from a single human demonstration, better than any previously published result. Our algorithm is simple: the agent plays a sequence of games starting from carefully chosen states from the demonstration, and learns from them by optimizing the game score using PPO, the same reinforcement learning algorithm that underpins OpenAI Five.
Loading...
Exploration and learning
In order to succeed at a reinforcement learning problem, an AI needs to do two things:
- Find a sequence of actions that leads to positive reward. This is the exploration problem.
- Remember the sequence of actions to take, and generalize to related but slightly different situations. This is the learning problem.
The exploration problem can largely be bypassed in Montezuma’s Revenge by starting each RL episode by resetting from a state in a demonstration. By starting from demonstration states, the agent needs to perform much less exploration to learn to play the game compared to when it starts from the beginning of the game at every episode. Doing so enables us to disentangle exploration and learning. Our results suggest that exploration is the hardest of the two problems for games like Montezuma’s Revenge and certain similar Atari games like PrivateEye.
Why exploration is difficult
Model-free RL methods like policy gradients and Q-learning explore by taking actions randomly. If, by chance, the random actions lead to a reward, they are reinforced, and the agent becomes more likely to take these beneficial actions in the future. This works well if rewards are dense enough for random actions to lead to a reward with reasonable probability. However, many of the more complicated games require long sequences of very specific actions to experience any reward, and such sequences are extremely unlikely to occur randomly.
Loading...
Consider a game which takes a precise sequence of N actions to experience the first reward. If each of those actions is taken with a fixed probability, a random agent will need to play the game for a duration that scales as exp(N) before it can expect to experience the first reward.
In the case of Montezuma’s Revenge, for example, the probability of getting the first key can be decomposed as
p(get key) = p(get down ladder 1) * p(get down rope) * p(get down ladder 2) * p(jump over skull) * p(get up ladder 3).
By multiplying N of these probabilities together, we end up with the resulting probability p(get key) that is exponentially smaller than any of the individual input probabilities. Algorithms with exponential scaling break down very quickly as your problem becomes more challenging, which limits the tasks current reinforcement learning techniques can solve.
Simplifying exploration with demonstrations
Although model-free RL methods have difficulty finding long sequences of actions, they work well for shorter sequences. Our main insight is that we can make our task easier to solve by decomposing it into a curriculum of subtasks requiring short action sequences; we construct this curriculum by starting each RL episode from a demonstration state. A variant of the same idea was used recently for reverse curriculum generation for robotics, where a curriculum was constructed by iteratively perturbing a set of starting states using random actions, and selecting the resulting states with the right level of difficulty.
Our approach works by letting each RL episode start from a state in a previously recorded demonstration. Early on in training, the agent begins every episode near the end of the demonstration. Once the agent is able to beat or at least tie the score of the demonstrator on the remaining part of the game in at least 20% of the rollouts, we slowly move the starting point back in time. We keep doing this until the agent is playing from the start of the game, without using the demo at all, at which point we have an RL-trained agent beating or tying the human expert on the entire game.
By slowly moving our starting state from the end of the demonstration to the beginning, we ensure that at every point the agent faces an easy exploration problem where it is likely to succeed, since it has already learned to solve most of the remaining game. We can interpret solving the RL problem in this way as a form of dynamic programming. If a specific sequence of N actions is required to reach a reward, this sequence may now be learned in a time that is linear in N, rather than exponential.
Starting episodes by resetting from demonstration states was previously proposed, but without constructing a curriculum that gradually moves the starting state back from the end of the demonstration to the beginning. When combined with imitation learning, several researchers report benefit from this approach. For our use case we found such a curriculum to be vitally important for deriving benefit from the demonstration.
Impression of our agent learning to reach the first key in Montezuma’s Revenge using RL and starting each episode from a demonstration state. When our agent starts playing the game, we place it right in front of the key, requiring it to only take a single jump to find success. After our agent has learned to do this consistently, we slowly move the starting point back in time. Our agent might then find itself halfway up the ladder that leads to the key. Once it learns to climb the ladder from there, we can have it start at the point where it needs to jump over the skull. After it learns to do that, we can have it start on the rope leading to the floor of the room, etc. Eventually, the agent starts in the original starting state of the game and is able to reach the key completely by itself.
Comparison to imitation-based approaches
Recently, DeepMind has shown an agent learning Montezuma’s Revenge by imitation learning from a demonstration; one approach trains an agent to achieve the same states seen in a YouTube video of Montezuma’s Revenge, and another technique combines a sophisticated version of Q-learning with maximizing the likelihood of actions taken in a demonstration. The advantage of these approaches is that they do not require as much control over the environment our technique does: they do not reset the environment to…
Excerpt shown — open the source for the full document.