OpenAI Five
Captured source
source ↗OpenAI Five | OpenAI
June 25, 2018
Milestone
OpenAI Five
Our team of five neural networks, OpenAI Five, has started to defeat amateur human teams at Dota 2.
Loading…
Share
Our team of five neural networks, OpenAI Five, has started to defeat amateur human teams at Dota 2. While today we play with restrictions, we aim to beat a team of top professionals at The International in August subject only to a limited set of heroes. We may not succeed: Dota 2 is one of the most popular and complex esports games in the world, with creative and motivated professionals who train year-round to earn part of Dota’s annual $40M prize pool(the largest of any esports game).
OpenAI Five plays 180 years worth of games against itself every day, learning via self-play. It trains using a scaled-up version of Proximal Policy Optimization running on 256 GPUs and 128,000 CPU cores—a larger-scale version of the system we built to play the much-simpler solo variant of the game last year. Using a separate LSTM for each hero and no human data, it learns recognizable strategies. This indicates that reinforcement learning can yield long-term planning with large but achievable scale—without fundamental advances, contrary to our own expectations upon starting the project.
To benchmark our progress, we’ll host a match versus top players on August 5th. Follow us on Twitch to view the live broadcast, or request an invite to attend in person!
The problem
One AI milestone is to exceed human capabilities in a complex video game like StarCraft or Dota. Relative to previous AI milestones like Chess or Go, complex video games start to capture the messiness and continuous nature of the real world. The hope is that systems which solve complex video games will be highly general, with applications outside of games.
Dota 2 is a real-time strategy game played between two teams of five players, with each player controlling a character called a “hero”. A Dota-playing AI must master the following:
- Long time horizons. Dota games run at 30 frames per second for an average of 45 minutes, resulting in 80,000 ticks per game. Most actions (like ordering a hero to move to a location) have minor impact individually, but some individual actions like town portal usage can affect the game strategically; some strategies can play out over an entire game. OpenAI Five observes every fourth frame, yielding 20,000 moves. Chess usually ends before 40 moves, Go before 150 moves, with almost every move being strategic.
- Partially-observed state. Units and buildings can only see the area around them. The rest of the map is covered in a fog hiding enemies and their strategies. Strong play requires making inferences based on incomplete data, as well as modeling what one’s opponent might be up to. Both chess and Go are full-information games.
- High-dimensional, continuous action space. In Dota, each hero can take dozens of actions, and many actions target either another unit or a position on the ground. We discretize the space into 170,000 possible actions per hero (not all valid each tick, such as using a spell on cooldown); not counting the continuous parts, there are an average of ~1,000 valid actions each tick. The average number of actions in chess is 35; in Go, 250.
- High-dimensional, continuous observation space. Dota is played on a large continuous map containing ten heroes, dozens of buildings, dozens of NPC units, and a long tail of game features such as runes, trees, and wards. Our model observes the state of a Dota game via Valve’s Bot API as 20,000 (mostly floating-point) numbers representing all information a human is allowed to access. A chess board is naturally represented as about 70 enumeration values (a 8x8 board of 6 piece types and minor historical info); a Go board as about 400 enumeration values (a 19x19 board of 2 piece types plus Ko).
The Dota rules are also very complex — the game has been actively developed for over a decade, with game logic implemented in hundreds of thousands of lines of code. This logic takes milliseconds per tick to execute, versus nanoseconds for Chess or Go engines. The game also gets an update about once every two weeks, constantly changing the environment semantics.
Our approach
Our system learns using a massively-scaled version of Proximal Policy Optimization. Both OpenAI Five and our earlier 1v1 bot learn entirely from self-play. They start with random parameters and do not use search or bootstrap from human replays.
Loading...
RL researchers (including ourselves) have generally believed that long time horizons would require fundamentally new advances, such as hierarchical reinforcement learning. Our results suggest that we haven’t been giving today’s algorithms enough credit — at least when they’re run at sufficient scale and with a reasonable way of exploring.
Our agent is trained to maximize the exponentially decayed sum of future rewards, weighted by an exponential decay factor calledγ. During the latest training run of OpenAI Five, we annealedγ from0.998(valuing future rewards with a half-life of 46 seconds) to0.9997(valuing future rewards with a half-life of five minutes). For comparison, the longest horizon in the PPO paper was a half-life of 0.5 seconds, the longest in the Rainbow paper was a half-life of 4.4 seconds, and the Observe and Look Further paper used a half-life of 46 seconds.
While the current version of OpenAI Five is weak at last-hitting(observing our test matches, the professional Dota commentator Blitz estimated it around median for Dota players), its objective prioritization matches a common professional strategy. Gaining long-term rewards such as strategic map control often requires sacrificing short-term rewards such as gold gained from farming, since grouping up to attack towers takes time. This observation reinforces our belief that the system is truly optimizing over a long horizon.
Model structure
Each of OpenAI Five’s networks contain a single-layer, 1024-unit LSTM that sees the current game state (extracted from Valve’s Bot API) and emits actions through several possible action heads. Each head has semantic meaning, for example, the number of ticks to delay this action, which action to select, the X or Y coordinate of this action in a grid around the unit, etc. Action heads are computed independently.
Interactive demonstration of the observation space and action space used…
Excerpt shown — open the source for the full document.