Retro Contest: Results
Captured source
source ↗Retro Contest: Results | OpenAI
June 22, 2018
Conclusion
Retro Contest: Results
The first run of our Retro Contest—exploring the development of algorithms that can generalize from previous experience—is now complete.
Loading…
Share
Though many approaches were tried, top results all came from tuning or extending existing algorithms such as PPO and Rainbow. There’s a long way to go: top performance was 4,692 after training while the theoretical max is 10,000. These results provide validation that our Sonic benchmark is a good problem for the community to double down on: the winning solutions are general machine learning approaches rather than competition-specific hacks, suggesting that one can’t cheat through this problem.
Loading...
Over the two-month contest, 923 teams registered and 229 submitted solutions to the leaderboard. Our automated evaluation systems conducted a total of 4,448 evaluations of submitted algorithms, corresponding to about 20 submissions per team. The contestants got to see their scores rise on the leaderboard, which was based on a test set of five low-quality levels that we created using a level editor. You can watch the agents play one of these levels by clicking on the leaderboard entries.
Because contestants got feedback about their submission in the form of a score and a video of the agent being tested on a level, they could easily overfit to the leaderboard test set. Therefore, we used a completely different test set for the final evaluation. Once submissions closed, we took the latest submission from the top 10 entrants and tested their agents against 11 custom Sonic levels made by skilled level designers. To reduce noise, we evaluated each contestant on each level three times, using different random seeds for the environment. The ranking changed in this final evaluation, but not to a large extent.
Loading...
Top scores
The top 5 scoring teams are:
Rank
Team
Score
#1
Dharmaraja
4692
#2
mistake
4446
#3
aborg
4430
#4
whatever
4274
#5
Students of Plato
4269
Joint PPO baseline
4070
Joint Rainbow baseline
3843
Rainbow baseline
3498
Dharmaraja topped the scoreboard during the contest, and the lead remained on the final evaluation; mistake narrowly won out over aborg for second place. The top three teams will receive trophies.
Learning curves of the top three teams for all 11 levels are as follows (showing the standard error computed from three runs).
Averaging over all levels, we can see the following learning curves.
Note that Dharmaraja and aborg start at similar scores, whereas mistake starts much lower. As we will describe in more detail below, these two teams fine-tuned (using PPO) from a pre-trained network, whereas mistake trained from scratch (using Rainbow DQN). mistake’s learning curves end early because they timed out at 12 hours.
Meet the winners
Loading...
Dharmaraja
Dharmaraja is a six-member team including Qing Da, Jing-Cheng Shi, Anxiang Zeng, Guangda Huzhang, Run-Ze Li, and Yang Yu. Qing Da and Anxiang Zeng are from the AI team within the search department of Alibaba in Hangzhou, China. In recent years, they have studied how to apply reinforcement learning to real world problems, especially in an e-commerce setting, together with Yang Yu, who is an Associate Professor of the Department of Computer Science at Nanjing University, Nanjing, China.
Dharmaraja’s solution is a variant of joint PPO (described in our tech report) with a few improvements. First, it uses RGB images rather than grayscale; second, it uses a slightly augmented action space, with more common button combinations; third, it uses an augmented reward function, which rewards the agent for visiting new states (as judged by a perceptual hash of the screen). In addition to these modifications, the team also tried a number of things that didn’t pan out: DeepMimic, object detection through YOLO, and some Sonic-specific ideas.
- Get the source code
Loading...
Mistake
Team mistake consists of Peng Xu and Qiaoling Zhong. Both are second-year graduate students in Beijing, China, studying at the CAS Key Laboratory of Network Data Science and the Technology Institute of Computing Technology, Chinese Academy of Sciences. In their spare time, Peng Xu enjoys playing basketball, and Qiaoling Zhong likes to play badminton. Their favorite video games are Contra and Mario.
Mistake’s solution is based on the Rainbow baseline. They made several modifications that helped boost performance: a better value of n for n-step Q learning; an extra CNN layer added to the model, which made training slower but better; and a lower DQN target update interval. Additionally, the team tried joint training with Rainbow, but found that it actually hurt performance in their case.
- Get the source code
Loading...
Aborg
Team Aborg is a solo effort from Alexandre Borghi. After completing a PhD in computer science in 2011, Alexandre worked for different companies in France before moving to the United Kingdom where he is a research engineer in deep learning. As both a video game and machine learning enthusiast, he spends most of his free time studying deep reinforcement learning, which led him to take part in the OpenAI Retro Contest.
Aborg’s solution, like Dharmaraja’s, is a variant of joint PPO with many improvements: more training levels from the Game Boy Advance and Master System Sonic games; a different network architecture; and fine-tuning hyper-parameters that were designed specifically for fast learning. Elaborating on the last point, Alexandre noticed that the first 150K timesteps of fine-tuning were unstable (i.e. the performance sometimes got worse), so he tuned the learning rate to fix this problem. In addition to the above changes, Alexandre tried several solutions that did not work: different optimizers, MobileNetV2, using color images, etc.
- Get the source code
Best write-ups
The Best Write-up Prize is awarded to contestants that produced high-quality essays describing the approaches they tried.
Loading...
Now, let’s meet the winners of this prize category.
Loading...
Dylan Djian
Dylan currently lives in Paris, France. He is a student in software development at school 42 in Paris. He got into machine learning after watching a video of a genetic algorithm learning how to play Mario a year and a half ago. This video sparked his interest and made him want to learn more about the field. His favorite video…
Excerpt shown — open the source for the full document.