What does this writing signal mean?

OpenAI Writing: Retro Contest: Results

Captured source

source ↗

openai.com/openai.com/index/retro-contest-results

Retro Contest: Results

Source ↗

published Jun 22, 2018seen 6dcaptured 2dhttp 200method exa

Retro Contest: Results | OpenAI

June 22, 2018

Conclusion

Retro Contest: Results

The first run of our Retro Contest—exploring the development of algorithms that can generalize from previous experience—is now complete.

Loading…

Though many approaches were tried, top results all came from tuning or extending existing algorithms such as PPO and Rainbow. There’s a long way to go: top performance was 4,692 after training while the theoretical max is 10,000. These results provide validation that our Sonic benchmark is a good problem for the community to double down on: the winning solutions are general machine learning approaches rather than competition-specific hacks, suggesting that one can’t cheat through this problem.

Over the two-month contest, 923 teams registered and 229 submitted solutions to the leaderboard. Our automated evaluation systems conducted a total of 4,448 evaluations of submitted algorithms, corresponding to about 20 submissions per team. The contestants got to see their scores rise on the leaderboard, which was based on a test set of five low-quality levels that we created using a level editor. You can watch the agents play one of these levels by clicking on the leaderboard entries⁠.

Because contestants got feedback about their submission in the form of a score and a video of the agent being tested on a level, they could easily overfit to the leaderboard test set. Therefore, we used a completely different test set for the final evaluation. Once submissions closed, we took the latest submission from the top 10 entrants and tested their agents against 11 custom Sonic levels made by skilled level designers. To reduce noise, we evaluated each contestant on each level three times, using different random seeds for the environment. The ranking changed in this final evaluation, but not to a large extent.

Top scores

The top 5 scoring teams are:

Rank

Team

Score

Dharmaraja

4692

mistake

4446

aborg

4430

whatever

4274

Students of Plato

4269

Joint PPO baseline

4070

Joint Rainbow baseline

3843

Rainbow baseline

3498

Dharmaraja topped the scoreboard during the contest, and the lead remained on the final evaluation; mistake narrowly won out over aborg for second place. The top three teams will receive trophies.

Learning curves of the top three teams for all 11 levels are as follows (showing the standard error computed from three runs).

Averaging over all levels, we can see the following learning curves.

Note that Dharmaraja and aborg start at similar scores, whereas mistake starts much lower. As we will describe in more detail below, these two teams fine-tuned (using PPO) from a pre-trained network, whereas mistake trained from scratch (using Rainbow DQN). mistake’s learning curves end early because they timed out at 12 hours.

Meet the winners

Dharmaraja

Dharmaraja is a six-member team including Qing Da, Jing-Cheng Shi, Anxiang Zeng, Guangda Huzhang, Run-Ze Li, and Yang Yu. Qing Da⁠ and Anxiang Zeng are from the AI team within the search department of Alibaba in Hangzhou, China. In recent years, they have studied how to apply reinforcement learning to real world problems, especially in an e-commerce setting, together with Yang Yu⁠, who is an Associate Professor of the Department of Computer Science at Nanjing University, Nanjing, China.

Dharmaraja’s solution is a variant of joint PPO (described in our tech report⁠) with a few improvements. First, it uses RGB images rather than grayscale; second, it uses a slightly augmented action space, with more common button combinations; third, it uses an augmented reward function, which rewards the agent for visiting new states (as judged by a perceptual hash of the screen). In addition to these modifications, the team also tried a number of things that didn’t pan out: DeepMimic⁠, object detection through YOLO⁠, and some Sonic-specific ideas.

Get the source code

Mistake

Team mistake consists of Peng Xu and Qiaoling Zhong. Both are second-year graduate students in Beijing, China, studying at the CAS Key Laboratory of Network Data Science and the Technology Institute of Computing Technology, Chinese Academy of Sciences. In their spare time, Peng Xu enjoys playing basketball, and Qiaoling Zhong likes to play badminton. Their favorite video games are Contra and Mario.

Mistake’s solution is based on the Rainbow baseline. They made several modifications that helped boost performance: a better value of n for n-step Q learning; an extra CNN layer added to the model, which made training slower but better; and a lower DQN target update interval. Additionally, the team tried joint training with Rainbow, but found that it actually hurt performance in their case.

Get the source code

Aborg

Team Aborg is a solo effort from Alexandre Borghi. After completing a PhD in computer science in 2011, Alexandre worked for different companies in France before moving to the United Kingdom where he is a research engineer in deep learning. As both a video game and machine learning enthusiast, he spends most of his free time studying deep reinforcement learning, which led him to take part in the OpenAI Retro Contest.

Aborg’s solution, like Dharmaraja’s, is a variant of joint PPO with many improvements: more training levels from the Game Boy Advance and Master System Sonic games; a different network architecture; and fine-tuning hyper-parameters that were designed specifically for fast learning. Elaborating on the last point, Alexandre noticed that the first 150K timesteps of fine-tuning were unstable (i.e. the performance sometimes got worse), so he tuned the learning rate to fix this problem. In addition to the above changes, Alexandre tried several solutions that did not work: different optimizers, MobileNetV2⁠, using color images, etc.

Get the source code

Best write-ups

The Best Write-up Prize is awarded to contestants that produced high-quality essays describing the approaches they tried.

Now, let’s meet the winners of this prize category.

Dylan Djian

Dylan currently lives in Paris, France. He is a student in software development at school 42 in Paris⁠. He got into machine learning after watching a video⁠ of a genetic algorithm learning how to play Mario a year and a half ago. This video sparked his interest and made him want to learn more about the field. His favorite video…

Excerpt shown — open the source for the full document.