What does this writing signal mean?

OpenAI Writing: Reinforcement learning with prediction-based rewards

Captured source

source ↗

openai.com/openai.com/index/reinforcement-learning-with-prediction-based-rewards

Reinforcement learning with prediction-based rewards

Source ↗

published Oct 31, 2018seen 6dcaptured 2dhttp 200method exa

Reinforcement learning with prediction-based rewards | OpenAI

October 31, 2018

Reinforcement learning with prediction-based rewards

Read paper View code

Loading…

We’ve developed Random Network Distillation (RND)⁠, a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first timeA exceeds average human performance on Montezuma’s Revenge⁠. RND achieves state-of-the-art performance, periodically finds all 24 rooms and solves the first level without using demonstrations or having access to the underlying state of the game.

RND incentivizes visiting unfamiliar states⁠ by measuring how hard it is to predict the output of a fixed random neural network on visited states. In unfamiliar states it’s hard to guess the output, and hence the reward is high. It can be applied to any reinforcement learning algorithm, is simple to implement and efficient to scale. Below we release a reference implementation of RND that can reproduce the results from our paper.

Progress in Montezuma’s Revenge

For an agent to achieve a desired goal it must first explore what is possible in its environment and what constitutes progress towards the goal. Many games’ reward signals provide a curriculum such that even simple exploration strategies are sufficient for achieving the game’s goal. In the seminal work introducing DQN⁠, Montezuma’s Revenge was the only game where DQN got 0% of the average human score (4.7K). Simple exploration strategies are highly unlikely to gather any rewards, or see more than a few of the 24 rooms in the level. Since then advances in Montezuma’s Revenge have been seen by many as synonymous with advances in exploration.

Significant progress was made in 2016⁠ by combining DQN with a count-based exploration bonus, resulting in an agent that explored 15 rooms, achieved a high score of 6.6K and an average reward of around 3.7K. Since then, significant⁠ improvement⁠ in the score achieved by an RL agent has come only from exploiting access to demonstrations⁠ from human experts⁠, or access to the underlying state of the emulator⁠.

We ran a large scale RND experiment with 1024 rollout workers resulting in a mean return of 10K over 9 runs and a best mean return of 14.5K. Each run discovered between 20 and 22 rooms. In addition one of our smaller scale but longer running experiments yielded one run (out of 10) that achieved a best return of 17.5K corresponding to passing the first level and finding all 24 rooms. The graph below compares these two experiments showing the mean return as a function of parameter updates.

The visualization below shows the progress of the smaller scale experiment in discovering the rooms. Curiosity drives the agent to discover new rooms and find ways of increasing the in-game score, and this extrinsic reward drives it to revisit those rooms later in the training.

Large-scale study of curiosity-driven learning

Prior to developing RND, we, together with collaborators from UC Berkeley, investigated learning without any environment-specific rewards. Curiosity gives us an easier way to teach agents to interact with any environment, rather than via an extensively engineered task-specific reward function that we hope corresponds to solving a task. Projects like ALE⁠, Universe⁠, Malmo⁠, Gym⁠, Gym Retro⁠, Unity⁠, DeepMind Lab⁠, CommAI⁠ make a large number of simulated environments available for an agent to interact with through a standardized interface. An agent using a generic reward function not specific to the particulars of an environment can acquire a basic level of competency in a wide range of environments, resulting in the agent’s ability to determine what useful behaviors are even in the absence of carefully engineered rewards.

Read paper
View code

In standard reinforcement learning set-ups, at every discrete time-step the agent sends an action to the environment, and the environment responds by emitting the next observation, transition reward and an indicator of episode end. In our previous paper⁠ we require the environment to output⁠ only⁠ the next observation. There, the agent learns a next-state predictor model from its experience, and uses the error of the prediction as an intrinsic reward. As a result it is attracted to the unpredictable. For example, it will find a change in a game score to be rewarding only if the score is displayed on the screen and the change is hard to predict. The agent will typically find interactions with new objects rewarding, as the outcomes of such interactions are usually harder to predict than other aspects of the environment.

Similar to prior⁠ work⁠, we tried to avoid modeling all aspects of the environment, whether they are relevant or not, by choosing to model features of the observation. Surprisingly, we found that even random features worked well.

What do curious agents do?

We tested our agent across 50+ different environments and observed a range of competence levels from seemingly random actions to deliberately interacting with the environment. To our surprise, in some environments the agent achieved the game’s objective even though the game’s objective was not communicated to it through an extrinsic reward.

Breakout The agent experiences spikes of intrinsic reward when it sees a new configuration of bricks early on in training and when it passes the level for the first time after training for several hours.

Pong We trained the agent to control both paddles at the same time and it learned to keep the ball in play resulting in long rallies. Even when trained against the in-game AI, the agent tried to prolong the game rather than win.

Bowling The agent learned to play the game better than agents trained to maximize the (clipped) extrinsic reward directly. We think this is because the agent gets attracted to the difficult-to-predict flashing of the scoreboard occurring after the strikes.

Mario The intrinsic reward is particularly well-aligned with the game’s objective of advancing through the levels. The agent is rewarded for finding new areas because the details of a newly found area are…

Excerpt shown — open the source for the full document.