What does this writing signal mean?

OpenAI Writing: Learning dexterity

Captured source

source ↗

openai.com/openai.com/index/learning-dexterity

Learning dexterity

Source ↗

published Jul 30, 2018seen 6dcaptured 2dhttp 200method exa

Learning dexterity | OpenAI

July 30, 2018

Milestone

Learning dexterity

We’ve trained a human-like robot hand to manipulate physical objects with unprecedented dexterity.

Read paper

Illustration: Ben Barry & Eric Haines

Loading…

Our system, called Dactyl, is trained entirely in simulation and transfers its knowledge to reality, adapting to real-world physics using techniques we’ve been working on for the past⁠ year⁠. Dactyl learns from scratch using the same general-purpose reinforcement learning algorithm and code as OpenAI Five⁠. Our results⁠ show that it’s possible to train agents in simulation and have them solve real-world tasks, without physically-accurate modeling of the world.

The task

Dactyl is a system for manipulating objects using a Shadow Dexterous Hand⁠. We place an object such as a block or a prism in the palm of the hand and ask Dactyl to reposition it into a different orientation; for example, rotating the block to put a new face on top. The network observes only the coordinates of the fingertips and the images from three regular RGB cameras.

Although the first humanoid hands were developed decades ago, using them to manipulate objects effectively has been a long-standing challenge in robotic control. Unlike other problems such as locomotion⁠, progress on dextrous manipulation using traditional robotics approaches has been slow, and current techniques⁠ remain limited in their ability to manipulate objects in the real world.

Reorienting an object in the hand requires the following problems to be solved:

Working in the real world. Reinforcement learning has shown many successes in simulations and video games, but has had comparatively limited results in the real world. We test Dactyl on a physical robot.
High-dimensional control. The Shadow Dexterous Hand has 24 degrees of freedom compared to 7 for a typical robot arm.
Noisy and partial observations. Dactyl works in the physical world and therefore must handle noisy and delayed sensor readings. When a fingertip sensor is occluded by other fingers or by the object, Dactyl has to work with partial information. Many aspects of the physical system like friction and slippage are not directly observable and must be inferred.
Manipulating more than one object. Dactyl is designed to be flexible enough to reorient multiple kinds of objects. This means that our approach cannot use strategies that are only applicable to a specific object geometry.

Our approach

Dactyl learns to solve the object reorientation task entirely in simulation without any human input. After this training phase, the learned policy works on the real robot without any fine-tuning.

Learning methods for robotic manipulation face a dilemma. Simulated robots can easily provide enough data to train complex policies, but most manipulation problems can’t be modeled accurately enough for those policies to transfer to real robots. Even modeling what happens when two objects touch—the most basic problem in manipulation—is an active area of research⁠ with no widely accepted solution. Training directly on physical robots allows the policy to learn from real-world physics, but today’s algorithms would require years of experience to solve a problem like object reorientation.

Our approach, domain randomization, learns in a simulation which is designed to provide a variety of experiences rather than maximizing realism. This gives us the best of both approaches: by learning in simulation, we can gather more experience quickly by scaling up, and by de-emphasizing realism, we can tackle problems that simulators can only model approximately.

It’s been shown (by OpenAI⁠ and others⁠) that domain randomization can work on increasingly complex problems—domain randomizations were even used to train OpenAI Five⁠. Here, we wanted to see if scaling up domain randomization could solve a task well beyond the reach of current methods in robotics.

We built a simulated version⁠ of our robotics setup using the MuJoCo⁠ physics engine. This simulation is only a coarse approximation of the real robot:

Measuring physical attributes like friction, damping, and rolling resistance is cumbersome and difficult. They also change over time as the robot experiences wear and tear.
MuJoCo is a rigid body⁠ simulator, which means that it cannot simulate the deformable rubber found at the fingertips of the hand or the stretching of tendons.
Our robot can only manipulate the object by repeatedly making contact with it. However, contact forces are notoriously difficult to reproduce accurately in simulation.

The simulation can be made more realistic by calibrating its parameters to match robot behavior, but many of these effects simply cannot be modeled accurately in current simulators.

Instead, we train the policy on a distribution of simulated environments where the physical and visual attributes are chosen randomly. Randomized values are a natural way to represent the uncertainties that we have about the physical system and also prevent overfitting to a single simulated environment. If a policy can accomplish the task across all of the simulated environments, it will more likely be able to accomplish it in the real world.

Learning to control

By building simulations that support transfer, we have reduced the problem of controlling a robot in the real world to accomplishing a task in simulation, which is a problem well-suited for reinforcement learning. While the task of manipulating an object in a simulated hand is already somewhat difficult⁠, learning to do so across all combinations of randomized physical parameters is substantially more difficult.

To generalize across environments, it is helpful for the policy to be able to take different actions in environments with different dynamics. Because most dynamics parameters cannot be inferred from a single observation, we used an LSTM⁠—a type of neural network with memory—to make it possible for the network to learn about the dynamics of the environment. The LSTM achieved about twice as many rotations in simulation as a policy without memory.

Dactyl learns using Rapid⁠, the massively scaled implementation of Proximal Policy Optimization developed to allow OpenAI Five to solve Dota 2. We use a different model architecture, environment, and hyperparameters than OpenAI Five does, but we use the same algorithms and training code. Rapid used 6144 CPU cores and 8 GPUs to train our policy,…

Excerpt shown — open the source for the full document.