Deep Learning for Robots: Learning from Large-Scale Interaction

While we’ve recently seen great strides in robotic capability, the gap between human and robot motor skills remains vast. Machines still have a very long way to go to match human proficiency even at basic sensorimotor skills like grasping. However, by linking learning with continuous feedback and control, we might begin to bridge that gap, and in so doing make it possible for robots to intelligently and reliably handle the complexities of the real world.

Consider for example this robot from KAIST, which won last year’s DARPA robotics challenge. The remarkably precise and deliberate motions are deeply impressive. But they are also quite… robotic. Why is that? What makes robot behavior so distinctly robotic compared to human behavior? At a high level, current robots typically follow a sense-plan-act paradigm, where the robot observes the world around it, formulates an internal model, constructs a plan of action, and then executes this plan. This approach is modular and often effective, but tends to break down in the kinds of cluttered natural environments that are typical of the real world. Here, perception is imprecise, all models are wrong in some way, and no plan survives first contact with reality.

In contrast, humans and animals move quickly, reflexively, and often with remarkably little advance planning, by relying on highly developed and intelligent feedback mechanisms that use sensory cues to correct mistakes and compensate for perturbations. For example, when serving a tennis ball, the player continually observes the ball and the racket, adjusting the motion of his hand so that they meet in the air. This kind of feedback is fast, efficient, and, crucially, can correct for mistakes or unexpected perturbations. Can we train robots to reliably handle complex real-world situations by using similar feedback mechanisms to handle perturbations and correct mistakes?

While servoing and feedback control have been studied extensively in robotics, the question of how to define the right sensory cue remains exceptionally challenging, especially for rich modalities such as vision. So instead of choosing the cues by hand, we can program a robot to acquire them on its own from scratch, by learning from extensive experience in the real world. In our first experiments with real physical robots, we decided to tackle robotic grasping in clutter.

A human child is able to reliably grasp objects after one year, and takes around four years to acquire more sophisticated precision grasps. However, networked robots can instantaneously share their experience with one another, so if we dedicate 14 separate robots to the job of learning grasping in parallel, we can acquire the necessary experience much faster. Below is a video of our robots practicing grasping a range of common office and household objects:
While initially the grasps are executed at random and succeed only rarely, each day the latest experiences are used to train a deep convolutional neural network (CNN) to learn to predict the outcome of a grasp, given a camera image and a potential motor command. This CNN is then deployed on the robots the following day, in the inner loop of a servoing mechanism that continually adjusts the robot’s motion to maximize the predicted chance of a successful grasp. In essence, the robot is constantly predicting, by observing the motion of its own hand, which kind of subsequent motion will maximize its chances of success. The result is continuous feedback: what we might call hand-eye coordination. Observing the behavior of the robot after over 800,000 grasp attempts, which is equivalent to about 3000 robot-hours of practice, we can see the beginnings of intelligent reactive behaviors. The robot observes its own gripper and corrects its motions in real time. It also exhibits interesting pre-grasp behaviors, like isolating a single object from a group. All of these behaviors emerged naturally from learning, rather than being programmed into the system.
To evaluate whether the system achieves measurable benefit from continuous feedback, we can compare its performance to an open-loop baseline that closer resembles the perception-planning-action loop described previously, albeit with a learned CNN used to determine both the open-loop grasps and the closed-loop servoing trained on the same data. With open-loop grasp selection, the robot chooses a single grasp pose from a single image, and then blindly executes this grasp. This method has a 34% average failure rate on the first 30 picking attempts for this set of office objects:
Incorporating continuous feedback into the system reduces the failures by nearly half, down to 18% from 34%, and produces interesting corrections and adjustments:
Neural networks have made great strides in allowing us to build computer programs that can process images, speech, text, and even draw pictures. However, introducing actions and control adds considerable new challenges, since every decision the network makes will affect what it sees next. Overcoming these challenges will bring us closer to building systems that understand the effects of their actions in the world. If we can bring the power of large-scale machine learning to robotic control, perhaps we will come one step closer to solving fundamental problems in robotics and automation.

The research on robotic hand-eye coordination and grasping was conducted by Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen, with special thanks to colleagues at Google Research and X who've contributed their expertise and time to this research. An early preprint is available on arXiv.