Tag Archives: Robotics

Off-Policy Classification – A New Reinforcement Learning Model Selection Method

Posted by Alex Irpan, Software Engineer, Robotics at Google

Reinforcement learning (RL) is a framework that lets agents learn decision making from experience. One of the many variants of RL is off-policy RL, where an agent is trained using a combination of data collected by other agents (off-policy data) and data it collects itself to learn generalizable skills like robotic walking and grasping. In contrast, fully off-policy RL is a variant in which an agent learns entirely from older data, which is appealing because it enables model iteration without requiring a physical robot. With fully off-policy RL, one can train several models on the same fixed dataset collected by previous agents, then select the best one. However, fully off-policy RL comes with a catch: while training can occur without a real robot, evaluation of the models cannot. Furthermore, ground-truth evaluation with a physical robot is too inefficient to test promising approaches that require evaluating a large number of models, such as automated architecture search with AutoML.

This challenge motivates off-policy evaluation (OPE), techniques for studying the quality of new agents using data from other agents. With rankings from OPE, we can selectively test only the most promising models on real-world robots, significantly scaling experimentation with the same fixed real robot budget.
A diagram for real-world model development. Assuming we can evaluate 10 models per day, without off-policy evaluation, we would need 100x as many days to evaluate our models.
Though the OPE framework shows promise, it assumes one has an off-policy evaluation method that accurately ranks performance from old data. However, agents that collected past experience may act very differently from newer learned agents, which makes it hard to get good estimates of performance.

In “Off-Policy Evaluation via Off-Policy Classification”, we propose a new off-policy evaluation method, called off-policy classification (OPC), that evaluates the performance of agents from past data by treating evaluation as a classification problem, in which actions are labeled as either potentially leading to success or guaranteed to result in failure. Our method works for image (camera) inputs, and doesn’t require reweighting data with importance sampling or using accurate models of the target environment, two approaches commonly used in prior work. We show that OPC scales to larger tasks, including a vision-based robotic grasping task in the real world.

How OPC Works
OPC relies on two assumptions: 1) that the final task has deterministic dynamics, i.e. no randomness is involved in how states change, and 2) that the agent either succeeds or fails at the end of each trial. This second “success or failure” assumption is natural for many tasks, such as picking up an object, solving a maze, winning a game, and so on. Because each trial will either succeed or fail in a deterministic way, we can assign binary classification labels to each action. We say an action is effective if it could lead to success, and is catastrophic if it is guaranteed to lead to failure.

OPC utilizes a Q-function, learned with a Q-learning algorithm, that estimates the future total reward if the agent chooses to take some action from its current state. The agent will then choose the action with the largest total reward estimate. In our paper, we prove that the performance of an agent is measured by how often its chosen action is an effective action, which depends on how well the Q-function correctly classifies actions as effective vs. catastrophic. This classification accuracy acts as an off-policy evaluation score.

However, the labeling of data from previous trials is only partial. For example, if a previous trial was a failure, we do not get negative labels because we do not know which action was the catastrophic one. To overcome this, we leverage techniques from semi-supervised learning, positive-unlabeled learning in particular, to get an estimate of classification accuracy from partially labeled data. This accuracy is the OPC score.

Off-Policy Evaluation for Sim-to-Real Learning
In robotics, it’s common to use simulated data and transfer learning techniques to reduce the sample complexity of learning robotics skills. This can be very useful, but tuning these sim-to-real techniques for real-world robotics is challenging. Much like off-policy RL, training doesn’t use the real robot, because it is trained in simulation, but evaluation of that policy still needs to use a real robot. Here, off-policy evaluation can come to the rescue again—we can take a policy trained only in simulation, then evaluate it using previous real-world data to measure its transfer to the real robot. We examine OPC across both fully off-policy RL and sim-to-real RL.
An example of how simulated experience can differ from real-world experience. Here, simulated images (left) have much less visual complexity than real-world images (right).
Results
First, we set up a simulated version of our robot grasping task, where we could easily train and evaluate several models to benchmark off-policy evaluation. These models were trained with fully off-policy RL, then evaluated with off-policy evaluation. We found that in our robotics tasks, a variant of the OPC called the SoftOPC performed best at predicting final success rate.
An experiment in the simulated grasping task. The red curve is the dimensionless SoftOPC score over the course of training, evaluated from old data. The blue curve is the grasp success rate in simulation. We see the SoftOPC on old data correlates well with grasp success of the model within our simulator.
After success in sim, we then tried SoftOPC in the real-world task. We took 15 models, trained to have varying degrees of robustness to the gap between simulation and reality. Of these models, 7 of them were trained purely in simulation, and the rest were trained on mixes of simulated and real-world data. For each model, we evaluated the SoftOPC on off-policy real-world data, then the real-world grasp success, to see how well SoftOPC predicted performance of that model. We found that on real data, the SoftOPC does produce scores that correlate with true grasp success, letting us rank sim-to-real techniques using past real experience.
SoftOPC score and true performance for 3 different sim-to-real methods: a baseline simulation, a simulation with random textures and lighting, and a model trained with RCAN. All three models are trained with no real data, then evaluated with off-policy evaluation on a validation set of real data. The ordering of the SoftOPC score matches the order of real grasp success.
Below is a scatterplot of the full results from all 15 models. Each point represents the off-policy evaluation score and real-world grasp success of each model. We compare different scoring functions by their correlation to final grasp success. The SoftOPC does not correlate perfectly with true grasp success, but its scores are significantly more reliable than baseline approaches like the temporal-difference error (the standard Q-learning loss).
Results from our sim-to-real evaluation experiment. On the left is a baseline, the temporal difference error of the model. On the right is one of our proposed methods, the SoftOPC. The shaded region is a 95% confidence interval. The correlation is significantly better with SoftOPC.
Future Work
One promising direction for future work is to see if we can relax our assumptions about the task, to support tasks where dynamics are more noisy, or where we get partial credit for almost succeeding. However, even with our included assumptions, we think the results are promising enough to be applied to many real-world RL problems.

Acknowledgements
This research was conducted by Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz and Sergey Levine. We’d like to thank Razvan Pascanu, Dale Schuurmans, George Tucker and Paul Wohlhart for valuable discussions. A preprint is available on arXiv.

Source: Google AI Blog


Unifying Physics and Deep Learning with TossingBot



Though considerable progress has been made in enabling robots to grasp objects efficiently, visually self adapt or even learn from real-world experiences, robotic operations still require careful consideration in how they pick up, handle, and place various objects -- especially in unstructured settings. Consider for example, this picking robot which took 1st place in the stowing task of the Amazon Robotics Challenge:
It's an impressive system, built with many design features that kinematically prevent it from dropping objects due to unforeseen dynamics: from its steady and deliberate movements, to its gripper fingers that mechanically constrain the momentum of the object so that it doesn't slip.

This robot, like many others, is designed to tolerate the dynamics of the unstructured world. But instead of just tolerating dynamics, can robots learn to use them advantageously, developing an "intuition" of physics that would allow them to complete tasks more efficiently? Perhaps in doing so, robots can improve their capabilities and acquire complex athletic skills like tossing, sliding, spinning, swinging, or catching, potentially leading to many useful applications, such as more efficient debris clearing robots in disaster response scenarios -- where time is of the essence.

To explore this concept, we worked with researchers at Princeton, Columbia, and MIT to develop TossingBot: a picking robot for our real, random world that learns to grasp and throw objects into selected boxes outside its natural range. We find that by learning to throw, TossingBot is capable of achieving picking speeds that are twice as fast as previous systems, with twice the effective placing range. TossingBot jointly learns grasping and throwing policies using an end-to-end neural network that maps from visual observations (RGB-D images) to control parameters for motion primitives. Using overhead cameras to track where objects land, TossingBot improves itself over time through self-supervision. More technical details are available in an early preprint on arXiv.
The Challenges
Throwing is a particularly difficult task as it depends on many factors: from how the object is picked up (i.e., "pre-throw conditions"), to the object's physical properties like mass, friction, aerodynamics, etc. For example, if you grasp a screwdriver by the handle near the center of mass and throw it, it would land much closer than if you had grasped it from the metal tip, which would swing forward and land much farther away. Regardless of how you grasped it though, tossing a screwdriver is incredibly different from tossing a ping pong ball, which would land closer due to air resistance. Manually designing a solution that explicitly handles these factors for every random object is nearly impossible.
Throwing depends on many factors: from how you picked it up, to object properties and dynamics.
Through deep learning, however, our robots can learn from experience rather than rely on manual case-by-case engineering. Previously we've shown that our robots can learn to push and grasp a large variety of objects, but accurately throwing objects requires a larger understanding of projectile physics. Acquiring this knowledge from scratch with only trial-and-error is not only time consuming and expensive, but also generally doesn't work outside of very specific, and carefully set up training scenarios.

Unifying Physics and Deep Learning
A fundamental component of TossingBot is that it learns to throw by integrating simple physics and deep learning, which enables it to train quickly and generalize to new scenarios. Physics provides prior models of how the world works, and we can leverage these models to develop initial controllers for our robots. In the case of throwing, for example, we can use projectile ballistics to provide an estimate for the throwing velocity that is needed to get an object to land at a target location. We can then use neural networks to predict adjustments on top of that estimate from physics, in order to compensate for unknown dynamics as well as the noise and variability of the real world. We call this hybrid formulation Residual Physics, and it enables TossingBot to achieve throwing accuracies of 85%.
At the start of training with randomly initialized weights, TossingBot repeatedly attempts bad grasps. Over time, however, TossingBot learns better ways to grasp objects and simultaneously improves its ability to throw. Occasionally the robot randomly explores what happens if it throws an object at a velocity that it hasn't tried before. When the bin is emptied, TossingBot lifts the boxes to allow objects to slide back into the bin. This way, human intervention is kept at a minimum during training. By 10,000 grasp and throw attempts (or 14 hours of training time), it is capable of achieving throwing accuracies of 85%, with a grasping reliability of 87% in clutter.
TossingBot starts out performing poorly (left), but progressively learns to grasp and toss overnight (right).
Generalizing to New Scenarios
By integrating physics and deep learning, TossingBot is capable of rapidly adapting to never-before-seen throwing locations and objects. For example, after training on objects with simple shapes like wooden blocks, balls, and markers, it can perform reasonably well on new objects such as fake fruit, decorative items, and office objects. On new objects, TossingBot starts out with lower performance, but quickly adapts within a few hundred training steps (i.e., an hour or two) to achieve similar performance as with training objects. We've found that combining physics and deep learning with Residual Physics yields better performance than baseline alternatives (e.g. deep learning without physics). We even tried this task ourselves, and we were pleasantly surprised to learn that TossingBot is more accurate than any of us engineers! Though take that with a grain of salt, as we've yet to test TossingBot against anyone with any actual athletic talent.
TossingBot can generalize to new objects, and is more accurate at throwing than the average Googler.
We also test our policies on their ability to generalize to new target locations previously unseen in training. To this end, we train on a set of boxes, then later test on a different set of boxes with entirely different landing areas. In this setting, we find that Residual Physics for throwing helps significantly, since the initial estimates of throwing velocities from projectile ballistics easily generalize to new target locations, while the residuals help make adjustments on top of those estimates to compensate for varying object properties in the real world. This is in contrast to the baseline alternative of using deep learning without physics, which can only handle target locations seen during training.
TossingBot uses Residual Physics to throw objects to unforeseen locations.
Emerging Semantics from Interaction
To explore what TossingBot learns, we place several objects in the bin, capture images, and feed them into TossingBot's trained neural network to extract intermediate pixel-wise deep features. By clustering these features based on similarity and visualizing nearest neighbors as a heatmap (hotter regions indicate more similarity in feature space), we can localize all ping pong balls in the scene. Even though the orange block shares a similar color with the ping pong balls, its features are different enough for TossingBot to make a distinction. Likewise, we can also use the extracted features to localize all marker pens, which share similar shape and mass, but do not share color. These observations suggest that TossingBot likely learns to rely more on geometric cues (e.g. shape) to learn grasping and throwing. It is also possible that the learned features reflect second-order attributes such as physical properties, which can influence how the objects should be thrown.
TossingBot learns deep features that distinguish object categories without explicit supervision.
These emerging features were learned implicitly from scratch without any explicit supervision beyond task-level grasping and throwing. Yet, they seem to be sufficient for enabling the system to distinguish between object categories (i.e., ping pong balls and marker pens). As such, this experiment speaks out to a broader concept related to machine vision: how should robots learn the semantics of the visual world? From the perspective of classic computer vision, semantics are often pre-defined using human-fabricated image datasets and manually constructed class categories. However, our experiment suggests that it is possible to implicitly learn such object-level semantics from physical interactions alone, as long as they matter for the task at hand. The more complex these interactions, the higher the resolution of the semantics. Towards more generally intelligent robots -- perhaps it is sufficient for them to develop their own notion of semantics through interaction, without requiring any human intervention.

Limitations and Future Work
Although TossingBot's results are promising, it does have its limitations. For example, it assumes that objects are robust enough to withstand landing collisions after being thrown -- further work is required to learn throws that account for fragile objects, or possibly train other robots to catch objects in ways that cushion the landing. Furthermore, TossingBot infers control parameters only from visual data -- exploring additional senses (e.g. force-torque or tactile) may enable the system to better react to new objects.

The combination of physics and deep learning that made TossingBot possible naturally leads to an interesting question: what else could benefit from Residual Physics? Investigating how the idea generalizes to other types of tasks and interactions is a promising direction for future research.

You can learn more about this work in the summary video below.
Acknowledgements
This research was done by Andy Zeng, Shuran Song (faculty at Columbia University), Johnny Lee, Alberto Rodriguez (faculty at MIT), and Thomas Funkhouser (faculty at Princeton University), with special thanks to Ryan Hickman for valuable managerial support, Ivan Krasin and Stefan Welker for fruitful technical discussions, Brandon Hurd and Julian Salazar and Sean Snyder for hardware support, Chad Richards and Jason Freidenfelds for helpful feedback on writing, Erwin Coumans for advice on PyBullet, Laura Graesser for video narration, and Regina Hickman for photography. An early preprint is available on arXiv.

Source: Google AI Blog


Long-Range Robotic Navigation via Automated Reinforcement Learning



In the United States alone, there are 3 million people with a mobility impairment that prevents them from ever leaving their homes. Service robots that can autonomously navigate long distances can improve the independence of people with limited mobility, for example, by bringing them groceries, medicine, and packages. Research has demonstrated that deep reinforcement learning (RL) is good at mapping raw sensory input to actions, e.g. learning to grasp objects and for robot locomotion, but RL agents usually lack the understanding of large physical spaces needed to safely navigate long distances without human help and to easily adapt to new spaces.

In three recent papers, “Learning Navigation Behaviors End-to-End with AutoRL,” “PRM-RL: Long-Range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning”, and “Long-Range Indoor Navigation with PRM-RL”, we investigate easy-to-adapt robotic autonomy by combining deep RL with long-range planning. We train local planner agents to perform basic navigation behaviors, traversing short distances safely without collisions with moving obstacles. The local planners take noisy sensor observations, such as a 1D lidar that provides distances to obstacles, and output linear and angular velocities for robot control. We train the local planner in simulation with AutoRL, a method that automates the search for RL reward and neural network architecture. Despite their limited range of 10 - 15 meters, the local planners transfer well to both real robots and to new, previously unseen environments. This enables us to use them as building blocks for navigation in large spaces. We then build a roadmap, a graph where nodes are locations and edges connect the nodes only if local planners, which mimic real robots well with their noisy sensors and control, can traverse between them reliably.

Automating Reinforcement Learning (AutoRL)
In our first paper, we train the local planners in small, static environments. However, training with standard deep RL algorithms, such as Deep Deterministic Policy Gradient (DDPG), poses several challenges. For example, the true objective of the local planners is to reach the goal, which represents a sparse reward. In practice, this requires researchers to spend significant time iterating and hand-tuning the rewards. Researchers must also make decisions about the neural network architecture, without clear accepted best practices. And finally, algorithms like DDPG are unstable learners and often exhibit catastrophic forgetfulness.

To overcome those challenges, we automate the deep Reinforcement Learning (RL) training. AutoRL is an evolutionary automation layer around deep RL that searches for a reward and neural network architecture using large-scale hyperparameter optimization. It works in two phases, reward search and neural network architecture search. During the reward search, AutoRL trains a population of DDPG agents concurrently over several generations, each with a slightly different reward function optimizing for the local planner’s true objective: reaching the destination. At the end of the reward search phase, we select the reward that leads the agents to its destination most often. In the neural network architecture search phase, we repeat the process, this time using the selected reward and tuning the network layers, optimizing for the cumulative reward.
Automating reinforcement learning with reward and neural network architecture search.
However, this iterative process means AutoRL is not sample efficient. Training one agent takes 5 million samples; AutoRL training over 10 generations of 100 agents requires 5 billion samples - equivalent to 32 years of training! The benefit is that after AutoRL the manual training process is automated, and DDPG does not experience catastrophic forgetfulness. Most importantly, the resulting policies are higher quality — AutoRL policies are robust to sensor, actuator and localization noise, and generalize well to new environments. Our best policy is 26% more successful than other navigation methods across our test environments.
AutoRL (red) success over short distances (up to 10 meters) in several unseen buildings. Compared to hand-tuned DDPG (dark-red), artificial potential fields (light blue), dynamic window approach (blue), and behavior cloning (green).
AutoRL local planner policy transfer to robots in real, unstructured environments
While these policies only perform local navigation, they are robust to moving obstacles and transfer well to real robots, even in unstructured environments. Though they were trained in simulation with only static obstacles, they can also handle moving objects effectively. The next step is to combine the AutoRL policies with sampling-based planning to extend their reach and enable long-range navigation.

Achieving Long Range Navigation with PRM-RL
Sampling-based planners tackle long-range navigation by approximating robot motions. For example, probabilistic roadmaps (PRMs) sample robot poses and connect them with feasible transitions, creating roadmaps that capture valid movements of a robot across large spaces. In our second paper, which won Best Paper in Service Robotics at ICRA 2018, we combine PRMs with hand-tuned RL-based local planners (without AutoRL) to train robots once locally and then adapt them to different environments.

First, for each robot we train a local planner policy in a generic simulated training environment. Next, we build a PRM with respect to that policy, called a PRM-RL, over a floor plan for the deployment environment. The same floor plan can be used for any robot we wish to deploy in the building in a one time per robot+environment setup.

To build a PRM-RL we connect sampled nodes only if the RL-based local planner, which represents robot noise well, can reliably and consistently navigate between them. This is done via Monte Carlo simulation. The resulting roadmap is tuned to both the abilities and geometry of the particular robot. Roadmaps for robots with the same geometry but different sensors and actuators will have different connectivity. Since the agent can navigate around corners, nodes without clear line of sight can be included. Whereas nodes near walls and obstacles are less likely to be connected into the roadmap because of sensor noise. At execution time, the RL agent navigates from roadmap waypoint to waypoint.
Roadmap being built with 3 Monte Carlo simulations per randomly selected node pair.
The largest map was 288 meters by 163 meters and contains almost 700,000 edges, collected over 4 days using 300 workers in a cluster requiring 1.1 billion collision checks.
The third paper makes several improvements over the original PRM-RL. First, we replace the hand-tuned DDPG with AutoRL-trained local planners, which results in improved long-range navigation. Second, it adds Simultaneous Localization and Mapping (SLAM) maps, which robots use at execution time, as a source for building the roadmaps. Because SLAM maps are noisy, this change closes the “sim2real gap”, a phonomena in robotics where simulation-trained agents significantly underperform when transferred to real-robots. Our simulated success rates are the same as in on-robot experiments. Last, we added distributed roadmap building, resulting in very large scale roadmaps containing up to 700,000 nodes.

We evaluated the method using our AutoRL agent, building roadmaps using the floor maps of offices up to 200x larger than the training environments, accepting edges with at least 90% success over 20 trials. We compared PRM-RL to a variety of different methods over distances up to 100m, well beyond the local planner range. PRM-RL had 2 to 3 times the rate of success over baseline because the nodes were connected appropriately for the robot’s capabilities.
Navigation over 100 meters success rates in several buildings. First paper -AutoRL local planner only (blue); original PRMs (red); path-guided artificial potential fields (yellow); second paper (green); third paper - PRMs with AutoRL (orange).
We tested PRM-RL on multiple real robots and real building sites. One set of tests are shown below; the robot is very robust except near cluttered areas and off the edge of the SLAM map.
On-robot experiments
Conclusion
Autonomous robot navigation can significantly improve independence of people with limited mobility. We can achieve this by development of easy-to-adapt robotic autonomy, including methods that can be deployed in new environments using information that it is already available. This is done by automating the learning of basic, short-range navigation behaviors with AutoRL and using these learned policies in conjunction with SLAM maps to build roadmaps. These roadmaps consist of nodes connected by edges that robots can traverse consistently. The result is a policy that once trained can be used across different environments and can produce a roadmap custom-tailored to the particular robot.

Acknowledgements
The research was done by, in alphabetical order, Hao-Tien Lewis Chiang, James Davidson, Aleksandra Faust, Marek Fiser, Anthony Francis, Jasmine Hsu, J. Chase Kew, Tsang-Wei Edward Lee, Ken Oslund, Oscar Ramirez from Robotics at Google and Lydia Tapia from University of New Mexico. We thank Alexander Toshev, Brian Ichter, Chris Harris, and Vincent Vanhoucke for helpful discussions.

Source: Google AI Blog


Long-Range Robotic Navigation via Automated Reinforcement Learning



In the United States alone, there are 3 million people with a mobility impairment that prevents them from ever leaving their homes. Service robots that can autonomously navigate long distances can improve the independence of people with limited mobility, for example, by bringing them groceries, medicine, and packages. Research has demonstrated that deep reinforcement learning (RL) is good at mapping raw sensory input to actions, e.g. learning to grasp objects and for robot locomotion, but RL agents usually lack the understanding of large physical spaces needed to safely navigate long distances without human help and to easily adapt to new spaces.

In three recent papers, “Learning Navigation Behaviors End-to-End with AutoRL,” “PRM-RL: Long-Range Robotic Navigation Tasks by Combining Reinforcement Learning and Sampling-based Planning”, and “Long-Range Indoor Navigation with PRM-RL”, we investigate easy-to-adapt robotic autonomy by combining deep RL with long-range planning. We train local planner agents to perform basic navigation behaviors, traversing short distances safely without collisions with moving obstacles. The local planners take noisy sensor observations, such as a 1D lidar that provides distances to obstacles, and output linear and angular velocities for robot control. We train the local planner in simulation with AutoRL, a method that automates the search for RL reward and neural network architecture. Despite their limited range of 10 - 15 meters, the local planners transfer well to both real robots and to new, previously unseen environments. This enables us to use them as building blocks for navigation in large spaces. We then build a roadmap, a graph where nodes are locations and edges connect the nodes only if local planners, which mimic real robots well with their noisy sensors and control, can traverse between them reliably.

Automating Reinforcement Learning (AutoRL)
In our first paper, we train the local planners in small, static environments. However, training with standard deep RL algorithms, such as Deep Deterministic Policy Gradient (DDPG), poses several challenges. For example, the true objective of the local planners is to reach the goal, which represents a sparse reward. In practice, this requires researchers to spend significant time iterating and hand-tuning the rewards. Researchers must also make decisions about the neural network architecture, without clear accepted best practices. And finally, algorithms like DDPG are unstable learners and often exhibit catastrophic forgetfulness.

To overcome those challenges, we automate the deep Reinforcement Learning (RL) training. AutoRL is an evolutionary automation layer around deep RL that searches for a reward and neural network architecture using large-scale hyperparameter optimization. It works in two phases, reward search and neural network architecture search. During the reward search, AutoRL trains a population of DDPG agents concurrently over several generations, each with a slightly different reward function optimizing for the local planner’s true objective: reaching the destination. At the end of the reward search phase, we select the reward that leads the agents to its destination most often. In the neural network architecture search phase, we repeat the process, this time using the selected reward and tuning the network layers, optimizing for the cumulative reward.
Automating reinforcement learning with reward and neural network architecture search.
However, this iterative process means AutoRL is not sample efficient. Training one agent takes 5 million samples; AutoRL training over 10 generations of 100 agents requires 5 billion samples - equivalent to 32 years of training! The benefit is that after AutoRL the manual training process is automated, and DDPG does not experience catastrophic forgetfulness. Most importantly, the resulting policies are higher quality — AutoRL policies are robust to sensor, actuator and localization noise, and generalize well to new environments. Our best policy is 26% more successful than other navigation methods across our test environments.
AutoRL (red) success over short distances (up to 10 meters) in several unseen buildings. Compared to hand-tuned DDPG (dark-red), artificial potential fields (light blue), dynamic window approach (blue), and behavior cloning (green).
AutoRL local planner policy transfer to robots in real, unstructured environments
While these policies only perform local navigation, they are robust to moving obstacles and transfer well to real robots, even in unstructured environments. Though they were trained in simulation with only static obstacles, they can also handle moving objects effectively. The next step is to combine the AutoRL policies with sampling-based planning to extend their reach and enable long-range navigation.

Achieving Long Range Navigation with PRM-RL
Sampling-based planners tackle long-range navigation by approximating robot motions. For example, probabilistic roadmaps (PRMs) sample robot poses and connect them with feasible transitions, creating roadmaps that capture valid movements of a robot across large spaces. In our second paper, which won Best Paper in Service Robotics at ICRA 2018, we combine PRMs with hand-tuned RL-based local planners (without AutoRL) to train robots once locally and then adapt them to different environments.

First, for each robot we train a local planner policy in a generic simulated training environment. Next, we build a PRM with respect to that policy, called a PRM-RL, over a floor plan for the deployment environment. The same floor plan can be used for any robot we wish to deploy in the building in a one time per robot+environment setup.

To build a PRM-RL we connect sampled nodes only if the RL-based local planner, which represents robot noise well, can reliably and consistently navigate between them. This is done via Monte Carlo simulation. The resulting roadmap is tuned to both the abilities and geometry of the particular robot. Roadmaps for robots with the same geometry but different sensors and actuators will have different connectivity. Since the agent can navigate around corners, nodes without clear line of sight can be included. Whereas nodes near walls and obstacles are less likely to be connected into the roadmap because of sensor noise. At execution time, the RL agent navigates from roadmap waypoint to waypoint.
Roadmap being built with 3 Monte Carlo simulations per randomly selected node pair.
The largest map was 288 meters by 163 meters and contains almost 700,000 edges, collected over 4 days using 300 workers in a cluster requiring 1.1 billion collision checks.
The third paper makes several improvements over the original PRM-RL. First, we replace the hand-tuned DDPG with AutoRL-trained local planners, which results in improved long-range navigation. Second, it adds Simultaneous Localization and Mapping (SLAM) maps, which robots use at execution time, as a source for building the roadmaps. Because SLAM maps are noisy, this change closes the “sim2real gap”, a phonomena in robotics where simulation-trained agents significantly underperform when transferred to real-robots. Our simulated success rates are the same as in on-robot experiments. Last, we added distributed roadmap building, resulting in very large scale roadmaps containing up to 700,000 nodes.

We evaluated the method using our AutoRL agent, building roadmaps using the floor maps of offices up to 200x larger than the training environments, accepting edges with at least 90% success over 20 trials. We compared PRM-RL to a variety of different methods over distances up to 100m, well beyond the local planner range. PRM-RL had 2 to 3 times the rate of success over baseline because the nodes were connected appropriately for the robot’s capabilities.
Navigation over 100 meters success rates in several buildings. First paper -AutoRL local planner only (blue); original PRMs (red); path-guided artificial potential fields (yellow); second paper (green); third paper - PRMs with AutoRL (orange).
We tested PRM-RL on multiple real robots and real building sites. One set of tests are shown below; the robot is very robust except near cluttered areas and off the edge of the SLAM map.
On-robot experiments
Conclusion
Autonomous robot navigation can significantly improve independence of people with limited mobility. We can achieve this by development of easy-to-adapt robotic autonomy, including methods that can be deployed in new environments using information that it is already available. This is done by automating the learning of basic, short-range navigation behaviors with AutoRL and using these learned policies in conjunction with SLAM maps to build roadmaps. These roadmaps consist of nodes connected by edges that robots can traverse consistently. The result is a policy that once trained can be used across different environments and can produce a roadmap custom-tailored to the particular robot.

Acknowledgements
The research was done by, in alphabetical order, Hao-Tien Lewis Chiang, James Davidson, Aleksandra Faust, Marek Fiser, Anthony Francis, Jasmine Hsu, J. Chase Kew, Tsang-Wei Edward Lee, Ken Oslund, Oscar Ramirez from Robotics at Google and Lydia Tapia from University of New Mexico. We thank Alexander Toshev, Brian Ichter, Chris Harris, and Vincent Vanhoucke for helpful discussions.

Source: Google AI Blog


Soft Actor-Critic: Deep Reinforcement Learning for Robotics



Deep reinforcement learning (RL) provides the promise of fully automated learning of robotic behaviors directly from experience and interaction in the real world, due to its ability to process complex sensory input using general-purpose neural network representations. However, many existing RL algorithms require days or weeks (or more) worth of real-world data in order to converge to the desired behavior. Furthermore, such systems can be tough to deploy on complex robotic systems (such as legged robots) which can easily get damaged during the exploration phase, hyperparameter settings can be challenging to tune, and various safety considerations can introduce further limitations.

In collaboration with UC Berkeley, we recently released Soft Actor-Critic (SAC), a stable and efficient deep RL algorithm suitable for real-world robotic skill learning that is well-aligned with the requirements of robotic experimentation. Importantly, SAC is efficient enough to solve real-world robot tasks in only a handful of hours, and works on a variety of environments with a single set of hyperparameters. Below, we discuss some of the research behind SAC, and also describe some of our recent experiments.

Requirements for Real-World Robotic Learning
Real-world robotic experimentation brings significant challenges, such as constant interruptions in the data stream due to hardware failures and manual resets, and smooth exploration to avoid mechanical wear and tear on the robot, which set additional restrictions to both the algorithm and its implementation, including (but not limited to):
  • Good sample efficiency to lower the learning time
  • Minimal number of hyperparameters that require tuning
  • Reusing already collected data on different scenarios (known as off-policy learning)
  • Ensuring that learning and exploration does not damage the hardware
Soft Actor-Critic
Soft actor-critic is based on maximum entropy reinforcement learning, a framework that aims to both maximize the expected reward (which is the standard RL objective) and to maximize the policy's entropy. Policies with higher entropy are more random, which intuitively means that maximum entropy reinforcement learning prefers the most random policy that still achieves a high reward.

Why might this be desirable for robotic learning? The most obvious reason is that policies optimized for maximum entropy will be more robust: if the policy can tolerate highly random behavior during training, it is more likely to respond successfully to unexpected perturbations at test time. However, a more subtle reason is that training for maximum entropy can improve both the algorithm's robustness to hyperparameters and its sample efficiency (to learn more, see this BAIR blog post, and this tutorial).

Soft actor-critic maximizes the entropy augmented reward by learning a stochastic policy that maps states to actions and a Q-function that estimates the objective value of the current policy, optimizing them using approximate dynamic programming. In doing so, SAC views the objective as a grounded way to derive better reinforcement learning algorithms that perform consistently and are sample efficient enough to be applicable to real-world robotic applications. For technical details please see our technical report.

Performance of SAC
We evaluated SAC with two tasks: 1) quadrupedal walking with the Minitaur robot from Ghost Robotics, and 2) rotating a valve with a three finger Dynamixel Claw. Learning to walk presents a substantial challenge, as the robot is underactuated, and must therefore delicately balance contact forces on the legs to make forward progress. An untrained policy can lose balance and fall, and too many falls will eventually damage the robot, making sample-efficient learning essential.

Although we trained our policy only on flat terrain, we subsequently tested it on varied terrains and obstacles. In principle, policies learned with soft actor-critic should be robust to test-time perturbations, because they are trained to maximize entropy (i.e., inject maximal noise) at training-time. Indeed, we observe that the policies learned with our method are robust to these perturbations without any additional learning.
Illustration of learned walking, using SAC implemented on the Minitaur robot. A full video of the learning process can be found at our project website.
The manipulation task requires the hand to rotate a valve-like object so that the colored peg faces to the right, as shown below. This task is exceptionally challenging due to both the perception challenges and the need to control a hand with 9 degrees of freedom. In order to perceive the valve, the robot must use raw RGB images shown in the inset at the bottom right. The initial position of the valve is reset uniformly at random for each episode, forcing the policy to learn to use the raw RGB images to perceive the current valve orientation.
Soft actor-critic solves both of these tasks quickly: the Minitaur locomotion takes 2 hours, and the valve-turning task from image observations takes 20 hours. We also learned a policy for the valve-turning task without images by providing the actual valve position as an observation to the policy. Soft actor-critic can learn this easier version of the valve task in 3 hours. For comparison, prior work has used natural policy gradients to learn the same task without images in 7.4 hours.

Conclusion
Our work demonstrates that deep reinforcement learning based on maximum entropy framework can be applied to learn robot skills in challenging real-world settings. Since the policies are learned directly in the real world, they exhibit robustness to variations in the environment, which can be difficult to obtain otherwise. We also showed that we can learn directly from high-dimensional image observations, which represents a significant challenge in classical robotics. We hope that the release of SAC helps other research teams in their effort to adopt deep RL for more complex real-world tasks in the future.

For more technical details, please visit the BAIR blog post, or read an early preprint of the locomotion experiment and a more complete description of the algorithm. You can find the implementation on GitHub.

Acknowledgements
This research was done in collaboration between Google and UC Berkeley. We would like to thank all the people who were involved, including Sehoon Ha, Kristian Hartikainen, Jie Tan, George Tucker, Vincent Vanhoucke and Aurick Zhou.

Source: Google AI Blog


Grasp2Vec: Learning Object Representations from Self-Supervised Grasping



From a remarkably young age, people are capable of recognizing their favorite objects and picking them up, despite never being explicitly taught how to do so. According to cognitive developmental research, the ability to interact with objects in the world plays a crucial role in the emergence of object perception and manipulation capabilities, such as targeted grasping. By interacting with the world around them, people are able to learn with self-supervision: we know what actions we took, and we learn from the outcome. In robotics, this type of self-supervised learning is actively researched because it enables robotic systems to learn without the need for large amounts of training data or manual supervision.

Inspired by the concept of object permanence, we propose Grasp2Vec, a simple yet highly effective algorithm for acquiring object representations. Grasp2Vec is based on the intuition that an attempt to pick up anything provides several pieces of information — if a robot grasps an object and holds it up, the object had to be in the scene before the grasp. Furthermore, the robot knows that the object it grasped is currently in its gripper, and therefore has been removed from the scene. By using this form of self supervision, the robot can learn to recognize the object by the visual change in the scene after the grasp.
Building on our prior collaboration with X Robotics, where a series of robots learn in parallel to grasp household objects using only monocular camera inputs, we use a robotic arm to grasp objects “unintentionally”, and that experience enables the learning of a rich representation of objects. These representations can then be used to acquire “intentional grasping” capabilities, where the robot arm can then pick up user-commanded objects.
Constructing a Perceptual Reward Function
In the framework of reinforcement learning (RL), task success is measured via a “reward function”. By maximizing that reward, robots can teach themselves diverse grasping skills from scratch. Engineering a reward function is easy when success can be measured by simple sensor measurements. A simple example of this is a button that supplies rewards directly to a robot when it is pushed.

However, engineering a reward function is much more difficult when our success criteria depends on perceptual understanding of the task at hand. Consider the task of instance grasping, where a robot is presented a picture of a desired object being held in the gripper. After the robot attempts to grasp that object, it inspects the contents of the gripper. The reward function for this task comes down to answering the question of object recognition: Do these objects match?
On the left, the gripper is holding the brush and there are some objects (yellow cup, blue plastic block) in the background. On the right, the gripper is holding the yellow cup and the brush is in the background. If the left image was the desired outcome, a good reward function should “understand” that the two images above correspond to different objects.
In order to solve this recognition problem, we need a perception system that extracts meaningful object concepts from unstructured image data (without any human annotations), learning the visual perception of objects in an unsupervised fashion. At their core, unsupervised learning algorithms work because they make structural assumptions about data. It is common to assume that images can be compressed into a low-dimensional space, and that frames in a video can be predicted from previous frames. However, without further assumptions on the content of the data, these are usually insufficient for learning disentangled object representations.

What if we used a robot to physically disentangle objects from each other during data collection? The field of robotics presents an exciting opportunity for representation learning because robots can manipulate objects, thus providing the factors of variation needed in data. Our method relies on the insight that grasping an object removes it from the scene. This yields 1) an image of the scene before grasping, 2) an image of the scene after grasping and 3) an isolated view of the grasped object itself.
Left: Objects before the grasp. Center: Objects after the grasp. Right: The Grasped object.
If we then consider an embedding function that extracts “the set of objects” from images, it should preserve the following subtractive relation:
objects_before_grasp - objects_after_grasp = grasped_object
We implement this equality relation using a fully convolutional architecture and a simple metric learning algorithm. At training time, the architecture shown below embeds the pre-grasp images and post-grasp images into a dense spatial feature map. The maps are mean-pooled into vectors and the difference between the “before grasp” and “after grasp” vectors represents a set of objects. This vector and the corresponding vector representation of the grasped object are pushed to equivalence via the N-Pairs objective.
Add caption
Once trained, two useful properties emerge naturally from our model.

1. Object Similarity
The first property is that a cosine distance between vector embeddings allows us to compare objects and determine whether they are identical. This can be used to implement reward functions for reinforcement learning, and allow robots to learn instance grasping without human-provided labels.
2. Localizing Target Objects
The second property is that we can combine scene spatial maps and object embeddings to localize a “query object” in image space. By taking the element-wise product of spatial feature maps and the vector corresponding to the query object, we can find all the pixels in the spatial map that “match” the query object.
Using Grasp2Vec embeddings to localize objects in a scene. The image on the top left shows the objects in the bin. On the bottom left is the query object we wish to grasp. By taking the dot product of the query object vector with the spatial features of the scene image, we get a per-pixel “activation map” (top right image) of how similar that region of the image is to the query. This response map can be used to approach the object for grasping.
Our method also works when there are multiple objects that match the query object, or even if the query consists of multiple objects (the average of two vectors). For example, here is a scenario where it detects multiple orange blocks in a scene.
The resulting “heatmap” can be used to plan the robot approach to the target object(s). We combine Grasp2Vec’s localization and instance recognition capabilities with our “grasp anything” policies to obtain a success rate of 80% on objects seen during data collection and 59% on novel objects the robot hasn’t encountered before.

Conclusion
In our paper, we show how robotic grasping skills can generate the data used for learning object-centric representations. We then can use representation learning to “bootstrap” more complex skills like instance grasping, all while retaining the self-supervised learning properties of our autonomous grasping system.

Besides our own work, a number of recent papers have also studied how self-supervised interaction can be used to acquire representations, by grasping, pushing, and otherwise manipulating objects in the environment. Going forward, we are excited not only for what machine learning can bring to robotics by way of better perception and control, but also what robotics can bring to machine learning in new paradigms of self-supervision.

Acknowledgements
This research was conducted by Eric Jang, Coline Devin, Vincent Vanhoucke, and Sergey Levine. We’d like to thank Adrian Li, Alex Irpan, Anthony Brohan, Chelsea Finn, Christian Howard, Corey Lynch, Dmitry Kalashnikov, Ian Wilkes, Ivonne Fajardo, Julian Ibarz, Ming Zhao, Peter Pastor, Pierre Sermanet, Stephen James, Tsung-Yi Lin, Yunfei Bai, and many others at Google, X, and the broader robotics community who contributed to improving this work.

Source: Google AI Blog


A Structured Approach to Unsupervised Depth Learning from Monocular Videos



Perceiving the depth of a scene is an important task for an autonomous robot — the ability to accurately estimate how far from the robot objects are, is crucial for obstacle avoidance, safe planning and navigation. While depth can be obtained (and learned) from sensor data, such as LIDAR, it is also possible to learn it in an unsupervised manner from a monocular camera only, relying on the motion of the robot and the resulting different views of the scene. In doing so, the “ego-motion” (the motion of the robot/camera between two frames) is also learned, which provides localization of the robot itself. While this approach has a long history — coming from the structure-from-motion and multi-view geometry paradigms — new learning based techniques, more specifically for unsupervised learning of depth and ego-motion by using deep neural networks, have advanced the state of the art, including work by Zhou et al., and our own prior research which aligns 3D point clouds of the scene during training.

Despite these efforts, learning to predict scene depth and ego-motion remains an ongoing challenge, specifically when handling highly dynamic scenes and estimating proper depth of moving objects. Because previous research efforts for unsupervised monocular learning do not model moving objects, it can result in consistent misestimation of objects’ depth, often resulting in mapping their depth to infinity.

In “Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos”, to appear in AAAI 2019, we propose a novel approach which is able to model moving objects and produces high quality depth estimation results. Our approach is able to recover the correct depth for moving objects compared to previous methods for unsupervised learning from monocular videos. In our paper, we also propose a seamless online refinement technique that can further improve quality and be applied for transfer across datasets. Furthermore, to encourage even more advanced approaches of onboard robotics learning, we have open sourced the code in TensorFlow.
Previous work (middle row) has not been able to correctly estimate depth of moving objects mapping them to infinity (dark blue regions in the heatmap). Our approach (right) provides much better depth estimates.
Structure
A key idea in our approach is to introduce structure into the learning framework. That is, instead of relying on a neural network to learn depth directly, we treat the monocular scene as 3D, composed of moving objects, including the robot itself. The respective motions are modeled as independent transformations — rotations and translations — in the scene, which is then used to model the 3D geometry and estimate all the objects’ motions. Additionally, knowing which objects may potentially move (e.g., cars, people, bicycles, etc.) helps us learn separate motion vectors for them even if they may be static. By decomposing the scene into 3D and individual objects, better depth and ego-motion in the scene is learned, especially on very dynamic scenes.

We tested this method on both KITTI and Cityscapes urban driving datasets, and found that it outperforms state-of-the-art approaches, and is approaching in quality methods which used stereo pair videos as training supervision. Importantly, we are able to recover correctly the depth of a car moving at the same speed as the ego-motion vehicle. This has been challenging previously — in this case, the moving vehicle appears (in a monocular input) as static, exhibiting the same behavior as the static horizon, resulting in an inferred infinite depth. While stereo inputs can solve that ambiguity, our approach is the first one that is able to correctly infer that from a monocular input.
Previous work with monocular inputs were not able to extract moving objects and incorrectly map them to infinity.
Furthermore, since objects are treated individually in our method, the algorithm is able to provide for the motion vectors for each individual object, i.e. which is an estimate of where it is heading:
Example depth results for a dynamic scene together with estimates of the motion vectors of the individual objects (rotation angles are estimated too, but for simplicity are not shown).
In addition to these results, this research provides motivation for further exploring what an unsupervised learning approach can achieve, as monocular inputs are cheaper and easier to deploy than stereo or LIDAR sensors. As can be seen in the figures below, in both the KITTI and Cityscapes datasets, the supervision sensor (be it stereo or LIDAR) is missing values and may occasionally be misaligned with the camera input, which happens due to time delay.
Depth prediction from monocular video input on the KITTI dataset, middle row, compared to ground truth depth from a Lidar sensor; the latter does not cover the full scene and has missing and noisy values. Ground truth depth is not used during training.
Depth prediction on the Cityscapes dataset. Left to right: image, baseline, our method and ground truth provided by stereo. Note the missing values in the stereo ground truth. Also note that our algorithm is able to achieve these results without any ground truth depth supervision.
Ego-motion
Our results also provide the best among the state-of-the-art estimates in ego-motion, which is crucial for autonomous robots, as it provides localization of the robots while moving in the environment. The video below shows results from our method that visualizes the speed and turning angle, obtained from the inferred ego-motion. While the outputs of both depth and ego-motion are valid up to a scalar, we can see that it is able to estimate its relative speed when slowing down and stopping.
Depth and ego-motion prediction. Follow the speed and the turning angle indicator to see the estimates when the car is taking a turn or stopping for a red light.
Transfer Across Domains
An important characteristic of a learning algorithm is its adaptability when moved to an unknown environment. In this work we further introduce an online refinement approach which continues to learn online while collecting new data. Below are examples of improvement of the estimated depth quality, after training on Cityscapes and online refinement on KITTI.
Online refinement when training on the Cityscapes Data and testing on KITTI. The images show depth prediction of the trained model, and of the trained model with online refinement. Depth prediction with online refinement better outlines the objects in the scene.
We further tested on a notably different dataset and setting, i.e. on an indoor dataset collected by the Fetch robot, while the training is done on the outdoor urban driving Cityscapes dataset. As to be expected, there is a large discrepancy between these datasets. Despite this, we observe that the online learning technique is able to obtain better depth estimates than the baseline.
Results of online adaptation when transferring the learning model from Cityscapes (an outdoors dataset collected from a moving car) to a dataset collected indoors by the Fetch robot. The bottom row shows improved depth after applying online refinement.
In summary, this work addresses unsupervised learning of depth and ego-motion from a monocular camera, and tackles the problem in highly dynamic scenes. It achieves high quality depth and ego-motion results and with quality comparable to stereo and sets forward the idea of incorporating structure in the learning process. More notably, our proposed combination of unsupervised learning of depth and ego-motion from monocular video only and online adaptation demonstrates a powerful concept, because not only can it learn in unsupervised manner from simple video, but it can also be transferred easily to other datasets.

Acknowledgements
This research was conducted by Vincent Casser, Soeren Pirk, Reza Mahjourian and Anelia Angelova. We would like to thank Ayzaan Wahid for his help with data collection and Martin Wicke and Vincent Vanhoucke for their support and encouragement.

Source: Google AI Blog


Scalable Deep Reinforcement Learning for Robotic Manipulation



How can robots acquire skills that generalize effectively to diverse, real-world objects and situations? While designing robotic systems that effectively perform repetitive tasks in controlled environments, like building products on an assembly line, is fairly routine, designing robots that can observe their surroundings and decide the best course of action while reacting to unexpected outcomes is exceptionally difficult. However, there are two tools that can help robots acquire such skills from experience: deep learning, which is excellent at handling unstructured real-world scenarios, and reinforcement learning, which enables longer-term reasoning while exhibiting more complex and robust sequential decision making. Combining these two techniques has the potential to enable robots to learn continuously from their experience, allowing them to master basic sensorimotor skills using data rather than manual engineering.

Designing reinforcement learning algorithms for robot learning introduces its own set of challenges: real-world objects span a wide variety of visual and physical properties, subtle differences in contact forces can make predicting object motion difficult and objects of interest can be obstructed from view. Furthermore, robotic sensors are inherently noisy, adding to the complexity. All of these factors makes it incredibly difficult to learn a general solution, unless there is enough variety in the training data, which takes time to collect. This motivates exploring learning algorithms that can effectively reuse past experience, similar to our previous work on grasping which benefited from large datasets. However, this previous work could not reason about the long-term consequences of its actions, which is important for learning how to grasp. For example, if multiple objects are clumped together, pushing one of them apart (called “singulation”) will make the grasp easier, even if doing so does not directly result in a successful grasp.
Examples of singulation.

To be more efficient, we need to use off-policy reinforcement learning, which can learn from data that was collected hours, days, or weeks ago. To design such an off-policy reinforcement learning algorithm that can benefit from large amounts of diverse experience from past interactions, we combined large-scale distributed optimization with a new fitted deep Q-learning algorithm that we call QT-Opt. A preprint is available on arXiv.

QT-Opt is a distributed Q-learning algorithm that supports continuous action spaces, making it well-suited to robotics problems. To use QT-Opt, we first train a model entirely offline, using whatever data we’ve already collected. This doesn’t require running the real robot, making it easier to scale. We then deploy and finetune that model on the real robot, further training it on newly collected data. As we run QT-Opt, we accumulate more offline data, letting us train better models, which lets us collect better data, and so on.

To apply this approach to robotic grasping, we used 7 real-world robots, which ran for 800 total robot hours over the course of 4 months. To bootstrap collection, we started with a hand-designed policy that succeeded 15-30% of the time. Data collection switched to the learned model when it started performing better. The policy takes a camera image and returns how the arm and gripper should move. The offline data contained grasps on over 1000 different objects.
Some of the training objects used.
In the past, we’ve seen that sharing experience across robots can accelerate learning. We scaled this training and data gathering process to ten GPUs, seven robots, and many CPUs, allowing us to collect and process a large dataset of over 580,000 grasp attempts. At the end of this process, we successfully trained a grasping policy that runs on a real world robot and generalizes to a diverse set of challenging objects that were not seen at training time.
Seven robots collecting grasp data.
Quantitatively, the QT-Opt approach succeeded in 96% of the grasp attempts across 700 trial grasps on previously unseen objects. Compared to our previous supervised-learning based grasping approach, which had a 78% success rate, our method reduced the error rate by more than a factor of five.
The objects used at evaluation time. To make the task challenging, we aimed for a large variety of object sizes, textures, and shapes.

Notably, the policy exhibits a variety of closed-loop, reactive behaviors that are often not found in standard robotic grasping systems:
  • When presented with a set of interlocking blocks that cannot be picked up together, the policy separates one of the blocks from the rest before picking it up.
  • When presented with a difficult-to-grasp object, the policy figures out it should reposition the gripper and regrasp it until it has a firm hold.
  • When grasping in clutter, the policy probes different objects until the fingers hold one of them firmly, before lifting.
  • When we perturbed the robot by intentionally swatting the object out of the gripper -- something it had not seen during training -- it automatically repositioned the gripper for another attempt.
Crucially, none of these behaviors were engineered manually. They emerged automatically from self-supervised training with QT-Opt, because they improve the model’s long-term grasp success.
Examples of the learned behaviors. In the left GIF, the policy corrects for the moved ball. In the right GIF, the policy tries several grasps until it succeeds at picking up the tricky object.

Additionally, we’ve found that QT-Opt reaches this higher success rate using less training data, albeit with taking longer to converge. This is especially exciting for robotics, where the bottleneck is usually collecting real robot data, rather than training time. Combining this with other data efficiency techniques (such as our prior work on domain adaptation for grasping) could open several interesting avenues in robotics. We’re also interested in combining QT-Opt with recent work on learning how to self-calibrate, which could further improve the generality.

Overall, the QT-Opt algorithm is a general reinforcement learning approach that’s giving us good results on real world robots. Besides the reward definition, nothing about QT-Opt is specific to robot grasping. We see this as a strong step towards more general robot learning algorithms, and are excited to see what other robotics tasks we can apply it to. You can learn more about this work in the short video below.
Acknowledgements
This research was conducted by Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. We’d also like to give special thanks to Iñaki Gonzalo and John-Michael Burke for overseeing the robot operations, Chelsea Finn, Timothy Lillicrap, and Arun Nair for valuable discussions, and other people at Google and X who’ve contributed their expertise and time towards this research. A preprint is available on arXiv.

Source: Google AI Blog


Teaching Uncalibrated Robots to Visually Self-Adapt



People are remarkably proficient at manipulating objects without needing to adjust their viewpoint to a fixed or specific pose. This capability (referred to as visual motor integration) is learned during childhood from manipulating objects in various situations, and governed by a self-adaptation and mistake correction mechanism that uses rich sensory cues and vision as feedback. However, this capability is quite difficult for vision-based controllers in robotics, which until now have been built on a rigid setup for reading visual input data from a fixed mounted camera which should not be moved or repositioned at train and test time. The ability to quickly acquire visual motor control skills under large viewpoint variation would have substantial implications for autonomous robotic systems — for example, this capability would be particularly desirable for robots that can help rescue efforts in emergency or disaster zones.

In “Sim2Real Viewpoint Invariant Visual Servoing by Recurrent Control” presented at CVPR 2018 this week, we study a novel deep network architecture (consisting of two fully convolutional networks and a long short-term memory unit) that learns from a past history of actions and observations to self-calibrate. Using diverse simulated data consisting of demonstrated trajectories and reinforcement learning objectives, our visually-adaptive network is able to control a robotic arm to reach a diverse set of visually-indicated goals, from various viewpoints and independent of camera calibration.
Viewpoint invariant manipulation for visually indicated goal reaching with a physical robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from drastically different camera viewpoints. First row shows the visually indicated goals.

The Challenge
Discovering how the controllable degrees of freedom (DoF) affect visual motion can be ambiguous and underspecified from a single image captured from an unknown viewpoint. Identifying the effect of actions on image-space motion and successfully performing the desired task requires a robust perception system augmented with the ability to maintain a memory of past actions. To be able to tackle this challenging problem, we had to address the following essential questions:
  • How can we make it feasible to provide the right amount of experience for the robot to learn the self-adaptation behavior based on pure visual observations that simulate a lifelong learning paradigm?
  • How can we design a model that integrates robust perception and self-adaptive control such that it can quickly transfer to unseen environments?
To do so, we devised a new manipulation task where a seven-DoF robot arm is provided with an image of an object and is directed to reach that particular goal amongst a set of distractor objects, while viewpoints change drastically from one trial to another. In doing so, we were able to simulate both the learning of complex behaviors and the transfer to unseen environments.
Visually indicated goal reaching task with a physical robotic arm and diverse camera viewpoints.
Harnessing Simulation to Learn Complex Behaviors
Collecting robot experience data is difficult and time-consuming. In a previous post, we showed how to scale up learning skills by distributing the data collection and trials to multiple robots. Although this approach expedited learning, it is still not feasibly extendable to learning complex behaviors such as visual self-calibration, where we need to expose robots to a huge space of various viewpoints. Instead, we opt to learn such complex behavior in simulation where we can collect unlimited robot trials and easily move the camera to various random viewpoints. In addition to fast data collection in simulation, we can also surpass hardware limitations requiring the installation of multiple cameras around a robot.
We use domain randomization technique to learn generalizable policies in simulation.
To learn visually robust features to transfer to unseen environments, we used a technique known as domain randomization (a.k.a. simulation randomization) introduced by Sadeghi & Levine (2017), that enables robots to learn vision-based policies entirely in simulation such that they can generalize to the real world. This technique was shown to work well for various robotic tasks such as indoor navigation, object localization, pick and placing, etc. In addition, to learn complex behaviors like self-calibration, we harnessed the simulation capabilities to generate synthetic demonstrations and combined reinforcement learning objectives to learn a robust controller for the robotic arm.
Viewpoint invariant manipulation for visually indicated goal reaching with a simulated seven-DoF robotic arm. We learn a single policy that can reach diverse goals from sensory input captured from dramatically different camera viewpoints.

Disentangling Perception from Control
To enable fast transfer to unseen environments, we devised a deep neural network that combines perception and control trained end-to-end simultaneously, while also allowing each to be learned independently if needed. This disentanglement between perception and control eases transfer to unseen environments, and makes the model both flexible and efficient in that each of its parts (i.e. 'perception' or 'control') can be independently adapted to new environments with small amounts of data. Additionally, while the control portion of the network was entirely trained by the simulated data, the perception part of our network was complemented by collecting a small amount of static images with object bounding boxes without needing to collect the whole action sequence trajectory with a physical robot. In practice, we fine-tuned the perception part of our network with only 76 object bounding boxes coming from 22 images.
Real-world robot and moving camera setup. First row shows the scene arrangements and the second row shows the visual sensory input to the robot.
Early Results
We tested the visually-adapted version of our network on a physical robot and on real objects with drastically different appearances than the ones used in simulation. Experiments were performed with both one or two objects on a table — “seen objects” (as labeled in the figure below) were used for visual adaptation using small collection of real static images, while “unseen objects” had not been seen during visual adaptation. During the test, the robot arm was directed to reach a visually indicated object from various viewpoints. For the two object experiments the second object was to "fool" the robotic arm. While the simulation-only network has good generalization capability (due to being trained with domain randomization technique), the very small amount of static visual data to visually adapt the controller boosted the performance, due to the flexible architecture of our network.
After adapting the visual features with the small amount of real images, performance was boosted by more than 10%. All used real objects are drastically different from the objects seen in simulation.
We believe that learning online visual self-adaptation is an important and yet challenging problem with the goal of learning generalizable policies for robots that can act in diverse and unstructured real world setup. Our approach can be extended to any sort of automatic self-calibration. See the video below for more information on this work.
Acknowledgements
This research was conducted by Fereshteh Sadeghi, Alexander Toshev, Eric Jang and Sergey Levine. We would also like to thank Erwin Coumans and Yunfei Bai for providing pybullet, and Vincent Vanhoucke for insightful discussions.




Source: Google AI Blog


IoT Developer Story: Deeplocal

Posted by Dave Smith, Developer Advocate for IoT

Deeplocal is a Pittsburgh-based innovation studio that makes inventions as marketing to help the world's most loved brands tell their stories. The team at Deeplocal built several fun and engaging robotics projects using Android Things. Leveraging the developer ecosystem surrounding the Android platform and the compute power of Android Things hardware, they were able to quickly and easily create robots powered by computer vision and machine learning.

DrawBot

DrawBot is a DIY drawing robot that transforms your selfies into physical works of art.

"The Android Things platform helped us move quickly from an idea, to prototype, to final product. Switching from phone apps to embedded code was easy in Android Studio, and we were able to pull in OpenCV modules, motor drivers, and other libraries as needed. The final version of our prototype was created two weeks after unboxing our first Android Things developer kit."

- Brian Bourgeois, Producer, Deeplocal

Want to build your own DrawBot? See the Hackster.io project for all the source code, schematics, and 3D models.

HandBot

A robotic hand that learns and reacts to hand gestures, HandBot visually recognizes gestures and applies machine learning.

"The Android Things platform made integration work for Handbot a breeze. Using TensorFlow, we were able to train a neural network to recognize hand gestures. Once this was created, we were able to use Android Things drivers to implement games in easy-to-read Android code. In a matter of weeks, we went from a fresh developer kit to competing against a robot hand in Rock, Paper, Scissors."

- Mike Derrick, Software Engineer, Deeplocal

Want to build your own HandBot? See the Hackster.io project for all the source code, schematics, and 3D models.

Visit the Google Hackster community to explore more inspiring ideas just like these, and join Google's IoT Developers Community on Google+ to get the latest platform updates, ask questions, and discuss ideas.