Tag Archives: reinforcement learning

Agile and Intelligent Locomotion via Deep Reinforcement Learning



Recent advancements in deep reinforcement learning (deep RL) has enabled legged robots to learn many agile skills through automated environment interactions. However, the lack of sample efficiency is still a major bottleneck for many algorithms, and researchers have to rely on using off-policy data, imitating animal behaviors, or performing meta learning to reduce the amount of real world experience required. Moreover, most existing works focus on simple, low-level skills only, such as walking forward, backward and turning. In order to operate autonomously in the real world, robots still need to combine these skills to generate more advanced behaviors.

Today we present two projects that aim to address the above problems and help close the perception-actuation loop for legged robots. In “Data Efficient Reinforcement Learning for Legged Robots”, we present an efficient way to learn low level motion control policies. By fitting a dynamics model to the robot and planning for actions in real time, the robot learns multiple locomotion skills using less than 5 minutes of data. Going beyond simple behaviors, we explore automatic path navigation in “Hierarchical Reinforcement Learning for Quadruped Locomotion”. With a policy architecture designed for end-to-end training, the robot learns to combine a high-level planning policy with a low-level motion controller, in order to navigate autonomously through a curved path.

Data Efficient Reinforcement Learning for Legged Robots
A major roadblock in RL is the lack of sample efficiency. Even with a state-of-the-art sample-efficient learning algorithm like Soft Actor-Critic (SAC), it would still require more than an hour of data to learn a reasonable walking policy, which is difficult to collect in the real world.

In a continued effort to learn walking skills using minimal interaction with the real-world environment, we present a model-based method to learn basic walking skills. Instead of directly learning a policy that maps from environment state to robot action, we learn a dynamics model of the robot that estimates future states given its current state and action. Since the entire learning process requires less than 5 minutes of data, it could be performed directly on the real robot.

We start by executing random actions on the robot, and fit the model to the data collected. With the model fitted, we control the robot using a model predictive control (MPC) planner. We iterate between collecting more data with MPC and re-training the model to better fit the dynamics of the environment.
Overview of the model-based learning pipeline. The system alternates between fitting the dynamics model and collecting trajectories using model predictive control (MPC).
In standard MPC, the controller plans for a sequence of actions at each timestep, and only executes the first of the planned actions. While online replanning with regular feedback from the robot to the controller makes the controller robust to model inaccuracies, it also poses a challenge for the action planner, as planning must finish before the next step of the control loop (usually less than 10ms for legged robots). To satisfy such a tight time constraint, we introduce a multi-threaded, asynchronous version of MPC, with action planning and execution happening on different threads. As the execution thread applies actions at a high frequency, the planning thread optimizes for actions in the background without interruption. Furthermore, since action planning can take multiple timesteps, the robot state would have changed by the time planning has finished. To address the problem with planning latency, we devise a novel technique to compensate, which first predicts the future state when the planner is expected to finish its computation, and then uses this future state to seed the planning algorithm.
We separate action planning and execution on different threads.
Although MPC refreshes the action plan frequently, the planner still needs to work over long action horizons to keep track of the long-term goal and avoid myopic behaviors. To that end, we use a multi-step loss function, a reformulation of the model loss function that helps to reduce error accumulation over time by predicting the loss over a range of future steps.

Safety is another concern for learning on the real robot. For legged robots, a small mistake, such as missing a foot step, could lead to catastrophic failures, from the robot falling to the motor overheating. To ensure safe exploration, we embed a stable, in-place stepping gait prior, that is modulated by a trajectory generator. With the stable walking prior, MPC can then safely explore the action space.

Combining an accurate dynamics model with an online, asynchronous MPC controller, the robot successfully learned to walk using only 4.5 minutes of data (36 episodes). The learned dynamics model is also generalizable: by simply changing the reward function of MPC, the controller is able to optimize for different behaviors, such as walking backwards, or turning, without re-training. As an extension, we use a similar framework to enable even more agile behaviors. For example, in simulation the robot learns to backflip and walk on its rear legs, though these behaviors are yet to be learned by the real robot.
The robot learns to walk using only 4.5 minutes of data.
The robot learns to backflip and walk with rear legs using the same framework.
Combining low-level controller with high-level planning
Although model-based RL has allowed the robot to learn simple locomotion skills efficiently, such skills are insufficient for handling complex, real-world tasks. For example, in order to navigate through an office space, the robot may have to adjust its speed, direction and height multiple times, instead of following a pre-defined speed profile. Traditionally, people solve such complex tasks by breaking them down into multiple hierarchical sub-problems, such as a high-level trajectory planner and a low-level trajectory-following controller. However, manually defining a suitable hierarchy is typically a tedious task, as it requires careful engineering for each sub-problem.

In our second paper, we introduce a hierarchical reinforcement learning (HRL) framework that can be trained to automatically decompose complex reinforcement learning tasks. We break down our policy structure into a high-level and a low-level policy. Instead of designing each policy manually, we only define a simple communication protocol between the policy levels. In this framework, the high-level policy (e.g., a trajectory planner) commands the low-level policy (such as the motion control policy) through a latent command, and decides for how long to hold that command constant before issuing a new one. The low-level policy then interprets the latent command from the high-level policy, and gives motor commands to the robot.

To facilitate learning, we also split the observation space into high-level (e.g., robot position and orientation) and low-level (IMU, motor positions) observations, which are fed to their corresponding policies. This architecture naturally allows the high-level policy to operate at a slower timescale than the low-level policy, which saves computation resources and reduces training complexity.
Framework of Hierarchical Policy: The policy gets observations from the robot and sends motor commands to execute desired actions. It is split into two levels (high and low). The high-level policy gives a latent command to the low-level policy and also decides the duration for which low-level will run.
Since the high-level and low-level policies operate at discrete timescales, the entire policy structure is not end-to-end differentiable, and standard gradient-based RL algorithms like PPO and SAC cannot be used. Instead, we choose to train the hierarchical policy through augmented random search (ARS), a simple evolutionary optimization method that has demonstrated good performance in reinforcement learning tasks. Weights of both levels of the policy are trained together, where the objective is to maximize the total reward from the robot trajectory.

We test our framework on a path-following task using the same quadruped robot. In addition to straight walking, the robot needs to steer in different directions to complete the task. Note that as the low-level policy does not know the robot’s position in the path, it does not have sufficient information to complete the entire task on its own. However, with the coordination between the high-level and low-level policies, steering behavior emerges automatically in the latent command space, which allows the robot to efficiently complete the path. After successful training in a simulated environment, we validate our results on hardware by transferring an HRL policy to a real robot and recording the resulting trajectories.
Successful trajectory of a robot on a curved path. Left: A plot of the trajectory traversed by the robot with dots along the trajectory marking the positions where the high-level policy sent a new latent command to the low-level policy. Middle: The robot walking along the path in the simulated environment. Right: The robot walking around the path in the real world.
To further demonstrate the learned hierarchical policy, we visualized the behavior of the learned low-level policy under different latent commands. As shown in the plot below, different latent commands can cause the robot to walk straight, or turn left or right at different rates. We also test the generalizability of low-level policies by transferring them to new tasks from a similar domain, which, in our case, includes following a path with different shapes. By fixing the low-level policy weights and only training the high-level policy, the robot could successfully traverse through different paths.
Left: Visualization of a learned 2D latent command space. Vector directions correspond to the movement direction of the robot. Vector length is proportional to the distance covered. Right: Transfer of low level policy: An HRL policy was trained on a single path (right, top). The learned low-level policy was then reused when training the high-level policy on other paths (e.g., right, bottom).
Conclusion
Reinforcement learning poses a promising future for robotics by automating the controller design process. With model-based RL, we enabled efficient learning of generalizable locomotion behaviors directly on the real robot. With hierarchical RL, the robot learned to coordinate policies at different levels to achieve more complex tasks. In the future, we plan to bring perception into the loop, so that robots can operate truly autonomously in the real world.

Acknowledgements
Both Deepali Jain and Yuxiang Yang are residents in the AI Residency program, mentored by Ken Caluwaerts and Atil Iscen. We would also like to thank Jie Tan and Vikas Sindhwani for support of the research, and Noah Broestl for managing the New York AI Residency Program.

Source: Google AI Blog


Agile and Intelligent Locomotion via Deep Reinforcement Learning



Recent advancements in deep reinforcement learning (deep RL) has enabled legged robots to learn many agile skills through automated environment interactions. However, the lack of sample efficiency is still a major bottleneck for many algorithms, and researchers have to rely on using off-policy data, imitating animal behaviors, or performing meta learning to reduce the amount of real world experience required. Moreover, most existing works focus on simple, low-level skills only, such as walking forward, backward and turning. In order to operate autonomously in the real world, robots still need to combine these skills to generate more advanced behaviors.

Today we present two projects that aim to address the above problems and help close the perception-actuation loop for legged robots. In “Data Efficient Reinforcement Learning for Legged Robots”, we present an efficient way to learn low level motion control policies. By fitting a dynamics model to the robot and planning for actions in real time, the robot learns multiple locomotion skills using less than 5 minutes of data. Going beyond simple behaviors, we explore automatic path navigation in “Hierarchical Reinforcement Learning for Quadruped Locomotion”. With a policy architecture designed for end-to-end training, the robot learns to combine a high-level planning policy with a low-level motion controller, in order to navigate autonomously through a curved path.

Data Efficient Reinforcement Learning for Legged Robots
A major roadblock in RL is the lack of sample efficiency. Even with a state-of-the-art sample-efficient learning algorithm like Soft Actor-Critic (SAC), it would still require more than an hour of data to learn a reasonable walking policy, which is difficult to collect in the real world.

In a continued effort to learn walking skills using minimal interaction with the real-world environment, we present a model-based method to learn basic walking skills. Instead of directly learning a policy that maps from environment state to robot action, we learn a dynamics model of the robot that estimates future states given its current state and action. Since the entire learning process requires less than 5 minutes of data, it could be performed directly on the real robot.

We start by executing random actions on the robot, and fit the model to the data collected. With the model fitted, we control the robot using a model predictive control (MPC) planner. We iterate between collecting more data with MPC and re-training the model to better fit the dynamics of the environment.
Overview of the model-based learning pipeline. The system alternates between fitting the dynamics model and collecting trajectories using model predictive control (MPC).
In standard MPC, the controller plans for a sequence of actions at each timestep, and only executes the first of the planned actions. While online replanning with regular feedback from the robot to the controller makes the controller robust to model inaccuracies, it also poses a challenge for the action planner, as planning must finish before the next step of the control loop (usually less than 10ms for legged robots). To satisfy such a tight time constraint, we introduce a multi-threaded, asynchronous version of MPC, with action planning and execution happening on different threads. As the execution thread applies actions at a high frequency, the planning thread optimizes for actions in the background without interruption. Furthermore, since action planning can take multiple timesteps, the robot state would have changed by the time planning has finished. To address the problem with planning latency, we devise a novel technique to compensate, which first predicts the future state when the planner is expected to finish its computation, and then uses this future state to seed the planning algorithm.
We separate action planning and execution on different threads.
Although MPC refreshes the action plan frequently, the planner still needs to work over long action horizons to keep track of the long-term goal and avoid myopic behaviors. To that end, we use a multi-step loss function, a reformulation of the model loss function that helps to reduce error accumulation over time by predicting the loss over a range of future steps.

Safety is another concern for learning on the real robot. For legged robots, a small mistake, such as missing a foot step, could lead to catastrophic failures, from the robot falling to the motor overheating. To ensure safe exploration, we embed a stable, in-place stepping gait prior, that is modulated by a trajectory generator. With the stable walking prior, MPC can then safely explore the action space.

Combining an accurate dynamics model with an online, asynchronous MPC controller, the robot successfully learned to walk using only 4.5 minutes of data (36 episodes). The learned dynamics model is also generalizable: by simply changing the reward function of MPC, the controller is able to optimize for different behaviors, such as walking backwards, or turning, without re-training. As an extension, we use a similar framework to enable even more agile behaviors. For example, in simulation the robot learns to backflip and walk on its rear legs, though these behaviors are yet to be learned by the real robot.
The robot learns to walk using only 4.5 minutes of data.
The robot learns to backflip and walk with rear legs using the same framework.
Combining low-level controller with high-level planning
Although model-based RL has allowed the robot to learn simple locomotion skills efficiently, such skills are insufficient for handling complex, real-world tasks. For example, in order to navigate through an office space, the robot may have to adjust its speed, direction and height multiple times, instead of following a pre-defined speed profile. Traditionally, people solve such complex tasks by breaking them down into multiple hierarchical sub-problems, such as a high-level trajectory planner and a low-level trajectory-following controller. However, manually defining a suitable hierarchy is typically a tedious task, as it requires careful engineering for each sub-problem.

In our second paper, we introduce a hierarchical reinforcement learning (HRL) framework that can be trained to automatically decompose complex reinforcement learning tasks. We break down our policy structure into a high-level and a low-level policy. Instead of designing each policy manually, we only define a simple communication protocol between the policy levels. In this framework, the high-level policy (e.g., a trajectory planner) commands the low-level policy (such as the motion control policy) through a latent command, and decides for how long to hold that command constant before issuing a new one. The low-level policy then interprets the latent command from the high-level policy, and gives motor commands to the robot.

To facilitate learning, we also split the observation space into high-level (e.g., robot position and orientation) and low-level (IMU, motor positions) observations, which are fed to their corresponding policies. This architecture naturally allows the high-level policy to operate at a slower timescale than the low-level policy, which saves computation resources and reduces training complexity.
Framework of Hierarchical Policy: The policy gets observations from the robot and sends motor commands to execute desired actions. It is split into two levels (high and low). The high-level policy gives a latent command to the low-level policy and also decides the duration for which low-level will run.
Since the high-level and low-level policies operate at discrete timescales, the entire policy structure is not end-to-end differentiable, and standard gradient-based RL algorithms like PPO and SAC cannot be used. Instead, we choose to train the hierarchical policy through augmented random search (ARS), a simple evolutionary optimization method that has demonstrated good performance in reinforcement learning tasks. Weights of both levels of the policy are trained together, where the objective is to maximize the total reward from the robot trajectory.

We test our framework on a path-following task using the same quadruped robot. In addition to straight walking, the robot needs to steer in different directions to complete the task. Note that as the low-level policy does not know the robot’s position in the path, it does not have sufficient information to complete the entire task on its own. However, with the coordination between the high-level and low-level policies, steering behavior emerges automatically in the latent command space, which allows the robot to efficiently complete the path. After successful training in a simulated environment, we validate our results on hardware by transferring an HRL policy to a real robot and recording the resulting trajectories.
Successful trajectory of a robot on a curved path. Left: A plot of the trajectory traversed by the robot with dots along the trajectory marking the positions where the high-level policy sent a new latent command to the low-level policy. Middle: The robot walking along the path in the simulated environment. Right: The robot walking around the path in the real world.
To further demonstrate the learned hierarchical policy, we visualized the behavior of the learned low-level policy under different latent commands. As shown in the plot below, different latent commands can cause the robot to walk straight, or turn left or right at different rates. We also test the generalizability of low-level policies by transferring them to new tasks from a similar domain, which, in our case, includes following a path with different shapes. By fixing the low-level policy weights and only training the high-level policy, the robot could successfully traverse through different paths.
Left: Visualization of a learned 2D latent command space. Vector directions correspond to the movement direction of the robot. Vector length is proportional to the distance covered. Right: Transfer of low level policy: An HRL policy was trained on a single path (right, top). The learned low-level policy was then reused when training the high-level policy on other paths (e.g., right, bottom).
Conclusion
Reinforcement learning poses a promising future for robotics by automating the controller design process. With model-based RL, we enabled efficient learning of generalizable locomotion behaviors directly on the real robot. With hierarchical RL, the robot learned to coordinate policies at different levels to achieve more complex tasks. In the future, we plan to bring perception into the loop, so that robots can operate truly autonomously in the real world.

Acknowledgements
Both Deepali Jain and Yuxiang Yang are residents in the AI Residency program, mentored by Ken Caluwaerts and Atil Iscen. We would also like to thank Jie Tan and Vikas Sindhwani for support of the research, and Noah Broestl for managing the New York AI Residency Program.

Source: Google AI Blog


Chip Design with Deep Reinforcement Learning



The revolution of modern computing has been largely enabled by remarkable advances in computer systems and hardware. With the slowing of Moore’s Law and Dennard scaling, the world is moving toward specialized hardware to meet the exponentially growing demand for compute. However, today’s chips take years to design, resulting in the need to speculate about how to optimize the next generation of chips for the machine learning (ML) models of 2-5 years from now. Dramatically shortening the chip design cycle would allow hardware to adapt to the rapidly advancing field of ML. What if ML itself could provide the means to shorten the chip design cycle, creating a more integrated relationship between hardware and ML, with each fueling advances in the other?

In “Chip Placement with Deep Reinforcement Learning”, we pose chip placement as a reinforcement learning (RL) problem, where we train an agent (i.e, an RL policy) to optimize the quality of chip placements. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. Whereas existing baselines require human experts in the loop and take several weeks to generate, our method can generate placements in under six hours that outperform or match their manually designed counterparts. While we show that we can generate optimized placements for Google accelerator chips (TPUs), our methods are applicable to any kind of chip (ASIC).

The Chip Floorplanning Problem
A computer chip is divided into dozens of blocks, each of which is an individual module, such as a memory subsystem, compute unit, or control logic system. These blocks can be described by a netlist, a graph of circuit components, such as macros (memory components) and standard cells (logic gates like NAND, NOR, and XOR), all of which are connected by wires. Determining the layout of a chip block, a process called chip floorplanning, is one of the most complex and time-consuming stages of the chip design process and involves placing the netlist onto a chip canvas (a 2D grid), such that power, performance, and area (PPA) are minimized, while adhering to constraints on density and routing congestion. Despite decades of research on this topic, it is still necessary for human experts to iterate for weeks to produce solutions that meet multi-faceted design criteria. This problem’s complexity arises from the size of the netlist graph (millions to billions of nodes), the granularity of the grid onto which that graph must be placed, and the exorbitant cost of computing the true target metrics, which can take many hours (sometimes over a day) using industry-standard electronic design automation tools.

The Deep Reinforcement Learning Model
The input to our model is the chip netlist (node types and graph adjacency information), the ID of the current node to be placed, and some netlist metadata, such as the total number of wires, macros, and standard cell clusters. The netlist graph and the current node are passed through an edge-based graph neural network that we developed to encode the input state. This generates embeddings of the partially placed graph and the candidate node.
A graph neural network generates embeddings that are concatenated with the metadata embeddings to form the input to the policy and value networks.
The edge, macro and netlist metadata embeddings are then concatenated to form a single state embedding, which is passed to a feedforward neural network. The output of the feedforward network is a learned representation that captures the useful features and serves as input to the policy and value networks. The policy network generates a probability distribution over all possible grid cells onto which the current node could be placed.

In each iteration of training, the macros are sequentially placed by the RL agent, after which the standard cell clusters are placed by a force-directed method, which models the circuit as a system of springs to minimize wirelength. RL training is guided by a fast-but-approximate reward signal calculated for each of the agent’s chip placements using the weighted average of approximate wirelength (i.e., the half-perimeter wirelength, HPWL) and approximate congestion (the fraction of routing resources consumed by the placed netlist).
During each training iteration, the macros are placed by the policy one at a time and the standard cell clusters are placed by a force-directed method. The reward is calculated from the weighted combination of approximate wirelength and congestion.
Results
To our knowledge, this method is the first chip placement approach that has the ability to generalize, meaning that it can leverage what it has learned while placing previous netlists to generate better placements for new unseen netlists. We show that as we increase the number of chip netlists on which we perform pre-training (i.e., as our method becomes more experienced in placement optimization), our policy better generalizes to new netlists.

For example, the pre-trained policy organically identifies an arrangement that places the macros near the edges of the chip with a convex space in the center in which to place the standard cells. This results in lower wirelength between the macros and standard cells without introducing excessive routing congestion. In contrast, the policy trained from scratch starts with random placements and takes much longer to converge to a high-quality solution, rediscovering the need to leave an opening in the center of the chip canvas. This is demonstrated in the animation below.
Macro placements of Ariane, an open-source RISC-V processor, as training progresses. On the left, the policy is being trained from scratch, and on the right, a pre-trained policy is being fine-tuned for this chip. Each rectangle represents an individual macro placement. Notice how the cavity discovered by the from-scratch policy is already present from the outset in the pre-trained policy’s placement.
We observe that pre-training improves sample efficiency and placement quality. We compare the quality of placements generated using pre-trained policies to those generated by training the policy from scratch. To generate placements for previously unseen chip blocks, we use a zero-shot method, meaning that we simply use a pre-trained policy (with no fine-tuning) to place a new block, yielding a placement in less than a second. The results can be further improved by fine-tuning the policy on the new block. The policy trained from scratch takes much longer to converge, and even after 24 hours, its chip placements are worse than what the fine-tuned policy achieves after 12 hours.
Convergence plots for two policies on Ariane blocks. One is training from scratch and the other is finetuning a pre-trained policy.
The performance of our approach improves as we train on a larger dataset. We observed that as we increase the training set from two blocks to five blocks, and then to 20 blocks, the policy generates better placements, both at zero-shot and after being fine-tuned for the same training wall-clock time.
Training data size vs. fine-tuning performance.
The ability of our approach to learn from experience and improve over time unlocks new possibilities for chip designers. As the agent is exposed to a greater volume and variety of chips, it becomes both faster and better at generating optimized placements for new chip blocks. A fast, high-quality, automatic chip placement method could greatly accelerate chip design and enable co-optimization with earlier stages of the chip design process. Although we evaluate primarily on accelerator chips, our proposed method is broadly applicable to any chip placement problem. After all that hardware has done for machine learning, we believe that it is time for machine learning to return the favor.

Acknowledgements
This project was a collaboration between Google Research and Google Hardware and Architecture teams. We would like to thank our coauthors: Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, Azade Nazi, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Anand Babu, Quoc Le, James Laudon, Roger Carpenter, Richard Ho, and Jeff Dean for their support and contributions to this work.

Source: Google AI Blog


Exploring Evolutionary Meta-Learning in Robotics



Rapid development of more accurate simulator engines has given robotics researchers a unique opportunity to generate sufficient amounts of data that can be used to train robotic policies for real-world deployment. However, moving trained policies from “sim-to-real” remains one of the greatest challenges of modern robotics, due to the subtle differences encountered between the simulation and real domains, termed the “reality gap”. While some recent approaches leverage existing data, such as imitation learning and offline reinforcement learning, to prepare a policy for the reality gap, a more common approach is to simply provide more data by varying properties of the simulated environment, a process called domain randomization.

However, domain randomization can sacrifice performance for stability, as it seeks to optimize for a decent, stable policy across all tasks, but offers little room for improving the policy on a specific task. This lack of a common optimal policy between simulation and reality is frequently a problem in robotic locomotion applications, where there are varying physical forces at play, such as leg friction, body mass, and terrain differences. For example, given the same initial conditions for the robot’s position and balance, the surface type will determine the optimal policy — for an incoming flat surface encountered in simulation, the robot could accelerate to a higher speed, while for an incoming rugged and bumpy surface encountered in the real world, it should walk slowly and carefully to prevent falling.

In “Rapidly Adaptable Legged Robots via Evolutionary Meta-Learning”, we present a particular type of meta-learning based on evolutionary strategies (ES), an approach generally believed to only work well in simulation, we can effectively and efficiently adapt a policy to a real-world robot in a completely model-free manner. Compared to previous approaches for adapting meta-policies, such as standard policy gradients which do not allow sim-to-teal adaptation, ES enables a robot to quickly overcome the reality gap and adapt to dynamic changes in the real world, some of which may not be encountered in simulation. This represents the first instance of successfully using ES for on-robot adaptation.
Our algorithm quickly adapts a legged robot’s policy to dynamics changes. In this example, the battery voltage dropped from 16.8V to 10V which reduced motor power, and a 500g mass was also placed on the robot's side, causing it to turn rather than walk straight. The policy is able to adapt in only 50 episodes (or 150s of real-world data).
Meta-Learning
This research falls under the general class of meta-learning techniques, and is demonstrated on a legged robot. At a high level, meta-learning learns to solve an incoming task quickly without completely retraining from scratch, by combining past experiences with small amounts of experience from the incoming task. This is especially beneficial in the sim-to-real case, where most of the past experiences come cheaply from simulation, while a minimal, yet necessary amount of experience is generated from the real world task. The simulation experiences allow the policy to possess a general level of behavior for solving a distribution of tasks, while the real-world experiences allow the policy to fine-tune specifically to the real-world task at hand.

In order to train a policy to meta-learn, it is necessary to encourage a policy to adapt during simulation. Normally, this can be achieved by applying model-agnostic meta-learning (MAML), which searches for a meta-policy that can adapt to a specific task quickly using small amounts of task-specific data. The standard approach to computing such meta-policies is by using policy gradient methods, which seek to improve the likelihood of selecting the same action given the same state. In order to determine the likelihood of a given action, the policy must be stochastic, allowing for the action selected by the policy to have a randomized component. The real-world environment for deploying such robotic policies is also highly stochastic, as there can be slight differences in motion arising naturally, even if starting from the exact same state and action sequence. The combination of using a stochastic policy inside a stochastic environment creates two conflicting objectives:
  1. Decreasing the policy’s stochasticity may be crucial, as otherwise the high-noise problem might be exacerbated by the additional randomness from the policy’s actions.

  2. However, increasing the policy’s stochasticity may also benefit exploration, as the policy needs to use random actions to probe the type of environment to which it adapts.
These two competing objectives, which have been noted before, seek to both decrease and increase the policy’s stochasticity and may cause complications.

Evolutionary Strategies in Robotics
Instead, we resolve these challenges by applying ES-MAML, an algorithm that leverages a drastically different paradigm for high-dimensional optimization — evolutionary strategies. The ES-MAML approach updates the policy based solely on the sum of rewards collected by the agent in the environment. The function used for optimizing the policy is black-box, mapping the policy parameters directly to this reward. Unlike policy gradient methods, this approach does not need to collect state/action/reward tuples and does not need to estimate action likelihoods. This allows the use of deterministic policies and exploration based on parameter changes and avoiding the conflict between stochasticity in the policy and in the environment.

In this paradigm, querying usually involves running episodes in the simulator, but we show that ES can be applied also for episodes collected on real hardware. ES optimization can be easily distributed and also works well for training efficient compact policies, a phenomenon with profound robotic implications, since policies with fewer parameters can be easier deployed on real hardware and often lead to more efficient inference and power usage. We confirm the effectiveness of ES in training compact policies by learning adaptable meta-policies with <130 parameters.

The ES optimization paradigm is very flexible. It can be used to optimize non-differentiable objectives, such as the total reward objective in our robotics case. It also works in the presence of substantial (potentially adversarial) noise. In addition, the most recent forms of ES methods (e.g., guided ES) are much more sample-efficient than previous versions.

This flexibility is critical for efficient adaptation of locomotion meta-policies. Our results show that adaptation with ES can be conducted with a small number of additional on-robot episodes. Thus, ES is no longer just an attractive alternative to the state-of-the-art algorithms, but defines a new state of the art for several challenging RL tasks.

Adaptation in Simulation
We first examine the types of adaptation that emerge when training with ES-MAML in simulation. When testing the policy in simulation, we found that the meta-policy forces the robot to fall down when the dynamics become too unstable, whereas the adapted policy allows the robot to re-stabilize and walk again. Furthermore, when the robot’s leg settings change, the meta-policy de-synchronizes the robot’s legs causing the robot to turn sharply, while the adapted policy corrects the robot so it can walk straight again.
The meta-policy’s gait, which experiences issues when facing a difficult dynamics task. Left: The meta-policy lets the robot fall down. Center: The adapted policy ensures the robot continues to walk correctly. Right: Comparative measurement of the robot’s height.
The meta-policy’s gait, under changes to the robot’s leg settings. Left: The meta-policy allows the robot veer to the right. Center: The adapted policy ensures the robot continues to walk in a straight line. Right: Comparative measurement of the robot’s walking direction.
Adaptation in the Real World
Despite the good performance of ES-MAML in simulation, applying it to a real robot is still a challenge. To effectively adapt in the noisy environment of the real world while requiring as little real-world data as possible, we introduce batch hill-climbing, an add-on to ES-MAML based on previous work for zeroth-order blackbox optimization. Rather than performing hill-climbing which iteratively updates the input one-by-one according to a deterministic objective, batch hill-climbing samples a parallel batch of queries to determine the next input, making it robust to large amounts of noise in the objective.

We then test our method on the following 2 tasks, which are designed to significantly change the dynamics from the normal setting of the robot:
In the mass-voltage task (left), a 500g weight is placed on the robot’s side and the voltage is dropped to 10.0V from 16.8V. In the friction task (right), we replaced the rubber feet with tennis balls, to significantly reduce friction and hinder walking.
For the mass-voltage task, the initial meta-policy steered the robot significantly to the right due to the extra mass and voltage change, which caused an imbalance in the robot’s body and leg motors. However, after 30 episodes of adaptation using our method, the robot straightens the walking pose, and after 50 episodes, the robot is able to balance its body completely and is able to walk longer distances. In comparison, training from scratch on an easier, noiseless task from only simulation required approximately 90,000 episodes, showing that our method significantly reduces sample complexity on expensive real world data.
Qualitative changes during the adaptation phase under the mass-voltage task.
We compared our method to domain randomization and the standard policy gradient approach to MAML (PG-MAML) only, presenting the final policies qualitatively, as well as metrics from the real robot to show how our method adapts. We found that both domain randomization and PG-MAML baselines do not adapt as well as our method.
Comparisons between Domain Randomization and PG-MAML, and metric differences between our method’s meta-policy and adapted policy. Top: Comparison for the mass-voltage task. Our method stabilizes the robot’s roll angle. Bottom: Comparison for the friction task. Our method results in longer trajectories.
Future Work
This work exposes several avenues for future development. One option is to make algorithmic improvements to reduce the number of real-world rollouts required for adaptation. Another area for advancement is the use of model-based reinforcement learning techniques for a lifelong learning system, in which the robot can continuously collect data and quickly adjust its policy to learn new skills and to operate optimally in new environments.

Acknowledgements
This research was conducted by the core ES-MAML team: Xingyou Song, Yuxiang Yang, Krzysztof Choromanski, Ken Caluwaerts, Wenbo Gao, Chelsea Finn, and Jie Tan. We would like to give special thanks to Vikas Sindhwani for his support on ES methods, and Daniel Seita for feedback on our paper.

Source: Google AI Blog


Off-Policy Estimation for Infinite-Horizon Reinforcement Learning



In conventional reinforcement learning (RL) settings, an agent interacts with an environment in an online fashion, meaning that it collects data from its interaction with the environment that is then used to inform changes to the policy governing its behavior. In contrast, offline RL refers to the setting where historical data are used to either learn good policies for acting in an environment, or to evaluate the performance of new policies. As RL is increasingly applied to crucial real-life problems like robotics and recommendation systems, evaluating new policies in the offline setting — estimating the expected reward of a target policy given historical data generated from actions that are based on a behavior policy becomes more critical. However, despite its importance, evaluating the overall effectiveness of a target policy based on historical behavior policies can be a bit tricky, due to the difficulty in building high-fidelity simulators and also the mismatch in data distributions.
Agent-environment interaction in reinforcement learning. At each step, an agent takes an action based on a policy, receives a reward and makes a transition to a new state.
As a simple example, consider the game Pong: one might like to predict if a new strategy (the target policy) increases the chance of winning when considering only historical data collected from previous strategies (behavior policies) and without actually playing the game. If one were interested only in the performance of the behavior policy, a good metric might be to average the rewards of all the time steps from the historical data. However, since historical data is based on actions determined by the behavior policy and not the target policy, this simple average of rewards in the off-policy data would not yield a good estimate of the target policy’s long-term reward. Instead, proper correction must be made to remove the bias resulting from having two different policies (i.e., the difference in data distribution).
In off-policy evaluation, unlike the behavior policy, we do not have any data from the target policy. Therefore, we cannot compute the expected reward of the target policy without using information from the behavior policy.
In “Black-Box Off-Policy Estimation for Infinite-Horizon Reinforcement Learning”, accepted at ICLR 2020, we propose a new approach to evaluate a given policy from offline data based on estimating the expected reward of the target policy as a weighted average of rewards in off-policy data. Since meaningful weights for the off-policy data are not known a priori, we propose a novel way of learning them. Unlike most of previous works, our method is particularly suitable when we plan to use historical data where trajectories are significantly lengthy or have infinite horizons. We empirically demonstrate the effectiveness of this approach using a number of classical control benchmarks.

Background
In general, one approach to solve the off-policy evaluation problem is to build a simulator that mimics the interaction of the agent with the environment, and then evaluate the target policy against the simulation. While the idea is natural, building a high-fidelity simulator for many domains can be extremely challenging, particularly those that involve human interactions.

An alternative approach is to use the weighted average of rewards from the off-policy data as an estimate of the average reward of the target policy. This approach can be more robust than using a simulator as it does not require modeling assumptions about real world dynamics. Indeed, most previous efforts using this approach have found success on short-horizon problems where the number of time steps (i.e., the length of data trajectory) is limited. However, as the horizon is extended, the variance in predictions made by most of the previous estimators often grows exponentially, necessitating novel solutions for long-horizon problems, and even more so in the extreme case of the infinite-horizon problem.

Our Approach for Infinite-Horizon RL
Our method of OPE leverages a well-known statistical technique called importance sampling through which one can estimate the properties of a particular distribution (e.g., the mean) from samples generated by another distribution. In particular, we estimate the long-term average reward of the target policy using the weighted average of rewards from the behavior policy data. The difficulty in this approach is how to choose the weights in order to remove the bias between the off-policy data distribution and that of the target policy while achieving the best estimate of the target policy’s average reward.

One important point is that if the weights are normalized to be positive and sum up to one, then they define a probability distribution over the set of possible states and actions of the agent. On the other hand, an individual policy defines a distribution on how often an agent visits a particular state or performs a particular action. In other words, it defines a unique distribution on states and actions. Under reasonable assumptions, this distribution does not change over time, and is called a stationary distribution. Since we are using importance sampling, we naturally want to optimize weights of the estimator such that the stationary distribution of the target policy matches the distribution induced by the weights of our estimator. However, the problem remains that we do not know the stationary distribution of the target policy, since we do not have any data generated by that policy.

One way to overcome this problem is to make sure that the distribution of weights satisfies properties that the target policy distribution has, without actually knowing what this distribution is. Luckily, we can take advantage of some mathematical "trickery" to solve this. While the full details are found in our paper, the upshot is that while we do not know the stationary distribution of the target policy (since we have no data collected from it) we can determine that distribution by solving an optimization problem involving a backward operator, which describes how an agent transitions from other states and actions to a particular state and action using probability distributions as both input and output. Once we are done, the weighted average of rewards from historic data gives us an estimate of the expected reward of the target policy.

Experimental Results
Using a toy environment called ModelWin that has three states and two actions, we compare our work with a previous state-of-the-art approach (labeled “IPS”), along with a naive method in which we simply average rewards from the behavior policy data. The figure below shows the log of the root-mean-square error (RMSE) with respect to the target policy reward as we change the number of steps collected by the behavior policy. The naive method suffers from a large bias and its error does not change even with more data collected by increasing the length of the episode. The estimation error of the IPS method decreases with increasing horizon length. On the other hand, the error exhibited by our method is small, even for short horizon length.
Left: The ModelWin environment with three states (s1-s3) and two actions (a1, a2). Right: RMSE of different approaches on the ModelWin problem in logarithmic scale. Our approach has converged very quickly in this simple problem, compared to the previous state-of-the-art method (IPS) that needs longer horizon length to converge.
We also compare the performance of our approach with other approaches (including naive estimator, IPS, and model-based estimator) on several classic control problems. As we can see in figures below, the naive averaging performance is almost independent of the number of trajectories. Our method outperforms other approaches in three sample environments: CartPole, Pendulum, and MountainCar.

Comparison of different methods on three environments: Cartpole, Pendulum, and Mountaincar. The left column shows the environments. The right column shows the log of the RMSE with respect to the target policy reward as the number of trajectories collected by the behavior policy changes. The results are based on 50 runs.
To summarize, in this post we described how one can use historic data gathered according to a behavior policy to assess the quality of a new target policy. An interesting future direction of this work is to use structural domain knowledge to improve the algorithm. We invite interested readers to read our paper to learn more about this work.

Acknowledgements.
Special thanks to Qiang Liu and Denny Zhou for contributing to this project.

Source: Google AI Blog


An Optimistic Perspective on Offline Reinforcement Learning



“The potential for off-policy learning remains tantalizing, the best way to achieve it still a mystery.” — Sutton & Barto

Most reinforcement learning (RL) algorithms assume that an agent actively interacts with an online environment to learn from its own collected experience. These algorithms are challenging to apply to complex real-world problems (such as robotics and autonomous driving) since extensive data collection from the real world can be extremely sample inefficient and lead to unintended behavior, while those operating in simulation require high-fidelity simulators that are challenging to build. However, for many real-world RL applications, there already exist a large amount of previously collected interaction data which can be utilized to make RL feasible for those problems, and enable better generalization by incorporating diverse prior experiences.

Existing interaction data can be used effectively using offline RL, which is the fully off-policy RL setting in which an agent is trained from a fixed dataset of logged experiences, without any further interactions with the environment. Offline RL can help (1) pretrain an RL agent using existing data, (2) empirically evaluate RL algorithms based on their ability to utilize a fixed dataset of interactions, and (3) deliver real-world impact. However, offline RL is considered challenging due to the distribution mismatch between online interactions and any fixed dataset of logged interactions, i.e., when the learned agent takes an action different from the data collection agent, we don’t know the reward that should be provided.
RL with online interactions vs. Offline RL.
In “An Optimistic Perspective on Offline RL”, we propose a simple experimental setup for offline RL on Atari 2600 games, based on logged experiences of a DQN agent. We demonstrate that it is possible to train agents with high returns that outperform the data collection agents using standard off-policy RL algorithms, without explicitly correcting for any distribution mismatch. We also develop a robust RL algorithm, called random ensemble mixture (REM), which shows promising results on offline RL. Overall, we present an optimistic perspective that robust RL algorithms trained on sufficiently large and diverse offline datasets can lead to high quality behaviour, strengthening the emerging data-driven RL paradigm. To facilitate the development and evaluation of offline RL methods, we are also publicly releasing the DQN Replay Dataset and have open-sourced our code. More details can be found at offline-rl.github.io.

A Primer on Off-policy and Offline RL
We summarize various approaches to RL below:
Online, off-policy RL agents, such as DQN, achieve human-level performance on Atari 2600 games by just observing the game screen, without any explicit knowledge about the game. DQN estimates the effectiveness of an action at a given state of the environment in terms of maximum achievable future rewards (i.e., Q-values). Furthermore, recent distributional RL agents, such as QR-DQN, model the entire distribution of probable future rewards, rather than a single expected value for each state-action pair. Agents such as DQN and QR-DQN are considered “online” because they alternate between optimizing a policy (how an agent acts at a given state) and using that policy to collect more data.

In principle, off-policy RL agents can learn from data collected by any policy, not just the policy being optimized. However, in the offline RL setting, recent work presents a discouraging view that standard off-policy agents diverge or otherwise yield poor performance. To fix this, previous work proposes remedies by regularizing the learned policy to stay close to the dataset of offline interactions.

The DQN Replay Dataset for Offline RL
In this work, we revisit offline RL by first creating the DQN Replay Dataset. This dataset is generated using DQN agents trained on 60 Atari 2600 games for 200 million frames each, while using sticky actions (with 25% probability that the agent’s previous action is executed instead of the current action) to make the problem more challenging. For each of the 60 games, we train 5 DQN agents with different random initializations, and store all of the (state, action, reward, next state) tuples encountered during training into 5 replay datasets per game, resulting in a total of 300 datasets.
Offline RL on Atari games using the DQN Replay Dataset.
The DQN Replay Dataset can then be used for training offline RL agents, without any interaction with the environment during training. Each game replay dataset is approximately 3.5 times larger than ImageNet and includes samples from all of the intermediate policies seen during the optimization of online DQN.

Training Offline Agents on the DQN Replay Dataset
We trained offline variants of DQN and distributional QR-DQN on the DQN Replay Dataset. Although the offline datasets contain data experienced by a DQN agent improving over time as training progresses, we compared the performance of offline agents against the best performing online DQN agent obtained after training (i.e., a fully-trained DQN). For each game, we evaluated the 5 offline agents trained (one per dataset), using online returns, reporting the best averaged performance.

Offline DQN underperforms fully-trained online DQN on all except a few games, where it achieves higher scores with the same amount of data. Offline QR-DQN, on the other hand, outperforms offline DQN and fully-trained DQN on most of the games. These results demonstrate that it is possible to optimize strong agents offline using standard deep RL algorithms. Furthermore, the disparity between the performance of offline QR-DQN and DQN indicates the difference in their ability to exploit offline data.
Offline DQN. Normalized improvement over a fully-trained DQN, per game, of offline DQN trained using DQN replay. On the normalized scale, fully-trained DQN corresponds to 100% performance while random agent corresponds to 0%.
Offline QR-DQN. Normalized performance improvement (in %) over a fully-trained DQN agent, per game, of offline QR-DQN trained offline using DQN replay.
Introducing Two Robust Offline RL Agents
In online RL, an agent chooses actions that it thinks will lead to high rewards, and then receives corrective feedback. Since it is not possible to collect additional data in offline RL, it is essential to reason about generalization using a fixed dataset. Leveraging methods from supervised learning that use an ensemble of models to improve generalization, we present two new offline RL agents:
  • Ensemble-DQN is a simple extension of DQN that trains multiple Q-value estimates and averages them for evaluation.
  • Random Ensemble Mixture (REM) is an easy to implement extension of DQN inspired by Dropout. The key intuition behind REM is that if one has access to multiple estimates of Q-values, then a weighted combination of the Q-value estimates is also an estimate for Q-values. Accordingly, in each training step, REM randomly combines multiple Q-value estimates and uses this random combination for robust training.
Neural Network architectures for DQN, distributional QR-DQN and the expected RL variants with the same multi-head QR-DQN architecture, i.e., Ensemble-DQN and REM. In QR-DQN, each head (red rectangles) corresponds to a specific fraction of the return distribution, while in the proposed variants, each head approximates the Q-function.
To utilize the DQN Replay Dataset more efficiently, we train offline agents for five times as many training iterations as online DQN and report their performance below. Offline REM outperforms offline DQN and offline QR-DQN. The comparison with fully-trained online C51, a strong distributional agent, illustrates that the gains from offline REM are more than the gains from C51.
Offline REM vs. baselines. Median normalized scores averaged over 5 runs across 60 Atari games of offline agents trained using DQN replay for 5x iterations, compared to online DQN.
Using the standard training protocols on Atari, online REM performs on par with QR-DQN in the standard online RL setting. This suggests that we can use the insights gained from the DQN Replay Dataset and the offline RL setting to build effective online RL methods.
Online REM vs. baselines. Median normalized evaluation scores averaged over 5 runs (shown as traces) across stochastic 60 Atari 2600 games of online agents trained for 200 million frames. Online REM with 4 Q-networks performs comparably to online QR-DQN.
Comparison of Results: Important Factors in Offline RL
The discrepancy between these results and prior work that reports failure of standard RL agents in the offline setting could be attributed to the following factors:
  • Offline Dataset Size. We trained offline QR-DQN and REM with reduced data obtained via randomly subsampling the entire DQN Replay Dataset, maintaining the same data distribution. Analogous to supervised learning, performance tends to increase as the size of data increases. With only 10% of the entire dataset, REM and QR-DQN approximately recover the performance of fully-trained DQN.
  • Offline Dataset Composition. We trained offline RL agents on the first 20 million frames per game in the DQN Replay Dataset. Offline REM and QR-DQN outperform the best policy in this lower quality dataset, indicating that standard RL agents work well in the offline setting with sufficiently diverse datasets.
    Offline RL with Lower Quality Dataset. REM and QR-DQN trained on offline data collected from DQN trained for 20 iterations (by using the first 20M frames from each game replay dataset). The horizontal line shows the performance of best policy in this dataset, which is significantly worse than fully-trained DQN.
  • Offline Algorithm Choice. There are claims that standard off-policy agents are ineffective on continuous control tasks when trained offline. However, we found that recent continuous control agents, such as TD3, perform comparably to a sophisticated offline agent when trained on large and diverse offline datasets.
Future Work. Our results emphasize the need for a rigorous characterization of the role of generalization due to neural networks when learning from offline data collected from a large mixture of diverse policies. Benchmarking offline RL with various data collection strategies by subsampling DQN replay (e.g., first / last k million frames) is another important direction. We currently employ online policy evaluation, however, “true” offline RL requires offline policy evaluation for hyperparameter tuning and early stopping. Finally, model-based RL and self-supervised learning approaches are also promising for offline RL.

Acknowledgements.
This research was conducted in collaboration with Dale Schuurmans. We’d like to thank members of the Google Research, Brain Team for valuable discussions. A prior version of this work was presented as a contributed talk at NeurIPS 2019 DRL Workshop.

Source: Google AI Blog


Exploring Nature-Inspired Robot Agility



Whether it’s a dog chasing after a ball or a horse jumping over obstacles, animals can effortlessly perform an incredibly rich repertoire of agile skills. Developing robots that are able to replicate these agile behaviors can open opportunities to deploy robots for sophisticated tasks in the real world. But designing controllers that enable legged robots to perform these agile behaviors can be a very challenging task. While reinforcement learning (RL) is an approach often used for automating development of robotic skills, a number of technical hurdles remain and, in practice, there is still substantial manual overhead. Designing reward functions that lead to effective skills can itself require a great deal of expert insight, and often involves a lengthy reward tuning process for each desired skill. Furthermore, applying RL to legged robots requires not only efficient algorithms, but also mechanisms to enable the robots to remain safe and recover after falling, without frequent human assistance.

In this post, we will discuss two of our recent projects aimed at addressing these challenges. First, we describe how robots can learn agile behaviors by imitating motions from real animals, producing fast and fluent movements like trotting and hopping. Then, we discuss a system for automating the training of locomotion skills in the real world, which allows robots to learn to walk on their own, with minimal human assistance.

Learning Agile Robotic Locomotion Skills by Imitating Animals
In “Learning Agile Robotic Locomotion Skills by Imitating Animals”, we present a framework that takes a reference motion clip recorded from an animal (a dog, in this case) and uses RL to train a control policy that enables a robot to imitate the motion in the real world. By providing the system with different reference motions, we are able to train a quadruped robot to perform a diverse set of agile behaviors, ranging from fast walking gaits to dynamic hops and turns. The policies are trained primarily in simulation, and then transferred to the real world using a latent space adaptation technique that can efficiently adapt a policy using only a few minutes of data from the real robot.

Motion Imitation
We start by collecting motion capture clips of a real dog performing various locomotion skills. Then, we use RL to train a control policy to imitate the dog’s motions. The policies are trained in a physics simulation to track the pose of the reference motion at each timestep. Then, by using different reference motions in the reward function, we can train a simulated robot to imitate a variety of different skills.
Reinforcement learning is used to train a simulated robot to imitate the reference motions from a dog. All simulations are performed using PyBullet.
However, since simulators generally provide only a coarse approximation of the real world, policies trained in simulation often perform poorly when deployed on a real robot. Therefore, we use a sample-efficient latent space adaptation technique to transfer a policy trained in simulation to the real world.

First, to encourage the policy to learn behaviors that are robust to variations in the dynamics, we randomize the dynamics of the simulation by varying physical quantities, such as the robot’s mass and friction. Since we have access to the values of these parameters during training in simulation, we can also map them to a low-dimensional representation using a learned encoder. This encoding is then passed as an additional input to the policy during training. Since the physical parameters of the real robot are not known a priori, when deploying the policy to a real robot, we remove the encoder and directly search for a set of parameters in the latent space that enables the robot to successfully execute the desired skills in the real world. This technique is often able to adapt a policy to the real world using less than 8 minutes of real-world data.
Comparison of policies before and after adaptation on the real robot. Before adaptation, the robot is prone to falling. But after adaptation, the policies are able to more consistently execute the desired skills.
Results
Using this approach, the robot learns to imitate various locomotion skills from a dog, including different walking gaits, such as pacing and trotting, as well as an agile spinning motion.
Robot imitating various skills from a dog.
In addition to imitating motions from real dogs, it is also possible to imitate artist-animated keyframe motions, including a dynamic hop-turn:
Skills learned by imitating artist-animated keyframe motions: side-steps, turn, and hop-turn.
More details are available in the following video:
Learning to Walk in the Real World with Minimal Human Effort
The above approach is able to train policies in simulation and then adapt them to the real world. However, when the task involves complex and diverse physical phenomena, it is also necessary to directly learn from real-world experience. Although learning on real robots has achieved state-of-the-art performance for manipulation tasks (e.g., QT-Opt), applying the same methods to legged robots is difficult since the robot may fall and damage itself, or leave the training area, which can then require human intervention.
An automated learning system for legged robots must resolve safety and automation challenges.
In “Learning to Walk in the Real World with Minimal Human Effort”, we developed an automated learning system with software and hardware components, using a multi-task learning procedure, a safety-constrained learner, and several carefully designed hardware and software components. Multi-task learning prevents the robot from leaving the training area by generating a learning schedule that drives the robot towards the center of the workspace. We also reduce the number of falls by designing a safety constraint, which we solve with dual gradient descent.

For each roll-out, the scheduler selects a task in which the desired walking direction is pointing towards the center. For instance, assuming we have two tasks, forward and backward walking, the scheduler will select the forward task if the robot is at the back of the workspace, and vice-versa for the backward task. In the middle of the episode, the learner takes dual gradient descent steps to iteratively optimize both the task objective and safety constraints, rather than treating them as a single goal. If the robot has fallen, we invoke an automated get-up controller and proceed to the next episode.
We solve automation and safety challenges with multi-task learning, a safety-constrained SAC algorithm, and an automatic reset controller.
Results
This framework successfully trains policies from scratch to walk in different directions without any human intervention.
Snapshots of the training process on the flat surface with zero human resets.
Once trained, it is possible to steer the robot with a remote controller. Notice how it's possible to command the robot to turn in place using the controller. This action would be difficult to manually design due to the planar leg structure of the robot, but is discovered automatically using our automated multi-instance learner.
We train locomotion policies to walk in four directions, which allow us to interactively control the robot with a game controller.
The system also enables the robot to navigate more challenging surfaces, such as a memory foam mattress and a doormat with crevices.
Learned locomotion gaits on challenging terrains.
More details can be found in the following video:
Conclusion
In these two papers, we present methods to reproduce a diverse corpus of behaviors with quadruped robots. Extending this line of work to learn skills from videos would also be an exciting direction, which can substantially increase the volume of data from which robots can learn. We are also interested in applying the automated training system to more complex real-world environments and tasks.

Acknowledgments
We would like to thank our coauthors, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, Sergey Levine, Peng Xu and Zhenyu Tan. We would also like to thank Julian Ibarz, Byron David, Thinh Nguyen, Gus Kouretas, Krista Reymann, and Bonny Ho for their support and contributions to this work.

Source: Google AI Blog


Massively Scaling Reinforcement Learning with SEED RL



Reinforcement learning (RL) has seen impressive advances over the last few years as demonstrated by the recent success in solving games such as Go and Dota 2. Models, or agents, learn by exploring an environment, such as a game, while optimizing for specified goals. However, current RL techniques require increasingly large amounts of training to successfully learn even simple games, which makes iterating research and product ideas computationally expensive and time consuming.

In “SEED RL: Scalable and Efficient Deep-RL with Accelerated Central Inference”, we present an RL agent that scales to thousands of machines, which enables training at millions of frames per second, and significantly improves computational efficiency. This is achieved with a novel architecture that takes advantage of accelerators (GPUs or TPUs) at scale by centralizing model inference and introducing a fast communication layer. We demonstrate the performance of SEED RL on popular RL benchmarks, such as Google Research Football, Arcade Learning Environment and DeepMind Lab, and show that by using larger models, data efficiency can be increased. The code has been open sourced on Github together with examples for running on Google Cloud with GPUs.

Current Distributed Architectures
The previous generation of distributed reinforcement learning agents, such as IMPALA, made use of accelerators specialized for numerical calculations, taking advantage of the speed and efficiency from which (un)supervised learning has benefited for years. The architecture of an RL agent is usually separated into actors and learners. The actors typically run on CPUs and iterate between taking steps in the environment and running inference on the model to predict the next action. Frequently the actor will update the parameters of the inference model, and after collecting a sufficient amount of observations, will send a trajectory of observations and actions to the learner, which then optimizes the model. In this architecture, the learner trains the model on GPUs using input from distributed inference on hundreds of machines.

Example architecture for an earlier generation RL agent, IMPALA. Inference is done on the actors, usually using inefficient CPUs. Updated model parameters are frequently sent from the learner to the actors increasing bandwidth requirements.
The architecture of RL agents (such as IMPALA) have a number of drawbacks:
  1. Using CPUs for neural network inference is much less efficient and slower than using accelerators and becomes problematic as models become larger and more computationally expensive.
  2. The bandwidth required for sending parameters and intermediate model states between the actors and learner can be a bottleneck.
  3. Handling two completely different tasks on one machine (i.e., environment rendering and inference) is unlikely to utilize machine resources optimally.
SEED RL Architecture
The SEED RL architecture is designed to solve these drawbacks. With this approach, neural network inference is done centrally by the learner on specialized hardware (GPUs or TPUs), enabling accelerated inference and avoiding the data transfer bottleneck by ensuring that the model parameters and state are kept local. While observations are sent to the learner at every environment step, latency is kept low due to a very efficient network library based on the gRPC framework with asynchronous streaming RPCs. This makes it possible to achieve up to a million queries per second on a single machine. The learner can be scaled to thousands of cores (e.g., up to 2048 on Cloud TPUs) and the number of actors can be scaled to thousands of machines to fully utilize the learner, making it possible to train at millions of frames per second. SEED RL is based on the TensorFlow 2 API and, in our experiments, was accelerated by TPUs.
Overview of the architecture of SEED RL. In contrast to the IMPALA architecture, the actors only take actions in environments. Inference is executed centrally by the learner on accelerators using batches of data from multiple actors.
In order for this architecture to be successful, two state-of-the-art algorithms are integrated into SEED RL. The first is V-trace, a policy gradient-based method, first introduced with IMPALA. In general, policy gradient-based methods predict an action distribution from which an action can be sampled. However, because the actors and the learner execute asynchronously in SEED RL, the policy of actors is slightly behind the policy of the learner, i.e., they become off-policy. The usual policy gradient-based methods are on-policy, meaning that they have the same policy for actors and learner, and suffer from convergence and numerical issues in off-policy settings. V-trace is an off-policy method and thus works well in the asynchronous SEED RL architecture.

The second algorithm is R2D2, a Q-learning method that selects an action based on the predicted future value of that action using recurrent distributed replay. This approach allows the Q-learning algorithm to be run at scale, while still allowing the use of recurrent neural networks that can predict future values based on the information of all past frames in an episode.

Experiments
SEED RL is benchmarked on the commonly used Arcade Learning Environment, DeepMind Lab environments, and on the recently released Google Research Football environment.
Frames per second comparing IMPALA and various configurations of SEED RL on DeepMind Lab. SEED RL achieves 2.4M frames per second using 4,160 CPUs. Assuming the same speed, IMPALA would need 14,000 CPUs.
On DeepMind Lab, we achieve 2.4 million frames per second with 64 Cloud TPU cores, which represents an improvement of 80x over the previous state-of-the-art distributed agent, IMPALA. This results in a significant speed-up in wall-clock time and computational efficiency. IMPALA requires 3-4x as many CPUs as SEED RL for the same speed.
Episode return (i.e., the sum of rewards) over time on the DeepMind Lab game “explore_goal_locations_small” using IMPALA and SEED RL. With SEED RL, the time to train is significantly reduced.
With an architecture optimized for use on modern accelerators, it’s natural to increase the model size in an attempt to increase data efficiency. We show that by increasing the size of the model and the input resolution, we are able to solve a previously unsolved Google Research Football task, “Hard”.
The score of different architectures on the Google Research Football “Hard” task. We show that by using an input resolution and a larger model, the score is improved, and with more training, the model can significantly outperform the builtin AI.
Additional details are provided in the paper, including our results on the Arcade Learning Environment. We believe SEED RL and the results presented, demonstrate that reinforcement learning has once again caught up with the rest of the deep learning field in terms of taking advantage of accelerators.

Acknowledgements
This project was done in collaboration with Raphaël Marinier, Piotr Stanczyk, Ke Wang, Marcin Andrychowicz and Marcin Michalski. We would also like to thank Tom Small for the visualizations.

Source: Google AI Blog


Introducing Dreamer: Scalable Reinforcement Learning Using World Models



Research into how artificial agents can choose actions to achieve goals is making rapid progress in large part due to the use of reinforcement learning (RL). Model-free approaches to RL, which learn to predict successful actions through trial and error, have enabled DeepMind's DQN to play Atari games and AlphaStar to beat world champions at Starcraft II, but require large amounts of environment interaction, limiting their usefulness for real-world scenarios.

In contrast, model-based RL approaches additionally learn a simplified model of the environment. This world model lets the agent predict the outcomes of potential action sequences, allowing it to play through hypothetical scenarios to make informed decisions in new situations, thus reducing the trial and error necessary to achieve goals. In the past, it has been challenging to learn accurate world models and leverage them to learn successful behaviors. While recent research, such as our Deep Planning Network (PlaNet), has pushed these boundaries by learning accurate world models from images, model-based approaches have still been held back by ineffective or computationally expensive planning mechanisms, limiting their ability to solve difficult tasks.

Today, in collaboration with DeepMind, we present Dreamer, an RL agent that learns a world model from images and uses it to learn long-sighted behaviors. Dreamer leverages its world model to efficiently learn behaviors via backpropagation through model predictions. By learning to compute compact model states from raw images, the agent is able to efficiently learn from thousands of predicted sequences in parallel using just one GPU. Dreamer achieves a new state-of-the-art in performance, data efficiency and computation time on a benchmark of 20 continuous control tasks given raw image inputs. To stimulate further advancement of RL, we are releasing the source code to the research community.

How Does Dreamer Work?
Dreamer consists of three processes that are typical for model-based methods: learning the world model, learning behaviors from predictions made by the world model, and executing its learned behaviors in the environment to collect new experience. To learn behaviors, Dreamer uses a value network to take into account rewards beyond the planning horizon and an actor network to efficiently compute actions. The three processes, which can be executed in parallel, are repeated until the agent has achieved its goals:
The three processes of the Dreamer agent. The world model is learned from past experience. From predictions of this model, the agent then learns a value network to predict future rewards and an actor network to select actions. The actor network is used to interact with the environment.
Learning the World Model
Dreamer leverages the PlaNet world model, which predicts outcomes based on a sequence of compact model states that are computed from the input images, instead of directly predicting from one image to the next. It automatically learns to produce model states that represent concepts helpful for predicting future outcomes, such as object types, positions of objects, and the interaction of the objects with their surroundings. Given a sequence of images, actions, and rewards from the agent's dataset of past experience, Dreamer learns the world model as shown:
Dreamer learns a world model from experience. Using past images (o1–o3) and actions (a1–a2), it computes a sequence of compact model states (green circles) from which it reconstructs the images (ô1–ô3) and predicts the rewards (r̂1–r̂3).
An advantage to using the PlaNet world model is that predicting ahead using compact model states instead of images greatly improves the computational efficiency. This enables the model to predict thousands of sequences in parallel on a single GPU. The approach can also facilitate generalization, leading to accurate long-term video predictions. To gain insights into how the model works, we can visualize the predicted sequences by decoding the compact model states back into images, as shown below for a task of the DeepMind Control Suite and for a task of the DeepMind Lab environment:
Predicting ahead using compact model states enables long-term predictions in complex environments. Shown here are two sequences that the agent has not encountered before. Given five input images, the model reconstructs them and predicts the future images up to time step 50.
Efficient Behavior Learning
Previously developed model-based agents typically select actions either by planning through many model predictions or by using the world model in place of a simulator to reuse existing model-free techniques. Both designs are computationally demanding and do not fully leverage the learned world model. Moreover, even powerful world models are limited in how far ahead they can accurately predict, rendering many previous model-based agents shortsighted. Dreamer overcomes these limitations by learning a value network and an actor network via backpropagation through predictions of its world model.

Dreamer efficiently learns the actor network to predict successful actions by propagating gradients of rewards backwards through predicted state sequences, which is not possible for model-free approaches. This tells Dreamer how small changes to its actions affect what rewards are predicted in the future, allowing it to refine the actor network in the direction that increases the rewards the most. To consider rewards beyond the prediction horizon, the value network estimates the sum of future rewards for each model state. The rewards and values are then backpropagated to refine the actor network to select improved actions:
Dreamer learns long-sighted behaviors from predicted sequences of model states. It first learns the long-term value (v̂2–v̂3) of each state, and then predicts actions (â1–â2) that lead to high rewards and values by backpropagating them through the state sequence to the actor network.
Dreamer differs from PlaNet in several ways. For a given situation in the environment, PlaNet searches for the best action among many predictions for different action sequences. In contrast, Dreamer side-steps this expensive search by decoupling planning and acting. Once its actor network has been trained on predicted sequences, it computes the actions for interacting with the environment without additional search. In addition, Dreamer considers rewards beyond the planning horizon using a value function and leverages backpropagation for efficient planning.

Performance on Control Tasks
We evaluated Dreamer on a standard benchmark of 20 diverse tasks with continuous actions and image inputs. The tasks include balancing and catching objects, as well as locomotion of various simulated robots. The tasks are designed to pose a variety of challenges to the RL agent, including difficult to predict collisions, sparse rewards, chaotic dynamics, small but relevant objects, high degrees of freedom, and 3D perspectives:
Dreamer learns to solve 20 challenging continuous control tasks with image inputs, 5 of which are displayed here. The visualizations show the same 64x64 images that the agent receives from the environment.
We compare the performance of Dreamer to that of PlaNet, the previous best model-based agent, the popular model-free agent, A3C, as well as the current best model-free agent on this benchmark, D4PG, which combines several advances of model-free RL. The model-based agents learn efficiently in under 5 million frames, corresponding to 28 hours inside the simulation. The model-free agents learn more slowly and require 100 million frames, corresponding to 23 days inside the simulation.

On the benchmark of 20 tasks, Dreamer outperforms the best model-free agent (D4PG) with an average score of 823 compared to 786, while learning from 20 times fewer environment interactions. Moreover, it exceeds the final performance of the previously best model-based agent (PlaNet) across almost all of the tasks. The computation time of 16 hours for training Dreamer is less than the 24 hours required for the other methods. The final performance of the four agents is shown below:
Dreamer outperforms the previous best model-free (D4PG) and model-based (PlaNet) methods on the benchmark of 20 tasks in terms of final performance, data efficiency, and computation time.
In addition to our main experiments on continuous control tasks, we demonstrate the generality of Dreamer by applying it to tasks with discrete actions. For this, we select Atari games and DeepMind Lab levels that require both reactive and long-sighted behavior, spatial awareness, and understanding of visually more diverse scenes. The resulting behaviors are visualized below, showing that Dreamer also efficiently learns to solve these more challenging tasks:
Dreamer learns successful behaviors on Atari games and DeepMind Lab levels, which feature discrete actions and visually more diverse scenes, including 3D environments with multiple objects.
Conclusion
Our work demonstrates that learning behaviors from sequences predicted by world models alone can solve challenging visual control tasks from image inputs, surpassing the performance of previous model-free approaches. Moreover, Dreamer demonstrates that learning behaviors by backpropagating value gradients through predicted sequences of compact model states is successful and robust, solving a diverse collection of continuous and discrete control tasks. We believe that Dreamer offers a strong foundation for further pushing the limits of reinforcement learning, including better representation learning, directed exploration with uncertainty estimates, temporal abstraction, and multi-task learning.

Acknowledgements
This project is a collaboration with Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. We further thank everybody in the Brain Team and beyond who commented on our paper draft and provided feedback at any point throughout the project.

Source: Google AI Blog


RecSim: A Configurable Simulation Platform for Recommender Systems

Originally posted on the Google AI Blog

Significant advances in machine learning, speech recognition, and language technologies are rapidly transforming the way in which recommender systems engage with users. As a result, collaborative interactive recommenders (CIRs)—recommender systems that engage in a deliberate sequence of interactions with a user to best meet that user's needs—have emerged as a tangible goal for online services.

Despite this, the deployment of CIRs has been limited by challenges in developing algorithms and models that reflect the qualitative characteristics of sequential user interaction. Reinforcement learning (RL) is the de facto standard ML approach for addressing sequential decision problems, and as such is a natural paradigm for modeling and optimizing sequential interaction in recommender systems. However, it remains under-investigated and under-utilized for use in CIRs in both research and practice. One major impediment is the lack of general-purpose simulation platforms for sequential recommender settings, whereas simulation has been one of the primary means for developing and evaluating RL algorithms in real-world applications like robotics.

To address this, we have developed RᴇᴄSɪᴍ (available here), a configurable platform for authoring simulation environments to facilitate the study of RL algorithms in recommender systems (and CIRs in particular). RᴇᴄSɪᴍ allows both researchers and practitioners to test the limits of existing RL methods in synthetic recommender settings. RecSim’s aim is to support simulations that mirror specific aspects of user behavior found in real recommender systems and serve as a controlled environment for developing, evaluating and comparing recommender models and algorithms, especially RL systems designed for sequential user-system interaction.

As an open-source platform, RᴇᴄSɪᴍ: (i) facilitates research at the intersection of RL and recommender systems; (ii) encourages reproducibility and model-sharing; (iii) aids the recommender-systems practitioner, interested in applying RL to rapidly test and refine models and algorithms in simulation, before incurring the potential cost (e.g., time, user impact) of live experiments; and (iv) serves as a resource for academic-industry collaboration through the release of “realistic” stylized models of user behavior without revealing user data or sensitive industry strategies.

Reinforcement Learning and Recommendation Systems

One challenge in applying RL to recommenders is that most recommender research is developed and evaluated using static datasets that do not reflect the sequential, repeated interaction a recommender has with its users. Even those with temporal extent, such as MovieLens 1M, do not (easily) support predictions about the long-term performance of novel recommender policies that differ significantly from those used to collect the data, as many of the factors that impact user choice are not recorded within the data. This makes the evaluation of even basic RL algorithms very difficult, especially when it comes to reasoning about the long-term consequences of some new recommendation policy—research shows changes in policy can have long-term, cumulative impact on user behavior. The ability to model such user behaviors in a simulated environment, and devise and test new recommendation algorithms, including those using RL, can greatly accelerate the research and development cycle for such problems.

Overview of RᴇᴄSɪᴍ

RᴇᴄSɪᴍ simulates a recommender agent’s interaction with an environment consisting of a user model, a document model and a user choice model. The agent interacts with the environment by recommending sets or lists of documents (known as slates) to users, and has access to observable features of simulated individual users and documents to make recommendations. The user model samples users from a distribution over (configurable) user features (e.g., latent features, like interests or satisfaction; observable features, like user demographic; and behavioral features, such as visit frequency or time budget). The document model samples items from a prior distribution over document features, both latent (e.g., quality) and observable (e.g., length, popularity). This prior, as all other components of RᴇᴄSɪᴍ, can be specified by the simulation developer, possibly informed (or learned) from application data.

The level of observability for both user and document features is customizable. When the agent recommends documents to a user, the response is determined by a user-choice model, which can access observable document features and all user features. Other aspects of a user’s response (e.g., time spent engaging with the recommendation) can depend on latent document features, such as document topic or quality. Once a document is consumed, the user state undergoes a transition through a configurable user transition model, since user satisfaction or interests might change.

We note that RᴇᴄSɪᴍ provides the ability to easily author specific aspects of user behavior of interest to the researcher or practitioner, while ignoring others. This can provide the critical ability to focus on modeling and algorithmic techniques designed for novel phenomena of interest (as we illustrate in two applications below). This type of abstraction is often critical to scientific modeling. Consequently, high-fidelity simulation of all elements of user behavior is not an explicit goal of RᴇᴄSɪᴍ. That said, we expect that it may also serve as a platform that supports “sim-to-real” transfer in certain cases (see below).
Data Flow through components of RᴇᴄSɪᴍ. Colors represent different model components — user and user-choice models (green), document model (blue), and the recommender agent (red)

Applications

We have used RᴇᴄSɪᴍ to investigate several key research problems that arise in the use of RL in recommender systems. For example, slate recommendations can result in RL problems, since the parameter space for action grows exponentially with slate size, posing challenges for exploration, generalization and action optimization. We used RᴇᴄSɪᴍ to develop a novel decomposition technique that exploits simple, widely applicable assumptions about user choice behavior to tractably compute Q-values of entire recommendation slates. In particular, RᴇᴄSɪᴍ was used to test a number of experimental hypotheses, such as algorithm performance and robustness to different assumptions about user behavior.

Future Work

While RᴇᴄSɪᴍ provides ample opportunity for researchers and practitioners to probe and question assumptions made by RL/recommender algorithms in stylized environments, we are developing several important extensions. These include: (i) methodologies to fit stylized user models to usage logs to partially address the “sim-to-real” gap; (ii) the development of natural APIs using TensorFlow’s probabilistic APIs to facilitate model specification and learning, as well as scaling up simulation and inference algorithms using accelerators and distributed execution; and (iii) the extension to full-factor, mixed-mode interaction models that will be the hallmark of modern CIRs—e.g., language-based dialogue, preference elicitation, explanations, etc.

Our hope is that RᴇᴄSɪᴍ will serve as a valuable resource that bridges the gap between recommender systems and RL research — the use cases above are examples of how it can be used in this fashion. We also plan to pursue it as a platform to support academic-industry collaborations, through the sharing of stylized models of user behavior that, at suitable levels of abstraction, reflect a degree of realism that can drive useful model and algorithm development.

Further details of the RᴇᴄSɪᴍ framework can be found in the white paper, while code and colabs/tutorials are available here.

Acknowledgements
We thank our collaborators and early adopters of RᴇᴄSɪᴍ, including the other members of the RᴇᴄSɪᴍ team: Eugene Ie, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu and Craig Boutilier.

By Martin Mladenov, Research Scientist and Chih-wei Hsu, Software Engineer, Google Research