Tag Archives: reinforcement learning

Recursive Classification: Replacing Rewards with Examples in RL

A general goal of robotics research is to design systems that can assist in a variety of tasks that can potentially improve daily life. Most reinforcement learning algorithms for teaching agents to perform new tasks require a reward function, which provides positive feedback to the agent for taking actions that lead to good outcomes. However, actually specifying these reward functions can be quite tedious and can be very difficult to define for situations without a clear objective, such as whether a room is clean or if a door is sufficiently shut. Even for tasks that are easy to describe, actually measuring whether the task has been solved can be difficult and may require adding many sensors to a robot's environment.

Alternatively, training a model using examples, called example-based control, has the potential to overcome the limitations of approaches that rely on traditional reward functions. This new problem statement is most similar to prior methods based on "success detectors", and efficient algorithms for example-based control could enable non-expert users to teach robots to perform new tasks, without the need for coding expertise, knowledge of reward function design, or the installation of environmental sensors.

In "Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification," we propose a machine learning algorithm for teaching agents how to solve new tasks by providing examples of success (e.g., if “success” examples show a nail embedded into a wall, the agent will learn to pick up a hammer and knock nails into the wall). This algorithm, recursive classification of examples (RCE), does not rely on hand-crafted reward functions, distance functions, or features, but rather learns to solve tasks directly from data, requiring the agent to learn how to solve the entire task by itself, without requiring examples of any intermediate states. Using a version of temporal difference learning — similar to Q-learning, but replacing the typical reward function term using only examples of success — RCE outperforms prior approaches based on imitation learning on simulated robotics tasks. Coupled with theoretical guarantees similar to those for reward-based learning, the proposed method offers a user-friendly alternative for teaching robots new tasks.

Top: To teach a robot to hammer a nail into a wall, most reinforcement learning algorithms require that the user define a reward function. Bottom: The example-based control method uses examples of what the world looks like when a task is completed to teach the robot to solve the task, e.g., examples where the nail is already hammered into the wall.

Example-Based Control vs Imitation Learning
While the example-based control method is similar to imitation learning, there is an important distinction — it does not require expert demonstrations. In fact, the user can actually be quite bad at performing the task themselves, as long as they can look back and pick out the small fraction of states where they did happen to solve the task.

Additionally, whereas previous research used a stage-wise approach in which the model first uses success examples to learn a reward function and then applies that reward function with an off-the-shelf reinforcement learning algorithm, RCE learns directly from the examples and skips the intermediate step of defining the reward function. Doing so avoids potential bugs and bypasses the process of defining the hyperparameters associated with learning a reward function (such as how often to update the reward function or how to regularize it) and, when debugging, removes the need to examine code related to learning the reward function.

Recursive Classification of Examples
The intuition behind the RCE approach is simple: the model should predict whether the agent will solve the task in the future, given the current state of the world and the action that the agent is taking. If there were data that specified which state-action pairs lead to future success and which state-action pairs lead to future failure, then one could solve this problem using standard supervised learning. However, when the only data available consists of success examples, the system doesn’t know which states and actions led to success, and while the system also has experience interacting with the environment, this experience isn't labeled as leading to success or not.

Left: The key idea is to learn a future success classifier that predicts for every state (circle) in a trajectory whether the task will be solved in the future (thumbs up/down). Right: In the example-based control approach, the model is provided only with unlabeled experience (grey circles) and success examples (green circles), so one cannot apply standard supervised learning. Instead, the model uses the success examples to automatically label the unlabeled experience.

Nonetheless, one can piece together what these data would look like, if it were available. First, by definition, a successful example must be one that solves the given task. Second, even though it is unknown whether an arbitrary state-action pair will lead to success in solving a task, it is possible to estimate how likely it is that the task will be solved if the agent started at the next state. If the next state is likely to lead to future success, it can be assumed that the current state is also likely to lead to future success. In effect, this is recursive classification, where the labels are inferred based on predictions at the next time step.

The underlying algorithmic idea of using a model's predictions at a future time step as a label for the current time step closely resembles existing temporal-difference methods, such as Q-learning and successor features. The key difference is that the approach described here does not require a reward function. Nonetheless, we show that this method inherits many of the same theoretical convergence guarantees as temporal difference methods. In practice, implementing RCE requires changing only a few lines of code in an existing Q-learning implementation.

Evaluation
We evaluated the RCE method on a range of challenging robotic manipulation tasks. For example, in one task we required a robotic hand to pick up a hammer and hit a nail into a board. Previous research into this task [1, 2] have used a complex reward function (with terms corresponding to the distance between the hand and the hammer, the distance between the hammer and the nail, and whether the nail has been knocked into the board). In contrast, the RCE method requires only a few observations of what the world would look like if the nail were hammered into the board.

We compared the performance of RCE to a number of prior methods, including those that learn an explicit reward function and those based on imitation learning , all of which struggle to solve this task. This experiment highlights how example-based control makes it easy for users to specify even complex tasks, and demonstrates that recursive classification can successfully solve these sorts of tasks.

Compared with prior methods, the RCE approach solves the task of hammering a nail into a board more reliably that prior approaches based on imitation learning [SQIL, DAC] and those that learn an explicit reward function [VICE, ORIL, PURL].

Conclusion
We have presented a method to teach autonomous agents to perform tasks by providing them with examples of success, rather than meticulously designing reward functions or collecting first-person demonstrations. An important aspect of example-based control, which we discuss in the paper, is what assumptions the system makes about the capabilities of different users. Designing variants of RCE that are robust to differences in users' capabilities may be important for applications in real-world robotics. The code is available, and the project website contains additional videos of the learned behaviors.

Acknowledgements
We thank our co-authors, Ruslan Salakhutdinov and Sergey Levine. We also thank Surya Bhupatiraju, Kamyar Ghasemipour, Max Igl, and Harini Kannan for feedback on this post, and Tom Small for helping to design figures for this post.

Source: Google AI Blog


Leveraging Machine Learning for Game Development

Over the years, online multiplayer games have exploded in popularity, captivating millions of players across the world. This popularity has also exponentially increased demands on game designers, as players expect games to be well-crafted and balanced — after all, it's no fun to play a game where a single strategy beats all the rest.

In order to create a positive gameplay experience, game designers typically tune the balance of a game iteratively:

  1. Stress-test through thousands of play-testing sessions from test users
  2. Incorporate feedback and re-design the game
  3. Repeat 1 & 2 until both the play-testers and game designers are satisfied

This process is not only time-consuming but also imperfect — the more complex the game, the easier it is for subtle flaws to slip through the cracks. When games often have many different roles that can be played, with dozens of interconnecting skills, it makes it all the more difficult to hit the right balance.

Today, we present an approach that leverages machine learning (ML) to adjust game balance by training models to serve as play-testers, and demonstrate this approach on the digital card game prototype Chimera, which we’ve previously shown as a testbed for ML-generated art. By running millions of simulations using trained agents to collect data, this ML-based game testing approach enables game designers to more efficiently make a game more fun, balanced, and aligned with their original vision.

Chimera
We developed Chimera as a game prototype that would heavily lean on machine learning during its development process. For the game itself, we purposefully designed the rules to expand the possibility space, making it difficult to build a traditional hand-crafted AI to play the game.

The gameplay of Chimera revolves around the titular chimeras, creature mash-ups that players aim to strengthen and evolve. The objective of the game is to defeat the opponent's chimera. These are the key points in the game design:

  • Players may play:
    • creatures, which can attack (through their attack stat) or be attacked (against their health stat), or
    • spells, which produce special effects.
  • Creatures are summoned into limited-capacity biomes, which are placed physically on the board space. Each creature has a preferred biome and will take repeated damage if placed on an incorrect biome or a biome that is over capacity.
  • A player controls a single chimera, which starts off in a basic "egg" state and can be evolved and strengthened by absorbing creatures. To do this, the player must also acquire a certain amount of link energy, which is generated from various gameplay mechanics.
  • The game ends when a player has successfully brought the health of the opponent's chimera to 0.

Learning to Play Chimera
As an imperfect information card game with a large state space, we expected Chimera to be a difficult game for an ML model to learn, especially as we were aiming for a relatively simple model. We used an approach inspired by those used by earlier game-playing agents like AlphaGo, in which a convolutional neural network (CNN) is trained to predict the probability of a win when given an arbitrary game state. After training an initial model on games where random moves were chosen, we set the agent to play against itself, iteratively collecting game data, that was then used to train a new agent. With each iteration, the quality of the training data improved, as did the agent’s ability to play the game.

The ML agent's performance against our best hand-crafted AI as training progressed. The initial ML agent (version 0) picked moves randomly.

For the actual game state representation that the model would receive as input, we found that passing an "image" encoding to the CNN resulted in the best performance, beating all benchmark procedural agents and other types of networks (e.g. fully connected). The chosen model architecture is small enough to run on a CPU in reasonable time, which allowed us to download the model weights and run the agent live in a Chimera game client using Unity Barracuda.

An example game state representation used to train the neural network.
In addition to making decisions for the game AI, we also used the model to display the estimated win probability for a player over the course of the game.

Balancing Chimera
This approach enabled us to simulate millions more games than real players would be capable of playing in the same time span. After collecting data from the games played by the best-performing agents, we analyzed the results to find imbalances between the two of the player decks we had designed.

First, the Evasion Link Gen deck was composed of spells and creatures with abilities that generated extra link energy used to evolve a player’s chimera. It also contained spells that enabled creatures to evade attacks. In contrast, the Damage-Heal deck contained creatures of variable strength with spells that focused on healing and inflicting minor damage. Although we had designed these decks to be of equal strength, the Evasion Link Gen deck was winning 60% of the time when played against the Damage-Heal deck.

When we collected various stats related to biomes, creatures, spells, and chimera evolutions, two things immediately jumped out at us:

  1. There was a clear advantage in evolving a chimera — the agent won a majority of the games where it evolved its chimera more than the opponent did. Yet, the average number of evolves per game did not meet our expectations. To make it more of a core game mechanic, we wanted to increase the overall average number of evolves while keeping its usage strategic.
  2. The T-Rex creature was overpowered. Its appearances correlated strongly with wins, and the model would always play the T-Rex regardless of penalties for summoning into an incorrect or overcrowded biome.

From these insights, we made some adjustments to the game. To emphasize chimera evolution as a core mechanism in the game, we decreased the amount of link energy required to evolve a chimera from 3 to 1. We also added a “cool-off” period to the T-Rex creature, doubling the time it took to recover from any of its actions.

Repeating our ‘self-play’ training procedure with the updated rules, we observed that these changes pushed the game in the desired direction — the average number of evolves per game increased, and the T-Rex's dominance faded.

One example comparison of the T-Rex’s influence before and after balancing. The charts present the number of games won (or lost) when a deck initiates a particular spell interaction (e.g., using the “Dodge” spell to benefit a T-Rex). Left: Before the changes, the T-Rex had a strong influence in every metric examined — highest survival rate, most likely to be summoned ignoring penalties, most absorbed creature during wins. Right: After the changes, the T-Rex was much less overpowered.

By weakening the T-Rex, we successfully reduced the Evasion Link Gen deck's reliance on an overpowered creature. Even so, the win ratio between the decks remained at 60/40 rather than 50/50. A closer look at the individual game logs revealed that the gameplay was often less strategic than we would have liked. Searching through our gathered data again, we found several more areas to introduce changes in.

To start, we increased the starting health of both players as well as the amount of health that healing spells could replenish. This was to encourage longer games that would allow a more diverse set of strategies to flourish. In particular, this enabled the Damage-Heal deck to survive long enough to take advantage of its healing strategy. To encourage proper summoning and strategic biome placement, we increased the existing penalties on playing creatures into incorrect or overcrowded biomes. And finally, we decreased the gap between the strongest and weakest creatures through minor attribute adjustments.

New adjustments in place, we arrived at the final game balance stats for these two decks:

Deck Avg # evolves per game    
(before → after)    
Win % (1M games)
(before → after)
Evasion Link Gen     1.54 → 2.16     59.1% → 49.8%
Damage Heal 0.86 → 1.76     40.9% → 50.2%

Conclusion
Normally, identifying imbalances in a newly prototyped game can take months of playtesting. With this approach, we were able to not only discover potential imbalances but also introduce tweaks to mitigate them in a span of days. We found that a relatively simple neural network was sufficient to reach high level performance against humans and traditional game AI. These agents could be leveraged in further ways, such as for coaching new players or discovering unexpected strategies. We hope this work will inspire more exploration in the possibilities of machine learning for game development.

Acknowledgements
This project was conducted in collaboration with many people. Thanks to Ryan Poplin, Maxwell Hannaman, Taylor Steil, Adam Prins, Michal Todorovic, Xuefan Zhou, Aaron Cammarata, Andeep Toor, Trung Le, Erin Hoffman-John, and Colin Boswell. Thanks to everyone who contributed through playtesting, advising on game design, and giving valuable feedback.

Source: Google AI Blog


PAIRED: A New Multi-agent Approach for Adversarial Environment Generation

The effectiveness of any machine learning method is critically dependent on its training data. In the case of reinforcement learning (RL), one can rely either on limited data collected by an agent interacting with the real world, or a simulated training environment that can be used to collect as much data as needed. This latter method of training in simulation is increasingly popular, but it has a problem — the RL agent can learn what is built into the simulator, but tends to be bad at generalizing to tasks that are even slightly different than the ones simulated. And obviously building a simulator that covers all the complexity of the real-world is extremely challenging.

An approach to address this is to automatically create more diverse training environments by randomizing all the parameters of the simulator, a process called domain randomization (DR). However, DR can fail even in very simple environments. For example, in the animation below, the blue agent is trying to navigate to the green goal. The left panel shows an environment created with DR where the positions of the obstacles and goal have been randomized. Many of these DR environments were used to train the agent, which was then transferred to the simple Four Rooms environment in the middle panel. Notice that the agent can’t find the goal. This is because it has not learned to walk around walls. Even though the wall configuration from the Four Rooms example could have been generated randomly in the DR training phase, it’s unlikely. As a result, the agent has not spent enough time training on walls similar to the Four Rooms structure, and is unable to reach the goal.

Domain randomization (left) does not effectively prepare an agent to transfer to previously unseen environments, such as the Four Rooms scenario (middle). To address this, a minimax adversary is used to construct previously unseen environments (right), but can result in creating situations that are impossible to solve.

Instead of just randomizing the environment parameters, one could train a second RL agent to learn how to set the environment parameters. This minimax adversary can be trained to minimize the performance of the first RL agent by finding and exploiting weaknesses in its policy - e.g. building wall configurations it has not encountered before. But again there is a problem. The right panel shows an environment built by a minimax adversary in which it is actually impossible for the agent to reach the goal. While the minimax adversary has succeeded in its task — it has minimized the performance of the original agent — it provides no opportunity for the agent to learn. Using a purely adversarial objective is not well suited to generating training environments, either.

In collaboration with UC Berkeley, we propose a new multi-agent approach for training the adversary in “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”, a publication recently presented at NeurIPS 2020. In this work we present an algorithm, Protagonist Antagonist Induced Regret Environment Design (PAIRED), that is based on minimax regret and prevents the adversary from creating impossible environments, while still enabling it to correct weaknesses in the agent’s policy. PAIRED incentivizes the adversary to tune the difficulty of the generated environments to be just outside the agent’s current abilities, leading to an automatic curriculum of increasingly challenging training tasks. We show that agents trained with PAIRED learn more complex behavior and generalize better to unknown test tasks. We have released open-source code for PAIRED on our GitHub repo.

PAIRED
To flexibly constrain the adversary, PAIRED introduces a third RL agent, which we call the antagonist agent, because it is allied with the adversarial agent, i.e., the one designing the environment. We rename our initial agent, the one navigating the environment, the protagonist. Once the adversary generates an environment, both the protagonist and antagonist play through that environment.

The adversary’s job is to maximize the antagonist’s reward while minimizing the protagonist's reward. This means it must create environments that are feasible (because the antagonist can solve them and get a high score), but challenging to the protagonist (exploit weaknesses in its current policy). The gap between the two rewards is the regret — the adversary tries to maximize the regret, while the protagonist competes to minimize it.

The methods discussed above (domain randomization, minimax regret and PAIRED) can be analyzed using the same theoretical framework, unsupervised environment design (UED), which we describe in detail in the paper. UED draws a connection between environment design and decision theory, enabling us to show that domain randomization is equivalent to the Principle of Insufficient Reason, the minimax adversary follows the Maximin Principle, and PAIRED is optimizing minimax regret. Below, we show how each of these ideas works for environment design:

Domain randomization (a) generates unstructured environments that aren’t tailored to the agent’s learning progress. The minimax adversary (b) may create impossible environments. PAIRED (c) can generate challenging, structured environments, which are still possible for the agent to complete.

Curriculum Generation
What’s interesting about minimax regret is that it incentivizes the adversary to generate a curriculum of initially easy, then increasingly challenging environments. In most RL environments, the reward function will give a higher score for completing the task more efficiently, or in fewer timesteps. When this is true, we can show that regret incentivizes the adversary to create the easiest possible environment the protagonist can’t solve yet. To see this, let’s assume the antagonist is perfect, and always gets the highest score that it possibly can. Meanwhile, the protagonist is terrible, and gets a score of zero on everything. In that case, the regret just depends on the difficulty of the environment. Since easier environments can be completed in fewer timesteps, they allow the antagonist to get a higher score. Therefore, the regret of failing at an easy environment is greater than the regret of failing on a hard environment:

So, by maximizing regret the adversary is searching for easy environments that the protagonist fails to do. Once the protagonist learns to solve each environment, the adversary must move on to finding a slightly harder environment that the protagonist can’t solve. Thus, the adversary generates a curriculum of increasingly difficult tasks.

Results
We can see the curriculum emerging in the learning curves below, which plot the shortest path length of a maze the agents have successfully solved. Unlike minimax or domain randomization, the PAIRED adversary creates a curriculum of increasingly longer, but possible, mazes, enabling PAIRED agents to learn more complex behavior.

But can these different training schemes help an agent generalize better to unknown test tasks? Below, we see the zero-shot transfer performance of each algorithm on a series of challenging test tasks. As the complexity of the transfer environment increases, the performance gap between PAIRED and the baselines widens. For extremely difficult tasks like Labyrinth and Maze, PAIRED is the only method that can occasionally solve the task. These results provide promising evidence that PAIRED can be used to improve generalization for deep RL.

Admittedly, these simple gridworlds do not reflect the complexities of the real world tasks that many RL methods are attempting to solve. We address this in “Adversarial Environment Generation for Learning to Navigate the Web”, which examines the performance of PAIRED when applied to more complex problems, such as teaching RL agents to navigate web pages. We propose an improved version of PAIRED, and show how it can be used to train an adversary to generate a curriculum of increasingly challenging websites:

Above, you can see websites built by the adversary in the early, middle, and late training stages, which progress from using very few elements per page to many simultaneous elements, making the tasks progressively harder. We test whether agents trained on this curriculum can generalize to standardized web navigation tasks, and achieve a 75% success rate, with a 4x improvement over the strongest curriculum learning baseline:

Conclusions
Deep RL is very good at fitting a simulated training environment, but how can we build simulations that cover the complexity of the real world? One solution is to automate this process. We propose Unsupervised Environment Design (UED) as a framework that describes different methods for automatically creating a distribution of training environments, and show that UED subsumes prior work like domain randomization and minimax adversarial training. We think PAIRED is a good approach for UED, because regret maximization leads to a curriculum of increasingly challenging tasks, and prepares agents to transfer successfully to unknown test tasks.

Acknowledgements
We would like to recognize the co-authors of “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”: Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine, as well as the co-authors of Adversarial Environment Generation for Learning to Navigate the Web: Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Kevin Malta, Manoj Tiwari, Honglak Lee, Aleksandra Faust. In addition, we thank Michael Chang, Marvin Zhang, Dale Schuurmans, Aleksandra Faust, Chase Kew, Jie Tan, Dennis Lee, Kelvin Xu, Abhishek Gupta, Adam Gleave, Rohin Shah, Daniel Filan, Lawrence Chan, Sam Toyer, Tyler Westenbroek, Igor Mordatch, Shane Gu, DJ Strouse, and Max Kleiman-Weiner for discussions that contributed to this work.

Source: Google AI Blog


Mastering Atari with Discrete World Models

Deep reinforcement learning (RL) enables artificial agents to improve their decisions over time. Traditional model-free approaches learn which of the actions are successful in different situations by interacting with the environment through a large amount of trial and error. In contrast, recent advances in deep RL have enabled model-based approaches to learn accurate world models from image inputs and use them for planning. World models can learn from fewer interactions, facilitate generalization from offline data, enable forward-looking exploration, and allow reusing knowledge across multiple tasks.

Despite their intriguing benefits, existing world models (such as SimPLe) have not been accurate enough to compete with the top model-free approaches on the most competitive reinforcement learning benchmarks — to date, the well-established Atari benchmark requires model-free algorithms, such as DQN, IQN, and Rainbow, to reach human-level performance. As a result, many researchers have focused instead on developing task-specific planning methods, such as VPN and MuZero, which learn by predicting sums of expected task rewards. However, these methods are specific to individual tasks and it is unclear how well they would generalize to new tasks or learn from unsupervised datasets. Similar to the recent breakthrough of unsupervised representation learning in computer vision [1, 2], world models aim to learn patterns in the environment that are more general than any particular task to later solve tasks more efficiently.

Today, in collaboration with DeepMind and the University of Toronto, we introduce DreamerV2, the first RL agent based on a world model to achieve human-level performance on the Atari benchmark. It constitutes the second generation of the Dreamer agent that learns behaviors purely within the latent space of a world model trained from pixels. DreamerV2 relies exclusively on general information from the images and accurately predicts future task rewards even when its representations were not influenced by those rewards. Using a single GPU, DreamerV2 outperforms top model-free algorithms with the same compute and sample budget.

Gamer normalized median score across the 55 Atari games after 200 million steps. DreamerV2 substantially outperforms previous world models. Moreover, it exceeds top model-free agents within the same compute and sample budget.
Behaviors learned by DreamerV2 for some of the 55 Atari games. These videos show images from the environment. Video predictions are shown below in the blog post.

An Abstract Model of the World
Just like its predecessor, DreamerV2 learns a world model and uses it to train actor-critic behaviors purely from predicted trajectories. The world model automatically learns to compute compact representations of its images that discover useful concepts, such as object positions, and learns how these concepts change in response to different actions. This lets the agent generate abstractions of its images that ignore irrelevant details and enables massively parallel predictions on a single GPU. During 200 million environment steps, DreamerV2 predicts 468 billion compact states for learning its behavior.

DreamerV2 builds upon the Recurrent State-Space Model (RSSM) that we introduced for PlaNet and was also used for DreamerV1. During training, an encoder turns each image into a stochastic representation that is incorporated into the recurrent state of the world model. Because the representations are stochastic, they do not have access to perfect information about the images and instead extract only what is necessary to make predictions, making the agent robust to unseen images. From each state, a decoder reconstructs the corresponding image to learn general representations. Moreover, a small reward network is trained to rank outcomes during planning. To enable planning without generating images, a predictor learns to guess the stochastic representations without access to the images from which they were computed.

Learning process of the world model used by DreamerV2. The world model maintains recurrent states (h1–h3) that receive actions (a1–a2) and incorporate information about the images (x1–x3) via stochastic representations (z1–z3). A predictor guesses the representations as (ẑ1–ẑ3) without access to the images from which they were generated.

Importantly, DreamerV2 introduces two new techniques to RSSM that lead to a substantially more accurate world model for learning successful policies. The first technique is to represent each image with multiple categorical variables instead of the Gaussian variables used by PlaNet, DreamerV1, and many more world models in the literature [1, 2, 3, 4, 5]. This leads the world model to reason about the world in terms of discrete concepts and enables more accurate predictions of future representations.

The encoder turns each image into 32 distributions over 32 classes each, the meanings of which are determined automatically as the world model learns. The one-hot vectors sampled from these distributions are concatenated to a sparse representation that is passed on to the recurrent state. To backpropagate through the samples, we use straight-through gradients that are easy to implement using automatic differentiation. Representing images with categorical variables allows the predictor to accurately learn the distribution over the one-hot vectors of the possible next images. In contrast, earlier world models that use Gaussian predictors cannot accurately match the distribution over multiple Gaussian representations for the possible next images.

Multiple categoricals that represent possible next images can be accurately predicted by a categorical predictor, whereas a Gaussian predictor is not flexible enough to accurately predict multiple possible Gaussian representations.

The second new technique of DreamerV2 is KL balancing. Many previous world models use the ELBO objective that encourages accurate reconstructions while keeping the stochastic representations (posteriors) close to their predictions (priors) to regularize the amount of information extracted from each image and facilitate generalization. Because the objective is optimized end-to-end, the stochastic representations and their predictions can be made more similar by bringing either of the two towards the other. However, bringing the representations towards their predictions can be problematic when the predictor is not yet accurate. KL balancing lets the predictions move faster toward the representations than vice versa. This results in more accurate predictions, a key to successful planning.

Long-term video predictions of the world model for holdout sequences. Each model receives 5 frames as input (not shown) and then predicts 45 steps forward given only actions. The video predictions are only used to gain insights into the quality of the world model. During planning, only compact representations are predicted, not images.

Measuring Atari Performance
DreamerV2 is the first world model that enables learning successful behaviors with human-level performance on the well-established and competitive Atari benchmark. We select the 55 games that many previous studies have in common and recommend this set of games for future work. Following the standard evaluation protocol, the agents are allowed 200M environment interactions using an action repeat of 4 and sticky actions (25% chance that an action is ignored and the previous action is repeated instead). We compare to the top model-free agents IQN and Rainbow, as well as to the well-known C51 and DQN agents implemented in the Dopamine framework.

Different standards exist for aggregating the scores across the 55 games. Ideally, a new algorithm would perform better under all conditions. For all four aggregation methods, DreamerV2 indeed outperforms all compared model-free algorithms while using the same computational budget.

DreamerV2 outperforms the top model-free agents according to four methods for aggregating scores across the 55 Atari games. We introduce and recommend the Clipped Record Mean (right-most plot) as an informative and robust performance metric.

The first three aggregation methods were previously proposed in the literature. We identify important drawbacks in each and recommend a new aggregation method, the clipped record mean to overcome their drawbacks.

  • Gamer Median. Most commonly, scores for each game are normalized by the performance of a human gamer that was assessed for the DQN paper and the median of the normalized scores of all games is reported. Unfortunately, the median ignores the scores of many simpler and harder games.
  • Gamer Mean. The mean takes the scores for all games into account but is mainly influenced by a small number of games where the human gamer performed poorly. This makes it easy for an algorithm to achieve large normalized scores on some games (e.g., James Bond, Video Pinball) that then dominate the mean.
  • Record Mean. Prior work recommends normalization based on the human world record instead, but such a metric is still overly influenced by a small number of games where it is easy for the artificial agents to outscore the human record.
  • Clipped Record Mean. We introduce a new metric that normalizes scores by the world record and clips them to not exceed the record. This yields an informative and robust metric that takes the performance on all games into account to an approximately equal amount.

While many current algorithms exceed the human gamer baseline, they are still quite far behind the human world record. As shown in the right-most plot above, DreamerV2 leads by achieving 25% of the human record on average across games. Clipping the scores at the record line lets us focus our efforts on developing methods that come closer to the human world record on all of the games rather than exceeding it on just a few games.

What matters and what doesn't
To gain insights into the important components of DreamerV2, we conduct an extensive ablation study. Importantly, we find that categorical representations offer a clear advantage over Gaussian representations despite the fact that Gaussians have been used extensively in prior works. KL balancing provides an even more substantial advantage over the KL regularizer used by most generative models.

By preventing the image reconstruction or reward prediction gradients from shaping the model states, we study their importance for learning successful representations. We find that DreamerV2 relies completely on universal information from the high-dimensional input images and its representations enable accurate reward predictions even when they were not trained using information about the reward. This mirrors the success of unsupervised representation learning in the computer vision community.

Atari performance for various ablations of DreamerV2 (Clipped Record Mean). Categorical representations, KL balancing, and learning about the images are crucial for the success of DreamerV2. Using reward information, that is specific to narrow tasks, offers no additional benefits for learning the world model.

Conclusion
We show how to learn a powerful world model to achieve human-level performance on the competitive Atari benchmark and outperform the top model-free agents. This result demonstrates that world models are a powerful approach for achieving high performance on reinforcement learning problems and are ready to use for practitioners and researchers. We see this as an indication that the success of unsupervised representation learning in computer vision [1, 2] is now starting to be realized in reinforcement learning in the form of world models. An unofficial implementation of DreamerV2 is available on Github and provides a productive starting point for future research projects. We see world models that leverage large offline datasets, long-term memory, hierarchical planning, and directed exploration as exciting avenues for future research.

Acknowledgements
This project is a collaboration with Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. We further thank everybody on the Brain Team and beyond who commented on our paper draft and provided feedback at any point throughout the project.

Source: Google AI Blog


Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning

Model-free reinforcement learning has been successfully demonstrated across a range of domains, including robotics, control, playing games and autonomous vehicles. These systems learn by simple trial and error and thus require a vast number of attempts at a given task before solving it. In contrast, model-based reinforcement learning (MBRL) learns a model of the environment (often referred to as a world model or a dynamics model) that enables the agent to predict the outcomes of potential actions, which reduces the amount of environment interaction needed to solve a task.

In principle, all that is strictly necessary for planning is to predict future rewards, which could then be used to select near-optimal future actions. Nevertheless, many recent methods, such as Dreamer, PlaNet, and SimPLe, additionally leverage the training signal of predicting future images. But is predicting future images actually necessary, or helpful? What benefit do visual MBRL algorithms actually derive from also predicting future images? The computational and representational cost of predicting entire images is considerable, so understanding whether this is actually useful is of profound importance for MBRL research.

In “Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning”, we demonstrate that predicting future images provides a substantial benefit, and is in fact a key ingredient in training successful visual MBRL agents. We developed a new open-source library, called the World Models Library, which enabled us to rigorously evaluate various world model designs to determine the relative impact of image prediction on returned rewards for each.

World Models Library
The World Models Library, designed specifically for visual MBRL training and evaluation, enables the empirical study of the effects of each design decision on the final performance of an agent across multiple tasks on a large scale. The library introduces a platform-agnostic visual MBRL simulation loop and the APIs to seamlessly define new world-models, planners and tasks or to pick and choose from the existing catalog, which includes agents (e.g., PlaNet), video models (e.g., SV2P), and a variety of DeepMind Control tasks and planners, such as CEM and MPPI.

Using the library, developers can study the effect of a varying factor in MBRL, such as the model design or representation space, on the performance of the agent on a suite of tasks. The library supports the training of the agents from scratch, or on a pre-collected set of trajectories, as well as evaluation of a pre-trained agent on a given task. The models, planning algorithms and the tasks can be easily mixed and matched to any desired combination.

To provide the greatest flexibility for users, the library is built using the NumPy interface, which enables different components to be implemented in either TensorFlow, Pytorch or JAX. Please look at this colab for a quick introduction.

Impact of Image Prediction
Using the World Models Library, we trained multiple world models with different levels of image prediction. All of these models use the same input (previously observed images) to predict an image and a reward, but they differ on what percentage of the image they predict. As the number of image pixels predicted by the agent increases, the agent performance as measured by the true reward generally improves.

The input to the model is fixed (previous observed images), but the fraction of the image predicted varies. As can be seen in the graph on the right, increasing the number of predicted pixels significantly improves the performance of the model.

Interestingly, the correlation between reward prediction accuracy and agent performance is not as strong, and in some cases a more accurate reward prediction can even result in lower agent performance. At the same time, there is a strong correlation between image reconstruction error and the performance of the agent.

Correlation between accuracy of image/reward prediction (x-axis) and task performance (y-axis). This graph clearly demonstrates a stronger correlation between image prediction accuracy and task performance.

This phenomenon is directly related to exploration, i.e., when the agent attempts more risky and potentially less rewarding actions in order to collect more information about the unknown options in the environment. This can be shown by testing and comparing models in an offline setup (i.e., learning policies from pre-collected datasets, as opposed to online RL, which learns policies by interacting with an environment). An offline setup ensures that there is no exploration and all of the models are trained on the same data. We observed that models that fit the data better usually perform better in the offline setup, and surprisingly, these may not be the same models that perform the best when learning and exploring from scratch.

Scores achieved by different visual MBRL models across different tasks. The top and bottom half of the graph visualizes the achieved score when trained in the online and offline settings for each task, respectively. Each color is a different model. It is common for a poorly-performing model in the online setting to achieve high scores when trained on pre-collected data (the offline setting) and vice versa.

Conclusion
We have empirically demonstrated that predicting images can substantially improve task performance over models that only predict the expected reward. We have also shown that the accuracy of image prediction strongly correlates with the final task performance of these models. These findings can be used for better model design and can be particularly useful for any future setting where the input space is high-dimensional and collecting data is expensive.

If you'd like to develop your own models and experiments, head to our repository and colab where you'll find instructions on how to reproduce this work and use or extend the World Models Library.

Acknowledgement:
We would like to give special recognition to multiple researchers in the Google Brain team and co-authors of the paper: Mohammad Taghi Saffar, Danijar Hafner, Harini Kannan, Chelsea Finn and Sergey Levine.

Source: Google AI Blog


Estimating the Impact of Training Data with Reinforcement Learning

Recent work suggests that not all data samples are equally useful for training, particularly for deep neural networks (DNNs). Indeed, if a dataset contains low-quality or incorrectly labeled data, one can often improve performance by removing a significant portion of training samples. Moreover, in cases where there is a mismatch between the train and test datasets (e.g., due to difference in train and test location or time), one can also achieve higher performance by carefully restricting samples in the training set to those most relevant for the test scenario. Because of the ubiquity of these scenarios, accurately quantifying the values of training samples has great potential for improving model performance on real-world datasets.


Top: Examples of low-quality samples (noisy/crowd-sourced); Bottom: Examples of a train and test mismatch.

In addition to improving model performance, assigning a quality value to individual data can also enable new use cases. It can be used to suggest better practices for data collection, e.g., what kinds of additional data would benefit the most, and can be used to construct large-scale training datasets more efficiently, e.g., by web searching using the labels as keywords and filtering out less valuable data.

In “Data Valuation Using Deep Reinforcement Learning”, accepted at ICML 2020, we address the challenge of quantifying the value of training data using a novel approach based on meta-learning. Our method integrates data valuation into the training procedure of a predictor model that learns to recognize samples that are more valuable for the given task, improving both predictor and data valuation performance. We have also launched four AI Hub Notebooks that exemplify the use cases of DVRL and are designed to be conveniently adapted to other tasks and datasets, such as domain adaptationcorrupted sample discovery and robust learningtransfer learning on image data and data valuation.

Quantifying the Value of Data
Not all data are equal for a given ML model — some have greater relevance for the task at hand or are more rich in informative content than others. So how does one evaluate the value of a single datum? At the granularity of a full dataset, it is straightforward; one can simply train a model on the entire dataset and use its performance on a test set as its value. However, estimating the value of a single datum is far more difficult, especially for complex models that rely on large-scale datasets, because it is computationally infeasible to re-train and re-evaluate a model on all possible subsets.

To tackle this, researchers have explored permutation-based methods (e.g., influence functions), and game theory-based methods (e.g., data Shapley). However, even the best current methods are far from being computationally feasible for large datasets and complex models, and their data valuation performance is limited. Concurrently, meta learning-based adaptive weight assignment approaches have been developed to estimate the weight values using a meta-objective. But rather than prioritizing learning from high value data samples, their data value mapping is typically based on gradient descent learning or other heuristic approaches that alter the conventional predictor model training dynamics, which can result in performance changes that are unrelated to the value of individual data points.

Data Valuation Using Reinforcement Learning (DVRL)
To infer the data values, we propose a data value estimator (DVE) that estimates data values and selects the most valuable samples to train the predictor model. This selection operation is fundamentally non-differentiable and thus conventional gradient descent-based methods cannot be used. Instead, we propose to use reinforcement learning (RL) such that the supervision of the DVE is based on a reward that quantifies the predictor performance on a small (but clean) validation set. The reward guides the optimization of the policy towards the action of optimal data valuation, given the state and input samples. Here, we treat the predictor model learning and evaluation framework as the environment, a novel application scenario of RL-assisted machine learning.

Training with Data Value Estimation using Reinforcement Learning (DVRL). When training the data value estimator with an accuracy reward, the most valuable samples (denoted with green dots) are used more and more, whereas the least valuable samples (red dots) are used less frequently.

Results
We evaluate the data value estimation quality of DVRL on multiple types of datasets and use cases.

  • Model performance after removing high/low value samples
    Removing low value samples from the training dataset can improve the predictor model performance, especially in the cases where the training dataset contains corrupted samples. On the other hand, removing high value samples, especially if the dataset is small, decreases the performance significantly. Overall, the performance after removing high/low value samples is a strong indicator for the quality of data valuation.
    Accuracy with the removal of most and least valuable samples, where 20% of the labels are noisy by design. By removing such noisy labels as the least valuable samples, a high-quality data valuation method achieves better accuracy. We demonstrate that DVRL outperforms other methods significantly from this perspective.
    DVRL shows the fastest performance degradation after removing the most important samples and the slowest performance degradation after removing the least important samples in most cases, underlining the superiority of DVRL in identifying noisy labels compared to competing methods (Leave-One-Out and Data Shapley).

  • Robust learning with noisy labels
    We consider how reliably DVRL can learn with noisy data in an end-to-end way, without removing the low-value samples. Ideally, noisy samples should get low data values as DVRL converges and a high performance model would be returned.
    Robust learning with noisy labels. Test accuracy for ResNet-32 and WideResNet-28-10 on CIFAR-10 and CIFAR-100 datasets with 40% of uniform random noise on labels. DVRL outperforms other popular methods that are based on meta-learning.
    We show state-of-the-art results with DVRL in minimizing the impact of noisy labels. These also demonstrate that DVRL can scale to complex models and large-scale datasets.

  • Domain adaptation
    We consider the scenario where the training dataset comes from a substantially different distribution from the validation and testing datasets. Data valuation is expected to be beneficial for this task by selecting the samples from the training dataset that best match the distribution of the validation dataset. We focus on the three cases: (1) a training set based on image search results (low-quality web-scraped) applied to the task of predicting skin lesion classification using HAM 10000 data (high-quality medical); (2) an MNIST training set for a digit recognition task on USPS data (different visual domain); (3) e-mail spam data to detect spam applied to an SMS dataset (different task). DVRL yields significant improvements for domain adaptation, by jointly optimizing the data valuator and corresponding predictor model.

Conclusions
We propose a novel meta learning framework for data valuation which determines how likely each training sample will be used in training of the predictor model. Unlike previous works, our method integrates data valuation into the training procedure of the predictor model, allowing the predictor and DVE to improve each other's performance. We model this data value estimation task using a DNN trained through RL with a reward obtained from a small validation set that represents the target task performance. In a computationally-efficient way, DVRL can provide high quality ranking of training data that is useful for domain adaptation, corrupted sample discovery and robust learning. We show that DVRL significantly outperforms alternative methods on diverse types of tasks and datasets.

Acknowledgements
We gratefully acknowledge the contributions of Tomas Pfister.

Source: Google AI Blog


Massively Large-Scale Distributed Reinforcement Learning with Menger

In the last decade, reinforcement learning (RL) has become one of the most promising research areas in machine learning and has demonstrated great potential for solving sophisticated real-world problems, such as chip placement and resource management, and solving challenging games (e.g., Go, Dota 2, and hide-and-seek). In simplest terms, an RL infrastructure is a loop of data collection and training, where actors explore the environment and collect samples, which are then sent to the learners to train and update the model. Most current RL techniques require many iterations over batches of millions of samples from the environment to learn a target task (e.g., Dota 2 learns from batches of 2 million frames every 2 seconds). As such, an RL infrastructure should not only scale efficiently (e.g., increase the number of actors) and collect an immense number of samples, but also be able to swiftly iterate over these extensive amounts of samples during training.

Overview of an RL system in which an actor sends trajectories (e.g., multiple samples) to a learner. The learner trains a model using the sampled data and pushes the updated model back to the actor (e.g. TF-Agents, IMPALA).

Today we introduce Menger1, a massive large-scale distributed RL infrastructure with localized inference that scales up to several thousand actors across multiple processing clusters (e.g., Borg cells), reducing the overall training time in the task of chip placement. In this post we describe how we implement Menger using Google TPU accelerators for fast training iterations, and present its performance and scalability on the challenging task of chip placement. Menger reduces the training time by up to 8.6x compared to a baseline implementation.

Menger System Design
There are various distributed RL systems, such as Acme and SEED RL, each of which focus on optimizing a single particular design point in the space of distributed reinforcement learning systems. For example, while Acme uses local inference on each actor with frequent model retrieval from the learner, SEED RL benefits from a centralized inference design by allocating a portion of TPU cores for performing batched calls. The tradeoffs between these design points are (1) paying the communication cost of sending/receiving observations and actions to/from a centralized inference server or paying the communication cost of model retrieval from a learner and (2) the cost of inference on actors (e.g., CPUs) compared to accelerators (e.g., TPUs/GPUs). Because of the requirements of our target application (e.g., size of observations, actions, and model size), Menger uses local inference in a manner similar to Acme, but pushes the scalability of actors to virtually an unbounded limit. The main challenges to achieving massive scalability and fast training on accelerators include:

  1. Servicing a large number of read requests from actors to a learner for model retrieval can easily throttle the learner and quickly become a major bottleneck (e.g., significantly increasing the convergence time) as the number of actors increases.
  2. The TPU performance is often limited by the efficiency of the input pipeline in feeding the training data to the TPU compute cores. As the number of TPU compute cores increases (e.g., TPU Pod), the performance of the input pipeline becomes even more critical for the overall training runtime.

Efficient Model Retrieval
To address the first challenge, we introduce transparent and distributed caching components between the learner and the actors optimized in TensorFlow and backed by Reverb (similar approach used in Dota). The main responsibility of the caching components is to strike a balance between the large number of requests from actors and the learner job. Adding these caching components not only significantly reduces the pressure on the learner to service the read requests, but also further distributes the actors across multiple Borg cells with a marginal communication overhead. In our study, we show that for a 16 MB model with 512 actors, the introduced caching components reduce the average read latency by a factor of ~4.0x leading to faster training iterations, especially for on-policy algorithms such as PPO.

Overview of a distributed RL system with multiple actors placed in different Borg cells. Servicing the frequent model update requests from a massive number of actors across different Borg cells throttles the learner and the communication network between learner and actors, which leads to a significant increase in the overall convergence time. The dashed lines represent gRPC communication between different machines.

Overview of a distributed RL system with multiple actors placed in different Borg cells with the introduced transparent and distributed caching service. The learner only sends the updated model to the distributed caching services. Each caching service handles the model request updates from the nearby actors (i.e., actors placed on the same Borg cells) and the caching service. The caching service not only reduces the load on the learner for servicing the model update requests, but also reduces the average read latency by the actors.

High Throughput Input Pipeline
To deliver a high throughput input data pipeline, Menger uses Reverb, a recently open-sourced data storage system designed for machine learning applications that provides an efficient and flexible platform to implement experience replay in a variety of on-policy/off-policy algorithms. However, using a single Reverb replay buffer service does not currently scale well in a distributed RL setting with thousands of actors, and simply becomes inefficient in terms of write throughput from actors.

A distributed RL system with a single replay buffer. Servicing a massive number of write requests from actors throttles the replay buffer and reduces its overall throughput. In addition, as we scale the learner to a setting with multiple compute engines (e.g., TPU Pod), feeding the data to these engines from a single replay buffer service becomes inefficient, which negatively impacts the overall convergence time.

To better understand the efficiency of the replay buffer in a distributed setting, we evaluate the average write latency for various payload sizes from 16 MB to 512 MB and a number of actors ranging from 16 to 2048. We repeat the experiment when the replay buffer and actors are placed on the same Borg cell. As the number of actors grows the average write latency also increases significantly. Expanding the number of actors from 16 to 2048, the average write latency increases by a factor of ~6.2x and ~18.9x for payload size 16 MB and 512 MB, respectively. This increase in the write latency negatively impacts the data collection time and leads to inefficiency in the overall training time.

The average write latency to a single Reverb replay buffer for various payload sizes (16 MB - 512 MB) and various number of actors (16 to 2048) when the actors and replay buffer are placed on the same Borg cells.

To mitigate this, we use the sharding capability provided by Reverb to increase the throughput between actors, learner, and replay buffer services. Sharding balances the write load from the large number of actors across multiple replay buffer servers, instead of throttling a single replay buffer server, and also minimizes the average write latency for each replay buffer server (as fewer actors share the same server). This enables Menger to scale efficiently to thousands of actors across multiple Borg cells.

A distributed RL system with sharded replay buffers. Each replay buffer service is a dedicated data storage for a collection of actors, generally located on the same Borg cells. In addition, the sharded replay buffer configuration provides a higher throughput input pipeline to the accelerator cores.

Case Study: Chip Placement
We studied the benefits of Menger in the complex task of chip placement for a large netlist. Using 512 TPU cores, Menger achieves significant improvements in the training time (up to ~8.6x, reducing the training time from ~8.6 hours down to merely one hour in the fastest configuration) compared to a strong baseline. While Menger was optimized for TPUs, that the key factor for this performance gain is the architecture, and we would expect to see similar gains when tailored to use on GPUs.

The improvement in training time using Menger with variable number of TPU cores compared to a baseline in the task of chip placement.

We believe that Menger infrastructure and its promising results in the intricate task of chip placement demonstrate an innovative path forward to further shorten the chip design cycle and has the potential to not only enable further innovations in the chip design process, but other challenging real-world tasks as well.

Acknowledgments
Most of the work was done by Amir Yazdanbakhsh, Junchaeo Chen, and Yu Zheng. We would like to also thank Robert Ormandi, Ebrahim Songhori, Shen Wang, TF-Agents team, Albin Cassirer, Aviral Kumar, James Laudon, John Wilkes, Joe Jiang, Milad Hashemi, Sat Chatterjee, Piotr Stanczyk, Sabela Ramos, Lasse Espeholt, Marcin Michalski, Sam Fishman, Ruoxin Sang, Azalia Mirhosseini, Anna Goldie, and Eric Johnson for their help and support.


1 A Menger cube is a three-dimensional fractal curve, and the inspiration for the name of this system, given that the proposed infrastructure can virtually scale ad infinitum.

Source: Google AI Blog


Imitation Learning in the Low-Data Regime

Reinforcement Learning (RL) is a paradigm for using trial-and-error to train agents to sequentially make decisions in complex environments, which has had great success in a number of domains, including games, robotics manipulation and chip design. Agents typically aim at maximizing the sum of the reward they collect in an environment, which can be based on a variety of parameters, including speed, curiosity, aesthetics and more. However, designing a specific RL reward function is a challenge since it can be hard to specify or too sparse. In such cases, imitation learning (IL) methods offer an alternative as they learn how to solve a task from expert demonstrations, rather than a carefully designed reward function. However, state-of-the-art IL methods rely on adversarial training, which uses min/max optimization procedures, making them algorithmically unstable and difficult to deploy.

In “Primal Wasserstein Imitation Learning” (PWIL), we introduce a new IL method, based on the primal form of the Wasserstein distance, also known as the earth mover’s distance, which does not rely on adversarial training. Using the MuJoCo suite of tasks, we demonstrate the efficacy of the PWIL method by imitating a simulated expert with a limited number of demonstrations (even a single example) and limited interactions with the environment.

Left: Demonstration of the algorithmic Humanoid “expert”, trained on the true reward of the task (which relates to speed). Right: Agent trained using PWIL on the expert demonstration.

Adversarial Imitation Learning
State-of-the-art adversarial IL methods operate similarly to generative adversarial networks (GANs) in which a generator (the policy) is trained to maximize the confusion of a discriminator (the reward) that itself is trained to differentiate between the agent’s state-action pairs and the expert’s. Adversarial IL methods boil down to a distribution matching problem, i.e., the problem of minimizing a distance between probability distributions in a metric space. However, just as GANs, adversarial IL methods rely on a min/max optimization problem and hence come with a number of training stability challenges.

Imitation Learning as Distribution Matching
The PWIL method is based on the formulation of IL as a distribution matching problem, in this case, the Wasserstein distance. The first step consists of inferring from the demonstrations a state-action distribution of the expert, the collection of relationships between the actions taken by the expert and the corresponding state of the environment. The goal is then to minimize the distance between the agent’s and the expert’s state-action distributions, through interactions with the environment. In contrast, PWIL is a non-adversarial method, enabling it to bypass the min/max optimization problem and directly minimize the Wasserstein distance between the agent’s and the expert’s state-action pair distributions.

Primal Wasserstein Imitation Learning
Computing the exact Wasserstein distance can be restrictive since one must wait until the end of a trajectory of the agent to calculate it, meaning that the rewards can be computed only when the agent is done interacting with the environment. To avoid this restriction, we use an upper bound on the distance instead, from which we can define a reward that we optimize using RL. We show that by doing so, we indeed recover expert behaviour and minimize the Wasserstein distance between the agent and the expert on a number of locomotion tasks of the MuJoCo simulator. While adversarial IL methods use a reward function from a neural network that must be optimized and re-estimated continuously as the agent interacts with the environment, PWIL defines a reward function offline from demonstrations, which does not change and is based on substantially fewer hyperparameters than adversarial IL approaches.

Training curves for PWIL on Humanoid. In green, the Wasserstein distance to the state-action distribution of the expert. In blue, the return (the sum of rewards collected) by the agent.

A Measure of Similarity for the True Imitation Learning Setting
As in numerous challenges in ML, a number of IL methods are evaluated on synthetic tasks, where one usually has access to the underlying reward function of the task and can measure similarity between the expert’s and the agent’s behaviour in terms of performance, which is the expected sum of rewards. A byproduct of PWIL is the creation of a metric that can compare expert behavior to an agent’s behavior for any IL method, without access to the true reward of the task. In this sense, we can use the Wasserstein distance in the true IL setting, not only on synthetic tasks.

Conclusion
In environments where interacting is costly (e.g., a real robot or a complex simulator), PWIL is a prime candidate not only because it can recover expert behaviour, but also because the reward function it defines is easy to tune and is defined without interactions with the environment. This opens multiple opportunities for future exploration, including deployment to real systems, extending PWIL to the setup where we have only access to demonstration states (rather than states and actions), and finally applying PWIL to visual based observations.

Acknowledgements
We thank our co-authors, Matthieu Geist and Olivier Pietquin; as well as Zafarali Ahmed, Adrien Ali Taïga, Gabriel Dulac-Arnold, Johan Ferret, Alexis Jacq and Saurabh Kumar for their feedback on the manuscript.

Source: Google AI Blog


Tackling Open Challenges in Offline Reinforcement Learning

Over the past several years, there has been a surge of interest in reinforcement learning (RL) driven by its high-profile successes in game playing and robotic control. However, unlike supervised learning methods, which learn from massive datasets that are collected once and then reused, RL algorithms use a trial-and-error feedback loop that requires active interaction during learning, collecting data every time a new policy is learned. This approach is prohibitive in many real-world settings, such as healthcare, autonomous driving, and dialogue systems, where trial-and-error data collection can be costly, time consuming, or even irresponsible. Even for problems where some active data collection can be used, the requirement for interactive collection limits dataset size and diversity.

Offline RL (also called batch RL or fully off-policy RL) relies solely on a previously collected dataset without further interaction. It provides a way to utilize previously collected datasets — from previous RL experiments, from human demonstrations, and from hand-engineered exploration strategies — in order to automatically learn decision-making strategies. In principle, while off-policy RL algorithms can be used in the offline setting (fully off-policy), they are generally only successful when used with active environment interaction — without receiving this direct feedback, they often exhibit undesirable performance in practice. Consequently, while offline RL has enormous potential, that potential cannot be reached without resolving significant algorithmic challenges.

In “Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems”, we provide a comprehensive tutorial on approaches for tackling the challenges of offline RL and discuss the many issues that remain. To address these issues, we have designed and released an open-source benchmarking framework, Datasets for Deep Data-Driven Reinforcement Learning (D4RL), as well as a new, simple, and highly effective offline RL algorithm, called conservative Q-learning (CQL).

Benchmarks for Offline RL
In order to understand the capabilities of current approaches and to guide future progress, it is first necessary to have effective benchmarks. A common choice in prior work was to simply use data generated by a successful online RL run. However, while simple, this data collection approach is artificial because it involves training an online RL agent which is prohibitive in many real-world settings as we discussed previously. One wishes to learn a policy that is better than the current best from diverse data sources that provides good coverage of the task. For example, one might have data collected from a hand-designed controller of a robot arm, and use offline RL to train an improved controller. To enable progress in this field under realistic settings, one needs a benchmark suite that accurately reflects these settings, while being simple and accessible enough to enable rapid experimentation.

D4RL provides standardized environments, datasets and evaluation protocols, as well as reference scores for recent algorithms to help accomplish this. This is a “batteries-included” resource, making it ideal for anyone to jump in and get started with minimal fuss.

Environments in D4RL

The key design goal for D4RL was to develop tasks that reflect both real-world dataset challenges as well as real-world applications. Previous datasets used data collected either from random agents or agents trained with RL. Instead, by thinking through potential applications in autonomous driving, robotics, and other domains, we considered how real-world applications of offline RL might require handling of data generated from human demonstrations or hard-coded controllers, data collected from heterogeneous sources, and data collected by agents with a variety of different goals.

Aside from the widely used MuJoCo locomotion tasks, D4RL includes datasets for more complex tasks. The Adroit domain, which requires manipulating a realistic robotic hand to use a hammer, for example, illustrates the challenges of working with limited human demonstrations, without which these tasks are extremely challenging. Previous work found that existing datasets could not distinguish between competing methods, whereas the Adroit domain reveals clear deficiencies between them.

Another common scenario for real-world tasks is one in which the dataset used for training is collected from agents performing a wide range of other activities that are related to, but not specifically targeted towards, the task of interest. For example, data from human drivers may illustrate how to drive a car well, but do not necessarily show how to reach a specific desired destination. In this case, one might like offline RL methods to “stitch” together parts of routes in the driving dataset to accomplish a task that was not actually seen in the data (i.e., navigation). As an illustrative example, given paths labeled “A” and “B” in the picture below, offline RL should be able to “remix” them to produce path C.

Having only observed paths A and B, they can be combined to form a shortest path (C).

We constructed a series of increasingly difficult tasks to exercise this “stitching” ability. The maze environments, shown below, require two robots (a simple ball or an “Ant” robot) to navigate to locations in a series of mazes.

Maze navigation environments in D4RL, which require “stitching” parts of paths to accomplish new navigational goals that were not seen in the dataset.

A more complex “stitching” scenario is provided by the Franka kitchen domain (based on the Adept environment), where demonstrations from humans using a VR interface comprise a multi-task dataset, and offline RL methods must again “remix” this data.

The “Franka kitchen” domain requires using data from human demonstrators performing a variety of different tasks in a simulated kitchen.

Finally, D4RL includes two tasks that are meant to more accurately reflect potential realistic applications of offline RL, both based on existing driving simulators. One is a first-person driving dataset that utilizes the widely used CARLA simulator developed at Intel, which provides photo-realistic images in realistic driving domains, and the other is a dataset from the Flow traffic control simulator (from UC Berkeley), which requires controlling autonomous vehicles to facilitate effective traffic flow.

D4RL includes datasets based on existing realistic simulators for driving with CARLA (left) and traffic management with Flow (right).

We have packaged these tasks and standardized datasets into an easy-to-use Python package to accelerate research. Furthermore, we provide benchmark numbers for all tasks using relevant prior methods (BC, SAC, BEAR, BRAC, AWR, BCQ), in order to baseline new approaches. We are not the first to propose a benchmark for offline RL: a number of prior works have proposed simple datasets based on running RL algorithms, and several more recent works have proposed datasets with image observations and other features. However, we believe that the more realistic dataset composition in D4RL makes it an effective way to drive progress in the field.

An Improved Algorithm for Offline RL
As we developed the benchmark tasks, we found that existing methods could not solve the more challenging tasks. The central challenge arises from a distributional shift: in order to improve over the historical data, offline RL algorithms must learn to make decisions that differ from the decisions taken in the dataset. However, this can lead to problems when the consequences of a seemingly good decision cannot be deduced from the data — if no agent has taken this particular turn in the maze, how does one know if it leads to the goal or not? Without handling this distributional shift problem, offline RL methods can extrapolate erroneously, making over-optimistic conclusions about the outcomes of rarely seen actions. Contrast this with the online setting, where reward bonuses modeled after curiosity and surprise optimistically bias the agent to explore all potentially rewarding paths. Because the agent receives interactive feedback, if the action turns out to be unrewarding, then it can simply avoid the path in the future.

To address this, we developed conservative Q-learning (CQL), an offline RL algorithm designed to guard against overestimation while avoiding explicit construction of a separate behavior model and without using importance weights. While standard Q-learning (and actor-critic) methods bootstrap from previous estimates, CQL is unique in that it is fundamentally a pessimistic algorithm: it assumes that if a good outcome was not seen for a given action, that action is likely to not be a good one. The central idea of CQL is to learn a lower bound on the policy’s expected return (called the Q-function), instead of learning to approximate the expected return. If we then optimize our policy under this conservative Q-function, we can be confident that its value is no lower than this estimate, preventing errors from overestimation.

We found that CQL attains state-of-the-art results on many of the harder D4RL tasks: CQL outperformed other approaches on the AntMaze, Kitchen tasks, and 6 out of 8 Adroit tasks. In particular, on the AntMaze tasks, which require navigating through a maze with an “Ant” robot, CQL is often the only algorithm that is able to learn non-trivial policies. CQL also performs well on other tasks, including Atari games. On the Atari tasks from Agarwal et al., CQL outperforms prior methods when data is limited (“1%” dataset). Moreover, CQL is simple to implement on top of existing algorithms (e.g., QR-DQN and SAC), without training additional neural networks.

Performance of CQL on Atari games with the 1% dataset from Agarwal et al.

Future Thoughts
We are excited about the fast-moving field of offline RL. While we took a first step towards a standard benchmark, there is clearly still room for improvement. We expect that as algorithms improve, we will need to reevaluate the tasks in the benchmark and develop more challenging tasks. We look forward to working with the community to evolve the benchmark and evaluation protocols. Together, we can bring the rich promises of offline RL to real-world applications.

Acknowledgements
This work was carried out in collaboration with UC Berkeley PhD students Aviral Kumar, Justin Fu, and Aurick Zhou, with contributions from Ofir Nachum from Google Research.

Source: Google AI Blog


A Simulation Suite for Tackling Applied Reinforcement Learning Challenges

Reinforcement Learning (RL) has proven to be effective in solving numerous complex problems ranging from Go, StarCraft and Minecraft to robot locomotion and chip design. In each of these cases, a simulator is available or the real environment is quick and inexpensive to access. Yet, there are still considerable challenges to deploying RL to real-world products and systems. For example, in physical control systems, such as robotics and autonomous driving, RL controllers are trained to solve tasks like grasping objects or driving on a highway. These controllers are susceptible to effects such as sensor noise, system delays, or normal wear-and-tear that can reduce the quality of input to the controller, leading to incorrect decision-making and potentially catastrophic failures.

A physical control system: Robots learning how to grasp and sort objects using RL at the Everyday Robot Project at X. These types of systems are subject to many of the real-world challenges detailed here.

In “Challenges of Real-World Reinforcement Learning”, we identify and discuss nine different challenges that hinder the application of current RL algorithms to applied systems. We then follow up this work with an empirical investigation in which we simulated versions of these challenges on state-of-the-art RL algorithms, and benchmark the effects of each. We have open-sourced these simulated challenges in the Real-World RL (RWRL) task suite to help draw attention to these important issues, as well as accelerate research toward solving them.

The RWRL Suite
The RWRL suite is a set of simulated tasks inspired by applied reinforcement learning challenges, the goal of which is to enable fast algorithmic iterations for both researchers and practitioners, without having to run slow, expensive experiments on real-systems. While there will be additional challenges transitioning from RL algorithms that were trained in simulation to real-world applications, this suite intends to close some of the more fundamental, algorithmic gaps. At present, RWRL supports a subset of the DeepMind Control Suite domains, but the goal is to broaden the suite to support an even more diverse domain set.

Easy-to-Use & Flexible
We designed the suite with two main goals in mind. (1) It should be easy to use — a user should be able to start running experiments within minutes of downloading the suite, simply by changing a few lines of code. (2) It should be flexible — a user should be able to incorporate any combination of challenges into the environment with very little effort.

A Delayed Action Example
To illustrate the ease of use of the RWRL suite, imagine a researcher or practitioner wants to implement action delays (i.e., temporal delays on actions being sent to the environment). To use the RWRL suite, simply import the rwrl module. Next, load an environment (e.g., cartpole) with the delay_spec argument. This optional argument is specified as a dictionary configuring delay applied to actions, observations, or rewards and the number of timesteps the corresponding element is delayed (e.g., 20 timesteps). Once the environment is loaded, the effects of actions are automatically delayed without any other changes to the experiment. This makes it easy to test an RL algorithm with action delays in a range of different environments supported by the RWRL suite.

A high-level overview of the RWRL suite. Add a challenge (e.g., action delays) into the environment with a few lines of code, run a hyperparameter sweep and produce a graph shown on the right

A user can combine different challenges or choose from a set of predefined benchmark challenges by simply adding additional arguments to the load function, all of which are specified in the open-source RWRL suite codebase.

Supported Challenges
The RWRL suite provides functionality to support experiments related to eight of the nine different challenges that make applying current RL algorithms on applied systems difficult: sample efficiency; system delays; high-dimensional state and action spaces; constraints; partial observability, stochasticity and non-stationarity; multiple objectives; real-time inference; and training from offline logs. RWRL excludes the explainability challenge, which is abstract and non-trivial to define. The supported experiments are non-exhaustive and provide researchers and practitioners with the ability to analyze the capabilities of their agent with respect to each challenge dimension. Examples of the supported challenges include:

  • System Delays
    Most real systems have delays in either sensing, actuation or reward feedback, all of which can be configured and applied to any task within the RWRL suite.The graphs below show the performance of a D4PG agent as actions (left), observations (middle) and rewards (right) are increasingly delayed.

    The effect of increasing the action (left), observation (middle) and reward (right) delays respectively on a state-of-the art RL agent in four MuJoCo domains.

    As can be seen in the graphs, a researcher or practitioner can quickly gain insights as to which type of delay affects their agent’s performance. These delays can also be combined together to observe their combined effect.

  • Constraints
    Almost all applied systems have some form of constraints embedded into the overall objective, which is not common in most RL environments. The RWRL suite implements a series of constraints for each task, with varying difficulties, to facilitate research in constrained RL. An example of a complex local angular velocity constraint being violated is visualized in the video below.
    An example of constraint violations for cartpole. The red screen indicates that a violation has occurred on localized angular velocity.
  • Non-Stationarity
    The user can introduce non-stationarity by perturbing environment parameters. These perturbations are in contrast to the pixel level adversarial perturbations that have recently gained popularity in research on supervised deep learning. For example, in the human walker domain, the size of the head and friction of the ground can be modified throughout training to simulate changing conditions. A variety of schedulers are available in the RWRL suite (see our codebase for more details), along with multiple default parameter perturbations, which were carefully defined to handicap the learning capabilities of state-of-the-art learning algorithms.
    Non-stationary perturbations. The suite supports perturbing environment parameters across episodes such as changing head size (center) and contact friction (right).
  • Training from Offline Log Data
    In most applied systems, it is both slow and expensive to run experiments. There are often logs of data available from previous experiments that can be utilized to train a policy. However, it is often difficult to outperform the previous model in production due to the data being limited, of low variance, or of poor quality. To address this, we have generated offline datasets of the combined RWRL benchmark challenges, which we made available as part of a wider offline dataset release. More information can be found in this notebook.

Conclusion
Most systems rarely manifest only a single challenge, and we are excited to see how algorithms can deal with an environment in which there are multiple challenges combined with increasing levels of difficulty (‘Easy’, ‘Medium’ and ‘Hard’). We highly encourage the research community to try and solve these challenges, as we believe that solving them will facilitate more widespread applications of RL to products and real-world systems.

While the initial set of RWRL suite features and experiments provide a starting point for closing the gap between the current state of RL and the challenges of applied systems, there is still much work to do. The supported experiments are not exhaustive and we welcome new ideas from the wider community to better evaluate the capabilities of our RL agents. Our main goal with this suite is to highlight and encourage research on the core problems that limit the effectiveness of RL algorithms in applied products and systems and to accelerate progress towards enabling future RL applications.

Acknowledgements
We would like to thank our core contributor and co-author Nir Levine for his invaluable help. We would also like to thank our co-authors Jerry Li, Sven Gowal, Todd Hester and Cosmin Paduraru as well as Robert Dadashi, the ACME team, Dan A. Calian, Juliet Rothenberg and Timothy Mann for their contributions.

Source: Google AI Blog