Tag Archives: Robotics

RT-1: Robotics Transformer for Real-World Control at Scale

Major recent advances in multiple subfields of machine learning (ML) research, such as computer vision and natural language processing, have been enabled by a shared common approach that leverages large, diverse datasets and expressive models that can absorb all of the data effectively. Although there have been various attempts to apply this approach to robotics, robots have not yet leveraged highly-capable models as well as other subfields.

Several factors contribute to this challenge. First, there’s the lack of large-scale and diverse robotic data, which limits a model’s ability to absorb a broad set of robotic experiences. Data collection is particularly expensive and challenging for robotics because dataset curation requires engineering-heavy autonomous operation, or demonstrations collected using human teleoperations. A second factor is the lack of expressive, scalable, and fast-enough-for-real-time-inference models that can learn from such datasets and generalize effectively.

To address these challenges, we propose the Robotics Transformer 1 (RT-1), a multi-task model that tokenizes robot inputs and outputs actions (e.g., camera images, task instructions, and motor commands) to enable efficient inference at runtime, which makes real-time control feasible. This model is trained on a large-scale, real-world robotics dataset of 130k episodes that cover 700+ tasks, collected using a fleet of 13 robots from Everyday Robots (EDR) over 17 months. We demonstrate that RT-1 can exhibit significantly improved zero-shot generalization to new tasks, environments and objects compared to prior techniques. Moreover, we carefully evaluate and ablate many of the design choices in the model and training set, analyzing the effects of tokenization, action representation, and dataset composition. Finally, we’re open-sourcing the RT-1 code, and hope it will provide a valuable resource for future research on scaling up robot learning.

RT-1 absorbs large amounts of data, including robot trajectories with multiple tasks, objects and environments, resulting in better performance and generalization.


Robotics Transformer (RT-1)



RT-1 is built on a transformer architecture that takes a short history of images from a robot’s camera along with task descriptions expressed in natural language as inputs and directly outputs tokenized actions.

RT-1’s architecture is similar to that of a contemporary decoder-only sequence model trained against a standard categorical cross-entropy objective with causal masking. Its key features include: image tokenization, action tokenization, and token compression, described below.

Image tokenization: We pass images through an EfficientNet-B3 model that is pre-trained on ImageNet, and then flatten the resulting 9×9×512 spatial feature map to 81 tokens. The image tokenizer is conditioned on natural language task instructions, and uses FiLM layers initialized to identity to extract task-relevant image features early on.

Action tokenization: The robot’s action dimensions are 7 variables for arm movement (x, y, z, roll, pitch, yaw, gripper opening), 3 variables for base movement (x, y, yaw), and an extra discrete variable to switch between three modes: controlling arm, controlling base, or terminating the episode. Each action dimension is discretized into 256 bins.

Token Compression: The model adaptively selects soft combinations of image tokens that can be compressed based on their impact towards learning with the element-wise attention module TokenLearner, resulting in over 2.4x inference speed-up.

RT-1’s architecture: The model takes a text instruction and set of images as inputs, encodes them as tokens via a pre-trained FiLM EfficientNet model and compresses them via TokenLearner. These are then fed into the Transformer, which outputs action tokens.

To build a system that could generalize to new tasks and show robustness to different distractors and backgrounds, we collected a large, diverse dataset of robot trajectories. We used 13 EDR robot manipulators, each with a 7-degree-of-freedom arm, a 2-fingered gripper, and a mobile base, to collect 130k episodes over 17 months. We used demonstrations provided by humans through remote teleoperation, and annotated each episode with a textual description of the instruction that the robot just performed. The set of high-level skills represented in the dataset includes picking and placing items, opening and closing drawers, getting items in and out drawers, placing elongated items up-right, knocking objects over, pulling napkins and opening jars. The resulting dataset includes 130k+ episodes that cover 700+ tasks using many different objects.


Experiments and Results

To better understand RT-1’s generalization abilities, we study its performance against three baselines: Gato, BC-Z and BC-Z XL (i.e., BC-Z with same number of parameters as RT-1), across four categories:

  1. Seen tasks performance: performance on tasks seen during training
  2. Unseen tasks performance: performance on unseen tasks where the skill and object(s) were seen separately in the training set, but combined in novel ways
  3. Robustness (distractors and backgrounds): performance with distractors (up to 9 distractors and occlusion) and performance with background changes (new kitchen, lighting, background scenes)
  4. Long-horizon scenarios: execution of SayCan-type natural language instructions in a real kitchen

RT-1 outperforms baselines by large margins in all four categories, exhibiting impressive degrees of generalization and robustness.

Performance of RT-1 vs. baselines on evaluation scenarios.


Incorporating Heterogeneous Data Sources

To push RT-1 further, we train it on data gathered from another robot to test if (1) the model retains its performance on the original tasks when a new data source is presented and (2) if the model sees a boost in generalization with new and different data, both of which are desirable for a general robot learning model. Specifically, we use 209k episodes of indiscriminate grasping that were autonomously collected on a fixed-base Kuka arm for the QT-Opt project. We transform the data collected to match the action specs and bounds of our original dataset collected with EDR, and label every episode with the task instruction “pick anything” (the Kuka dataset doesn’t have object labels). Kuka data is then mixed with EDR data in a 1:2 ratio in every training batch to control for regression in original EDR skills.

Training methodology when data has been collected from multiple robots.

Our results indicate that RT-1 is able to acquire new skills by observing other robots’ experiences. In particular, the 22% accuracy seen when training with EDR data alone jumps by almost 2x to 39% when RT-1 is trained on both bin-picking data from Kuka and existing EDR data from robot classrooms, where we collected most of RT-1 data. When training RT-1 on bin-picking data from Kuka alone, and then evaluating it on bin-picking from the EDR robot, we see 0% accuracy. Mixing data from both robots, on the other hand, allows RT-1 to infer the actions of the EDR robot when faced with the states observed by Kuka, without explicit demonstrations of bin-picking on the EDR robot, and by taking advantage of experiences collected by Kuka. This presents an opportunity for future work to combine more multi-robot datasets to enhance robot capabilities.


Training Data Classroom Eval      Bin-picking Eval
Kuka bin-picking data + EDR data 90% 39%
EDR only data 92% 22%
Kuka bin-picking only data 0 0

RT-1 accuracy evaluation using various training data.


Long-Horizon SayCan Tasks

RT-1’s high performance and generalization abilities can enable long-horizon, mobile manipulation tasks through SayCan. SayCan works by grounding language models in robotic affordances, and leveraging few-shot prompting to break down a long-horizon task expressed in natural language into a sequence of low-level skills.

SayCan tasks present an ideal evaluation setting to test various features:

  1. Long-horizon task success falls exponentially with task length, so high manipulation success is important.
  2. Mobile manipulation tasks require multiple handoffs between navigation and manipulation, so the robustness to variations in initial policy conditions (e.g., base position) is essential.
  3. The number of possible high-level instructions increases combinatorially with skill-breadth of the manipulation primitive.

We evaluate SayCan with RT-1 and two other baselines (SayCan with Gato and SayCan with BC-Z) in two real kitchens. Below, “Kitchen2” constitutes a much more challenging generalization scene than “Kitchen1”. The mock kitchen used to gather most of the training data was modeled after Kitchen1.

SayCan with RT-1 achieves a 67% execution success rate in Kitchen1, outperforming other baselines. Due to the generalization difficulty presented by the new unseen kitchen, the performance of SayCan with Gato and SayCan with BCZ shapely falls, while RT-1 does not show a visible drop.


 SayCan tasks in Kitchen1    SayCan tasks in Kitchen2
Planning Execution Planning Execution
Original Saycan 73 47 - -
SayCan w/ Gato 87 33 87 0
SayCan w/ BC-Z 87 53 87 13
SayCan w/ RT-1 87 67 87 67

The following video shows a few example PaLM-SayCan-RT1 executions of long-horizon tasks in multiple real kitchens.




Conclusion

The RT-1 Robotics Transformer is a simple and scalable action-generation model for real-world robotics tasks. It tokenizes all inputs and outputs, and uses a pre-trained EfficientNet model with early language fusion, and a token learner for compression. RT-1 shows strong performance across hundreds of tasks, and extensive generalization abilities and robustness in real-world settings.

As we explore future directions for this work, we hope to scale the number of robot skills faster by developing methods that allow non-experts to train the robot with directed data collection and model prompting. We also look forward to improving robotics transformers’ reaction speeds and context retention with scalable attention and memory. To learn more, check out the paper, open-sourced RT-1 code, and the project website.


Acknowledgements

This work was done in collaboration with Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deeksha Manjunath, Igor Mordatch, Ofir Nachum, Carolina Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jornell Quiambao, Kanishka Rao, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Kevin Sayed, Jaspiar Singh, Sumedh Sontakke, Austin Stone, Clayton Tan, Huong Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich.

Source: Google AI Blog


Talking to Robots in Real Time

A grand vision in robot learning, going back to the SHRDLU experiments in the late 1960s, is that of helpful robots that inhabit human spaces and follow a wide variety of natural language commands. Over the last few years, there have been significant advances in the application of machine learning (ML) for instruction following, both in simulation and in real world systems. Recent Palm-SayCan work has produced robots that leverage language models to plan long-horizon behaviors and reason about abstract goals. Code as Policies has shown that code-generating language models combined with pre-trained perception systems can produce language conditioned policies for zero shot robot manipulation. Despite this progress, an important missing property of current "language in, actions out" robot learning systems is real time interaction with humans.

Ideally, robots of the future would react in real time to any relevant task a user could describe in natural language. Particularly in open human environments, it may be important for end users to customize robot behavior as it is happening, offering quick corrections ("stop, move your arm up a bit") or specifying constraints ("nudge that slowly to the right"). Furthermore, real-time language could make it easier for people and robots to collaborate on complex, long-horizon tasks, with people iteratively and interactively guiding robot manipulation with occasional language feedback.

The challenges of open-vocabulary language following. To be successfully guided through a long horizon task like "put all the blocks in a vertical line", a robot must respond precisely to a wide variety of commands, including small corrective behaviors like "nudge the red circle right a bit".

However, getting robots to follow open vocabulary language poses a significant challenge from a ML perspective. This is a setting with an inherently large number of tasks, including many small corrective behaviors. Existing multitask learning setups make use of curated imitation learning datasets or complex reinforcement learning (RL) reward functions to drive the learning of each task, and this significant per-task effort is difficult to scale beyond a small predefined set. Thus, a critical open question in the open vocabulary setting is: how can we scale the collection of robot data to include not dozens, but hundreds of thousands of behaviors in an environment, and how can we connect all these behaviors to the natural language an end user might actually provide?

In Interactive Language, we present a large scale imitation learning framework for producing real-time, open vocabulary language-conditionable robots. After training with our approach, we find that an individual policy is capable of addressing over 87,000 unique instructions (an order of magnitude larger than prior works), with an estimated average success rate of 93.5%. We are also excited to announce the release of Language-Table, the largest available language-annotated robot dataset, which we hope will drive further research focused on real-time language-controllable robots.




Guiding robots with real time language.

Real Time Language-Controllable Robots

Key to our approach is a scalable recipe for creating large, diverse language-conditioned robot demonstration datasets. Unlike prior setups that define all the skills up front and then collect curated demonstrations for each skill, we continuously collect data across multiple robots without scene resets or any low-level skill segmentation. All data, including failure data (e.g., knocking blocks off a table), goes through a hindsight language relabeling process to be paired with text. Here, annotators watch long robot videos to identify as many behaviors as possible, marking when each began and ended, and use freeform natural language to describe each segment. Importantly, in contrast to prior instruction following setups, all skills used for training emerge bottom up from the data itself rather than being determined upfront by researchers.

Our learning approach and architecture are intentionally straightforward. Our robot policy is a cross-attention transformer, mapping 5hz video and text to 5hz robot actions, using a standard supervised learning behavioral cloning objective with no auxiliary losses. At test time, new spoken commands can be sent to the policy (via speech-to-text) at any time up to 5hz.

Interactive Language: an imitation learning system for producing real time language-controllable robots.

Open Source Release: Language-Table Dataset and Benchmark

This annotation process allowed us to collect the Language-Table dataset, which contains over 440k real and 180k simulated demonstrations of the robot performing a language command, along with the sequence of actions the robot took during the demonstration. This is the largest language-conditioned robot demonstration dataset of its kind, by an order of magnitude. Language-Table comes with a simulated imitation learning benchmark that we use to perform model selection, which can be used to evaluate new instruction following architectures or approaches.


Dataset # Trajectories (k)     # Unique (k)     Physical Actions     Real     Available
Episodic Demonstrations
BC-Z 25 0.1
SayCan 68 0.5
Playhouse 1,097 779
Hindsight Language Labeling
BLOCKS 30 n/a
LangLFP 10 n/a
LOREL 6 1.7
CALVIN 20 0.4
Language-Table (real + sim) 623 (442+181) 206 (127+79)

We compare Language-Table to existing robot datasets, highlighting proportions of simulated (red) or real (blue) robot data, the number of trajectories collected, and the number of unique language describable tasks.

Learned Real Time Language Behaviors

Examples of short horizon instructions the robot is capable of following, sampled randomly from the full set of over 87,000.

Short-Horizon Instruction Success
(87,000 more…)
push the blue triangle to the top left corner    80.0%
separate the red star and red circle 100.0%
nudge the yellow heart a bit right 80.0%
place the red star above the blue cube 90.0%
point your arm at the blue triangle 100.0%
push the group of blocks left a bit 100.0%
Average over 87k, CI 95% 93.5% +- 3.42%

95% Confidence interval (CI) on the average success of an individual Interactive Language policy over 87,000 unique natural language instructions.

We find that interesting new capabilities arise when robots are able to follow real time language. We show that users can walk robots through complex long-horizon sequences using only natural language to solve for goals that require multiple minutes of precise, coordinated control (e.g., "make a smiley face out of the blocks with green eyes" or "place all the blocks in a vertical line"). Because the robot is trained to follow open vocabulary language, we see it can react to a diverse set of verbal corrections (e.g., "nudge the red star slightly right") that might otherwise be difficult to enumerate up front.

Examples of long horizon goals reached under real time human language guidance.

Finally, we see that real time language allows for new modes of robot data collection. For example, a single human operator can control four robots simultaneously using only spoken language. This has the potential to scale up the collection of robot data in the future without requiring undivided human attention for each robot.

One operator controlling multiple robots at once with spoken language.

Conclusion

While currently limited to a tabletop with a fixed set of objects, Interactive Language shows initial evidence that large scale imitation learning can indeed produce real time interactable robots that follow freeform end user commands. We open source Language-Table, the largest language conditioned real-world robot demonstration dataset of its kind and an associated simulated benchmark, to spur progress in real time language control of physical robots. We believe the utility of this dataset may not only be limited to robot control, but may provide an interesting starting point for studying language- and action-conditioned video prediction, robot video-conditioned language modeling, or a host of other interesting active questions in the broader ML context. See our paper and GitHub page to learn more.


Acknowledgements

We would like to thank everyone who supported this research. This includes robot teleoperators: Alex Luong, Armando Reyes, Elio Prado, Eric Tran, Gavin Gonzalez, Jodexty Therlonge, Joel Magpantay, Rochelle Dela Cruz, Samuel Wan, Sarah Nguyen, Scott Lehrer, Norine Rosales, Tran Pham, Kyle Gajadhar, Reece Mungal, and Nikauleene Andrews; robot hardware support and teleoperation coordination: Sean Snyder, Spencer Goodrich, Cameron Burns, Jorge Aldaco, Jonathan Vela; data operations and infrastructure: Muqthar Mohammad, Mitta Kumar, Arnab Bose, Wayne Gramlich; and the many who helped provide language labeling of the datasets. We would also like to thank Pierre Sermanet, Debidatta Dwibedi, Michael Ryoo, Brian Ichter and Vincent Vanhoucke for their invaluable advice and support.

Source: Google AI Blog


Robots That Write Their Own Code

A common approach used to control robots is to program them with code to detect objects, sequencing commands to move actuators, and feedback loops to specify how the robot should perform a task. While these programs can be expressive, re-programming policies for each new task can be time consuming, and requires domain expertise.

What if when given instructions from people, robots could autonomously write their own code to interact with the world? It turns out that the latest generation of language models, such as PaLM, are capable of complex reasoning and have also been trained on millions of lines of code. Given natural language instructions, current language models are highly proficient at writing not only generic code but, as we’ve discovered, code that can control robot actions as well. When provided with several example instructions (formatted as comments) paired with corresponding code (via in-context learning), language models can take in new instructions and autonomously generate new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime. More broadly, this suggests an alternative approach to using machine learning for robots that (i) pursues generalization through modularity and (ii) leverages the abundance of open-source code and data available on the Internet.

Given code for an example task (left), language models can re-compose API calls to assemble new robot behaviors for new tasks (right) that use the same functions but in different ways.

To explore this possibility, we developed Code as Policies (CaP), a robot-centric formulation of language model-generated programs executed on physical systems. CaP extends our prior work, PaLM-SayCan, by enabling language models to complete even more complex robotic tasks with the full expression of general-purpose Python code. With CaP, we propose using language models to directly write robot code through few-shot prompting. Our experiments demonstrate that outputting code led to improved generalization and task performance over directly learning robot tasks and outputting natural language actions. CaP allows a single system to perform a variety of complex and varied robotic tasks without task-specific training.




We demonstrate, across several robot systems, including a robot from Everyday Robots, that language models can autonomously interpret language instructions to generate and execute CaPs that represent reactive low-level policies (e.g., proportional-derivative or impedance controllers) and waypoint-based policies (e.g., vision-based pick and place, trajectory-based control).

A Different Way to Think about Robot Generalization

To generate code for a new task given natural language instructions, CaP uses a code-writing language model that, when prompted with hints (i.e., import statements that inform which APIs are available) and examples (instruction-to-code pairs that present few-shot "demonstrations" of how instructions should be converted into code), writes new code for new instructions. Central to this approach is hierarchical code generation, which prompts language models to recursively define new functions, accumulate their own libraries over time, and self-architect a dynamic codebase. Hierarchical code generation improves state-of-the-art on both robotics as well as standard code-gen benchmarks in natural language processing (NLP) subfields, with 39.8% pass@1 on HumanEval, a benchmark of hand-written coding problems used to measure the functional correctness of synthesized programs.

Code-writing language models can express a variety of arithmetic operations and feedback loops grounded in language. Pythonic language model programs can use classic logic structures, e.g., sequences, selection (if/else), and loops (for/while), to assemble new behaviors at runtime. They can also use third-party libraries to interpolate points (NumPy), analyze and generate shapes (Shapely) for spatial-geometric reasoning, etc. These models not only generalize to new instructions, but they can also translate precise values (e.g., velocities) to ambiguous descriptions ("faster" and "to the left") depending on the context to elicit behavioral commonsense.

Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.

CaP generalizes at a specific layer in the robot: interpreting natural language instructions, processing perception outputs (e.g., from off-the-shelf object detectors), and then parameterizing control primitives. This fits into systems with factorized perception and control, and imparts a degree of generalization (acquired from pre-trained language models) without the magnitude of data collection needed for end-to-end robot learning. CaP also inherits language model capabilities that are unrelated to code writing, such as supporting instructions with non-English languages and emojis.

CaP inherits the capabilities of language models, such as multilingual and emoji support.

By characterizing the types of generalization encountered in code generation problems, we can also study how hierarchical code generation improves generalization. For example, "systematicity" evaluates the ability to recombine known parts to form new sequences, "substitutivity" evaluates robustness to synonymous code snippets, while "productivity" evaluates the ability to write policy code longer than those seen in the examples (e.g., for new long horizon tasks that may require defining and nesting new functions). Our paper presents a new open-source benchmark to evaluate language models on a set of robotics-related code generation problems. Using this benchmark, we find that, in general, bigger models perform better across most metrics, and that hierarchical code generation improves "productivity" generalization the most.

Performance on our RoboCodeGen Benchmark across different generalization types. The larger model (Davinci) performs better than the smaller model (Cushman), with hierarchical code generation improving productivity the most.

We're also excited about the potential for code-writing models to express cross-embodied plans for robots with different morphologies that perform the same task differently depending on the available APIs (perception action spaces), which is an important aspect of any robotics foundation model.

Language model code-generation exhibits cross-embodiment capabilities, completing the same task in different ways depending on the available APIs (that define perception action spaces).

Limitations

Code as policies today are restricted by the scope of (i) what the perception APIs can describe (e.g., few visual-language models to date can describe whether a trajectory is "bumpy" or "more C-shaped"), and (ii) which control primitives are available. Only a handful of named primitive parameters can be adjusted without over-saturating the prompts. Our approach also assumes all given instructions are feasible, and we cannot tell if generated code will be useful a priori. CaPs also struggle to interpret instructions that are significantly more complex or operate at a different abstraction level than the few-shot examples provided to the language model prompts. Thus, for example, in the tabletop domain, it would be difficult for our specific instantiation of CaPs to "build a house with the blocks" since there are no examples of building complex 3D structures. These limitations point to avenues for future work, including extending visual language models to describe low-level robot behaviors (e.g., trajectories) or combining CaPs with exploration algorithms that can autonomously add to the set of control primitives.


Open-Source Release

We have released the code needed to reproduce our experiments and an interactive simulated robot demo on the project website, which also contains additional real-world demos with videos and generated code.


Conclusion

Code as policies is a step towards robots that can modify their behaviors and expand their capabilities accordingly. This can be enabling, but the flexibility also raises potential risks since synthesized programs (unless manually checked per runtime) may result in unintended behaviors with physical hardware. We can mitigate these risks with built-in safety checks that bound the control primitives that the system can access, but more work is needed to ensure new combinations of known primitives are equally safe. We welcome broad discussion on how to minimize these risks while maximizing the potential positive impacts towards more general-purpose robots.


Acknowledgements

This research was done by Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng. Special thanks to Vikas Sindhwani, Vincent Vanhoucke for helpful feedback on writing, Chad Boodoo for operations and hardware support. An early preprint is available on arXiv.

Source: Google AI Blog


PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

Evolution strategy (ES) is a family of optimization techniques inspired by the ideas of natural selection: a population of candidate solutions are usually evolved over generations to better adapt to an optimization objective. ES has been applied to a variety of challenging decision making problems, such as legged locomotion, quadcopter control, and even power system control.

Compared to gradient-based reinforcement learning (RL) methods like proximal policy optimization (PPO) and soft actor-critic (SAC), ES has several advantages. First, ES directly explores in the space of controller parameters, while gradient-based methods often explore within a limited action space, which indirectly influences the controller parameters. More direct exploration has been shown to boost learning performance and enable large scale data collection with parallel computation. Second, a major challenge in RL is long-horizon credit assignment, e.g., when a robot accomplishes a task in the end, determining which actions it performed in the past were the most critical and should be assigned a greater reward. Since ES directly considers the total reward, it relieves researchers from needing to explicitly handle credit assignment. In addition, because ES does not rely on gradient information, it can naturally handle highly non-smooth objectives or controller architectures where gradient computation is non-trivial, such as meta–reinforcement learning. However, a major weakness of ES-based algorithms is their difficulty in scaling to problems that require high-dimensional sensory inputs to encode the environment dynamics, such as training robots with complex vision inputs.

In this work, we propose “PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations”, a learning algorithm that combines representation learning and ES to effectively solve high dimensional problems in a scalable way. The core idea is to leverage predictive information, a representation learning objective, to obtain a compact representation of the high-dimensional environment dynamics, and then apply Augmented Random Search (ARS), a popular ES algorithm, to transform the learned compact representation into robot actions. We tested PI-ARS on the challenging problem of visual-locomotion for legged robots. PI-ARS enables fast training of performant vision-based locomotion controllers that can traverse a variety of difficult environments. Furthermore, the controllers trained in simulated environments successfully transfer to a real quadruped robot.

PI-ARS trains reliable visual-locomotion policies that are transferable to the real world.

Predictive Information
A good representation for policy learning should be both compressive, so that ES can focus on solving a much lower dimensional problem than learning from raw observations would entail, and task-critical, so the learned controller has all the necessary information needed to learn the optimal behavior. For robotic control problems with high-dimensional input space, it is critical for the policy to understand the environment, including the dynamic information of both the robot itself and its surrounding objects.

As such, we propose an observation encoder that preserves information from the raw input observations that allows the policy to predict the future states of the environment, thus the name predictive information (PI). More specifically, we optimize the encoder such that the encoded version of what the robot has seen and planned in the past can accurately predict what the robot might see and be rewarded in the future. One mathematical tool to describe such a property is that of mutual information, which measures the amount of information we obtain about one random variable X by observing another random variable Y. In our case, X and Y would be what the robot saw and planned in the past, and what the robot sees and is rewarded in the future. Directly optimizing the mutual information objective is a challenging problem because we usually only have access to samples of the random variables, but not their underlying distributions. In this work we follow a previous approach that uses InfoNCE, a contrastive variational bound on mutual information to optimize the objective.

Left: We use representation learning to encode PI of the environment. Right: We train the representation by replaying trajectories from the replay buffer and maximize the predictability between the observation and motion plan in the past and the observation and reward in the future of the trajectory.

Predictive Information with Augmented Random Search
Next, we combine PI with Augmented Random Search (ARS), an algorithm that has shown excellent optimization performance for challenging decision-making tasks. At each iteration of ARS, it samples a population of perturbed controller parameters, evaluates their performance in the testing environment, and then computes a gradient that moves the controller towards the ones that performed better.

We use the learned compact representation from PI to connect PI and ARS, which we call PI-ARS. More specifically, ARS optimizes a controller that takes as input the learned compact representation PI and predicts appropriate robot commands to achieve the task. By optimizing a controller with smaller input space, it allows ARS to find the optimal solution more efficiently. Meanwhile, we use the data collected during ARS optimization to further improve the learned representation, which is then fed into the ARS controller in the next iteration.

An overview of the PI-ARS data flow. Our algorithm interleaves between two steps: 1) optimizing the PI objective that updates the policy, which is the weights for the neural network that extracts the learned representation; and 2) sampling new trajectories and updating the controller parameters using ARS.

Visual-Locomotion for Legged Robots
We evaluate PI-ARS on the problem of visual-locomotion for legged robots. We chose this problem for two reasons: visual-locomotion is a key bottleneck for legged robots to be applied in real-world applications, and the high-dimensional vision-input to the policy and the complex dynamics in legged robots make it an ideal test-case to demonstrate the effectiveness of the PI-ARS algorithm. A demonstration of our task setup in simulation can be seen below. Policies are first trained in simulated environments, and then transferred to hardware.

An illustration of the visual-locomotion task setup. The robot is equipped with two cameras to observe the environment (illustrated by the transparent pyramids). The observations and robot state are sent to the policy to generate a high-level motion plan, such as feet landing location and desired moving speed. The high-level motion plan is then achieved by a low-level Motion Predictive Control (MPC) controller.

Experiment Results
We first evaluate the PI-ARS algorithm on four challenging simulated tasks:

  • Uneven stepping stones: The robot needs to walk over uneven terrain while avoiding gaps.
  • Quincuncial piles: The robot needs to avoid gaps both in front and sideways.
  • Moving platforms: The robot needs to walk over stepping stones that are randomly moving horizontally or vertically. This task illustrates the flexibility of learning a vision-based policy in comparison to explicitly reconstructing the environment.
  • Indoor navigation: The robot needs to navigate to a random location while avoiding obstacles in an indoor environment.

As shown below, PI-ARS is able to significantly outperform ARS in all four tasks in terms of the total task reward it can obtain (by 30-50%).

Left: Visualization of PI-ARS policy performance in simulation. Right: Total task reward (i.e., episode return) for PI-ARS (green line) and ARS (red line). The PI-ARS algorithm significantly outperforms ARS on four challenging visual-locomotion tasks.

We further deploy the trained policies to a real Laikago robot on two tasks: random stepping stone and indoor navigation. We demonstrate that our trained policies can successfully handle real-world tasks. Notably, the success rate of the random stepping stone task improved from 40% in the prior work to 100%.

PI-ARS trained policy enables a real Laikago robot to navigate around obstacles.

Conclusion
In this work, we present a new learning algorithm, PI-ARS, that combines gradient-based representation learning with gradient-free evolutionary strategy algorithms to leverage the advantages of both. PI-ARS enjoys the effectiveness, simplicity, and parallelizability of gradient-free algorithms, while relieving a key bottleneck of ES algorithms on handling high-dimensional problems by optimizing a low-dimensional representation. We apply PI-ARS to a set of challenging visual-locomotion tasks, among which PI-ARS significantly outperforms the state of the art. Furthermore, we validate the policy learned by PI-ARS on a real quadruped robot. It enables the robot to walk over randomly-placed stepping stones and navigate in an indoor space with obstacles. Our method opens the possibility of incorporating modern large neural network models and large-scale data into the field of evolutionary strategy for robotics control.

Acknowledgements
We would like to thank our paper co-authors: Ofir Nachum, Tingnan Zhang, Sergio Guadarrama, and Jie Tan. We would also like to thank Ian Fischer and John Canny for valuable feedback.

Source: Google AI Blog


Table Tennis: A Research Platform for Agile Robotics

Robot learning has been applied to a wide range of challenging real world tasks, including dexterous manipulation, legged locomotion, and grasping. It is less common to see robot learning applied to dynamic, high-acceleration tasks requiring tight-loop human-robot interactions, such as table tennis. There are two complementary properties of the table tennis task that make it interesting for robotic learning research. First, the task requires both speed and precision, which puts significant demands on a learning algorithm. At the same time, the problem is highly-structured (with a fixed, predictable environment) and naturally multi-agent (the robot can play with humans or another robot), making it a desirable testbed to investigate questions about human-robot interaction and reinforcement learning. These properties have led to several research groups developing table tennis research platforms [1, 2, 3, 4].

The Robotics team at Google has built such a platform to study problems that arise from robotic learning in a multi-player, dynamic and interactive setting. In the rest of this post we introduce two projects, Iterative-Sim2Real (to be presented at CoRL 2022) and GoalsEye (IROS 2022), which illustrate the problems we have been investigating so far. Iterative-Sim2Real enables a robot to hold rallies of over 300 hits with a human player, while GoalsEye enables learning goal-conditioned policies that match the precision of amateur humans.

Iterative-Sim2Real policies playing cooperatively with humans (top) and a GoalsEye policy returning balls to different locations (bottom).

Iterative-Sim2Real: Leveraging a Simulator to Play Cooperatively with Humans
In this project, the goal for the robot is cooperative in nature: to carry out a rally with a human for as long as possible. Since it would be tedious and time-consuming to train directly against a human player in the real world, we adopt a simulation-based (i.e., sim-to-real) approach. However, because it is difficult to simulate human behavior accurately, applying sim-to-real learning to tasks that require tight, close-loop interaction with a human participant is difficult.

In Iterative-Sim2Real, (i.e., i-S2R), we present a method for learning human behavior models for human-robot interaction tasks, and instantiate it on our robotic table tennis platform. We have built a system that can achieve rallies of up to 340 hits with an amateur human player (shown below).

A 340-hit rally lasting over 4 minutes.

Learning Human Behavior Models: a Chicken and Egg Problem
The central problem in learning accurate human behavior models for robotics is the following: if we do not have a good-enough robot policy to begin with, then we cannot collect high-quality data on how a person might interact with the robot. But without a human behavior model, we cannot obtain robot policies in the first place. An alternative would be to train a robot policy directly in the real world, but this is often slow, cost-prohibitive, and poses safety-related challenges, which are further exacerbated when people are involved. i-S2R, visualized below, is a solution to this chicken and egg problem. It uses a simple model of human behavior as an approximate starting point and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined.

i-S2R Methodology.

Results
To evaluate i-S2R, we repeated the training process five times with five different human opponents and compared it with a baseline approach of ordinary sim-to-real plus fine-tuning (S2R+FT). When aggregated across all players, the i-S2R rally length is higher than S2R+FT by about 9% (below on the left). The histogram of rally lengths for i-S2R and S2R+FT (below on the right) shows that a large fraction of the rallies for S2R+FT are shorter (i.e., less than 5), while i-S2R achieves longer rallies more frequently.

Summary of i-S2R results. Boxplot details: The white circle is the mean, the horizontal line is the median, box bounds are the 25th and 75th percentiles.

We also break down the results based on player type: beginner (40% players), intermediate (40% of players) and advanced (20% players). We see that i-S2R significantly outperforms S2R+FT for both beginner and intermediate players (80% of players).

i-S2R Results by player type.

More details on i-S2R can be found on our preprint, website, and also in the following summary video.

GoalsEye: Learning to Return Balls Precisely on a Physical Robot
While we focused on sim-to-real learning in i-S2R, it is sometimes desirable to learn using only real-world data — closing the sim-to-real gap in this case is unnecessary. Imitation learning (IL) provides a simple and stable approach to learning in the real world, but it requires access to demonstrations and cannot exceed the performance of the teacher. Collecting expert human demonstrations of precise goal-targeting in high speed settings is challenging and sometimes impossible (due to limited precision in human movements). While reinforcement learning (RL) is well-suited to such high-speed, high-precision tasks, it faces a difficult exploration problem (especially at the start), and can be very sample inefficient. In GoalsEye, we demonstrate an approach that combines recent behavior cloning techniques [5, 6] to learn a precise goal-targeting policy, starting from a small, weakly-structured, non-targeting dataset.

Here we consider a different table tennis task with an emphasis on precision. We want the robot to return the ball to an arbitrary goal location on the table, e.g. “hit the back left corner" or ''land the ball just over the net on the right side" (see left video below). Further, we wanted to find a method that can be applied directly on our real world table tennis environment with no simulation involved. We found that the synthesis of two existing imitation learning techniques, Learning from Play (LFP) and Goal-Conditioned Supervised Learning (GCSL), scales to this setting. It is safe and sample efficient enough to train a policy on a physical robot which is as accurate as amateur humans at the task of returning balls to specific goals on the table.

 
GoalsEye policy aiming at a 20cm diameter goal (left). Human player aiming at the same goal (right).

The essential ingredients of success are:

  1. A minimal, but non-goal-directed “bootstrap” dataset of the robot hitting the ball to overcome an initial difficult exploration problem.
  2. Hindsight relabeled goal conditioned behavioral cloning (GCBC) to train a goal-directed policy to reach any goal in the dataset.
  3. Iterative self-supervised goal reaching. The agent improves continuously by setting random goals and attempting to reach them using the current policy. All attempts are relabeled and added into a continuously expanding training set. This self-practice, in which the robot expands the training data by setting and attempting to reach goals, is repeated iteratively.
GoalsEye methodology.

Demonstrations and Self-Improvement Through Practice Are Key
The synthesis of techniques is crucial. The policy’s objective is to return a variety of incoming balls to any location on the opponent’s side of the table. A policy trained on the initial 2,480 demonstrations only accurately reaches within 30 cm of the goal 9% of the time. However, after a policy has self-practiced for ~13,500 attempts, goal-reaching accuracy rises to 43% (below on the right). This improvement is clearly visible as shown in the videos below. Yet if a policy only self-practices, training fails completely in this setting. Interestingly, the number of demonstrations improves the efficiency of subsequent self-practice, albeit with diminishing returns. This indicates that demonstration data and self-practice could be substituted depending on the relative time and cost to gather demonstration data compared with self-practice.

Self-practice substantially improves accuracy. Left: simulated training. Right: real robot training. The demonstration datasets contain ~2,500 episodes, both in simulation and the real world.
 
Visualizing the benefits of self-practice. Left: policy trained on initial 2,480 demonstrations. Right: policy after an additional 13,500 self-practice attempts.

More details on GoalsEye can be found in the preprint and on our website.

Conclusion and Future Work
We have presented two complementary projects using our robotic table tennis research platform. i-S2R learns RL policies that are able to interact with humans, while GoalsEye demonstrates that learning from real-world unstructured data combined with self-supervised practice is effective for learning goal-conditioned policies in a precise, dynamic setting.

One interesting research direction to pursue on the table tennis platform would be to build a robot “coach” that could adapt its play style according to the skill level of the human participant to keep things challenging and exciting.

Acknowledgements
We thank our co-authors, Saminda Abeyruwan, Alex Bewley, Krzysztof Choromanski, David B. D’Ambrosio, Tianli Ding, Deepali Jain, Corey Lynch, Pannag R. Sanketi, Pierre Sermanet and Anish Shankar. We are also grateful for the support of many members of the Robotics Team who are listed in the acknowledgement sections of the papers.

Source: Google AI Blog


Learning to Walk in the Wild from Terrain Semantics

An important promise for quadrupedal robots is their potential to operate in complex outdoor environments that are difficult or inaccessible for humans. Whether it’s to find natural resources deep in the mountains, or to search for life signals in heavily-damaged earthquake sites, a robust and versatile quadrupedal robot could be very helpful. To achieve that, a robot needs to perceive the environment, understand its locomotion challenges, and adapt its locomotion skill accordingly. While recent advances in perceptive locomotion have greatly enhanced the capability of quadrupedal robots, most works focus on indoor or urban environments, thus they cannot effectively handle the complexity of off-road terrains. In these environments, the robot needs to understand not only the terrain shape (e.g., slope angle, smoothness), but also its contact properties (e.g., friction, restitution, deformability), which are important for a robot to decide its locomotion skills. As existing perceptive locomotion systems mostly focus on the use of depth cameras or LiDARs, it can be difficult for these systems to estimate such terrain properties accurately.

In “Learning Semantics-Aware Locomotion Skills from Human Demonstrations”, we design a hierarchical learning framework to improve a robot’s ability to traverse complex, off-road environments. Unlike previous approaches that focus on environment geometry, such as terrain shape and obstacle locations, we focus on environment semantics, such as terrain type (grass, mud, etc.) and contact properties, which provide a complementary set of information useful for off-road environments. As the robot walks, the framework decides the locomotion skill, including the speed and gait (i.e., shape and timing of the legs’ movement) of the robot based on the perceived semantics, which allows the robot to walk robustly on a variety of off-road terrains, including rocks, pebbles, deep grass, mud, and more.

Our framework selects skills (gait and speed) of the robot from the camera RGB image. We first compute the speed from terrain semantics, and then select a gait based on the speed.

Overview
The hierarchical framework consists of a high-level skill policy and a low level motor controller. The skill policy selects a locomotion skill based on camera images, and the motor controller converts the selected skill into motor commands. The high-level skill policy is further decomposed into a learned speed policy and a heuristic-based gait selector. To decide a skill, the speed policy first computes the desired forward speed, based on the semantic information from the onboard RGB camera. For energy efficiency and robustness, quadrupedal robots usually select a different gait for each speed, so we designed the gait selector to compute a desired gait based on the forward speed. Lastly, a low-level convex model-predictive controller (MPC) converts the desired locomotion skill into motor torque commands, and executes them on the real hardware. We train the speed policy directly in the real world using imitation learning because it requires fewer training data compared to standard reinforcement learning algorithms.

The framework consists of a high-level skill policy and a low-level motor controller.

Learning Speed Command from Human Demonstrations
As the central component in our pipeline, the speed policy outputs the desired forward speed of the robot based on the RGB image from the onboard camera. Although many robot learning tasks can leverage simulation as a source of lower-cost data collection, we train the speed policy in the real world because accurate simulation of complex and diverse off-road environments is not yet available. As policy learning in the real world is time-consuming and potentially unsafe, we make two key design choices to improve the data efficiency and safety of our system.

The first is learning from human demonstrations. Standard reinforcement learning algorithms typically learn by exploration, where the agent attempts different actions in an environment and builds preferences based on the rewards received. However, such explorations can be potentially unsafe, especially in off-road environments, since any robot failures can damage both the robot hardware and the surrounding environment. To ensure safety, we train the speed policy using imitation learning from human demonstrations. We first ask a human operator to teleoperate the robot on a variety of off-road terrains, where the operator controls the speed and heading of the robot using a remote joystick. Next, we collect the training data by storing (image, forward_speed) pairs. We then train the speed policy using standard supervised learning to predict the human operator’s speed command. As it turns out, the human demonstration is both safe and high-quality, and allows the robot to learn a proper speed choice for different terrains.

The second key design choice is the training method. Deep neural networks, especially those involving high-dimensional visual inputs, typically require lots of data to train. To reduce the amount of real-world training data required, we first pre-train a semantic segmentation model on RUGD (an off-road driving dataset where the images look similar to those captured by the robot’s onboard camera), where the model predicts the semantic class (grass, mud, etc.) for every pixel in the camera image. We then extract a semantic embedding from the model’s intermediate layers and use that as the feature for on-robot training. With the pre-trained semantic embedding, we can train the speed policy effectively using less than 30 minutes of real-world data, which greatly reduces the amount of effort required.

We pre-train a semantic segmentation model and extract a semantic embedding to be fine-tuned on robot data.

Gait Selection and Motor Control
The next component in the pipeline, the gait selector, computes the appropriate gait based on the speed command from the speed policy. The gait of a robot, including its stepping frequency, swing height, and base height, can greatly affect the robot’s ability to traverse different terrains.

Scientific studies have shown that animals switch between different gaits at different speeds, and this result is further validated in quadrupedal robots, so we designed the gait selector to compute a robust gait for each speed. Compared to using a fixed gait across all speeds, we find that the gait selector further enhances the robot’s navigation performance on off-road terrains (more details in the paper).

The last component of the pipeline is a motor controller, which converts the speed and gait commands into motor torques. Similar to previous work, we use separate control strategies for swing and stance legs. By separating the task of skill learning and motor control, the skill policy only needs to output the desired speed, and does not need to learn low-level locomotion controls, which greatly simplifies the learning process.

Experiment Results
We implemented our framework on an A1 quadrupedal robot and tested it on an outdoor trail with multiple terrain types, including grass, gravel, and asphalt, which pose varying degrees of difficulty for the robot. For example, while the robot needs to walk slowly with high foot swings in deep grass to prevent its foot from getting stuck, on asphalt it can walk much faster with lower foot swings for better energy efficiency. Our framework captures such differences and selects an appropriate skill for each terrain type: slow speed (0.5m/s) on deep grass, medium speed (1m/s) on gravel, and high speed (1.4m/s) on asphalt. It completes the 460m-long trail in 9.6 minutes with an average speed of 0.8m/s (i.e., that’s 1.8 miles or 2.9 kilometers per hour). In contrast, non-adaptive policies either cannot complete the trail safely or walk significantly slower (0.5m/s), illustrating the importance of adapting locomotion skills based on the perceived environments.

The framework selects different speeds based on conditions of the trail.

To test generalizability, we also deployed the robot to a number of trails that are not seen during training. The robot traverses through all of them without failure, and adjusts its locomotion skills based on terrain semantics. In general, the skill policy selects a faster skill on rigid and flat terrains and a slower speed on deformable or uneven terrain. At the time of writing, the robot has traversed over 6km of outdoor trails without failure.

With the framework, the robot walks safely on a variety of outdoor terrains not seen during training.

Conclusion
In this work, we present a hierarchical framework to learn semantic-aware locomotion skills for off-road locomotion. Using less than 30 minutes of human demonstration data, the framework learns to adjust the speed and gait of the robot based on the perceived semantics of the environment. The robot can walk safely and efficiently on a wide variety of off-road terrains. One limitation of our framework is that it only adjusts locomotion skills for standard walking and does not support more agile behaviors such as jumping, which can be essential for traversing more difficult terrains with gaps or hurdles. Another limitation is that our framework currently requires manual steering commands to follow a desired path and reach the goal. In future work, we plan to look into a deeper integration of high-level skill policy with the low-level controller for more agile behaviors, and incorporate navigation and path planning into the framework so that the robot can operate fully autonomously in challenging off-road environments.

Acknowledgements
We would like to thank our paper co-authors: Xiangyun Meng, Wenhao Yu, Tingnan Zhang, Jie Tan, and Byron Boots. We would also like to thank the team members of Robotics at Google for discussions and feedback.

Source: Google AI Blog


Towards Helpful Robots: Grounding Language in Robotic Affordances

Over the last several years, we have seen significant progress in applying machine learning to robotics. However, robotic systems today are capable of executing only very short, hard-coded commands, such as “Pick up an apple,” because they tend to perform best with clear tasks and rewards. They struggle with learning to perform long-horizon tasks and reasoning about abstract goals, such as a user prompt like “I just worked out, can you get me a healthy snack?”

Meanwhile, recent progress in training language models (LMs) has led to systems that can perform a wide range of language understanding and generation tasks with impressive results. However, these language models are inherently not grounded in the physical world due to the nature of their training process: a language model generally does not interact with its environment nor observe the outcome of its responses. This can result in it generating instructions that may be illogical, impractical or unsafe for a robot to complete in a physical context. For example, when prompted with “I spilled my drink, can you help?” the language model GPT-3 responds with “You could try using a vacuum cleaner,” a suggestion that may be unsafe or impossible for the robot to execute. When asking the FLAN language model the same question, it apologizes for the spill with "I'm sorry, I didn't mean to spill it,” which is not a very useful response. Therefore, we asked ourselves, is there an effective way to combine advanced language models with robot learning algorithms to leverage the benefits of both?

In “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”, we present a novel approach, developed in partnership with Everyday Robots, that leverages advanced language model knowledge to enable a physical agent, such as a robot, to follow high-level textual instructions for physically-grounded tasks, while grounding the language model in tasks that are feasible within a specific real-world context. We evaluate our method, which we call PaLM-SayCan, by placing robots in a real kitchen setting and giving them tasks expressed in natural language. We observe highly interpretable results for temporally-extended complex and abstract tasks, like “I just worked out, please bring me a snack and a drink to recover.” Specifically, we demonstrate that grounding the language model in the real world nearly halves errors over non-grounded baselines. We are also excited to release a robot simulation setup where the research community can test this approach.

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

A Dialog Between User and Robot, Facilitated by the Language Model
Our approach uses the knowledge contained in language models (Say) to determine and score actions that are useful towards high-level instructions. It also uses an affordance function (Can) that enables real-world-grounding and determines which actions are possible to execute in a given environment. Using the the PaLM language model, we call this PaLM-SayCan.

Our approach selects skills based on what the language model scores as useful to the high level instruction and what the affordance model scores as possible.

Our system can be seen as a dialog between the user and robot, facilitated by the language model. The user starts by giving an instruction that the language model turns into a sequence of steps for the robot to execute. This sequence is filtered using the robot’s skillset to determine the most feasible plan given its current state and environment. The model determines the probability of a specific skill successfully making progress toward completing the instruction by multiplying two probabilities: (1) task-grounding (i.e., a skill language description) and (2) world-grounding (i.e., skill feasibility in the current state).

There are additional benefits of our approach in terms of its safety and interpretability. First, by allowing the LM to score different options rather than generate the most likely output, we effectively constrain the LM to only output one of the pre-selected responses. In addition, the user can easily understand the decision making process by looking at the separate language and affordance scores, rather than a single output.

PaLM-SayCan is also interpretable: at each step, we can see the top options it considers based on their language score (blue), affordance score (red), and combined score (green).

Training Policies and Value Functions
Each skill in the agent’s skillset is defined as a policy with a short language description (e.g., “pick up the can”), represented as embeddings, and an affordance function that indicates the probability of completing the skill from the robot’s current state. To learn the affordance functions, we use sparse reward functions set to 1.0 for a successful execution, and 0.0 otherwise.

We use image-based behavioral cloning (BC) to train the language-conditioned policies and temporal-difference-based (TD) reinforcement learning (RL) to train the value functions. To train the policies, we collected data from 68,000 demos performed by 10 robots over 11 months and added 12,000 successful episodes, filtered from a set of autonomous episodes of learned policies. We then learned the language conditioned value functions using MT-Opt in the Everyday Robots simulator. The simulator complements our real robot fleet with a simulated version of the skills and environment, which is transformed using RetinaGAN to reduce the simulation-to-real gap. We bootstrapped simulation policies’ performance by using demonstrations to provide initial successes, and then continuously improved RL performance with online data collection in simulation.

Given a high-level instruction, our approach combines the probabilities from the language model with the probabilities from the value function (VF) to select the next skill to perform. This process is repeated until the high-level instruction is successfully completed.

Performance on Temporally-Extended, Complex, and Abstract Instructions
To test our approach, we use robots from Everyday Robots paired with PaLM. We place the robots in a kitchen environment containing common objects and evaluate them on 101 instructions to test their performance across various robot and environment states, instruction language complexity and time horizon. Specifically, these instructions were designed to showcase the ambiguity and complexity of language rather than to provide simple, imperative queries, enabling queries such as “I just worked out, how would you bring me a snack and a drink to recover?” instead of “Can you bring me water and an apple?”

We use two metrics to evaluate the system’s performance: (1) the plan success rate, indicating whether the robot chose the right skills for the instruction, and (2) the execution success rate, indicating whether it performed the instruction successfully. We compare two language models, PaLM and FLAN (a smaller language model fine-tuned on instruction answering) with and without the affordance grounding as well as the underlying policies running directly with natural language (Behavioral Cloning in the table below). The results show that the system using PaLM with affordance grounding (PaLM-SayCan) chooses the correct sequence of skills 84% of the time and executes them successfully 74% of the time, reducing errors by 50% compared to FLAN and compared to PaLM without robotic grounding. This is particularly exciting because it represents the first time we can see how an improvement in language models translates to a similar improvement in robotics. This result indicates a potential future where robotics is able to ride the wave of progress that we have been observing in language models, bringing these subfields of research closer together.

Algorithm     Plan     Execute
PaLM-SayCan     84%     74%
PaLM     67%     -
FLAN-SayCan     70%     61%
FLAN     38%     -
Behavioral Cloning     0%     0%
PaLM-SayCan halves errors compared to PaLM without affordances and compared to FLAN over 101 tasks.
SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.

If you're interested in learning more about this project from the researchers themselves, please check out the video below:

Conclusion and Future Work
We’re excited about the progress that we’ve seen with PaLM-SayCan, an interpretable and general approach to leveraging knowledge from language models that enables a robot to follow high-level textual instructions to perform physically-grounded tasks. Our experiments on a number of real-world robotic tasks demonstrate the ability to plan and complete long-horizon, abstract, natural language instructions at a high success rate. We believe that PaLM-SayCan’s interpretability allows for safe real-world user interaction with robots. As we explore future directions for this work, we hope to better understand how information gained via the robot’s real-world experience could be leveraged to improve the language model and to what extent natural language is the right ontology for programming robots. We have open-sourced a robot simulation setup, which we hope will provide researchers with a valuable resource for future research that combines robotic learning with advanced language models. The research community can visit the project’s GitHub page and website to learn more.

Acknowledgements
We’d like to thank our coauthors Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Kelly Fu, Keerthana Gopalakrishnan, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. We’d also like to thank Yunfei Bai, Matt Bennice, Maarten Bosma, Justin Boyd, Bill Byrne, Kendra Byrne, Noah Constant, Pete Florence, Laura Graesser, Rico Jonschkowski, Daniel Kappler, Hugo Larochelle, Benjamin Lee, Adrian Li, Suraj Nair, Krista Reymann, Jeff Seto, Dhruv Shah, Ian Storz, Razvan Surdulescu, and Vincent Zhao for their help and support in various aspects of the project. And we’d like to thank Tom Small for creating many of the animations in this post.

Source: Google AI Blog


Scanned Objects by Google Research: A Dataset of 3D-Scanned Common Household Items

Many recent advances in computer vision and robotics rely on deep learning, but training deep learning models requires a wide variety of data to generalize to new scenarios. Historically, deep learning for computer vision has relied on datasets with millions of items that were gathered by web scraping, examples of which include ImageNet, Open Images, YouTube-8M, and COCO. However, the process of creating these datasets can be labor-intensive, and can still exhibit labeling errors that can distort the perception of progress. Furthermore, this strategy does not readily generalize to arbitrary three-dimensional shapes or real-world robotic data.

Real-world robotic data collection is very useful, but difficult to scale and challenging to label (figure from BC-Z).

Simulating robots and environments using tools such as Gazebo, MuJoCo, and Unity can mitigate many of the inherent limitations in these datasets. However, simulation is only an approximation of reality — handcrafted models built from polygons and primitives often correspond poorly to real objects. Even if a scene is built directly from a 3D scan of a real environment, the movable objects in that scan will act like fixed background scenery and will not respond the way real-world objects would. Due to these challenges, there are few large libraries with high-quality models of 3D objects that can be incorporated into physical and visual simulations to provide the variety needed for deep learning.

In “Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items”, presented at ICRA 2022, we describe our efforts to address this need by creating the Scanned Objects dataset, a curated collection of over 1000 3D-scanned common household items. The Scanned Objects dataset is usable in tools that read Simulation Description Format (SDF) models, including the Gazebo and PyBullet robotics simulators. Scanned Objects is hosted on Open Robotics, an open-source hosting environment for models compatible with the Gazebo simulator.

History
Robotics researchers within Google began scanning objects in 2011, creating high-fidelity 3D models of common household objects to help robots recognize and grasp things in their environments. However, it became apparent that 3D models have many uses beyond object recognition and robotic grasping, including scene construction for physical simulations and 3D object visualization for end-user applications. Therefore, this Scanned Objects project was expanded to bring 3D experiences to Google at scale, collecting a large number of 3D scans of household objects through a process that is more efficient and cost effective than traditional commercial-grade product photography.

Scanned Objects was an end-to-end effort, involving innovations at nearly every stage of the process, including curation of objects at scale for 3D scanning, the development of novel 3D scanning hardware, efficient 3D scanning software, fast 3D rendering software for quality assurance, and specialized frontends for web and mobile viewers. We also executed human-computer interaction studies to create effective experiences for interacting with 3D objects.

Objects that were acquired for scanning.

These object models proved useful in 3D visualizations for Everyday Robots, which used the models to bridge the sim-to-real gap for training, work later published as RetinaGAN and RL-CycleGAN. Building on these earlier 3D scanning efforts, in 2019 we began preparing an external version of the Scanned Objects dataset and transforming the previous set of 3D images into graspable 3D models.

Object Scanning
To create high-quality models, we built a scanning rig to capture images of an object from multiple directions under controlled and carefully calibrated conditions. The system consists of two machine vision cameras for shape detection, a DSLR camera for high-quality HDR color frame extraction, and a computer-controlled projector for pattern recognition. The scanning rig uses a structured light technique that infers a 3D shape from camera images with patterns of light that are projected onto an object.

The scanning rig used to capture 3D models.
A shoe being scanned (left). Images are captured from several directions with different patterns of light and color. A shadow passing over an object (right) illustrates how a 3D shape can be captured with an off-axis view of a shadow edge.

Simulation Model Conversion
The early internal scanned models used protocol buffer metadata, high-resolution visuals, and formats that were not suitable for simulation. For some objects, physical properties, such as mass, were captured by weighing the objects at scanning time, but surface properties, such as friction or deformation, were not represented.

So, following data collection, we built an automated pipeline to solve these issues and enable the use of scanned models in simulation systems. The automated pipeline filters out invalid or duplicate objects, automatically assigns object names using text descriptions of the objects, and eliminates object mesh scans that do not meet simulation requirements. Next, the pipeline estimates simulation properties (e.g., mass and moment of inertia) from shape and volume, constructs collision volumes, and downscales the model to a usable size. Finally, the pipeline converts each model to SDF format, creates thumbnail images, and packages the model for use in simulation systems.

The pipeline filters models that are not suitable for simulation, generates collision volumes, computes physical properties, downsamples meshes, generates thumbnails, and packages them all for use in simulation systems.
A collection of Scanned Object models rendered in Blender.

The output of this pipeline is a simulation model in an appropriate format with a name, mass, friction, inertia, and collision information, along with searchable metadata in a public interface compatible with our open-source hosting on Open Robotics’ Gazebo.

The output objects are represented as SDF models that refer to Wavefront OBJ meshes averaging 1.4 Mb per model. Textures for these models are in PNG format and average 11.2 Mb. Together, these provide high resolution shape and texture.

Impact
The Scanned Objects dataset contains 1030 scanned objects and their associated metadata, totaling 13 Gb, licensed under the CC-BY 4.0 License. Because these models are scanned rather than modeled by hand, they realistically reflect real object properties, not idealized recreations, reducing the difficulty of transferring learning from simulation to the real world.

Input views (left) and reconstructed shape and texture from two novel views on the right (figure from Differentiable Stereopsis).
Visualized action scoring predictions over three real-world 3D scans from the Replica dataset and Scanned Objects (figure from Where2Act).

The Scanned Objects dataset has already been used in over 25 papers across as many projects, spanning computer vision, computer graphics, robot manipulation, robot navigation, and 3D shape processing. Most projects used the dataset to provide synthetic training data for learning algorithms. For example, the Scanned Objects dataset was used in Kubric, an open-sourced generator of scalable datasets for use in over a dozen vision tasks, and in LAX-RAY, a system for searching shelves with lateral access X-rays to automate the mechanical search for occluded objects on shelves.

Unsupervised 3D keypoints on real-world data (figure from KeypointDeformer).

We hope that the Scanned Objects dataset will be used by more robotics and simulation researchers in the future, and that the example set by this dataset will inspire other owners of 3D model repositories to make them available for researchers everywhere. If you would like to try it yourself, head to Gazebo and start browsing!

Acknowledgments
The authors thank the Scanned Objects team, including Peter Anderson-Sprecher, J.J. Blumenkranz, James Bruce, Ken Conley, Katie Dektar, Charles DuHadway, Anthony Francis, Chaitanya Gharpure, Topraj Gurung, Kristy Headley, Ryan Hickman, John Isidoro, Sumit Jain, Brandon Kinman, Greg Kline, Mach Kobayashi, Nate Koenig, Kai Kohlhoff, James Kuffner, Thor Lewis, Mike Licitra, Lexi Martin, Julian (Mac) Mason, Rus Maxham, Pascal Muetschard, Kannan Pashupathy, Barbara Petit, Arshan Poursohi, Jared Russell, Matt Seegmiller, John Sheu, Joe Taylor, Vincent Vanhoucke, Josh Weaver, and Tommy McHugh.

Special thanks go to Krista Reymann for organizing this project, helping write the paper, and editing this blogpost, James Bruce for the scanning pipeline design and Pascal Muetschard for maintaining the database of object models.

Source: Google AI Blog


Learning Locomotion Skills Safely in the Real World

The promise of deep reinforcement learning (RL) in solving complex, high-dimensional problems autonomously has attracted much interest in areas such as robotics, game playing, and self-driving cars. However, effectively training an RL policy requires exploring a large set of robot states and actions, including many that are not safe for the robot. This is a considerable risk, for example, when training a legged robot. Because such robots are inherently unstable, there is a high likelihood of the robot falling during learning, which could cause damage.

The risk of damage can be mitigated to some extent by learning the control policy in computer simulation and then deploying it in the real world. However, this approach usually requires addressing the difficult sim-to-real gap, i.e., the policy trained in simulation can not be readily deployed in the real world for various reasons, such as sensor noise in deployment or the simulator not being realistic enough during training. Another approach to solve this issue is to directly learn or fine-tune a control policy in the real world. But again, the main challenge is to assure safety during learning.

In “Safe Reinforcement Learning for Legged Locomotion”, we introduce a safe RL framework for learning legged locomotion while satisfying safety constraints during training. Our goal is to learn locomotion skills autonomously in the real world without the robot falling during the entire learning process. Our learning framework adopts a two-policy safe RL framework: a “safe recovery policy” that recovers robots from near-unsafe states, and a “learner policy” that is optimized to perform the desired control task. The safe learning framework switches between the safe recovery policy and the learner policy to enable robots to safely acquire novel and agile motor skills.

The Proposed Framework
Our goal is to ensure that during the entire learning process, the robot never falls, regardless of the learner policy being used. Similar to how a child learns to ride a bike, our approach teaches an agent a policy while using "training wheels", i.e., a safe recovery policy. We first define a set of states, which we call a “safety trigger set”, where the robot is close to violating safety constraints but can still be saved by a safe recovery policy. For example, the safety trigger set can be defined as a set of states with the height of the robots being below a certain threshold and the roll, pitch, yaw angles being too large, which is an indication of falls. When the learner policy results in the robot being within the safety trigger set (i.e., where it is likely to fall), we switch to the safe recovery policy, which drives the robot back to a safe state. We determine when to switch back to the learner policy by leveraging an approximate dynamics model of the robot to predict the future robot trajectory. For example, based on the position of the robot’s legs and the current angle of the robot based on sensors for roll, pitch, and yaw, is it likely to fall in the future? If the predicted future states are all safe, we hand the control back to the learner policy, otherwise, we keep using the safe recovery policy.

The state diagram of the proposed approach. (1) If the learner policy violates the safety constraint, we switch to the safe recovery policy. (2) If the learner policy cannot ensure safety in the near future after switching to the safe recovery policy, we keep using the safe recovery policy. This allows the robot to explore more while ensuring safety.

This approach ensures safety in complex systems without resorting to opaque neural networks that may be sensitive to distribution shifts in application. In addition, the learner policy is able to explore states that are near safety violations, which is useful for learning a robust policy.

Because we use “approximated” dynamics to predict the future trajectory, we also examine how much safer a robot would be if we used a much more accurate model for its dynamics. We provide a theoretical analysis of this problem and show that our approach can achieve minimal safety performance loss compared to one with a full knowledge about the system dynamics.

Legged Locomotion Tasks
To demonstrate the effectiveness of the algorithm, we consider learning three different legged locomotion skills:

  1. Efficient Gait: The robot learns how to walk with low energy consumption and is rewarded for consuming less energy.
  2. Catwalk: The robot learns a catwalk gait pattern, in which the left and right two feet are close to each other. This is challenging because by narrowing the support polygon, the robot becomes less stable.
  3. Two-leg Balance: The robot learns a two-leg balance policy, in which the front-right and rear-left feet are in stance, and the other two are lifted. The robot can easily fall without delicate balance control because the contact polygon degenerates into a line segment.
Locomotion tasks considered in the paper. Top: efficient gait. Middle: catwalk. Bottom: two-leg balance.

Implementation Details
We use a hierarchical policy framework that combines RL and a traditional control approach for the learner and safe recovery policies. This framework consists of a high-level RL policy, which produces gait parameters (e.g., stepping frequency) and feet placements, and pairs it with a low-level process controller called model predictive control (MPC) that takes in these parameters and computes the desired torque for each motor in the robot. Because we do not directly command the motors’ angles, this approach provides more stable operation, streamlines the policy training due to a smaller action space, and results in a more robust policy. The input of the RL policy network includes the previous gait parameters, the height of the robot, base orientation, linear, angular velocities, and feedback to indicate whether the robot is approaching the safety trigger set. We use the same setup for each task.

We train a safe recovery policy with a reward for reaching stability as soon as possible. Furthermore, we design the safety trigger set with inspiration from capturability theory. In particular, the initial safety trigger set is defined to ensure that the robot’s feet can not fall outside of the positions from which the robot can safely recover using the safe recovery policy. We then fine-tune this set on the real robot with a random policy to prevent the robot from falling.

Real-World Experiment Results
We report the real-world experimental results showing the reward learning curves and the percentage of safe recovery policy activations on the efficient gait, catwalk, and two-leg balance tasks. To ensure that the robot can learn to be safe, we add a penalty when triggering the safe recovery policy. Here, all the policies are trained from scratch, except for the two-leg balance task, which was pre-trained in simulation because it requires more training steps.

Overall, we see that on these tasks, the reward increases, and the percentage of uses of the safe recovery policy decreases over policy updates. For instance, the percentage of uses of the safe recovery policy decreases from 20% to near 0% in the efficient gait task. For the two-leg balance task, the percentage drops from near 82.5% to 67.5%, suggesting that the two-leg balance is substantially harder than the previous two tasks. Still, the policy does improve the reward. This observation implies that the learner can gradually learn the task while avoiding the need to trigger the safe recovery policy. In addition, this suggests that it is possible to design a safe trigger set and a safe recovery policy that does not impede the exploration of the policy as the performance increases.

The reward learning curve (blue) and the percentage of safe recovery policy activations (red) using our safe RL algorithm in the real world.

In addition, the following video shows the learning process for the two-leg balance task, including the interplay between the learner policy and the safe recovery policy, and the reset to the initial position when an episode ends. We can see that the robot tries to catch itself when falling by putting down the lifted legs (front left and rear right) outward, creating a support polygon. After the learning episode ends, the robot walks back to the reset position automatically. This allows us to train policy autonomously and safely without human supervision.

Early training stage.
Late training stage.
Without a safe recovery policy.

Finally, we show the clips of learned policies. First, in the catwalk task, the distance between two sides of the legs is 0.09m, which is 40.9% smaller than the nominal distance. Second, in the two-leg balance task, the robot can maintain balance by jumping up to four times via two legs, compared to one jump from the policy pre-trained from simulation.

Final learned two-leg balance.

Conclusion
We presented a safe RL framework and demonstrated how it can be used to train a robotic policy with no falls and without the need for a manual reset during the entire learning process for the efficient gait and catwalk tasks. This approach even enables training of a two-leg balance task with only four falls. The safe recovery policy is triggered only when needed, allowing the robot to more fully explore the environment. Our results suggest that learning legged locomotion skills autonomously and safely is possible in the real world, which could unlock new opportunities including offline dataset collection for robot learning.

No model is without limitation. We currently ignore the model uncertainty from the environment and non-linear dynamics in our theoretical analysis. Including these would further improve the generality of our approach. In addition, some hyper-parameters of the switching criteria are currently being heuristically tuned. It would be more efficient to automatically determine when to switch based on the learning progress. Furthermore, it would be interesting to extend this safe RL framework to other robot applications, such as robot manipulation. Finally, designing an appropriate reward when incorporating the safe recovery policy can impact learning performance. We use a penalty-based approach that obtained reasonable results in these experiments, but we plan to investigate this in future work to make further performance improvements.

Acknowledgements
We would like to thank our paper co-authors: Tingnan Zhang, Linda Luu, Sehoon Ha, Jie Tan, and Wenhao Yu. We would also like to thank the team members of Robotics at Google for discussions and feedback.

Source: Google AI Blog


Extracting Skill-Centric State Abstractions from Value Functions

Advances in reinforcement learning (RL) for robotics have enabled robotic agents to perform increasingly complex tasks in challenging environments. Recent results show that robots can learn to fold clothes, dexterously manipulate a rubik’s cube, sort objects by color, navigate complex environments and walk on difficult, uneven terrain. But "short-horizon" tasks such as these, which require very little long-term planning and provide immediate failure feedback, are relatively easy to train compared to many tasks that may confront a robot in a real-world setting. Unfortunately, scaling such short-horizon skills to the abstract, long horizons of real-world tasks is difficult. For example, how would one train a robot capable of picking up objects to rearrange a room?

Hierarchical reinforcement learning (HRL), a popular way of solving this problem, has achieved some success in a variety of long-horizon RL tasks. HRL aims to solve such problems by reasoning over a bank of low-level skills, thus providing an abstraction for actions. However, the high-level planning problem can be further simplified by abstracting both states and actions. For example, consider a tabletop rearrangement task, where a robot is tasked with interacting with objects on a desk. Using recent advances in RL, imitation learning, and unsupervised skill discovery, it is possible to obtain a set of primitive manipulation skills such as opening or closing drawers, picking or placing objects, etc. However, even for the simple task of putting a block into the drawer, chaining these skills together is not straightforward. This may be attributed to a combination of (i) challenges with planning and reasoning over long horizons, and (ii) dealing with high dimensional observations while parsing the semantics and affordances of the scene, i.e., where and when the skill can be used.

In “Value Function Spaces: Skill-Centric State Abstractions for Long-Horizon Reasoning”, presented at ICLR 2022, we address the task of learning suitable state and action abstractions for long-range problems. We posit that a minimal, but complete, representation for a higher-level policy in HRL must depend on the capabilities of the skills available to it. We present a simple mechanism to obtain such a representation using skill value functions and show that such an approach improves long-horizon performance in both model-based and model-free RL and enables better zero-shot generalization.

Our method, VFS, can compose low-level primitives (left) to learn complex long-horizon behaviors (right).

Building a Value Function Space
The key insight motivating this work is that the abstract representation of actions and states is readily available from trained policies via their value functions. The notion of “value” in RL is intrinsically linked to affordances, in that the value of a state for skill reflects the probability of receiving a reward for successfully executing the skill. For any skill, its value function captures two key properties: 1) the preconditions and affordances of the scene, i.e., where and when the skill can be used, and 2) the outcome, which indicates whether the skill executed successfully when it was used.

Given a decision process with a finite set of k skills trained with sparse outcome rewards and their corresponding value functions, we construct an embedding space by stacking these skill value functions. This gives us an abstract representation that maps a state to a k-dimensional representation that we call the Value Function Space, or VFS for short. This representation captures functional information about the exhaustive set of interactions that the agent can have with the environment, and is thus a suitable state abstraction for downstream tasks.

Consider a toy example of the tabletop rearrangement setup discussed earlier, with the task of placing the blue object in the drawer. There are eight elementary actions in this environment. The bar plot on the right shows the values of each skill at any given time, and the graph at the bottom shows the evolution of these values over the course of the task.

Value functions corresponding to each skill (top-right; aggregated in bottom) capture functional information about the scene (top-left) and aid decision-making.

At the beginning, the values corresponding to the “Place on Counter” skill are high since the objects are already on the counter; likewise, the values corresponding to “Close Drawer” are high. Through the trajectory, when the robot picks up the blue cube, the corresponding skill value peaks. Similarly, the values corresponding to placing the objects in the drawer increase when the drawer is open and peak when the blue cube is placed inside it. All the functional information required to affect each transition and predict its outcome (success or failure) is captured by the VFS representation, and in principle, allows a high-level agent to reason over all the skills and chain them together — resulting in an effective representation of the observations.

Additionally, since VFS learns a skill-centric representation of the scene, it is robust to exogenous factors of variation, such as background distractors and appearances of task-irrelevant components of the scene. All configurations shown below are functionally equivalent — an open drawer with the blue cube in it, a red cube on the countertop, and an empty gripper — and can be interacted with identically, despite apparent differences.

The learned VFS representation can ignore task-irrelevant factors such as arm pose, distractor objects (green cube) and background appearance (brown desk).

Robotic Manipulation with VFS
This approach enables VFS to plan out complex robotic manipulation tasks. Take, for example, a simple model-based reinforcement learning (MBRL) algorithm that uses a simple one-step predictive model of the transition dynamics in value function space and randomly samples candidate skill sequences to select and execute the best one in a manner similar to the model-predictive control. Given a set of primitive pushing skills of the form “move Object A near Object B” and a high-level rearrangement task, we find that VFS can use MBRL to reliably find skill sequences that solve the high-level task.

A rollout of VFS performing a tabletop rearrangement task using a robotic arm. VFS can reason over a sequence of low-level primitives to achieve the desired goal configuration.

To better understand the attributes of the environment captured by VFS, we sample the VFS-encoded observations from a large number of independent trajectories in the robotic manipulation task and project them onto a two-dimensional axis using the t-SNE technique, which is useful for visualizing clusters in high-dimensional data. These t-SNE embeddings reveal interesting patterns identified and modeled by VFS. Looking at some of these clusters closely, we find that VFS can successfully capture information about the contents (objects) in the scene and affordances (e.g., a sponge can be manipulated when held by the robot’s gripper), while ignoring distractors like the relative positions of the objects on the table and the pose of the robotic arm. While these factors are certainly important to solve the task, the low-level primitives available to the robot abstract them away and hence, make them functionally irrelevant to the high-level controller.

Visualizing the 2D t-SNE projections of VFS embeddings show emergent clustering of equivalent configurations of the environment while ignoring task-irrelevant factors like arm pose.

Conclusions and Connections to Future Work
Value function spaces are representations built on value functions of underlying skills, enabling long-horizon reasoning and planning over skills. VFS is a compact representation that captures the affordances of the scene and task-relevant information while robustly ignoring distractors. Empirical experiments reveal that such a representation improves planning for model-based and model-free methods and enables zero-shot generalization. Going forward, this representation has the promise to continue improving along with the field of multitask reinforcement learning. The interpretability of VFS further enables integration into fields such as safe planning and grounding language models.

Acknowledgements
We thank our co-authors Sergey Levine, Ted Xiao, Alex Toshev, Peng Xu and Yao Lu for their contributions to the paper and feedback on this blog post. We also thank Tom Small for creating the informative visualizations used in this blog post.

Source: Google AI Blog