Author Archives: Google AI

Learning an Accurate Physics Simulator via Adversarial Reinforcement Learning

Simulation empowers various engineering disciplines to quickly prototype with minimal human effort. In robotics, physics simulations provide a safe and inexpensive virtual playground for robots to acquire physical skills with techniques such as deep reinforcement learning (DRL). However, as the hand-derived physics in simulations does not match the real world exactly, control policies trained entirely within simulation can fail when tested on real hardware — a challenge known as the sim-to-real gap or the domain adaptation problem. The sim-to-real gap for perception-based tasks (such as grasping) has been tackled using RL-CycleGAN and RetinaGAN, but there is still a gap caused by the dynamics of robotic systems. This prompts us to ask, can we learn a more accurate physics simulator from a handful of real robot trajectories? If so, such an improved simulator could be used to refine the robot controller using standard DRL training, so that it succeeds in the real world.

In our ICRA 2021 publication “SimGAN: Hybrid Simulator Identification for Domain Adaptation via Adversarial Reinforcement Learning”, we propose to treat the physics simulator as a learnable component that is trained by DRL with a special reward function that penalizes discrepancies between the trajectories (i.e., the movement of the robots over time) generated in simulation and a small number of trajectories that are collected on real robots. We use generative adversarial networks (GANs) to provide such a reward, and formulate a hybrid simulator that combines learnable neural networks and analytical physics equations, to balance model expressiveness and physical correctness. On robotic locomotion tasks, our method outperforms multiple strong baselines, including domain randomization.

A Learnable Hybrid Simulator
A traditional physics simulator is a program that solves differential equations to simulate the movement or interactions of objects in a virtual world. For this work, it is necessary to build different physical models to represent different environments – if a robot walks on a mattress, the deformation of the mattress needs to be taken into account (e.g., with the finite element method). However, due to the diversity of the scenarios that robots could encounter in the real world, it would be tedious (or even impossible) for such environment-specific modeling techniques, which is why it is useful to instead take an approach based on machine learning. Although simulators can be learned entirely from data, if the training data does not include a wide enough variety of situations, the learned simulator might violate the laws of physics (i.e., deviate from the real-world dynamics) if it needs to simulate situations for which it was not trained. As a result, the robot that is trained in such a limited simulator is more likely to fail in the real world.

To overcome this complication, we construct a hybrid simulator that combines both learnable neural networks and physics equations. Specifically, we replace what are often manually-defined simulator parameters — contact parameters (e.g., friction and restitution coefficients) and motor parameters (e.g., motor gains) — with a learnable simulation parameter function because the unmodeled details of contact and motor dynamics are major causes of the sim-to-real gap. Unlike conventional simulators in which these parameters are treated as constants, in the hybrid simulator they are state-dependent — they can change according to the state of the robot. For example, motors can become weaker at higher speed. These typically unmodeled physical phenomena can be captured using the state-dependent simulation parameter functions. Moreover, while contact and motor parameters are usually difficult to identify and subject to change due to wear-and-tear, our hybrid simulator can learn them automatically from data. For example, rather than having to manually specify the parameters of a robot’s foot against every possible surface it might contact, the simulation learns these parameters from training data.

Comparison between a conventional simulator and our hybrid simulator.

The other part of the hybrid simulator is made up of physics equations that ensure the simulation obeys fundamental laws of physics, such as conservation of energy, making it a closer approximation to the real world and thus reducing the sim-to-real gap.

In our earlier mattress example, the learnable hybrid simulator is able to mimic the contact forces from the mattress. Because the learned contact parameters are state-dependent, the simulator can modulate contact forces based on the distance and velocity of the robot’s feet relative to the mattress, mimicking the effect of the stiffness and damping of a deformable surface. As a result, we do not need to analytically devise a model specifically for deformable surfaces.

Using GANs for Simulator Learning
Successfully learning the simulation parameter functions discussed above would result in a hybrid simulator that can generate similar trajectories to the ones collected on the real robot. The key that enables this learning is defining a metric for the similarity between trajectories. GANs, initially designed to generate synthetic images that share the same distribution, or “style,” with a small number of real images, can be used to generate synthetic trajectories that are indistinguishable from real ones. GANs have two main parts, a generator that learns to generate new instances, and a discriminator that evaluates how similar the new instances are to the training data. In this case, the learnable hybrid simulator serves as the GAN generator, while the GAN discriminator provides the similarity scores.

The GAN discriminator provides the similarity metric that compares the movements of the simulated and the real robot.

Fitting parameters of simulation models to data collected in the real world, a process called system identification (SysID), has been a common practice in many engineering fields. For example, the stiffness parameter of a deformable surface can be identified by measuring the displacements of the surface under different pressures. This process is typically manual and tedious, but using GANs can be much more efficient. For example, SysID often requires a hand-crafted metric for the discrepancy between simulated and real trajectories. With GANs, such a metric is automatically learned by the discriminator. Furthermore, to calculate the discrepancy metric, conventional SysID requires pairing each simulated trajectory to a corresponding real-world one that is generated using the same control policy. Since the GAN discriminator takes only one trajectory as the input and calculates the likelihood that it is collected in the real world, this one-to-one pairing is not needed.

Using Reinforcement Learning (RL) to Learn the Simulator and Refine the Policy
Putting everything together, we formulate simulation learning as an RL problem. A neural network learns the state-dependent contact and motor parameters from a small number of real-world trajectories. The neural network is optimized to minimize the error between the simulated and the real trajectories. Note that it is important to minimize this error over an extended period of time — a simulation that accurately predicts a more distant future will lead to a better control policy. RL is well suited to this because it optimizes the accumulated reward over time, rather than just optimizing a single-step reward.

After the hybrid simulator is learned and becomes more accurate, we use RL again to refine the robot’s control policy within the simulation (e.g., walking across a surface, shown below).

Following the arrows clockwise: (upper left) recording a small number of robot's failed attempts in the target domain (e.g., a real-world proxy in which the leg in red is modified to be much heavier than the source domain); (upper right) learning the hybrid simulator to match trajectories collected in the target domain; (lower right) refining control policies in this learned simulator; (lower left) testing the refined controller directly in the target domain.

Evaluation
Due to limited access to real robots during 2020, we created a second and different simulation (target domain) as a proxy of the real-world. The change of dynamics between the source and the target domains are large enough to approximate different sim-to-real gaps (e.g., making one leg heavier, walking on deformable surfaces instead of hard floor). We assessed whether our hybrid simulator, with no knowledge of these changes, could learn to match the dynamics in the target domain, and if the refined policy in this learned simulator could be successfully deployed in the target domain.

Qualitative results below show that simulation learning with less than 10 minutes of data collected in the target domain (where the floor is deformable) is able to generate a refined policy that performs much better for two robots with different morphologies and dynamics.

Comparison of performance between the initial and refined policy in the target domain (deformable floor) for the hopper and the quadruped robot.

Quantitative results below show that SimGAN outperforms multiple state-of-the-art baselines, including domain randomization (DR) and direct finetuning in target domains (FT).

Comparison of policy performance using different sim-to-real transfer methods in three different target domains for the Quadruped robot: locomotion on deformable surface, with weakened motors, and with heavier bodies.

Conclusion
The sim-to-real gap is one of the key bottlenecks that prevents robots from tapping into the power of reinforcement learning. We tackle this challenge by learning a simulator that can more faithfully model real-world dynamics, while using only a small amount of real-world data. The control policy that is refined in this simulator can be successfully deployed. To achieve this, we augment a classical physics simulator with learnable components, and train this hybrid simulator using adversarial reinforcement learning. To date we have tested its application to locomotion tasks, we hope to build on this general framework by applying it to other robot learning tasks, such as navigation and manipulation.

Source: Google AI Blog


A Step Toward More Inclusive People Annotations in the Open Images Extended Dataset

In 2016, we introduced Open Images, a collaborative release of ~9 million images annotated with image labels spanning thousands of object categories and bounding box annotations for 600 classes. Since then, we have made several updates, including the release of crowdsourced data to the Open Images Extended collection to improve diversity of object annotations. While the labels provided with these datasets were expansive, they did not focus on sensitive attributes for people, which are critically important for many machine learning (ML) fairness tasks, such as fairness evaluations and bias mitigation. In fact, finding datasets that include thorough labeling of such sensitive attributes is difficult, particularly in the domain of computer vision.

Today, we introduce the More Inclusive Annotations for People (MIAP) dataset in the Open Images Extended collection. The collection contains more complete bounding box annotations for the person class hierarchy in 100k images containing people. Each annotation is also labeled with fairness-related attributes, including perceived gender presentation and perceived age range. With the increasing focus on reducing unfair bias as part of responsible AI research, we hope these annotations will encourage researchers already leveraging Open Images to incorporate fairness analysis in their research.

Examples of new boxes in MIAP. In each subfigure the magenta boxes are from the original Open Images dataset, while the yellow boxes are additional boxes added by the MIAP Dataset. Original photo credits — left: Boston Public Library; middle: jen robinson; right: Garin Fons; all used with permission under the CC- BY 2.0 license.

Annotations in Open Images
Each image in the original Open Images dataset contains image-level annotations that broadly describe the image and bounding boxes drawn around specific objects. To avoid drawing multiple boxes around the same object, less specific classes were temporarily pruned from the label candidate set, a process that we refer to as hierarchical de-duplication. For example, an image with labels animal, cat, and washing machine has bounding boxes annotated for cat and washing machine, but not for the redundant class animal.

The MIAP dataset addresses the five classes that are part of the person hierarchy in the original Open Images dataset: person, man, woman, boy, girl. The existence of these labels make the Open Images dataset uniquely valuable for research advancing responsible AI, allowing one to train a general person detector with access to gender- and age-range-specific labels for fairness analysis and bias mitigation.

However, we found that the combination of hierarchical de-duplication and societally imposed distinctions between woman/girl and man/boy introduced limitations in the original annotations. For example, if annotators were asked to draw boxes for the class girl, they would not draw a box around a boy in the image. They may or may not draw a box around a woman depending on their assessment of the age of the individual and their cultural understanding of the concept of girl. These decisions could be applied inconsistently between images, depending on the cultural background of the individual annotator, the appearance of an individual, and the context of the scene. Consequently, the bounding box annotations in some images were incomplete, with some people who appeared prominently not being annotated.

Annotations in MIAP
The new MIAP annotations are designed to address these limitations and fulfill the promise of Open Images as a dataset that will enable new advances in machine learning fairness research. Rather than asking annotators to draw boxes for the most specific class from the hierarchy (e.g., girl), we invert the procedure, always requesting bounding boxes for the gender- and age-agnostic person class. All person boxes are then separately associated with labels for perceived gender presentation (predominantly feminine, predominantly masculine, or unknown) and age presentation (young, middle, older, or unknown). We recognize that gender is not binary and that an individual's gender identity may not match their perceived or intended gender presentation and, in an effort to mitigate the effects of unconscious bias on the annotations, we reminded annotators that norms around gender expression vary across cultures and have changed over time.

This procedure adds a significant number of boxes that were previously missing.

Over the 100k images that include people, the number of person bounding boxes have increased from ~358k to ~454k. The number of bounding boxes per perceived gender presentation and perceived age presentation increased consistently. These new annotations provide more complete ground truth for training a person detector as well as more accurate subgroup labels for incorporating fairness into computer vision research.

Comparison of number of person bounding boxes between the original Open Images and the new MIAP dataset.

Intended Use
We include annotations for perceived age range and gender presentation for person bounding boxes because we believe these annotations are necessary to advance the ability to better understand and work to mitigate and eliminate unfair bias or disparate performance across protected subgroups within the field of image understanding. We note that the labels capture the gender and age range presentation as assessed by a third party based on visual cues alone, rather than an individual's self-identified gender or actual age. We do not support or condone building or deploying gender and/or age presentation classifiers trained from these annotations as we believe the risks associated with the use of these technologies outside fairness research outweigh any potential benefits.

Acknowledgements
The core team behind this work included Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. We would also like to thank Alex Hanna, Reena Jana, Alina Kuznetsova, Matteo Malloci, Stefano Pellegrini, Jordi Pont-Tuset, and Mahima Pushkarna, for their contributions to the project.

Source: Google AI Blog


The Importance of A/B Testing in Robotics

Disciplines in the natural sciences, social sciences, and medicine all have to grapple with how to evaluate and compare results within the context of the continually changing real world. In contrast, a significant body of machine learning (ML) research uses a different method that relies on the assumption of a fixed world: measure the performance of a baseline model on fixed data sets, then build a new model aimed at improving on the baseline, and evaluate its performance (on the same fixed data) by comparing its performance to the baseline.

Research into robotics systems and their applications to the real world requires a rethinking of this experiment design. Even in controlled robotic lab environments, it is possible that real-world changes cause the baseline model to perform inconsistently over time, making it unclear whether new models’ performance is an improvement compared to the baseline, or just the result of unintentional, random changes in the experiment setup. As robotics research advances into more complex and challenging real-world scenarios, there is a growing need for both understanding the impact of the ever-changing world on baselines and developing systematic methods to generate informative and clear results.

In this post, we demonstrate how robotics research, even in the relatively controlled environment of a lab, is meaningfully affected by changes in the environment, and discuss how to address this fundamental challenge using random assignment and A/B testing. Although these are classical research methods, they are not generally employed by default in robotics research — yet, they are critical to producing meaningful and measurable scientific results for robotics in real-world scenarios. Additionally, we cover the costs, benefits, and other considerations of using these methods.

The Ever-Changing Real World in Robotics
Even in a robotics lab environment, which is designed to minimize all changes that are not experimental conditions, it is notoriously difficult to set up a perfectly reproducible experiment. Robots get bumped and are subject to wear and tear, lighting changes affect perception, battery charge influences the torque applied to motors — all things that can affect results in ways large and small.

To illustrate this on real robot data, we collected success rate data on one of our simplest setups — moving identical foam dice from one bin to another. For this task, we ran about 33k task trials on two robots over more than five months with the same software and ML model, and took the overall success rate of the last two weeks as baseline. We then measured the historic performance over time in this “very well controlled” environment.

Video of a real robot completing the task: moving identical foam dice from one bin to another.

Given that we did not purposefully change anything during data collection, one would expect the success rate to be statistically similar over time. And yet, this is not what was observed.

The y-axis represents the 95% confidence interval of % change in success rate relative to baseline. If the confidence intervals contain zero, that indicates the success rate is statistically similar to the success rate of baseline. Confidence intervals were computed using Jackknife, with Cochran-Mantel-Haenszel correction to remove operator bias.

Using the sequential data from the plot above, one might conclude that the model ran during weeks 13-14 performed best and that ran during weeks 9-10 performed the worst. One might also expect most, if not all, of the confidence intervals above to contain 0, but only one did. Because no changes were made at any time during these trials, this example effectively demonstrates the impact of unintentional, random real-world changes on even very simple setups. It’s also worth noting that having more trials per experiment wouldn’t remove these differences, instead they will more likely produce a narrower confidence interval making the impact more obvious.

However, what happens when one uses random assignment to compare results, grouping the data randomly rather than sequentially? To answer this, we randomly assigned the above data to the same number of groups for comparison with the baseline. This is equivalent to performing A/B testing where all groups receive the same treatment.

Looking at the chart, we observe that the confidence intervals include zero, indicating success similar to the baseline, as expected.

We performed similar studies with a few other robotics tasks, comparing between sequential and random assignments. They all yielded similar results.

We see that even with no intentional changes, there are statistically significant differences observed for sequential assignment, while random assignment shows the expected result of no statistically significant differences.

Considerations for A/B testing in robotics
While it’s clear based on the above that A/B testing with random assignment is an effective way to control for the unexplainable variance of the real world in robotics, there are some considerations when adopting this approach. Here are several, along with their accompanying pros, cons, and solutions:

  • Absolute vs relative performance: Each experiment needs to be measured against a baseline that is run concurrently. The relative performance metric between baseline and experiment is published with a confidence interval. The absolute performance metric (in baseline or experiment) is less informative, because it depends to an unknown degree on the state of the world when the measurement was taken. However, the statistical differences we’ve measured between the experiment and baseline are sound and robust to reproduction.
  • Data efficiency: With this approach, the baseline always needs to run in parallel with the experimental conditions so they can be compared against each other. Although this may seem wasteful, it is worth the cost when compared against the drawbacks of making an invalid inference against a stale baseline. Furthermore, as the number of random assignment experiments scale up, we can use a single baseline arm with multiple simultaneous experiment arms across independent factors leveraging Google’s overlapping experiment infrastructure. Data efficiency improves with scale.
  • Environmental biases: If there’s any external factor affecting performance overall (lighting, slicker surfaces, etc.), both the baseline and all experiment arms will encounter this factor with similar probability, so its effect will cancel if there’s no relative impact. If there is a correlation between environmental factors and experiment arms, this will show up as differences over time (each environmental factor accumulates in the episodes collected). This can substantially reduce or eliminate the need for effortful environmental resets, and lets us run lifelong experiments and still measure improvements across experimental arms.
  • Human biases: One advantage of random assignment is a reduction in biases introduced by humans. Since human operators cannot know which data sample gets routed to which arm of the experiment, it is harder to have biased experimenters influence any particular outcome.

The Path Forward
The A/B testing experiment framework has been successfully used for a long time in many scientific disciplines to measure performance against changing, unpredictable real-world environments. In this blog post, we show that robotics research can benefit from using this same methodology: it improves the quality and confidence of research results, and avoids the impossible task of perfectly controlling all elements of a fundamentally changing environment. Doing this well requires infrastructure to continuously operate robots, collect data, and tools to make the statistical framework easily accessible to researchers.

Acknowledgements
Arnab Bose, Tuna Toksoz, Yuheng Kuang, Anthony Brohan, Razvan Sudulescu developed the experiment infrastructure and conducted the research. Matthieu Devin suggested the A/A analysis to showcase the differences using existing data. Special thanks to Bill Heavlin, Chris Harris, Vincent Vanhoucke who provided invaluable feedback and support to the work.

Source: Google AI Blog


FRILL: On-Device Speech Representations using TensorFlow-Lite

Representation learning is a machine learning (ML) method that trains a model to identify salient features that can be applied to a variety of downstream tasks, ranging from natural language processing (e.g., BERT and ALBERT) to image analysis and classification (e.g., Inception layers and SimCLR). Last year, we introduced a benchmark for comparing speech representations and a new, generally-useful speech representation model (TRILL). TRILL is based on temporal proximity, and tries to map speech that occurs close together in time to a lower-dimensional embedding that captures temporal proximity in the embedding space. Since its release, the research community has used TRILL on a diverse set of tasks, such as age classification, video thumbnail selection, and language identification. However, despite achieving state-of-the-art performance, TRILL and other neural network-based approaches require more memory and take longer to compute than signal processing operations that deal with simple features, like loudness, average energy, pitch, etc.

In our recent paper "FRILL: A Non-Semantic Speech Embedding for Mobile Devices", to appear at Interspeech 2021, we create a new model that is 40% the size of TRILL and and a feature set that can be computed over 32x faster on mobile phone, with an average decrease in accuracy of less than 2%. This marks an important step towards fully on-device applications of speech ML models, which will lead to better personalization, improved user experiences and greater privacy, an important aspect of developing AI responsibly. We release the code to create FRILL on github, and a pre-trained FRILL model on TensorFlow Hub.

FRILL: Smaller, Faster TRILL
The TRILL architecture is based on a modified version of ResNet50, an architecture that is computationally taxing for constrained hardware, like mobile phones or smart home devices. On the other hand, architectures like MobileNetV3 have been designed with hardware-aware AutoML to perform well on mobile devices. To take advantage of this, we leverage knowledge distillation to combine the benefits of MobileNetV3’s performance with TRILL’s representations.

In the distillation process, the smaller model (i.e., the "student") tries to match the output of the larger model ("teacher") on the AudioSet dataset. Whereas the original TRILL model learned its weights by optimizing a self-supervised loss that clustered audio segments close in time, the student model learns its weights through a fully-supervised loss that ignores temporal matching and instead tries to match TRILL outputs on the training data. The fully-supervised learning signal is often stronger than self-supervision, and allows us to train more quickly.

Knowledge distillation for non-semantic speech embeddings. The dashed line shows the student model output. The "teacher network" is the TRILL network, where "Layer 19" was the best-performing internal representation. The "Student Hyperparameters" on the left are the options explored in this study, the result of which are 144 distinct models. These models were trained with mean-squared error (MSE) to try to match TRILL's Layer 19.

Choosing the Best Student Model
We perform distillation with a variety of student models, each trained with a specific combination of architecture choices (explained below). To measure each student model’s latency, we leverage TensorFlow Lite (TFLite), a framework that enables execution of TensorFlow models on edge devices. Each candidate model is first converted into TFLite’s flatbuffer format for 32-bit floating point inference and then sent to the target device (in this case, a Pixel 1) for benchmarking. These measurements help us to accurately assess the latency versus quality tradeoffs across all student models and to minimize the loss of quality in the conversion process.

Architecture Choices and Optimizations
We explored different neural network architectures and features that balance latency and accuracy — models with fewer parameters are usually smaller and faster, but have less representational power and therefore generate less generally-useful representations. We trained 144 different models across a number of hyperparameters, all based on the MobileNetV3 architecture:

  1. MobileNetV3 size and width: MobileNetV3 was released in different sizes for use in different environments. The size refers to which MobileNetV3 architecture we used. The width, sometimes known as alpha, proportionally decreases or increases the number of filters in each layer. A width of 1.0 corresponds to the number of filters in the original paper.
  2. Global average pooling: MobileNetV3 normally produces a set of two-dimensional feature maps. These are flattened, concatenated, and passed to the bottleneck layer. However, this bottleneck is often still too large to be computed quickly. We reduce the size of the bottleneck layer kernel by taking the global average of all ”pixels” in each output feature map. Our intuition is that the discarded temporal information is less important for learning a non-semantic speech representation due to the fact that relevant aspects of the signal are stable across time.
  3. Bottleneck compression: A significant portion of the student model’s weights are located in the bottleneck layer. To reduce the size of this layer, we apply a compression operator based on singular value decomposition (SVD) that learns a low-rank approximation of the bottleneck weight matrix.
  4. Quantization-aware training: Since the bottleneck layer has most of the model weights, we use quantization-aware training (QAT) to gradually reduce the numerical precision of the bottleneck weights during training. QAT allows the model to adjust to the lower numerical precision during training, instead of potentially causing performance degradation by introducing quantization after training finishes.

Results
We evaluated each of these models on the Non-Semantic Speech Benchmark (NOSS) and two new tasks — a challenging task to detect whether a speaker is wearing a mask and the human-noise subset of the Environment Sound Classification dataset, which includes labels like “coughing” and “sneezing”. After eliminating models that have strictly better alternatives, we are left with eight ”frontier” models on the quality vs. latency curve, which are the models that had no faster and better performance alternatives at a corresponding quality threshold or latency in our batch of 144 models. We plot the latency vs. quality curve of only these "frontier" models below, and we ignore models that are strictly worse.

Embedding quality and latency tradeoff. The x-axis represents the inference latency and the y-axis shows the difference in accuracy from TRILL’s performance, averaged across benchmark datasets.

FRILL is the best performing sub-10ms inference model, with an inference time of 8.5 ms on a Pixel 1 (about 32x faster than TRILL), and is also roughly 40% the size of TRILL. The frontier curve plateaus at about 10ms latency, which means that at low latency, one can achieve much better performance with minimal latency costs, while achieving improved performance at latencies beyond 10ms is more difficult. This supports our choice of experiment hyperparameters. FRILL's per-task performance is shown in the table below.

FRILL TRILL

Size (MB) 38.5 98.1
Latency (ms) 8.5 275.3

Voxceleb1* 45.5 46.8
Voxforge 78.8 84.5
Speech Commands 81.0 81.7
CREMA-D 71.3 65.9
SAVEE 63.3 70.0
Masked Speech 68.0 65.8
ESC-50 HS 87.9 86.4
Accuracy on each of the classification tasks (higher is better).
*Results in our study use a small subset of Voxceleb1 filtered according to internal privacy guidelines. Interested readers can run our study on the full dataset using TensorFlow Datasets and our open-source evaluation code.

Finally, we evaluate the relative contribution of each of our hyperparameters. We find that for our experiments, quantization-aware training, bottleneck compression and global average pooling most reduced the latency of the resulting models. At the same time bottleneck compression most reduced the quality of the resulting model, while pooling reduced the model performance the least. The architecture width parameter was an important factor in reducing the model size, with minimal performance degradation.

Linear regression weight magnitudes for predicting model quality, latency, and size. The weights indicate the expected impact of changing the input hyperparameter. A higher weight magnitude indicates a greater expected impact.

Our work is an important step in bringing the full benefits of speech machine learning research to mobile devices. We also provide our public model, corresponding model card, and evaluation code to help the research community responsibly develop even more applications for on-device speech representation research.

Acknowledgements
We'd like to thank our paper co-authors: Jacob Peplinksi and Shwetak Patel. We'd like to thank Aren Jansen for his technical support on this project, Françoise Beaufays, and Tulsee Doshi for help open sourcing the model, and Google Research, Tokyo for logistical support.

Source: Google AI Blog


Using Variational Transformer Networks to Automate Document Layout Design

Information in a written document is not only conveyed by the meaning of the words contained in it, but also by the overall document layout. Layouts are commonly used to direct the order in which the reader parses a document to enable a better understanding (e.g., with columns or paragraphs), to provide helpful summaries (e.g., with titles) or for aesthetic purposes (e.g., when displaying advertisements).

While these design rules are easy to follow, it is difficult to explicitly define them without quickly needing to include exceptions or encountering ambiguous cases. This makes the automation of document design difficult, as any system with a hardcoded set of production rules will either be overly simplistic and thus incapable of producing original layouts (causing a lack of diversity in the layout of synthesized data), or too complex, with a large set of rules and their accompanying exceptions. In an attempt to solve this challenge, some have proposed machine learning (ML) techniques to synthesize document layouts. However, most ML-based solutions for automatic document design do not scale to a large number of layout components, or they rely on additional information for training, such as the relationships between the different components of a document.

In “Variational Transformer Networks for Layout Generation”, to be presented at CVPR 2021, we create a document layout generation system that scales to an arbitrarily large number of elements and does not require any additional information to capture the relationships between design elements. We use self-attention layers as building blocks of a variational autoencoder (VAE), which is able to model document layout design rules as a distribution, rather than using a set of predetermined heuristics, increasing the diversity of the generated layouts. The resulting Variational Transformer Network (VTN) model is able to extract meaningful relationships between the layout elements (paragraphs, tables, images, etc.), resulting in realistic synthetic documents (e.g., better alignment and margins). We show the effectiveness of this combination across different domains, such as scientific papers, UI layouts, and even furniture arrangements.

VAEs for Layout Generation
The ultimate goal of this system is to infer the design rules for a given type of layout from a collection of examples. If one considers these design rules as the distribution underlying the data, it is possible to use probabilistic models to discover it. We propose doing this with a VAE (widely used for tasks like image generation or anomaly detection), an autoencoder architecture that consists of two distinct subparts, the encoder and decoder. The encoder learns to compress the input to fewer dimensions, retaining only the necessary information to reconstruct the input, while the decoder learns to undo this operation. The compressed representation (also called the bottleneck) can be forced to behave like a known distribution (e.g., a uniform Gaussian). Feeding samples from this a priori distribution to the decoder segment of the network results in outputs similar to the training data.

An additional advantage of the VAE formulation is that it is agnostic to the type of operations used to implement the encoder and decoder segments. As such, we use self-attention layers (typically seen in Transformer architectures) to automatically capture the influence that each layout element has over the rest.

Transformers use self-attention layers to model long, sequenced relationships, often applied to an array of natural language understanding tasks, such as translation and summarization, as well as beyond the language domain in object detection or document layout understanding tasks. The self-attention operation relates every element in a sequence to every other and determines how they influence each other. This property is ideal to model relationships across different elements in a layout without the need for explicit annotations.

In order to synthesize new samples from these relationships, some approaches for layout generation [e.g., 1] and even for other domains [e.g., 2, 3] rely on greedy search algorithms, such as beam search, nucleus sampling or top-k sampling. Since these strategies are often based on exploration rules that tend to favor the most likely outcome at every step, the diversity of the generated samples is not guaranteed. However, by combining self-attention with the VAE’s probabilistic techniques, the model is able to directly learn a distribution from which it can extract new elements.

Modeling the Variational Bottleneck
The bottleneck of a VAE is commonly modeled as a vector representing the input. Since self-attention layers are a sequence-to-sequence architecture, i.e., a sequence of n input elements is mapped onto n output elements, the standard VAE formulation is difficult to apply. Inspired by BERT, we append an auxiliary token to the beginning of the sequence and treat it as the autoencoder bottleneck vector z. During training, the vector associated with this token is the only piece of information passed to the decoder, so the encoder needs to learn how to compress the entire document information in this vector. The decoder then learns to infer the number of elements in the document as well as the locations of each element in the input sequence from this vector alone. This strategy allows us to use standard techniques to regularize the bottleneck, such as the KL divergence.

Decoding
In order to synthesize documents with varying numbers of elements, the network needs to model sequences of arbitrary length, which is not trivial. While self-attention enables the encoder to adapt automatically to any number of elements, the decoder segment does not know the number of elements in advance. We overcome this issue by decoding sequences in an autoregressive way — at every step, the decoder produces an element, which is concatenated to the previously decoded elements (starting with the bottleneck vector z as input), until a special stop element is produced.

A visualization of our proposed architecture

Turning Layouts into Input Data
A document is often composed of several design elements, such as paragraphs, tables, images, titles, footnotes, etc. In terms of design, layout elements are often represented by the coordinates of their enclosing bounding boxes. To make this information easily digestible for a neural network, we define each element with four variables (x, y, width, height), representing the element’s location on the page (x, y) and size (width, height).

Results
We evaluate the performance of the VTN following two criteria: layout quality and layout diversity. We train the model on publicly available document datasets, such as PubLayNet, a collection of scientific papers with layout annotations, and evaluate the quality of generated layouts by quantifying the amount of overlap and alignment between elements. We measure how well the synthetic layouts resemble the training distribution using the Wasserstein distance over the distributions of element classes (e.g., paragraphs, images, etc.) and bounding boxes. In order to capture the layout diversity, we find the most similar real sample for each generated document using the DocSim metric, where a higher number of unique matches to the real data indicates a more diverse outcome.

We compare the VTN approach to previous works like LayoutVAE and Gupta et al. The former is a VAE-based formulation with an LSTM backbone, whereas Gupta et al. use a self-attention mechanism similar to ours, combined with standard search strategies (beam search). The results below show that LayoutVAE struggles to comply with design rules, like strict alignments, as in the case of PubLayNet. Thanks to the self-attention operation, Gupta et al. can model these constraints much more effectively, but the usage of beam search affects the diversity of the results.

IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Matches ↑
LayoutVAE   0.171 0.321 0.472 - 0.045 241
Gupta et al.   0.039 0.006 0.361 0.018 0.012 546
VTN 0.031 0.017 0.347 0.022 0.012 697
Real Data   0.048 0.007 0.353 - - -
Results on PubLayNet. Down arrows (↓) indicate that a lower score is better, whereas up arrows (↑) indicate higher is better.

We also explore the ability of our approach to learn design rules in other domains, such as Android UIs (RICO), natural scenes (COCO) and indoor scenes (SUN RGB-D). Our method effectively learns the design rules of these datasets and produces synthetic layouts of similar quality as the current state of the art and a higher degree of diversity.

IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Matches ↑
LayoutVAE   0.193 0.400 0.416 - 0.045 496
Gupta et al.   0.086 0.145 0.366 0.004 0.023 604
VTN 0.115 0.165 0.373 0.007 0.018 680
Real Data   0.084 0.175 0.410 - - -
Results on RICO. Down arrows (↓) indicate that a lower score is better, whereas up arrows (↑) indicate higher is better.
IoU Overlap Alignment Wasserstein Class ↓ Wasserstein Box ↓ # Unique Matches ↑
LayoutVAE   0.325 2.819 0.246 - 0.062 700
Gupta et al.   0.194 1.709 0.334 0.001 0.016 601
VTN 0.197 2.384 0.330 0.0005 0.013 776
Real Data   0.192 1.724 0.347 - - -
Results for COCO. Down arrows (↓) indicate that a lower score is better, whereas up arrows (↑) indicate higher is better.

Below are some examples of layouts produced by our method compared to existing methods. The design rules learned by the network (location, margins, alignment) resemble those of the original data and show a high degree of variability.

LayoutVAE  
Gupta et al.  
VTN
Qualitative results of our method on PubLayNet compared to existing state-of-the-art methods.

Conclusion
In this work we show the feasibility of using self-attention as part of the VAE formulation. We validate the effectiveness of this approach for layout generation, achieving state-of-the-art performance on various datasets and across different tasks. Our research paper also explores alternative architectures for the integration of self-attention and VAEs, exploring non-autoregressive decoding strategies and different types of priors, and analyzes advantages and disadvantages. The layouts produced by our method can help to create synthetic training data for downstream tasks, such as document parsing or automating graphic design tasks. We hope that this work provides a foundation for continued research in this area, as many subproblems are still not completely solved, such as how to suggest styles for the elements in the layout (text font, which image to choose, etc.) or how to reduce the amount of training data necessary for the model to generalize.

AcknowledgementsWe thank our co-author Janis Postels, as well as Alessio Tonioni and Luca Prasso for helping with the design of several of our experiments. We also thank Tom Small for his help creating the animations for this post.

Source: Google AI Blog


Extending Contrastive Learning to the Supervised Setting

In recent years, self-supervised representation learning, which is used in a variety of image and video tasks, has significantly advanced due to the application of contrastive learning. These contrastive learning approaches typically teach a model to pull together the representations of a target image (a.k.a., the “anchor”) and a matching (“positive”) image in embedding space, while also pushing apart the anchor from many non-matching (“negative”) images. Because labels are assumed to be unavailable in self-supervised learning, the positive is often an augmentation of the anchor, and the negatives are chosen to be the other samples from the training minibatch. However, because of this random sampling, false negatives, i.e., negatives generated from samples of the same class as the anchor, can cause a degradation in the representation quality. Furthermore, determining the optimal method to generate positives is still an area of active research.

In contrast to the self-supervised approach, a fully-supervised approach could use labeled data to generate positives from existing same-class examples, providing more variability in pretraining than could typically be achieved by simply augmenting the anchor. However, very little work has been done to successfully apply contrastive learning in the fully-supervised domain.

In “Supervised Contrastive Learning”, presented at NeurIPS 2020, we propose a novel loss function, called SupCon, that bridges the gap between self-supervised learning and fully supervised learning and enables contrastive learning to be applied in the supervised setting. Leveraging labeled data, SupCon encourages normalized embeddings from the same class to be pulled closer together, while embeddings from different classes are pushed apart. This simplifies the process of positive selection, while avoiding potential false negatives. Because it accommodates multiple positives per anchor, this approach results in an improved selection of positive examples that are more varied, while still containing semantically relevant information. SupCon also allows label information to play an active role in representation learning rather than restricting it to be used only in downstream training, as is the case for conventional contrastive learning. To the best of our knowledge, this is the first contrastive loss to consistently perform better on large-scale image classification problems than the common approach of using cross-entropy loss to train the model directly. Importantly, SupCon is straightforward to implement and stable to train, provides consistent improvement to top-1 accuracy for a number of datasets and architectures (including Transformer architectures), and is robust to image corruptions and hyperparameter variations.

Self-supervised (left) vs supervised (right) contrastive losses: The self-supervised contrastive loss contrasts a single positive for each anchor (i.e., an augmented version of the same image) against a set of negatives consisting of the entire remainder of the minibatch. The supervised contrastive loss considered in this paper, however, contrasts the set of all samples from the same class as positives against the negatives from the remainder of the batch.

The Supervised Contrastive Learning Framework
SupCon can be seen as a generalization of both the SimCLR and N-pair losses — the former uses positives generated from the same sample as that of the anchor, and the latter uses positives generated from different samples by exploiting known class labels. The use of many positives and many negatives for each anchor allows SupCon to achieve state-of-the-art performance without the need for hard negative mining (i.e., searching for negatives similar to the anchor), which can be difficult to tune properly.

SupCon subsumes multiple losses from the literature and is a generalization of the SimCLR and N-Pair losses.

This method is structurally similar to those used in self-supervised contrastive learning, with modifications for supervised classification. Given an input batch of data, we first apply data augmentation twice to obtain two copies, or “views,” of each sample in the batch (though one could create and use any number of augmented views). Both copies are forward propagated through an encoder network, and the resulting embedding is then L2-normalized. Following standard practice, the representation is further propagated through an optional projection network to help identify meaningful features. The supervised contrastive loss is computed on the normalized outputs of the projection network. Positives for an anchor consist of the representations originating from the same batch instance as the anchor or from other instances with the same label as the anchor; the negatives are then all remaining instances. To measure performance on downstream tasks, we train a linear classifier on top of the frozen representations.

Cross-entropy, self-supervised contrastive loss and supervised contrastive loss Left: The cross-entropy loss uses labels and a softmax loss to train a classifier. Middle: The self-supervised contrastive loss uses a contrastive loss and data augmentations to learn representations. Right: The supervised contrastive loss also learns representations using a contrastive loss, but uses label information to sample positives in addition to augmentations of the same image.

Key Findings
SupCon consistently boosts top-1 accuracy compared to cross-entropy, margin classifiers (with use of labels), and self-supervised contrastive learning techniques on CIFAR-10 and CIFAR-100 and ImageNet datasets. With SupCon, we achieve excellent top-1 accuracy on the ImageNet dataset with the ResNet-50 and ResNet-200 architectures. On ResNet-200, we achieve a top-1 accuracy of 81.4%, which is a 0.8% improvement over the state-of-the-art cross-entropy loss using the same architecture (which represents a significant advance for ImageNet). We also compared cross-entropy and SupCon on a Transformer-based ViT-B/16 model and found a consistent improvement over cross-entropy (77.8% versus 76% for ImageNet; 92.6% versus 91.6% for CIFAR-10) under the same data augmentation regime (without any higher-resolution fine-tuning).

The SupCon loss consistently outperforms cross-entropy with standard data augmentation strategies (AutoAugment, RandAugment and CutMix). We show top-1 accuracy for ImageNet, on ResNet-50, ResNet-101 and ResNet200.

We also demonstrate analytically that the gradient of our loss function encourages learning from hard positives and hard negatives. The gradient contributions from hard positives/negatives are large while those for easy positives/negatives are small. This implicit property allows the contrastive loss to sidestep the need for explicit hard mining, which is a delicate but critical part of many losses, such as triplet loss. See the supplementary material of our paper for a full derivation.

SupCon is also more robust to natural corruptions, such as noise, blur and JPEG compression. The mean Corruption Error (mCE) measures the average degradation in performance compared to the benchmark ImageNet-C dataset. The SupCon models have lower mCE values across different corruptions compared to cross-entropy models, showing increased robustness.

We show empirically that the SupCon loss is less sensitive than cross-entropy to a range of hyperparameters. Across changes in augmentations, optimizers, and learning rates, we observe significantly lower variance in the output of the contrastive loss. Moreover, applying different batch sizes while holding all other hyperparameters constant results in consistently better top-1 accuracy of SupCon to that of cross-entropy at each batch size.

Accuracy of cross-entropy and supervised contrastive loss as a function of hyperparameters and training data size, measured on ImageNet with a ResNet-50 encoder. Left: Boxplot showing Top-1 accuracy vs changes in augmentation, optimizer and learning rates. SupCon yields more consistent results across variations in each, which is useful when the best strategies are unknown a priori. Right: Top-1 accuracy as a function of batch size shows both losses benefit from larger batch sizes while SupCon has higher Top-1 accuracy, even when trained with small batch sizes.
Accuracy of supervised contrastive loss as a function of training duration and the temperature hyperparameter, measured on ImageNet with a ResNet-50 encoder. Left: Top-1 accuracy as a function of SupCon pre-training epochs. Right: Top-1 accuracy as a function of temperature during the pre-training stage for SupCon. Temperature is an important hyperparameter in contrastive learning and reducing sensitivity to temperature is desirable.

Broader Impact and Next Steps
This work provides a technical advancement in the field of supervised classification. Supervised contrastive learning can improve both the accuracy and robustness of classifiers with minimal complexity. The classic cross-entropy loss can be seen as a special case of SupCon where the views correspond to the images and the learned embeddings in the final linear layer corresponding to the labels. We note that SupCon benefits from large batch sizes, and being able to train the models on smaller batches is an important topic for future research.

Our Github repository includes Tensorflow code to train the models in the paper. Our pre-trained models are also released on TF-Hub.

Acknowledgements
The NeurIPS paper was jointly co-authored with Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Special thanks to Jenny Huang for leading the writing process for this blogpost.

Source: Google AI Blog


Data Cascades in Machine Learning

Data is a foundational aspect of machine learning (ML) that can impact performance, fairness, robustness, and scalability of ML systems. Paradoxically, while building ML models is often highly prioritized, the work related to data itself is often the least prioritized aspect. This data work can require multiple roles (such as data collectors, annotators, and ML developers) and often involves multiple teams (such as database, legal, or licensing teams) to power a data infrastructure, which adds complexity to any data-related project. As such, the field of human-computer interaction (HCI), which is focused on making technology useful and usable for people, can help both to identify potential issues and to assess the impact on models when data-related work is not prioritized.

In “‘Everyone wants to do the model work, not the data work’: Data Cascades in High-Stakes AI”, published at the 2021 ACM CHI Conference, we study and validate downstream effects from data issues that result in technical debt over time (defined as "data cascades"). Specifically, we illustrate the phenomenon of data cascades with the data practices and challenges of ML practitioners across the globe working in important ML domains, such as cancer detection, landslide detection, loan allocation and more — domains where ML systems have enabled progress, but also where there is opportunity to improve by addressing data cascades. This work is the first that we know of to formalize, measure, and discuss data cascades in ML as applied to real-world projects. We further discuss the opportunity presented by a collective re-imagining of ML data as a high priority, including rewarding ML data work and workers, recognizing the scientific empiricism in ML data research, improving the visibility of data pipelines, and improving data equity around the world.

Origins of Data Cascades
We observe that data cascades often originate early in the lifecycle of an ML system, at the stage of data definition and collection. Cascades also tend to be complex and opaque in diagnosis and manifestation, so there are often no clear indicators, tools, or metrics to detect and measure their effects. Because of this, small data-related obstacles can grow into larger and more complex challenges that affect how a model is developed and deployed. Challenges from data cascades include the need to perform costly system-level changes much later in the development process, or the decrease in users’ trust due to model mis-predictions that result from data issues. Nevertheless and encouragingly, we also observe that such data cascades can be avoided through early interventions in ML development.

Different color arrows indicate different types of data cascades, which typically originate upstream, compound over the ML development process, and manifest downstream.

Examples of Data Cascades
One of the most common causes of data cascades is when models that are trained on noise-free datasets are deployed in the often-noisy real world. For example, a common type of data cascade originates from model drifts, which occur when target and independent variables deviate, resulting in less accurate models. Drifts are more common when models closely interact with new digital environments — including high-stakes domains, such as air quality sensing, ocean sensing, and ultrasound scanning — because there are no pre-existing and/or curated datasets. Such drifts can lead to more factors that further decrease a model’s performance (e.g., related to hardware, environmental, and human knowledge). For example, to ensure good model performance, data is often collected in controlled, in-house environments. But in the live systems of new digital environments with resource constraints, it is more common for data to be collected with physical artefacts such as fingerprints, shadows, dust, improper lighting, and pen markings, which can add noise that affects model performance. In other cases, environmental factors such as rain and wind can unexpectedly move image sensors in deployment, which also trigger cascades. As one of the model developers we interviewed reported, even a small drop of oil or water can affect data that could be used to train a cancer prediction model, therefore affecting the model’s performance. Because drifts are often caused by the noise in real-world environments, they also take the longest — up to 2-3 years — to manifest, almost always in production.

Another common type of data cascade can occur when ML practitioners are tasked with managing data in domains in which they have limited expertise. For instance, certain kinds of information, such as identifying poaching locations or data collected during underwater exploration, rely on expertise in the biological sciences, social sciences, and community context. However, some developers in our study described having to take a range of data-related actions that surpassed their domain expertise — e.g., discarding data, correcting values, merging data, or restarting data collection — leading to data cascades that limited model performance. The practice of relying on technical expertise more than domain expertise (e.g., by engaging with domain experts) is what appeared to set off these cascades.

Two other cascades observed in this paper resulted from conflicting incentives and organizational practices between data collectors, ML developers, and other partners — for example, one cascade was caused by poor dataset documentation. While work related to data requires careful coordination across multiple teams, this is especially challenging when stakeholders are not aligned on priorities or workflows.

How to Address Data Cascades
Addressing data cascades requires a multi-part, systemic approach in ML research and practice:

  1. Develop and communicate the concept of goodness of the data that an ML system starts with, similar to how we think about goodness of fit with models. This includes developing standardized metrics and frequently using those metrics to measure data aspects like phenomenological fidelity (how accurately and comprehensively does the data represent the phenomena) and validity (how well the data explains things related to the phenomena captured by the data), similar to how we have developed good metrics to measure model performance, like F1-scores.
  2. Innovate on incentives to recognize work on data, such as welcoming empiricism on data in conference tracks, rewarding dataset maintenance, or rewarding employees for their work on data (collection, labelling, cleaning, or maintenance) in organizations.
  3. Data work often requires coordination across multiple roles and multiple teams, but this is quite limited currently (partly, but not wholly, because of the previously stated factors). Our research points to the value of fostering greater collaboration, transparency, and fairer distribution of benefits between data collectors, domain experts, and ML developers, especially with ML systems that rely on collecting or labelling niche datasets.
  4. Finally, our research across multiple countries indicates that data scarcity is pronounced in lower-income countries, where ML developers face the additional problem of defining and hand-curating new datasets, which makes it difficult to even start developing ML systems. It is important to enable open dataset banks, create data policies, and foster ML literacy of policy makers and civil society to address the current data inequalities globally.

Conclusion
In this work we both provide empirical evidence and formalize the concept of data cascades in ML systems. We hope to create an awareness of the potential value that could come from incentivising data excellence. We also hope to introduce an under-explored but significant new research agenda for HCI. Our research on data cascades has led to evidence-backed, state-of-the-art guidelines for data collection and evaluation in the revised PAIR Guidebook, aimed at ML developers and designers.

Acknowledgements
This paper was written in collaboration with Shivani Kapania, Hannah Highfill, Diana Akrong, Praveen Paritosh and Lora Aroyo. We thank our study participants, and Sures Kumar Thoddu Srinivasan, Jose M. Faleiro, Kristen Olson, Biswajeet Malik, Siddhant Agarwal, Manish Gupta, Aneidi Udo-Obong, Divy Thakkar, Di Dang, and Solomon Awosupin.

Source: Google AI Blog


Toward Generalized Sim-to-Real Transfer for Robot Learning

Reinforcement and imitation learning methods in robotics research can enable autonomous environmental navigation and efficient object manipulation, which in turn opens up a breadth of useful real-life applications. Previous work has demonstrated how robots that learn end-to-end using deep neural networks can reliably and safely interact with the unstructured world around us by comprehending camera observations to take actions and solve tasks. However, while end-to-end learning methods can generalize and scale for complicated robot manipulation tasks, they require hundreds of thousands real world robot training episodes, which can be difficult to obtain. One can attempt to alleviate this constraint by using a simulation of the environment that allows virtual robots to learn more quickly and at scale, but the simulations’ inability to exactly match the real world presents a challenge c ommonly referred to as the sim-to-real gap. One important source of the gap comes from discrepancies between the images rendered in simulation and the real robot camera observations, which then causes the robot to perform poorly in the real world.

To-date, work on bridging this gap has employed a technique called pixel-level domain adaptation, which translates synthetic images to realistic ones at the pixel level. One example of this technique is GraspGAN, which employs a generative adversarial network (GAN), a framework that has been very effective at image generation, to model this transformation between simulated and real images given datasets of each domain. These pseudo-real images correct some sim-to-real gap, so policies learned with simulation execute more successfully on real robots. A limitation for their use in sim-to-real transfer, however, is that because GANs translate images at the pixel-level, multi-pixel features or structures that are necessary for robot task learning may be arbitrarily modified or even removed.

To address the above limitation, and in collaboration with the Everyday Robot Project at X, we introduce two works, RL-CycleGAN and RetinaGAN, that train GANs with robot-specific consistencies — so that they do not arbitrarily modify visual features that are specifically necessary for robot task learning — and thus bridge the visual discrepancy between sim and real. We demonstrate how these consistencies preserve features critical to policy learning, eliminating the need for hand-engineered, task-specific tuning, which in turn allows for this sim-to-real methodology to work flexibly across tasks, domains, and learning algorithms. With RL-CycleGAN, we describe our sim-to-real transfer methodology and demonstrate state-of-the-art performance on real world grasping tasks trained with RL. With RetinaGAN, we extend our approach to include imitation learning with a door opening task.

RL-CycleGAN
In “RL-CycleGAN: Reinforcement Learning Aware Simulation-To-Real”, we leverage a variation of CycleGAN for sim-to-real adaptation by ensuring consistency of task-relevant features between real and simulated images. CycleGAN encourages preservation of image contents by ensuring an adapted image transformed back to the original domain is identical to the original image, which is called cycle consistency. To further encourage the adapted images to be useful for robotics, the CycleGAN is jointly trained with a reinforcement learning (RL) robot agent that ensures the robot’s actions are the same given both the original images and those after GAN-adaptation. That is, task-specific features like robot arm or graspable object locations are unaltered, but the GAN may still alter lighting or textural differences between domains that do not affect task-level decisions.

Evaluating RL-CycleGAN
We evaluated RL-CycleGAN on a robotic indiscriminate grasping task. Trained on 580,000 real trials and simulations adapted with RL-CycleGAN, the robot grasps objects with 94% success, surpassing the 89% success rate of the prior state-of-the-art sim-to-real method GraspGAN and the 87% mark using real-only data without simulation. With only 28,000 trials, the RL-CycleGAN method reaches 86%, comparable to the previous baselines with 20x the data. Some examples of the RL-CycleGAN output alongside the simulation images are shown below.

Comparison between simulation images of robot grasping before (left) and after RL-CycleGAN translation (right).

RetinaGAN
While RL-CycleGAN reliably transfers from sim-to-real for the RL domain using task awareness, a natural question arises: can we develop a more flexible sim-to-real transfer technique that applies broadly to different tasks and robot learning techniques?

In “RetinaGAN: An Object-Aware Approach to Sim-to-Real Transfer”, presented at ICRA 2021, we develop such a task-decoupled, algorithm-decoupled GAN approach to sim-to-real transfer by instead focusing on robots’ perception of objects. RetinaGAN enforces strong object-semantic awareness through perception consistency via object detection to predict bounding box locations for all objects on all images. In an ideal sim-to-real model, we expect the object detector to predict the same box locations before and after GAN translation, as objects should not change structurally. RetinaGAN is trained toward this ideal by backpropagation, such that there is consistency in perception of objects both when a) simulated images are transformed from simulation to real and then back to simulation and b) when real images are transformed from real to simulation and then back to real. We find this object-based consistency to be more widely applicable than the task-specific consistency required by RL-CycleGAN.

Diagram of RetinaGAN stages. The simulated image (top left) is transformed by the sim-to-real generator and subsequently by the real-to-sim generator. The real image (bottom left) undergoes the transformation in reverse order. Having separate pipelines that start with the simulated and real images improves the GAN’s performance.

Evaluating RetinaGAN on a Real Robot
Given the goal of building a more flexible sim-to-real transfer technique, we evaluate RetinaGAN in multiple ways to understand for which tasks and under what conditions it accomplishes sim-to-real transfer.

We first apply RetinaGAN to a grasping task. As demonstrated visually below, RetinaGAN emphasizes the translation of realistic object textures, shadows, and lighting, while maintaining the visual quality and saliency of the graspable objects. We couple a pre-trained RetinaGAN model with the distributed reinforcement learning method Q2-Opt to train a vision-based task model for instance grasping. On real robots, this policy grasps object instances with 80% success when trained on a hundred thousand episodes — outperforming prior adaptation methods RL-CycleGAN and CycleGAN (both achieving ~68%) and training without domain adaptation (grey bars below: 19% with sim data, 22% with real data, and 54% with mixed data). This gives us confidence that perception consistency is a valuable strategy for sim-to-real transfer. Further, with just 10,000 training episodes (8% of the data), the RL policy with RetinaGAN grasps with 66% success, demonstrating performance of prior methods with significantly less data.

Evaluation performance of RL policies on instance grasping, trained with various datasets and sim-to-real methods. Low-Data RetinaGAN uses 8% of the real dataset.
The simulated grasping environment (left) is translated to a realistic image (right) using RetinaGAN.

Next, we pair RetinaGAN with a different learning method, behavioral cloning, to open conference room doors given demonstrations by human operators. Using images from both simulated and real demonstrations, we train RetinaGAN to translate the synthetic images to look realistic, bridging the sim-to-real gap. We then train a behavior cloning model to imitate the task-solving actions of the human operators within real and RetinaGAN-adapted sim demonstrations. When evaluating this model by predicting actions to take, the robot enters real conference rooms over 93% of the time, surpassing baselines of 75% and below.

Both of the above images show the same simulation, but RetinaGAN translates simulated door opening images (left) to look more like real robot sensor data (right).
Three examples of the real robot successfully opening conference room doors using the RetinaGAN-trained behavior cloning policy.

Conclusion
This work has demonstrated how additional constraints on GANs may address the visual sim-to-real gap without requiring task-specific tuning; these approaches reach higher real robot success rates with less data collection. RL-CycleGAN translates synthetic images to realistic ones with an RL-consistency loss that automatically preserves task-relevant features. RetinaGAN is an object-aware sim-to-real adaptation technique that transfers robustly across environments and tasks, agnostic to the task learning method. Since RetinaGAN is not trained with any task-specific knowledge, we show how it can be reused for a novel object pushing task. We hope that work on the sim-to-real gap further generalizes toward solving task-agnostic robotic manipulation in unstructured environments.

Acknowledgements
Research into RL-CycleGAN was conducted by Kanishka Rao, Chris Harris, Alex Irpan, Sergey Levine, Julian Ibarz, and Mohi Khansari. Research into RetinaGAN was conducted by Daniel Ho, Kanishka Rao, Zhuo Xu, Eric Jang, Mohi Khansari, and Yunfei Bai. We’d also like to give special thanks to Ivonne Fajardo, Noah Brown, Benjamin Swanson, Christopher Paguyo, Armando Fuentes, and Sphurti More for overseeing the robot operations. We thank Paul Wohlhart, Konstantinos Bousmalis, Daniel Kappler, Alexander Herzog, Anthony Brohan, Yao Lu, Chad Richards, Vincent Vanhoucke, and Mrinal Kalakrishnan, Max Braun and others in the Robotics at Google team and the Everyday Robot Project for valuable discussions and help.

Source: Google AI Blog


Cross-Modal Contrastive Learning for Text-to-Image Generation

Automatic text-to-image synthesis, in which a model is trained to generate images from text descriptions alone, is a challenging task that has recently received significant attention. Its study provides rich insights into how machine learning (ML) models capture visual attributes and relate them to text. Compared to other kinds of inputs to guide image creation, such as sketches, object masks or mouse traces (which we have highlighted in prior work), descriptive sentences are a more intuitive and flexible way to express visual concepts. Hence, a strong automatic text-to-image generation system can also be a useful tool for rapid content creation and could be applied to many other creative applications, similar to other efforts to integrate machine learning into the creation of art (e.g., Magenta).

State-of-the-art image synthesis results are typically achieved using generative adversarial networks (GANs), which train two models — a generator, which tries to create realistic images, and a discriminator, which tries to determine if an image is real or fabricated. Many text-to-image generation models are GANs that are conditioned using text inputs in order to generate semantically relevant images. This is significantly challenging, especially when long, ambiguous descriptions are provided. Moreover, GAN training can be prone to mode collapse, a common failure case for the training process in which the generator learns to produce only a limited set of outputs, so that the discriminator fails to learn robust strategies to recognize fabricated images. To mitigate mode collapse, some approaches use multi-stage refinement networks that iteratively refine an image. However, such systems require multi-stage training, which is less efficient than simpler single-stage end-to-end models. Other efforts rely on hierarchical approaches that first model object layouts before finally synthesizing a realistic image. This requires the use of labeled segmentation data, which can be difficult to obtain.

In “Cross-Modal Contrastive Learning for Text-to-Image Generation,” to appear at CVPR 2021, we present the Cross-Modal Contrastive Generative Adversarial Network (XMC-GAN), which addresses text-to-image generation by learning to maximize the mutual information between image and text using inter-modal (image-text) and intra-modal (image-image) contrastive losses. This approach helps the discriminator to learn more robust and discriminative features, so XMC-GAN is less prone to mode collapse even with one-stage training. Importantly, XMC-GAN achieves state-of-the-art performance with a simple one-stage generation, as compared to previous multi-stage or hierarchical approaches. It is end-to-end trainable, and only requires image-text pairs (as opposed to labeled segmentation or bounding box data).

Contrastive Losses for Text-to-Image Synthesis
The goal of text-to-image synthesis systems is to produce clear, photo-realistic scenes with high semantic fidelity to their conditioned text descriptions. To achieve this, we propose to maximize the mutual information between the corresponding pairs: (1) images (real or generated) with a sentence describing the scene; (2) a generated image and a real image with the same description; and (3) regions of an image (real or generated) and words or phrases associated with them.

In XMC-GAN, this is enforced using contrastive losses. Similar to other GANs, XMC-GAN contains a generator for synthesizing images, and a discriminator that is trained to act as a critic between real and generated images. Three sets of data contribute to the contrastive loss in this system — the real images, the text that describes those images, and the images generated from the text descriptions. The individual loss functions for both the generator and the discriminator are combinations of the loss calculated from whole images with the full text description, combined with the loss calculated from sub-divided images with associated words or phrases. Then, for each batch of training data, we calculate the cosine similarity score between each text description and the real images, and likewise, between each text description and the batch of generated images. The goal is for the matching pairs (both text-to-image and real image-to-generated image) to have high similarity scores and for non-matching pairs to have low scores. Enforcing such a contrastive loss allows the discriminator to learn more robust and discriminative features.

Inter-modal and intra-modal contrastive learning in our proposed XMC-GAN text-to-image synthesis model.

Results
We apply XMC-GAN to three challenging datasets — the first was a collection of MS-COCO descriptions of MS-COCO images, and the other two were datasets annotated with Localized Narratives, one of which covers MS-COCO images (which we call LN-COCO) and the other of which describes Open Images data (LN-OpenImages). We find that XMC-GAN achieves a new state of the art on each. The images generated by XMC-GAN depict scenes that are of higher quality than those generated using other techniques. On MS-COCO, XMC-GAN improves the state-of-the-art Fréchet inception distance (FID) score from 24.7 to 9.3, and is significantly preferred by human evaluators.

Selected qualitative results for generated images on MS-COCO.

Similarly, human raters prefer the image quality in XMC-GAN generated images 77.3% of the time, and 74.1% prefer its image-text alignment compared to three other state-of-the-art approaches (CP-GAN, SD-GAN, and OP-GAN) .

Human evaluation on MS-COCO for image quality and text alignment. Annotators rank (anonymized and order-randomized) generated images from best to worst.

XMC-GAN also generalizes well to the challenging Localized Narratives dataset, which contains longer and more detailed descriptions. Our prior work TReCS tackles text-to-image generation for Localized Narratives using mouse trace inputs to improve image generation quality. Despite not receiving mouse trace annotations, XMC-GAN is able to significantly outperform TReCS on image generation on LN-COCO, improving state-of-the-art FID from 48.7 to 14.1. Incorporating mouse traces and other additional inputs into an end-to-end model such as XMC-GAN would be interesting to study in future work.

In addition, we also train and evaluate on the LN-OpenImages, which is more challenging than MS-COCO because the dataset is much larger with images that cover a broader range of subject matter and that are more complex (8.4 objects on average). To the best of our knowledge, XMC-GAN is the first text-to-image synthesis model that is trained and evaluated on Open Images. XMC-GAN is able to generate high quality results, and sets a strong benchmark FID score of 26.9 on this very challenging task.

Random samples of real and generated images on Open Images.

Conclusion and Future Work
In this work, we present a cross-modal contrastive learning framework to train GAN models for text-to-image synthesis. We investigate several cross-modal contrastive losses that enforce correspondence between image and text. For both human evaluations and quantitative metrics, XMC-GAN establishes a marked improvement over previous models on multiple datasets. It generates high quality images that match their input descriptions well, including for long, detailed narratives, and does so while being a simpler, end-to-end model. We believe that this represents a significant advance towards creative applications for image generation from natural language descriptions. As we continue this research, we are continually evaluating responsible approaches, potential applications and risk mitigation, in accordance with our AI Principles.

Acknowledgements
This is a joint work with Jason Baldridge, Honglak Lee, and Yinfei Yang. We would like to thank Kevin Murphy, Zizhao Zhang, Dilip Krishnan for their helpful feedback. We also want to thank the Google Data Compute team for their work on conducting human evaluations. We are also grateful for general support from the Google Research team.

Source: Google AI Blog