Tag Archives: NeurIPS

PAIRED: A New Multi-agent Approach for Adversarial Environment Generation

The effectiveness of any machine learning method is critically dependent on its training data. In the case of reinforcement learning (RL), one can rely either on limited data collected by an agent interacting with the real world, or a simulated training environment that can be used to collect as much data as needed. This latter method of training in simulation is increasingly popular, but it has a problem — the RL agent can learn what is built into the simulator, but tends to be bad at generalizing to tasks that are even slightly different than the ones simulated. And obviously building a simulator that covers all the complexity of the real-world is extremely challenging.

An approach to address this is to automatically create more diverse training environments by randomizing all the parameters of the simulator, a process called domain randomization (DR). However, DR can fail even in very simple environments. For example, in the animation below, the blue agent is trying to navigate to the green goal. The left panel shows an environment created with DR where the positions of the obstacles and goal have been randomized. Many of these DR environments were used to train the agent, which was then transferred to the simple Four Rooms environment in the middle panel. Notice that the agent can’t find the goal. This is because it has not learned to walk around walls. Even though the wall configuration from the Four Rooms example could have been generated randomly in the DR training phase, it’s unlikely. As a result, the agent has not spent enough time training on walls similar to the Four Rooms structure, and is unable to reach the goal.

Domain randomization (left) does not effectively prepare an agent to transfer to previously unseen environments, such as the Four Rooms scenario (middle). To address this, a minimax adversary is used to construct previously unseen environments (right), but can result in creating situations that are impossible to solve.

Instead of just randomizing the environment parameters, one could train a second RL agent to learn how to set the environment parameters. This minimax adversary can be trained to minimize the performance of the first RL agent by finding and exploiting weaknesses in its policy - e.g. building wall configurations it has not encountered before. But again there is a problem. The right panel shows an environment built by a minimax adversary in which it is actually impossible for the agent to reach the goal. While the minimax adversary has succeeded in its task — it has minimized the performance of the original agent — it provides no opportunity for the agent to learn. Using a purely adversarial objective is not well suited to generating training environments, either.

In collaboration with UC Berkeley, we propose a new multi-agent approach for training the adversary in “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”, a publication recently presented at NeurIPS 2020. In this work we present an algorithm, Protagonist Antagonist Induced Regret Environment Design (PAIRED), that is based on minimax regret and prevents the adversary from creating impossible environments, while still enabling it to correct weaknesses in the agent’s policy. PAIRED incentivizes the adversary to tune the difficulty of the generated environments to be just outside the agent’s current abilities, leading to an automatic curriculum of increasingly challenging training tasks. We show that agents trained with PAIRED learn more complex behavior and generalize better to unknown test tasks. We have released open-source code for PAIRED on our GitHub repo.

PAIRED
To flexibly constrain the adversary, PAIRED introduces a third RL agent, which we call the antagonist agent, because it is allied with the adversarial agent, i.e., the one designing the environment. We rename our initial agent, the one navigating the environment, the protagonist. Once the adversary generates an environment, both the protagonist and antagonist play through that environment.

The adversary’s job is to maximize the antagonist’s reward while minimizing the protagonist's reward. This means it must create environments that are feasible (because the antagonist can solve them and get a high score), but challenging to the protagonist (exploit weaknesses in its current policy). The gap between the two rewards is the regret — the adversary tries to maximize the regret, while the protagonist competes to minimize it.

The methods discussed above (domain randomization, minimax regret and PAIRED) can be analyzed using the same theoretical framework, unsupervised environment design (UED), which we describe in detail in the paper. UED draws a connection between environment design and decision theory, enabling us to show that domain randomization is equivalent to the Principle of Insufficient Reason, the minimax adversary follows the Maximin Principle, and PAIRED is optimizing minimax regret. Below, we show how each of these ideas works for environment design:

Domain randomization (a) generates unstructured environments that aren’t tailored to the agent’s learning progress. The minimax adversary (b) may create impossible environments. PAIRED (c) can generate challenging, structured environments, which are still possible for the agent to complete.

Curriculum Generation
What’s interesting about minimax regret is that it incentivizes the adversary to generate a curriculum of initially easy, then increasingly challenging environments. In most RL environments, the reward function will give a higher score for completing the task more efficiently, or in fewer timesteps. When this is true, we can show that regret incentivizes the adversary to create the easiest possible environment the protagonist can’t solve yet. To see this, let’s assume the antagonist is perfect, and always gets the highest score that it possibly can. Meanwhile, the protagonist is terrible, and gets a score of zero on everything. In that case, the regret just depends on the difficulty of the environment. Since easier environments can be completed in fewer timesteps, they allow the antagonist to get a higher score. Therefore, the regret of failing at an easy environment is greater than the regret of failing on a hard environment:

So, by maximizing regret the adversary is searching for easy environments that the protagonist fails to do. Once the protagonist learns to solve each environment, the adversary must move on to finding a slightly harder environment that the protagonist can’t solve. Thus, the adversary generates a curriculum of increasingly difficult tasks.

Results
We can see the curriculum emerging in the learning curves below, which plot the shortest path length of a maze the agents have successfully solved. Unlike minimax or domain randomization, the PAIRED adversary creates a curriculum of increasingly longer, but possible, mazes, enabling PAIRED agents to learn more complex behavior.

But can these different training schemes help an agent generalize better to unknown test tasks? Below, we see the zero-shot transfer performance of each algorithm on a series of challenging test tasks. As the complexity of the transfer environment increases, the performance gap between PAIRED and the baselines widens. For extremely difficult tasks like Labyrinth and Maze, PAIRED is the only method that can occasionally solve the task. These results provide promising evidence that PAIRED can be used to improve generalization for deep RL.

Admittedly, these simple gridworlds do not reflect the complexities of the real world tasks that many RL methods are attempting to solve. We address this in “Adversarial Environment Generation for Learning to Navigate the Web”, which examines the performance of PAIRED when applied to more complex problems, such as teaching RL agents to navigate web pages. We propose an improved version of PAIRED, and show how it can be used to train an adversary to generate a curriculum of increasingly challenging websites:

Above, you can see websites built by the adversary in the early, middle, and late training stages, which progress from using very few elements per page to many simultaneous elements, making the tasks progressively harder. We test whether agents trained on this curriculum can generalize to standardized web navigation tasks, and achieve a 75% success rate, with a 4x improvement over the strongest curriculum learning baseline:

Conclusions
Deep RL is very good at fitting a simulated training environment, but how can we build simulations that cover the complexity of the real world? One solution is to automate this process. We propose Unsupervised Environment Design (UED) as a framework that describes different methods for automatically creating a distribution of training environments, and show that UED subsumes prior work like domain randomization and minimax adversarial training. We think PAIRED is a good approach for UED, because regret maximization leads to a curriculum of increasingly challenging tasks, and prepares agents to transfer successfully to unknown test tasks.

Acknowledgements
We would like to recognize the co-authors of “Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design”: Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine, as well as the co-authors of Adversarial Environment Generation for Learning to Navigate the Web: Izzeddin Gur, Natasha Jaques, Yingjie Miao, Jongwook Choi, Kevin Malta, Manoj Tiwari, Honglak Lee, Aleksandra Faust. In addition, we thank Michael Chang, Marvin Zhang, Dale Schuurmans, Aleksandra Faust, Chase Kew, Jie Tan, Dennis Lee, Kelvin Xu, Abhishek Gupta, Adam Gleave, Rohin Shah, Daniel Filan, Lawrence Chan, Sam Toyer, Tyler Westenbroek, Igor Mordatch, Shane Gu, DJ Strouse, and Max Kleiman-Weiner for discussions that contributed to this work.

Source: Google AI Blog


Presenting a Challenge and Workshop in Efficient Open-Domain Question Answering



One of the primary goals of natural language processing is to build systems that can answer a user's questions. To do this, computers need to be able to understand questions, represent world knowledge, and reason their way to answers. Traditionally, answers have been retrieved from a collection of documents or a knowledge graph. For example, to answer the question, “When was the declaration of independence officially signed?” a system might first find the most relevant article from Wikipedia, and then locate a sentence containing the answer, “August 2, 1776”. However, more recent approaches, like T5, have also shown that neural models, trained on large amounts of web-text, can also answer questions directly, without retrieving documents or facts from a knowledge graph. This has led to significant debate about how knowledge should be stored for use by our question answering systems — in human readable text and structured formats, or in the learned parameters of a neural network.

Today, we are proud to announce the EfficientQA competition and workshop at NeurIPS 2020, organized in cooperation with Princeton University and the University of Washington. The goal is to develop an end-to-end question answering system that contains all of the knowledge required to answer open-domain questions. There are no constraints on how the knowledge is stored — it could be in documents, databases, the parameters of a neural network, or any other form — but entries will be evaluated based on the number of bytes used to access this knowledge, including code, corpora, and model parameters. There will also be an unconstrained track, in which the goal is to achieve the best possible question answering performance regardless of system size. To build small, yet robust systems, participants will have to explore new methods of knowledge representation and reasoning.
An illustration of how the memory budget changes as a neural network and retrieval corpus grow and shrink. It is possible that successful systems will also use other resources such as a knowledge graph.
Competition Overview
The competition will be evaluated using the open-domain variant of the Natural Questions dataset. We will also provide further human evaluation of all the top performing entries to account for the fact that there are many correct ways to answer a question, not all of which will be covered by any set of reference answers. For example, for the question “What type of car is a Jeep considered?” both “off-road vehicles” and “crossover SUVs” are valid answers.

The competition is divided between four separate tracks: best performing system under 500 Mb; best performing system under 6 Gb; smallest system to get at least 25% accuracy; and the best performing system with no constraints. The winners of each of these tracks will be invited to present their work during the competition track at NeurIPS 2020, which will be hosted virtually. We will also put each of the winning systems up against human trivia experts (the 2017 NeurIPS Human-Computer competition featured Jeopardy! and Who Wants to Be a Millionaire champions) in a real-time contest at the virtual conference.

Participation
To participate, go to the competition site where you will find the data and evaluation code available for download, as well as dates and instructions on how to participate, and a sign-up form for updates. Along with our academic collaborators, we have provided some example systems to help you get started.

We believe that the field of natural language processing will benefit from a greater exploration and comparison of small system question answering options. We hope that by encouraging the development of very small systems, this competition will pave the way for on-device question answering.

Acknowledgements
Creating this challenge and workshop has been a large team effort including Adam Roberts, Colin Raffel, Chris Alberti, Jordan Boyd-Graber, Jennimaria Palomaki, Kenton Lee, Kelvin Guu, and Michael Collins from Google; as well as Sewon Min and Hannaneh Hajishirzi from the University of Washington; and Danqi Chen from Princeton University.

Source: Google AI Blog


Can You Trust Your Model’s Uncertainty?



In an ideal world, machine learning (ML) methods like deep learning are deployed to make predictions on data from the same distribution as that on which they were trained. But the practical reality can be quite different: camera lenses becoming blurry, sensors degrading, and changes to popular online topics can result in differences between the distribution of data on which the model was trained and to which a model is applied, leading to what is known as covariate shift. For example, it was recently observed that deep learning models trained to detect pneumonia in chest x-rays would achieve very different levels of accuracy when evaluated on previously unseen hospitals’ data, due in part to subtle differences in image acquisition and processing.

In “Can you trust your model’s uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift,presented at NeurIPS 2019, we benchmark the uncertainty of state-of-the-art deep learning models as they are exposed to both shifting data distributions and out-of-distribution data. In this work we consider a variety of input modalities, including images, text and online advertising data, exposing these deep learning models to increasingly shifted test data while carefully analyzing the behavior of their predictive probabilities. We also compare a variety of different methods for improving model uncertainty to see which strategies perform best under distribution shift.

What is Out-of-Distribution Data?
Deep learning models provide a probability with each prediction, representing the model confidence or uncertainty. As such, they can express what they don’t know and, correspondingly, abstain from prediction when the data is outside the realm of the original training dataset. In the case of covariate shift, uncertainty would ideally increase proportionally to any decrease in accuracy. A more extreme case is when data are not at all represented in the training set, i.e., when the data are out-of-distribution (OOD). For example, consider what happens when a cat-versus-dog image classifier is shown an image of an airplane. Would the model confidently predict incorrectly or would it assign a low probability to each class? In a related post we recently discussed methods we developed to identify such OOD examples. In this work we instead analyze the predictive uncertainty of models given out-of-distribution and shifted examples to see if the model probabilities reflect their ability to predict on such data.

Quantifying the Quality of Uncertainty
What does it mean for one model to have better representation of its uncertainty than another? While this can be a nuanced question that often is defined by a downstream task, there are ways to quantitatively assess the general quality of probabilistic predictions. For example, the meteorological community has carefully considered this question and developed a set of proper scoring rules that a comparison function for probabilistic weather forecasts should satisfy in order to be well-calibrated, while still rewarding accuracy. We applied several of these proper scoring rules, such as the Brier Score and Negative Log Likelihood (NLL), along with more intuitive heuristics, such as the expected calibration error (ECE), to understand how different ML models dealt with uncertainty under dataset shift.

Experiments
We analyze the effect of dataset shift on uncertainty across a variety of data modalities, including images, text, online advertising data and genomics. As an example, we illustrate the effect of dataset shift on the ImageNet dataset, a popular image understanding benchmark. ImageNet involves classifying over a million images into 1000 different categories. Some now consider this challenge mostly solved, and have developed harder variants, such as Corrupted Imagenet (or Imagenet-C), in which the data are augmented according to 16 different realistic corruptions, each at 5 different intensities.
We explore how model uncertainty behaves under changes to the data distribution, such as increasing intensities of the image perturbations used in Corrupted Imagenet. Shown here are examples of each type of image corruption, at intensity level 3 (of 5).
We used these corrupted images as examples of shifted data and examined the predictive probabilities of deep learning models as they were exposed to shifts of increasing intensity. Below we show box plots of the resulting accuracy and the ECE for each level of corruption (including uncorrupted test data), where each box aggregates across all corruption types in ImageNet-C. Each color represents a different type of model — a “vanilla” deep neural network used as a baseline, four uncertainty methods (dropout, temperature scaling and our last layer approaches), and an ensemble approach.
Accuracy (top) and expected calibration error (bottom; lower is better) for increasing intensities of dataset shift on ImageNet-C. We observe that the decrease in accuracy is not reflected by an increase in uncertainty of the model, indicated by both accuracy and ECE getting worse.
As the shift intensity increases, the deviation in accuracy across corruption methods for each model increases (increasing box size), as expected, and the accuracy on the whole decreases. Ideally this would be reflected in increasing uncertainty of the model, thus leaving the expected calibration error (ECE) unchanged. However, looking at the lower plot of the ECE, one sees that this is not the case and that calibration generally suffers as well. We observed similar worsening trends for Brier score and NLL indicating that the models are not becoming increasingly unsure with shift, but instead are becoming confidently wrong.

One popular method to improve calibration is known as temperature scaling, a variant of Platt scaling, which involves smoothing the predictions after training, using performance on a held-out validation set. We observed that while this improved calibration on the standard test data, it often made things worse on shifted data! Thus, practitioners applying this technique should be wary of distributional shift.

Fortunately, one method degrades in uncertainty much more gracefully than others. Deep ensembles (green), which average the predictions of a selection of models, each of which have different initializations, is a simple strategy that significantly improves robustness to shift and outperforms all other methods tested.

Summary and Recommended Best Practices
In our paper, we explored the behavior of state-of-the-art models under dataset shift across images, text, online advertising data and genomics. Our findings were mostly consistent across these different kinds of data. The quality of uncertainty degrades under dataset shift, but there are promising avenues of research to mitigate this. We hope that deep learning users take home the following messages from our study:
  1. Uncertainty under dataset shift is a real concern that needs to be considered when training models.
  2. Improving calibration and accuracy on an in-distribution test set often does not translate to improved calibration on shifted data.
  3. Out of all the methods we considered, deep ensembles are the most robust to dataset shift, and a relatively small ensemble size (e.g., 5) is sufficient. The effectiveness of ensembles presents interesting avenues for improving other approaches.
Improving the predictive uncertainty of deep learning models remains an active area of research in ML. We have released all of the code and model predictions from this benchmark in the hope that it will be useful to the community to drive and evaluate future work on this important topic.

Source: Google AI Blog


Can You Trust Your Model’s Uncertainty?



In an ideal world, machine learning (ML) methods like deep learning are deployed to make predictions on data from the same distribution as that on which they were trained. But the practical reality can be quite different: camera lenses becoming blurry, sensors degrading, and changes to popular online topics can result in differences between the distribution of data on which the model was trained and to which a model is applied, leading to what is known as covariate shift. For example, it was recently observed that deep learning models trained to detect pneumonia in chest x-rays would achieve very different levels of accuracy when evaluated on previously unseen hospitals’ data, due in part to subtle differences in image acquisition and processing.

In “Can you trust your model’s uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift,presented at NeurIPS 2019, we benchmark the uncertainty of state-of-the-art deep learning models as they are exposed to both shifting data distributions and out-of-distribution data. In this work we consider a variety of input modalities, including images, text and online advertising data, exposing these deep learning models to increasingly shifted test data while carefully analyzing the behavior of their predictive probabilities. We also compare a variety of different methods for improving model uncertainty to see which strategies perform best under distribution shift.

What is Out-of-Distribution Data?
Deep learning models provide a probability with each prediction, representing the model confidence or uncertainty. As such, they can express what they don’t know and, correspondingly, abstain from prediction when the data is outside the realm of the original training dataset. In the case of covariate shift, uncertainty would ideally increase proportionally to any decrease in accuracy. A more extreme case is when data are not at all represented in the training set, i.e., when the data are out-of-distribution (OOD). For example, consider what happens when a cat-versus-dog image classifier is shown an image of an airplane. Would the model confidently predict incorrectly or would it assign a low probability to each class? In a related post we recently discussed methods we developed to identify such OOD examples. In this work we instead analyze the predictive uncertainty of models given out-of-distribution and shifted examples to see if the model probabilities reflect their ability to predict on such data.

Quantifying the Quality of Uncertainty
What does it mean for one model to have better representation of its uncertainty than another? While this can be a nuanced question that often is defined by a downstream task, there are ways to quantitatively assess the general quality of probabilistic predictions. For example, the meteorological community has carefully considered this question and developed a set of proper scoring rules that a comparison function for probabilistic weather forecasts should satisfy in order to be well-calibrated, while still rewarding accuracy. We applied several of these proper scoring rules, such as the Brier Score and Negative Log Likelihood (NLL), along with more intuitive heuristics, such as the expected calibration error (ECE), to understand how different ML models dealt with uncertainty under dataset shift.

Experiments
We analyze the effect of dataset shift on uncertainty across a variety of data modalities, including images, text, online advertising data and genomics. As an example, we illustrate the effect of dataset shift on the ImageNet dataset, a popular image understanding benchmark. ImageNet involves classifying over a million images into 1000 different categories. Some now consider this challenge mostly solved, and have developed harder variants, such as Corrupted Imagenet (or Imagenet-C), in which the data are augmented according to 16 different realistic corruptions, each at 5 different intensities.
We explore how model uncertainty behaves under changes to the data distribution, such as increasing intensities of the image perturbations used in Corrupted Imagenet. Shown here are examples of each type of image corruption, at intensity level 3 (of 5).
We used these corrupted images as examples of shifted data and examined the predictive probabilities of deep learning models as they were exposed to shifts of increasing intensity. Below we show box plots of the resulting accuracy and the ECE for each level of corruption (including uncorrupted test data), where each box aggregates across all corruption types in ImageNet-C. Each color represents a different type of model — a “vanilla” deep neural network used as a baseline, four uncertainty methods (dropout, temperature scaling and our last layer approaches), and an ensemble approach.
Accuracy (top) and expected calibration error (bottom; lower is better) for increasing intensities of dataset shift on ImageNet-C. We observe that the decrease in accuracy is not reflected by an increase in uncertainty of the model, indicated by both accuracy and ECE getting worse.
As the shift intensity increases, the deviation in accuracy across corruption methods for each model increases (increasing box size), as expected, and the accuracy on the whole decreases. Ideally this would be reflected in increasing uncertainty of the model, thus leaving the expected calibration error (ECE) unchanged. However, looking at the lower plot of the ECE, one sees that this is not the case and that calibration generally suffers as well. We observed similar worsening trends for Brier score and NLL indicating that the models are not becoming increasingly unsure with shift, but instead are becoming confidently wrong.

One popular method to improve calibration is known as temperature scaling, a variant of Platt scaling, which involves smoothing the predictions after training, using performance on a held-out validation set. We observed that while this improved calibration on the standard test data, it often made things worse on shifted data! Thus, practitioners applying this technique should be wary of distributional shift.

Fortunately, one method degrades in uncertainty much more gracefully than others. Deep ensembles (green), which average the predictions of a selection of models, each of which have different initializations, is a simple strategy that significantly improves robustness to shift and outperforms all other methods tested.

Summary and Recommended Best Practices
In our paper, we explored the behavior of state-of-the-art models under dataset shift across images, text, online advertising data and genomics. Our findings were mostly consistent across these different kinds of data. The quality of uncertainty degrades under dataset shift, but there are promising avenues of research to mitigate this. We hope that deep learning users take home the following messages from our study:
  1. Uncertainty under dataset shift is a real concern that needs to be considered when training models.
  2. Improving calibration and accuracy on an in-distribution test set often does not translate to improved calibration on shifted data.
  3. Out of all the methods we considered, deep ensembles are the most robust to dataset shift, and a relatively small ensemble size (e.g., 5) is sufficient. The effectiveness of ensembles presents interesting avenues for improving other approaches.
Improving the predictive uncertainty of deep learning models remains an active area of research in ML. We have released all of the code and model predictions from this benchmark in the hope that it will be useful to the community to drive and evaluate future work on this important topic.

Source: Google AI Blog


Improving Out-of-Distribution Detection in Machine Learning Models



Successful deployment of machine learning systems requires that the system be able to distinguish between data that is anomalous or significantly different from that used in training. This is particularly important for deep neural network classifiers, which might classify such out-of-distribution (OOD) inputs into in-distribution classes with high confidence. This is critically important when these predictions inform real-world decisions.

For example, one challenging application of machine learning models to real-world applications is bacteria identification based on genomic sequences. Bacteria detection is crucial for diagnosis and treatment of infectious diseases, such as sepsis, and for identifying foodborne pathogens. New bacterial classes continue to be discovered over the years, and while a neural network classifier trained on the known classes achieves high accuracy as measured through cross-validation, deploying a model is challenging, since real-world data is ever evolving and will inevitably contain genomes from unseen classes (OOD inputs) not present in the training data.
New bacterial classes are gradually discovered over the years. A classifier trained on known classes achieves high accuracy for test inputs belonging to known classes, but can wrongly classify inputs from unknown classes (i.e., out-of-distribution) into known classes with high confidence.
In “Likelihood Ratios for Out-of-Distribution Detection”, presented at NeurIPS 2019, we proposed and released a realistic benchmark dataset of genomic sequences for OOD detection that is inspired by the real-world challenges described above. We tested existing methods for OOD detection using generative models on genomic sequences and found that the likelihood values — i.e., the model's probability that an input comes from the distribution as estimated using in-distribution data — was often in error. This phenomenon has also been observed in recent work on deep generative models of images. We explain this phenomenon through the effect of background statistics and propose a likelihood-ratio based solution that significantly improves the accuracy of OOD detection.

Why Do Density Models Fail At OOD Detection?
To mimic the real problem and systematically evaluate different methods, we built a new bacterial dataset using data sourced from the publicly available NCBI catalog of prokaryotic genome sequences. To mimic sequencing data, we fragmented genomes into short sequences of 250 base pairs, a length commonly generated by current sequencing technology. We then separated in- and out-of-distribution data by the date of discovery, such that bacterial classes discovered before a cutoff time were defined as in-distribution, and those discovered afterward as OOD.

We then trained a deep generative model on in-distribution genomic sequences and examined how well the model discriminated between in- and out-of-distribution inputs by plotting their likelihood values. The histogram of the likelihood for OOD sequences largely overlaps with that of in-distribution sequences, indicating that the generative model was unable to distinguish between the two populations for OOD detection. Similar results were shown in earlier work for deep generative models of images — for instance, a PixelCNN++ model trained on images from Fashion-MNIST dataset (which consists of images of clothing and footwear) assigns higher likelihood to OOD images from the MNIST dataset (which consists of images of digits 0-9).
Left: Histogram of likelihood values for in- and out-of-distribution (OOD) genomic sequences. The likelihood fails to separate in-distribution and OOD genomic sequences. Right: A similar plot for a model trained on Fashion-MNIST and evaluated on MNIST. The model assigns higher likelihood values for OOD (MNIST) than in-distribution images.
When investigating this failure mode, we observed that the likelihood can be confounded by background statistics. To understand the phenomenon more intuitively, assume that an input is composed of two components, (1) a background component characterized by background statistics, and (2) a semantic component characterized by patterns specific to the in-distribution data. For example, an MNIST image can be modeled as background plus semantics. When humans interpret the image, we can easily ignore the background and focus primarily on the semantic information, e.g., the “/” mark in the image below. But the likelihood is calculated for all pixels in an image, including both semantic and background pixels. Though we want to use just the semantic likelihood for decision making, the raw likelihood can be dominated by background.
Left top: Sample images from Fashion-MNIST. Left bottom: Sample images from MNIST. Right: Background and semantic components in an MNIST image.
Likelihood Ratios For OOD Detection
We propose a likelihood ratio method that removes the effect of background and focuses on semantics. First, we train a background model on perturbed inputs. The method for perturbing the input is inspired by genetic mutations, and proceeds by randomly selecting positions in the input and substituting the value with another that has equal probability. For imaging, the values are randomly chosen from the 256 possible pixel values, and for the DNA sequences, the value is selected from the four possible nucleotides (A, T, C, or G). The right amount of perturbation can corrupt the semantic structure in the data, and captures only the background. Then we compute the likelihood ratio between the full model and the background model, and the background component is cancelled out, so that only the likelihood for semantics remains. Likelihood ratio is a background contrastive score, i.e., it captures the significance of the semantics compared to the background.

To qualitatively evaluate the difference between the likelihood and likelihood ratio, we plotted their values for each pixel in the Fashion-MNIST and MNIST datasets, creating heatmaps that have the same size as the images. This allows us to visualize which pixels contribute the most to the two terms, respectively. From the log-likelihood heatmaps, we see that the background pixels contribute much more to the likelihood than the semantic pixels. In hindsight, this is not surprising, since background pixels consist mostly of a string of zeros, a pattern very easily learned by the model. A comparison between the MNIST and Fashion-MNIST heatmaps demonstrates why MNIST returns higher likelihood values — it simply has a lot more background pixels! The likelihood ratio instead focuses more on the semantic pixels.
Left: Log-likelihood heatmaps for Fashion-MNIST and MNIST datasets. Right: The same examples showing heatmaps of the likelihood-ratio. Pixels with higher values are of lighter shades. The likelihood is dominated by the “background” pixels, whereas the likelihood ratio focuses on the “semantic” pixels and is thus better for OOD detection.
Our likelihood ratio method corrects the background effect and significantly improves the OOD detection of MNIST images from an AUROC score of 0.089 to 0.994, based on a PixelCNN++ model trained for Fashion-MNIST. When applied to the genomic benchmark dataset, this method achieves state-of-the-art performance on this challenging problem, when compared to 12 other baseline methods.

For more details, please check out our recent paper at NeurIPS 2019. While our likelihood ratio method reaches state-of-the-art performance on the genomic dataset, it does not yet have high enough accuracy to reach the standards for deployment of the model to real applications. We encourage researchers to contribute their solutions to this important problem and improve the current state-of-the-art. The dataset is available on our GitHub repository.

Acknowledgments
The work described here was authored by Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, Balaji Lakshminarayanan, through a collaboration spanning several teams across Google AI and DeepMind. We are grateful for all the discussions and feedback on this work that we received from the reviewers at NeurIPS 2019, and our colleagues at Google and DeepMind: Alexander A. Alemi, Andreea Gane, Brian Lee, D. Sculley, Eric Jang, Jacob Burnim, Katherine Lee, Matthew D. Hoffman, Noah Fiedel, Rif A. Saurous, Suman Ravuri, Thomas Colthurst, Yaniv Ovadia, along with the Google Brain and TensorFlow teams.

Source: Google AI Blog


Google at NeurIPS 2019



This week, Vancouver hosts the 33rd annual Conference on Neural Information Processing Systems (NeurIPS 2019), the biggest machine learning conference of the year. The conference includes invited talks, demonstrations and presentations of some of the latest in machine learning research. As a Diamond Sponsor of NeurIPS 2019, Google will have a strong presence at NeurIPS 2019 with more than 500 Googlers attending in order to contribute to, and learn from, the broader academic research community via talks, posters, workshops, competitions and tutorials. We will be presenting work that pushes the boundaries of what is possible in language understanding, translation, speech recognition and visual & audio perception, with Googlers co-authoring more than 130 accepted papers.

If you are attending NeurIPS 2019, we hope you’ll stop by our booth and chat with our researchers about the projects and opportunities at Google that go into solving the world's most challenging research problems, and to see demonstrations of some of the exciting research we pursue, such as ML-based Flood Forecasting, AI for Social Good, Google Research Football, Google Dataset Search, TF-Agents and much more. You can also learn more about our work being presented in the list below (Google affiliations highlighted in blue).

NeurIPS Foundation Board
Samy Bengio, Corinna Cortes

NeurIPS Advisory Board
John C. Platt, Fernando Pereira, Dale Schuurmans

NeurIPS Program Committee
Program Chair: Hugo Larochelle
Diversity & Inclusion Co-Chair: Katherine Heller
Meetup Chair: Nicolas La Roux
Party Co-Chair: Pablo Samuel Castro

Senior Area Chairs include: Amir Globerson, Claudio Gentile, Cordelia Schmid, Corinna Cortes, Dale Schuurmans, Elad Hazan, Honglak Lee, Mehryar Mohri, Peter Bartlett, Satyen Kale, Sergey Levine, Surya Ganguli

Area Chairs include: Afshin Rostamizadeh, Alex Kulesza, Amin Karbasi, Andrew Dai, Been Kim, Boqing Gong, Brainslav Kveton, Ce Liu, Charles Sutton, Chelsea Finn, Cho-Jui Hsieh, D Sculley, Danny Tarlow, David Held, Denny Zhou, Yann Dauphin, Dustin Tran, Hartmut Neven, Hossein Mobahi, Ilya Tolstikhin, Jasper Snoek, Jean-Philippe Vert, Jeffrey Pennington, Kevin Swersky, Kun Zhang, Kunal Talwar, Lihong Li, Manzil Zaheer, Marc G Bellemare, Marco Cuturi, Maya Gupta, Meg Mitchell, Minmin Chen, Mohammad Norouzi, Moustapha Cisse, Olivier Bachem, Qiang Liu, Rong Ge, Sanjiv Kumar, Sanmi Koyejo, Sebastian Nowozin, Sergei Vassilvitskii, Shivani Agarwal, Slav Petrov, Srinadh Bhojanapalli, Stephen Bach, Timnit Gebru, Tomer Koren, Vitaly Feldman, William Cohen, Yann Dauphin, Nicolas La Roux

NeurIPS Workshops Program Committee
Yann Dauphin, Honglak Lee, Sebastian Nowozin, Fernanda Viegas

NeurIPS Invited Talk
Social Intelligence
Blaise Aguera y Arcas

Accepted Papers
Memory Efficient Adaptive Optimization
Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

Stand-Alone Self-Attention in Vision Models
Niki Parmar, Prajit Ramachandran, Ashish Vaswani, Irwan Bello, Anselm Levskaya, Jon Shlens

High Fidelity Video Prediction with Large Neural Nets
Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee

Unsupervised Learning of Object Structure and Dynamics from Videos
Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, Hyouk Joong Lee, Jiquan Ngiam, Quoc V. Le, Yonghui Wu, Zhifeng Chen

Quadratic Video Interpolation
Xiangyu Xu, Li Si-Yao, Wenxiu Sun, Qian Yin, Ming-Hsuan Yang

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
Aviv Rosenberg, Yishay Mansour

Individual Regret in Cooperative Nonstochastic Multi-Armed Bandits
Yogev Bar-On, Yishay Mansour

Learning to Screen
Alon Cohen, Avinatan Hassidim, Haim Kaplan, Yishay Mansour, Shay Moran

DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections
Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li

A Kernel Loss for Solving the Bellman Equation
Yihao Feng, Lihong Li, Qiang Liu

Accurate Uncertainty Estimation and Decomposition in Ensemble Learning
Jeremiah Liu, John Paisley, Marithani-Anna Kioumourtzoglou, Brent Coull

Saccader: Improving Accuracy of Hard Attention Models for Vision
Gamaleldin F. Elsayed, Simon Kornblith, Quoc V. Le

Invertible Convolutional Flow
Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Dinh, Daniel Duckworth

Hypothesis Set Stability and Generalization
Dylan J. Foster, Spencer Greenberg, Satyen Kale, Haipeng Luo, Mehryar Mohri, Karthik Sridharan

Bandits with Feedback Graphs and Switching Costs
Raman Arora, Teodor V. Marinov, Mehryar Mohri

Regularized Gradient Boosting
Corinna Cortes, Mehryar Mohri, Dmitry Storcheus

Logarithmic Regret for Online Control
Naman Agarwal, Elad Hazan, Karan Singh

Sampled Softmax with Random Fourier Features
Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, Sanjiv Kumar

Multilabel Reductions: What is My Loss Optimising?
Aditya Krishna Menon, Ankit Singh Rawat, Sashank Reddi, Sanjiv Kumar

MetaInit: Initializing Learning by Learning to Initialize
Yann N. Dauphin, Sam Schoenholz

Generalization Bounds for Neural Networks via Approximate Description Length
Amit Daniely, Elad Granot

Variance Reduction of Bipartite Experiments through Correlation Clustering
Jean Pouget-Abadie, Kevin Aydin, Warren Schudy, Kay Brodersen, Vahab Mirrokni

Likelihood Ratios for Out-of-Distribution Detection
Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark A. DePristo, Joshua V. Dillon, Balaji Lakshminarayanan

Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift
Yaniv Ovadia, Emily Fertig, Jie Jessie Ren, D. Sculley, Josh Dillon, Sebastian Nowozin, Zack Nado, Balaji Lakshminarayanan, Jasper Snoek

Surrogate Objectives for Batch Policy Optimization in One-step Decision Making
Minmin Chen, Ramki Gummadi, Chris Harris, Dale Schuurmans

Globally Optimal Learning for Structured Elliptical Losses
Yoav Wald, Nofar Noy, Gal Elidan, Ami Wiesel

DPPNet: Approximating Determinantal Point Processes with Deep Networks
Zelda Mariet, Yaniv Ovadia, Jasper Snoek

Graph Normalizing Flows
Jenny Liu, Aviral Kumar, Jimmy Ba, Jamie Kiros, Kevin Swersky

When Does Label Smoothing Help?
Rafael Muller, Simon Kornblith, Geoff Hinton

On the Role of Inductive Bias From Simulation and the Transfer to the Real World: a new Disentanglement Dataset
Muhammad Waleed Gondal, Manuel Wüthrich, Đorđe Miladinović, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer

On the Fairness of Disentangled Representations
Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Schölkopf, Olivier Bachem

Are Disentangled Representations Helpful for Abstract Visual Reasoning?
Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, Olivier Bachem

Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse
James Lucas, George Tucker, Roger Grosse, Mohammad Norouzi

Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction
Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, Sergey Levine

Optimizing Generalized Rate Metrics with Game Equilibrium
Harikrishna Narasimhan, Andrew Cotter, Maya Gupta

On Making Stochastic Classifiers Deterministic
Andrew Cotter, Harikrishna Narasimhan, Maya Gupta

Discrete Flows: Invertible Generative Models of Discrete Data
Dustin Tran, Keyon Vafa, Kumar Agrawal, Laurent Dinh, Ben Poole

Graph Agreement Models for Semi-Supervised Learning
Otilia Stretcu, Krishnamurthy Viswanathan, Dana Movshovitz-Attias, Emmanouil Platanios, Andrew Tomkins, Sujith Ravi

A Robust Non-Clairvoyant Dynamic Mechanism for Contextual Auctions
Yuan Deng, Sébastien Lahaie, Vahab Mirrokni

Adversarial Robustness through Local Linearization
Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy (Dj) Dvijotham, Alhusein Fawzi, Soham De, Robert Stanforth, Pushmeet Kohli

A Geometric Perspective on Optimal Representations for Reinforcement Learning
Marc G. Bellemare, Will Dabney, Robert Dadashi, Adrien Ali Taiga, Pablo Samuel Castro, Nicolas Le Roux, Dale Schuurmans, Tor Lattimore, Clare Lyle

Online Learning via the Differential Privacy Lens
Jacob Abernethy, Young Hun Jung, Chansoo Lee, Audra McMillan, Ambuj Tewari

Reducing the Variance in Online Optimization by Transporting Past Gradients
Sébastien M. R. Arnold, Pierre-Antoine Manzagol, Reza Babanezhad, Ioannis Mitliagkas, Nicolas Le Roux

Universality and Individuality in Neural Dynamics Across Large Populations of Recurrent Networks
Niru Maheswaranathan, Alex Williams, Matt Golub, Surya Ganguli, David Sussillo

Reverse Engineering Recurrent Networks for Sentiment Classification Reveals Line Attractor Dynamics
Niru Maheswaranathan, Alex H. Williams, Matthew D. Golub, Surya Ganguli, David Sussillo

Strategizing Against No-Regret Learners
Yuan Deng, Jon Schneider, Balasubramanian Sivan

Prior-Free Dynamic Auctions with Low Regret Buyers
Yuan Deng, Jon Schneider, Balasubramanian Sivan

Private Stochastic Convex Optimization with Optimal Rates
Raef Bassily, Vitaly Feldman, Kunal Talwar, Abhradeep Thakurta

Computational Separations between Sampling and Optimization
Kunal Talwar

Momentum-Based Variance Reduction in Non-Convex SGD
Ashok Cutkosky and Francesco Orabona

Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration
Kwang-Sung Jun, Ashok Cutkosky, Francesco Orabona

Fast and Flexible Multi-Task Classification using Conditional Neural Adaptive Processes
James Requeima, Jonathan Gordon, John Bronskill, Sebastian Nowozin, Richard E. Turner

Icebreaker: Element-wise Active Information Acquisition with Bayesian Deep Latent Gaussian Model
Wenbo Gong, Sebastian Tschiatschek, Richard E. Turner, Sebastian Nowozin, Jose Miguel Hernandez-Lobato, Cheng Zhang

Multiview Aggregation for Learning Category-Specific Shape Reconstruction
Srinath Sridhar, Davis Rempe, Julien Valentin, Sofien Bouaziz, Leonidas J. Guibas

Visualizing and Measuring the Geometry of BERT
Andy Coenen, Emily Reif, Ann Yuan, Been Kim, Adam Pearce, Fernanda Viégas, Martin Wattenberg

Locality-Sensitive Hashing for f-Divergences: Mutual Information Loss and Beyond
Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S. Mirrokni

A Benchmark for Interpretability Methods in Deep Neural Networks
Sara Hooker, Dumitru Erhan, Pieter-jan Kindermans, Been Kim

Practical and Consistent Estimation of f-Divergences
Paul Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, Ilya Tolstikhin

Tree-Sliced Variants of Wasserstein Distances
Tam Le, Makoto Yamada, Kenji Fukumizu, Marco Cuturi

Game Design for Eliciting Distinguishable Behavior
Fan Yang, Liu Leqi, Yifan Wu, Zachary Lipton, Pradeep Ravikumar, Tom M Mitchell, William Cohen

Differentially Private Anonymized Histograms
Ananda Theertha Suresh

Locally Private Gaussian Estimation
Matthew Joseph, Janardhan Kulkarni, Jieming Mao, Zhiwei Steven Wu

Exponential Family Estimation via Adversarial Dynamics Embedding
Bo Dai, Zhen Liu, Hanjun Dai, Niao He, Arthur Gretton, Le Song, Dale Schuurmans

Learning to Predict Without Looking Ahead: World Models Without Forward Prediction
C. Daniel Freeman, Luke Metz, David Ha

Adaptive Density Estimation for Generative Models
Thomas Lucas, Konstantin Shmelkov, Karteek Alahari, Cordelia Schmid, Jakob Verbeek

Weight Agnostic Neural Networks
Adam Gaier, David Ha

Retrosynthesis Prediction with Conditional Graph Logic Network
Hanjun Dai, Chengtao Li, Connor Coley, Bo Dai, Le Song

Large Scale Structure of Neural Network Loss Landscapes
Stanislav Fort, Stainslaw Jastrzebski

Off-Policy Evaluation via Off-Policy Classification
Alex Irpan, Kanishka Rao, Konstantinos Bousmalis, Chris Harris, Julian Ibarz, Sergey Levine

Domes to Drones: Self-Supervised Active Triangulation for 3D Human Pose Reconstruction
Aleksis Pirinen, Erik Gartner, Cristian Sminchisescu

Energy-Inspired Models: Learning with Sampler-Induced Distributions
Dieterich Lawson, George TuckerBo Dai, Rajesh Ranganath

From Deep Learning to Mechanistic Understanding in Neuroscience: The Structure of Retinal Prediction
Hidenori Tanaka, Aran Nayebi, Niru Maheswaranathan, Lane McIntosh, Stephen Baccus, Surya Ganguli

Language as an Abstraction for Hierarchical Deep Reinforcement Learning
Yiding Jiang, Shixiang Gu, Kevin Murphy, Chelsea Finn

Bayesian Layers: A Module for Neural Network Uncertainty
Dustin Tran, Michael W. Dusenberry, Mark van der Wilk, Danijar Hafner

Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates
Hugo Penedones, Carlos RiquelmeDamien Vincent, Hartmut Maennel, Timothy Mann, Andre Barreto, Sylvain Gelly, Gergely Neu

A Unified Framework for Data Poisoning Attack to Graph-based Semi-Supervised Learning
Xuanqing Liu, Si Si, Xiaojin Zhu, Yang Li, Cho-Jui Hsieh

MixMatch: A Holistic Approach to Semi-Supervised Learning
David Berthelot, Nicholas Carlini, Ian Goodfellow (work done while at Google), Avital Oliver, Nicolas Papernot, Colin Raffel

SMILe: Scalable Meta Inverse Reinforcement Learning through Context-Conditional Policies
Seyed Kamyar Seyed Ghasemipour, Shixiang (Shane) Gu, Richard Zemel

Limits of Private Learning with Access to Public Data
Noga Alon, Raef Bassily, Shay Moran

Regularized Weighted Low Rank Approximation
Frank Ban, David Woodruff, Richard Zhang

Unsupervised Curricula for Visual Meta-Reinforcement Learning
Allan Jabri, Kyle Hsu, Abhishek Gupta, Benjamin Eysenbach, Sergey Levine, Chelsea Finn

Secretary Ranking with Minimal Inversions
Sepehr Assadi, Eric Balkanski, Renato Paes Leme

Mixtape: Breaking the Softmax Bottleneck Efficiently
Zhilin Yang, Thang Luong, Russ Salakhutdinov, Quoc V. Le

Budgeted Reinforcement Learning in Continuous State Space
Nicolas Carrara, Edouard Leurent, Romain Laroche, Tanguy Urvoy, Odalric-Ambrym Maillard, Olivier Pietquin

From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization
Krzysztof Choromanski, Aldo Pacchiano, Jack Parker-Holder, Yunhao Tang

Generalization Bounds for Neural Networks via Approximate Description Length
Amit Daniely, Elad Granot

Flattening a Hierarchical Clustering through Active Learning
Fabio Vitale, Anand Rajagopalan, Claudio Gentile

Robust Attribution Regularization
Jiefeng Chen, Xi Wu, Vaibhav Rastogi, Yingyu Liang, Somesh Jha

Robustness Verification of Tree-based Models
Hongge Chen, Huan Zhang, Si Si, Yang Li, Duane Boning, Cho-Jui Hsieh

Meta Architecture Search
Albert Shaw, Wei Wei, Weiyang Liu, Le Song, Bo Dai

Contextual Bandits with Cross-Learning
Santiago Balseiro, Negin Golrezaei, Mohammad Mahdian, Vahab Mirrokni, Jon Schneider

Dynamic Incentive-Aware Learning: Robust Pricing in Contextual Auctions
Negin Golrezaei, Adel Javanmard, Vahab Mirrokni

Optimizing Generalized Rate Metrics with Three Players
Harikrishna Narasimhan, Andrew Cotter, Maya Gupta

Noise-Tolerant Fair Classification
Alexandre Louis Lamy, Ziyuan Zhong, Aditya Krishna Menon, Nakul Verma

Towards Automatic Concept-based Explanations
Amirata Ghorbani, James Wexler, James Zou, Been Kim

Locally Private Learning without Interaction Requires Separation
Amit Daniely, Vitaly Feldman

Learning GANs and Ensembles Using Discrepancy
Ben Adlam, Corinna Cortes, Mehryar Mohri, Ningshan Zhang

CondConv: Conditionally Parameterized Convolutions for Efficient Inference
Brandon Yang, Gabriel Bender, Quoc V. Le, Jiquan Ngiam

A Fourier Perspective on Model Robustness in Computer Vision
Dong Yin, Raphael Gontijo Lopes, Jonathon Shlens, Ekin D. Cubuk, Justin Gilmer

Robust Bi-Tempered Logistic Loss Based on Bregman Divergences
Ehsan Amid, Manfred K. Warmuth, Rohan Anil, Tomer Koren

When Does Label Smoothing Help?
Rafael Müller, Simon Kornblith, Geoffrey Hinton

Memory Efficient Adaptive Optimization
Rohan Anil, Vineet Gupta, Tomer Koren, Yoram Singer

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model
Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George E. Dahl, Christopher J. Shallue, Roger Grosse

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington

Universality and Individuality in Neural Dynamics Across Large Populations of Recurrent Networks
Niru Maheswaranathan, Alex H. Williams, Matthew D. Golub, Surya Ganguli, David Sussillo

Abstract Reasoning with Distracting Features
Kecheng Zheng, Zheng-Jun Zha, Wei Wei

Search on the Replay Buffer: Bridging Planning and Reinforcement Learning
Benjamin Eysenbach, Ruslan Salakhutdinov, Sergey Levine

Differentiable Ranking and Sorting Using Optimal Transport
Marco Cuturi, Olivier Teboul, Jean-Philippe Vert

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Private Learning Implies Online Learning: An Efficient Reduction
Alon Gonen, Elad Hazan, Shay Moran

Evaluating Protein Transfer Learning with TAPE
Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Peter Chen, John Canny, Pieter Abbeel, Yun Song

Tight Dimensionality Reduction for Sketching Low Degree Polynomial Kernels
Michela Meister, Tamas Sarlos, David P. Woodruff

No Pressure! Addressing the Problem of Local Minima in Manifold Learning Algorithms
Max Vladymyrov

Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections
Boris Muzellec, Marco Cuturi

Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function
Aviv Rosenberg, Yishay Mansour

Private Learning Implies Online Learning: An Efficient Reduction
Alon Gonen, Elad Hazan, Shay Moran

On the Fairness of Disentangled Representations
Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Schölkopf, Olivier Bachem

On the Transfer of Inductive Bias from Simulation to the Real World: a New Disentanglement Dataset
Muhammad Waleed Gondal, Manuel Wüthrich, Ðorde Miladinovíc, Francesco Locatello, Martin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Schölkopf, Stefan Bauer

Stacked Capsule Autoencoders
Adam R. Kosiorek, Sara Sabour, Yee Whye Teh, Geoffrey E. Hinton

Wasserstein Dependency Measure for Representation Learning
Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron van den Oord, Sergey Levine, Pierre Sermanet

Sampling Sketches for Concave Sublinear Functions of Frequencies
Edith Cohen, Ofir Geri

Hamiltonian Neural Networks
Sam Greydanus, Misko Dzamba, Jason Yosinski

Evaluating Protein Transfer Learning with TAPE
Roshan Rao, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, Yun S. Song

Computational Mirrors: Blind Inverse Light Transport by Deep Matrix Factorization
Miika Aittala, Prafull Sharma, Lukas Murmann, Adam B. Yedidia, Gregory W. Wornell, William T. Freeman, Frédo Durand

Quadratic Video Interpolation
Xiangyu Xu, Li Siyao, Wenxiu Sun, Qian Yin, Ming-Hsuan Yang

Transfusion: Understanding Transfer Learning for Medical Imagings
Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio

XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Differentially Private Covariance Estimation
Kareem Amin, Travis Dick, Alex Kulesza, Andres Munoz, Sergei Vassilvitskii

Private Stochastic Convex Optimization with Optimal Rates
Raef Bassily, Vitaly Feldman, Kunal Talwar, Abhradeep Thakurta

Learning Transferable Graph Exploration
Hanjun Dai, Yujia Li, Chenglong Wang, Rishabh Singh, Po-Sen Huang, Pushmeet Kohli

Neural Attribution for Semantic Bug-Localization in Student Programs
Rahul Gupta, Aditya Kanade, Shirish Shevade

PyTorch: An Imperative Style, High-Performance Deep Learning Library
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala

Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces
Chuan Guo, Ali Mousavi, Xiang Wu, Daniel Holtmann-Rice, Satyen Kale, Sashank Reddi, Sanjiv Kumar

Efficient Rematerialization for Deep Networks
Ravi Kumar, Manish Purohit, Zoya Svitkina, Erik Vee, Joshua R. Wang

Momentum-Based Variance Reduction in Non-Convex SGD
Ashok Cutkosky, Francesco Orabona

Kernel Truncated Randomized Ridge Regression: Optimal Rates and Low Noise Acceleration
Kwang-Sung Jun, Ashok Cutkosky, Francesco Orabona

Workshops
3rd Conversational AI: Today's Practice and Tomorrow's Potential
Organizers include: Bill Byrne

AI for Humanitarian Assistance and Disaster Response Workshop
Invited Speakers include: Yossi Matias

Bayesian Deep Learning
Organizers include: Kevin P Murphy

Beyond First Order Methods in Machine Learning Systems
Invited Speakers include: Elad Hazan

Biological and Artificial Reinforcement Learning
Invited Speakers include: Igor Mordatch

Context and Compositionality in Biological and Artificial Neural Systems
Invited Speakers include: Kenton Lee

Deep Reinforcement Learning
Organizers include: Chelsea Finn

Document Intelligence
Organizers include: Tania Bedrax Weiss

Federated Learning for Data Privacy and Confidentiality
Organizers include: Jakub KonečnýBrendan McMahan
Invited Speakers include: Françoise Beaufays, Daniel Ramage

Graph Representation Learning
Organizers include: Rianne van den Berg

Human-Centric Machine Learning
Invited Speakers include: Been Kim

Information Theory and Machine Learning
Organizers include: Ben Poole
Invited Speakers include: Alex Alemi

KR2ML - Knowledge Representation and Reasoning Meets Machine Learning
Invited Speakers include: William Cohen

Learning Meaningful Representations of Life
Organizers include: Jasper Snoek, Alexander Wiltschko

Learning Transferable Skills
Invited Speakers include: David Ha

Machine Learning for Creativity and Design
Organizers include: Adam Roberts, Jesse Engel

Machine Learning for Health (ML4H): What Makes Machine Learning in Medicine Different?
Invited Speakers include: Lily Peng, Alan Karthikesalingam, Dale Webster

Machine Learning and the Physical Sciences
Speakers include: Yasaman Bahri, Samual Schoenholz

ML for Systems
Organizers include: Milad HashemiKevin SwerskyAzalia MirhoseiniAnna Goldie
Invited Speakers include: Jeff Dean

Optimal Transport for Machine Learning
Organizers include: Marco Cuturi

The Optimization Foundations of Reinforcement Learning
Organizers include: Bo DaiNicolas Le RouxLihong LiDale Schuurmans

Privacy in Machine Learning
Invited Speakers include: Brendan McMahan

Program Transformations for ML
Organizers include: Pascal LamblinAlexander WiltschkoBart van MerrienboerEmily Fertig
Invited Speakers include: Skye Wanderman-Milne

Real Neurons & Hidden Units: Future Directions at the Intersection of Neuroscience and Artificial Intelligence
Organizers include: David Sussillo

Robot Learning: Control and Interaction in the Real World
Organizers include: Stefan Schaal

Safety and Robustness in Decision Making
Organizers include: Yinlam Chow

Science Meets Engineering of Deep Learning
Invited Speakers include: Yasaman Bahri, Surya Ganguli‎, Been Kim, Surya Ganguli

Sets and Partitions
Organizers include: Manzil Zaheer, Andrew McCallum
Invited Speakers include: Amr Ahmed

Tackling Climate Change with ML
Organizers include: John Platt
Invited Speakers include: Jeff Dean

Visually Grounded Interaction and Language
Invited Speakers include: Jason Baldridge

Workshop on Machine Learning with Guarantees
Invited Speakers include: Mehryar Mohri

Tutorials
Representation Learning and Fairness
Organizers include: Moustapha Cisse, Sanmi Koyejo

Source: Google AI Blog


Exploring Massively Multilingual, Massive Neural Machine Translation



“... perhaps the way [of translation] is to descend, from each language, down to the common base of human communication — the real but as yet undiscovered universal language — and then re-emerge by whatever particular route is convenient.”Warren Weaver, 1949

Over the last few years there has been enormous progress in the quality of machine translation (MT) systems, breaking language barriers around the world thanks to the developments in neural machine translation (NMT). The success of NMT however, owes largely to the great amounts of supervised training data. But what about languages where data is scarce, or even absent? Multilingual NMT, with the inductive bias that “the learning signal from one language should benefit the quality of translation to other languages”, is a potential remedy.

Multilingual machine translation processes multiple languages using a single translation model. The success of multilingual training for data-scarce languages has been demonstrated for automatic speech recognition and text-to-speech systems, and by prior research on multilingual translation [1,2,3]. We previously studied the effect of scaling up the number of languages that can be learned in a single neural network, while controlling the amount of training data per language. But what happens once all constraints are removed? Can we train a single model using all of the available data, despite the huge differences across languages in data size, scripts, complexity and domains?

In “Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges” and follow-up papers [4,5,6,7], we push the limits of research on multilingual NMT by training a single NMT model on 25+ billion sentence pairs, from 100+ languages to and from English, with 50+ billion parameters. The result is an approach for massively multilingual, massive neural machine translation (M4) that demonstrates large quality improvements on both low- and high-resource languages and can be easily adapted to individual domains/languages, while showing great efficacy on cross-lingual downstream transfer tasks.

Massively Multilingual Machine Translation
Though data skew across language-pairs is a great challenge in NMT, it also creates an ideal scenario in which to study transfer, where insights gained through training on one language can be applied to the translation of other languages. On one end of the distribution, there are high-resource languages like French, German and Spanish where there are billions of parallel examples, while on the other end, supervised data for low-resource languages such as Yoruba, Sindhi and Hawaiian, is limited to a few tens of thousands.
The data distribution over all language pairs (in log scale) and the relative translation quality (BLEU score) of the bilingual baselines trained on each one of these specific language pairs.
Once trained using all of the available data (25+ billion examples from 103 languages), we observe strong positive transfer towards low-resource languages, dramatically improving the translation quality of 30+ languages at the tail of the distribution by an average of 5 BLEU points. This effect is already known, but surprisingly encouraging, considering the comparison is between bilingual baselines (i.e., models trained only on specific language pairs) and a single multilingual model with representational capacity similar to a single bilingual model. This finding hints that massively multilingual models are effective at generalization, and capable of capturing the representational similarity across a large body of languages.
Translation quality comparison of a single massively multilingual model against bilingual baselines that are trained for each one of the 103 language pairs.
In our EMNLP’19 paper [5], we compare the representations of multilingual models across different languages. We find that multilingual models learn shared representations for linguistically similar languages without the need for external constraints, validating long-standing intuitions and empirical results that exploit these similarities. In [6], we further demonstrate the effectiveness of these learned representations on cross-lingual transfer on downstream tasks.
Visualization of the clustering of the encoded representations of all 103 languages, based on representational similarity. Languages are color-coded by their linguistic family.
Building Massive Neural Networks
As we increase the number of low-resource languages in the model, the quality of high-resource language translations starts to decline. This regression is recognized in multi-task setups, arising from inter-task competition and the unidirectional nature of transfer (i.e., from high- to low-resource). While working on better learning and capacity control algorithms to mitigate this negative transfer, we also extend the representational capacity of our neural networks by making them bigger by increasing the number of model parameters to improve the quality of translation for high-resource languages.

Numerous design choices can be made to scale neural network capacity, including adding more layers or making the hidden representations wider. Continuing our study on training deeper networks for translation, we utilized GPipe [4] to train 128-layer Transformers with over 6 billion parameters. Increasing the model capacity resulted in significantly improved performance across all languages by an average of 5 BLEU points. We also studied other properties of very deep networks, including the depth-width trade-off, trainability challenges and design choices for scaling Transformers to over 1500 layers with 84 billion parameters.

While scaling depth is one approach to increasing model capacity, exploring architectures that can exploit the multi-task nature of the problem is a very plausible complementary way forward. By modifying the Transformer architecture through the substitution of the vanilla feed-forward layers with sparsely-gated mixture of experts, we drastically scale up the model capacity, allowing us to successfully train and pass 50 billion parameters, which further improved translation quality across the board.
Translation quality improvement of a single massively multilingual model as we increase the capacity (number of parameters) compared to 103 individual bilingual baselines.
Making M4 Practical
It is inefficient to train large models with extremely high computational costs for every individual language, domain or transfer task. Instead, we present methods [7] to make these models more practical by using capacity tunable layers to adapt a new model to specific languages or domains, without altering the original.

Next Steps
At least half of the 7,000 languages currently spoken will no longer exist by the end of this century*. Can multilingual machine translation come to the rescue? We see the M4 approach as a stepping stone towards serving the next 1,000 languages; starting from such multilingual models will allow us to easily extend to new languages, domains and down-stream tasks, even when parallel data is unavailable. Indeed the path is rocky, and on the road to universal MT many promising solutions appear to be interdisciplinary. This makes multilingual NMT a plausible test bed for machine learning practitioners and theoreticians interested in exploring the annals of multi-task learning, meta-learning, training dynamics of deep nets and much more. We still have a long way to go.

Acknowledgements
This effort is built on contributions from Naveen Arivazhagan, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Chen, Yuan Cao, Yanping Huang, Sneha Kudugunta, Isaac Caswell, Aditya Siddhant, Wei Wang, Roee Aharoni, Sébastien Jean, George Foster, Colin Cherry, Wolfgang Macherey, Zhifeng Chen and Yonghui Wu. We would also like to acknowledge support from the Google Translate, Brain, and Lingvo development teams, Jakob Uszkoreit, Noam Shazeer, Hyouk Joong Lee, Dehao Chen, Youlong Cheng, David Grangier, Colin Raffel, Katherine Lee, Thang Luong, Geoffrey Hinton, Manisha Jain, Pendar Yousefi and Macduff Hughes.


* The Cambridge Handbook of Endangered Languages (Austin and Sallabank, 2011).

Source: Google AI Blog


The NeurIPS 2018 Test of Time Award: The Trade-Offs of Large Scale Learning



Progress in machine learning (ML) is happening so rapidly, that it can sometimes feel like any idea or algorithm more than 2 years old is already outdated or superseded by something better. However, old ideas sometimes remain relevant even when a large fraction of the scientific community has turned away from them. This is often a question of context: an idea which may seem to be a dead end in a particular context may become wildly successful in a different one. In the specific case of deep learning (DL), the growth of both the availability of data and computing power renewed interest in the area and significantly influenced research directions.

The NIPS 2008 paper “The Trade-Offs of Large Scale Learning” by Léon Bottou (then at NEC Labs, now at Facebook AI Research) and Olivier Bousquet (Google AI, Zürich) is a good example of this phenomenon. As the recent recipient of the NeurIPS 2018 Test of Time Award, this seminal work investigated the interplay between data and computation in ML, showing that if one is limited by computing power but can make use of a large dataset, it is more efficient to perform a small amount of computation on many individual training examples rather than to perform extensive computation on a subset of the data. This demonstrated the power of an old algorithm, stochastic gradient descent, which is nowadays used in pretty much all applications of DL.

Optimization and the Challenge of Scale
Many ML algorithms can be thought of as the combination of two main ingredients:
  • A model, which is a set of possible functions that will be used to fit the data.
  • An optimization algorithm which specifies how to find the best function in that set.
Back in the 90’s the datasets used in ML were much smaller than the ones in use today, and while artificial neural networks had already led to some successes, they were considered hard to train. In the early 2000’s, with the introduction of Kernel Machines (SVMs in particular), neural networks went out of fashion. Simultaneously, the attention shifted away from the optimization algorithms that had been used to train neural networks (stochastic gradient descent) to focus on those used for kernel machines (quadratic programming). One important difference being that in the former case, training examples are used one at a time to perform gradient steps (this is called “stochastic”), while in the latter case, all training examples are used at each iteration (this is called “batch”).

As the size of the training sets increased, the efficiency of optimization algorithms to handle large amounts of data became a bottleneck. For example, in the case of quadratic programming, running time scales at least quadratically in the number of examples. In other words, if you double your training set size, your training will take at least 4 times longer. Hence, lots of effort went into trying to make these algorithms scale to larger training sets (see for example Large Scale Kernel Machines).

People who had experience with training neural networks knew that stochastic gradient descent was comparably easier to scale to large datasets, but unfortunately its convergence is very slow (it takes lots of iterations to reach an accuracy comparable to that of a batch algorithm), so it wasn’t clear that this would be a solution to the scaling problem.

Stochastic Algorithms Scale Better
In the context of ML, the number of iterations needed to optimize the cost function is actually not the main concern: there is no point in perfectly tuning your model since you will essentially “overfit” to the training data. So why not reduce the computational effort that you put into tuning the model and instead spend the effort processing more data?

The work of Léon and Olivier provided a formal study of this phenomenon: by considering access to a large amount of data and assuming the limiting factor is computation, they showed that it is better to perform a minimal amount of computation on each individual training example (thus processing more of them) rather than performing extensive computation on a smaller amount of data.

In doing so, they also demonstrated that among various possible optimization algorithms, stochastic gradient descent is the best. This was confirmed by many experiments and led to a renewed interest in online optimization algorithms which are now in extensive use in ML.

Mysteries Remain
In the following years, many variants of stochastic gradient descent were developed both in the convex case and in the non-convex one (particularly relevant for DL). The most common variant now is the so-called “mini-batch” SGD where one considers a small number (~10-100) of training examples at each iteration, and performs several passes over the training set, with a couple of clever tricks to scale the gradient appropriately. Most ML libraries provide a default implementation of such an algorithm and it is arguably one of the pillars of DL.

While this analysis provided a solid foundation for understanding the properties of this algorithm, the amazing and sometimes surprising successes of DL continue to raise many more questions for the scientific community. In particular, the role of this algorithm in the generalization properties of deep networks has been repeatedly demonstrated but is still poorly understood. This means that a lot of fascinating questions are yet to be explored which could lead to a better understanding of the algorithms currently in use and the development of even more efficient algorithms in the future.

The perspective proposed by Léon and Olivier in their collaboration 10 years ago provided a significant boost to the development of the algorithm that is nowadays the workhorse of ML systems that benefit our lives daily, and we offer our sincere congratulations to both authors on this well-deserved award.

Source: Google AI Blog